0% found this document useful (0 votes)
19 views17 pages

Vector Space Model & Text Vectorization Techniques

The document explains the Vector Space Model (VSM) and various vectorization approaches including One-Hot Encoding, Bag of Words, Bag of N-Grams, and TF-IDF, highlighting their pros and cons. It also discusses distributed representations in NLP, emphasizing the importance of dense vectors for capturing semantic meaning and the use of Word2Vec for generating pre-trained word embeddings. Additionally, it describes the Continuous Bag of Words (CBOW) and Skip-gram models, which are techniques for learning word embeddings from context in text data.

Uploaded by

yashu99marupilli
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views17 pages

Vector Space Model & Text Vectorization Techniques

The document explains the Vector Space Model (VSM) and various vectorization approaches including One-Hot Encoding, Bag of Words, Bag of N-Grams, and TF-IDF, highlighting their pros and cons. It also discusses distributed representations in NLP, emphasizing the importance of dense vectors for capturing semantic meaning and the use of Word2Vec for generating pre-trained word embeddings. Additionally, it describes the Continuous Bag of Words (CBOW) and Skip-gram models, which are techniques for learning word embeddings from context in text data.

Uploaded by

yashu99marupilli
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT-2

[Link] is a vector space model , explain basic vectorization approaches one hot
encoding, bag of words, bag of n-grams, TFIDF?
Vector Space Model (VSM) and Basic Vectorization Approaches
1. Vector Space Model (VSM)
 The Vector Space Model (VSM) is a mathematical model used to represent text as
vectors of numbers.
 Since ML algorithms cannot process raw text directly, text is converted into numerical
vectors using a chosen representation scheme.
 Each text unit (word, sentence, or document) is represented as a vector in an n-
dimensional space, where n is the vocabulary size.
 Similarity between two text vectors is often calculated using cosine similarity or
Euclidean distance.
Cosine Similarity Formula:

similarity=cos⁡(θ)=A⋅B∣∣A∣∣ ∣∣B∣∣\text{similarity} = \cos(\theta) = \frac{A \cdotB}{||A|| \, ||


B||}similarity=cos(θ)=∣∣A∣∣∣∣B∣∣A⋅B
where A and B are two document vectors.
2. Basic Vectorization Approaches
(a) One-Hot Encoding
In one-hot encoding, each word w in the corpus vocabulary is given a unique integer ID wid that is
between 1 and |V|, where V is the set of the corpus vocabulary. Each word is then represented by a V-
dimensional binary vector of 0s and 1s. This is done via a |V| dimension vector filled with all 0s
barring the index, where index = wid. At this index, we simply put a 1.
We first map each of the six words to unique IDs: dog = 1, bites = 2, man = 3, meat = 4 , food = 5, eats = 6.
 Each word is represented by a binary vector of length equal to the vocabulary size.
 A single “1” indicates the word’s presence at its unique index; all other positions are 0.

Example:

Vocabulary = [dog, bites, man, eats, meat, food]


 “dog” → [1,0,0,0,0,0]
 “bites” → [0,1,0,0,0,0]
 Document D1 (“dog bites man”) → [[1,0,0,0,0,0], [0,1,0,0,0,0], [0,0,1,0,0,0]]
Pros: Simple, easy to implement.
Cons: Sparse vectors, no semantic meaning, OOV (Out-of-Vocabulary) issue.

(b) Bag of Words (BoW)


BoW maps words to unique integer IDs between 1 and |V|. Each document in the corpus is then
converted into a vector of |V| dimensions where in the ith component of the vector, i = wid, is simply
the number of times the word w occurs in the document, i.e., we simply score each word in V by their
occurrence count in the document.
 Represents a document as a vector of word frequencies.

 Ignores word order and context.


Example:
D1: “dog bites man” → [1,1,1,0,0,0]
D4: “man eats food” → [0,0,1,0,1,1]
Pros: Fixed-length vectors, captures word frequency.

Cons: Sparsity, ignores word order, cannot capture semantic similarity.


(c) Bag of N-Grams (BoN)
 Extends BoW by considering n consecutive words (n-grams).
 Captures some word order and context.
Example (bigrams):
Corpus bigrams = {dog bites, bites man, man eats, eats food}
 D1: “dog bites man” → [1,1,0,0]
 D4: “man eats food” → [0,0,1,1]
Pros: Captures local context (phrases).
Cons: Increases dimensionality and sparsity with higher n.
(d) TF-IDF (Term Frequency – Inverse Document Frequency)
 Assigns weights to words based on their importance in a document relative to the
corpus.
 Words that occur frequently in one document but rarely across others get higher scores.
Formula:

Example:
 Common words like “the” get low weights.
 Unique words like “meat” get high weights.
Pros: Captures importance of words, widely used in IR & text classification.
Cons: Still sparse, cannot capture deep semantic meaning.

:
Summary Table
Captures Word Captures Word
Approach Handles Importance Sparse? OOV Issue
Frequency Order

One-Hot No No No Yes Yes

Bag of Words Yes No No Yes Yes

Bag of N-Grams Yes Partially (local) No Yes Yes

TF-IDF Yes No Yes Yes Yes

[Link] the term Distributed Representation and other key terms associated with it?
Ans: In NLP, a distributed representation is a way of representing words as dense, low-
dimensional vectors rather than large, sparse vectors like in one-hot encoding or TF-IDF.
These dense vectors capture semantic meaning by placing words with similar meanings close
to each other in the vector space. This makes computation faster, more efficient, and better at
capturing relationships between words.
key associated terms:
[Link] Similarity:
This is the idea that the meaning of a word can be understood from the context in which it
appears. For example, in the sentence “NLP rocks,” the literal meaning of “rocks” is “stones,”
but based on the context, it actually means something exciting or cool. So, meaning comes
from context, not just dictionary definition.
[Link] Hypothesis
This is a key linguistic theory that says: “Words that occur in similar contexts tend to have
similar meanings.” For example, “dog” and “cat” often appear in similar sentences, so their
meanings are considered related. In NLP, this is translated into vectors—if two words appear
in similar contexts, their vectors should also be close in the embedding space.
[Link] Representation:
These are word representations built directly from word co-occurrence statistics in text.
Methods like one-hot encoding, bag of words, bag of n-grams, and TF-IDF all fall into this
category. The downside is that they create high-dimensional, sparse vectors (mostly zeros),
which are inefficient to compute and don’t capture deeper meaning.
[Link] Representation:
This is the improved version of distributional representation. Instead of using high-
dimensional, sparse vectors, distributed representations compress the information into low-
dimensional, dense vectors. These vectors are much more powerful because they capture
semantic similarity while being computationally efficient.
[Link]:
An embedding is the mapping from a distributional representation (like one-hot vectors) into
a distributed representation (dense vectors). For example, Word2Vec, GloVe, and
FastTextare embedding techniques that convert sparse vectors into meaningful dense ones.
[Link] Semantics:
This is the general term for NLP approaches that use vectors to represent words, basedon the
distributional properties of language. It’s the broader field that includes both distributional
and distributed representations.

15. Utilize Word2vec Keyed Vector model to generate Pre-trained Word Embeddings?
Word Embeddings are a crucial text representation technique that captures distributional
similarities between words. The idea is that the meaning of a word can be understood from the
context in which it appears.
Word2vec, a neural network-based model introduced by Mikolov et al. in 2013, was a seminal
work that demonstrated the ability to capture word analogy relationships like "King – Man +
Woman ≈ Queen". These learned representations are typically low-dimensional (e.g., 50–500
dimensions) and dense, making Machine Learning (ML) tasks more tractable and efficient.
Pre-trained Word Embeddings are models that have already been trained on large text corpora
(e.g., billions of words from Wikipedia and news articles). These models can be loaded and
used directly in new projects, providing a strong baseline without the need for extensive
training.
The gensim library provides functionalities for loading and working with Word2vec models.
The following code demonstrates how to load a pre-trained Word2vec model (e.g., Google's
pre-trained model on Google News data) and then query it for word similarity and vector
representation:

from [Link] import KeyedVectors


import os
# Define the path to your pre-trained Word2vec model file
# (Replace 'path/to/your/folder' with the actual path where you've downloaded the model)
# The source mentions '[Link]' as an example.
data_path = "path/to/your/folder" # This path needs to be updated by the user
path_to_model = [Link](data_path, '[Link]')
# Load the pre-trained Word2vec model.
# This can take some time as it's a large file (~3.6 GB as per the source [34, 35]).
try:
w2v_model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
print('Done loading Word2Vec model.')
# Print the number of words in the vocabulary
print(f"Number of words in the vocabulary: {len(w2v_model.vocab)}") # [33]
# Find words semantically most similar to 'beautiful'
print("\nWords most similar to 'beautiful':")
# The source specifically mentions w2v_model.most_similar['beautiful'] but the correct
syntax in gensim is w2v_model.most_similar('beautiful')
print(w2v_model.most_similar('beautiful', topn=5)) # [33]
# Get the 300-dimensional vector for 'beautiful'
print("\nVector representation for 'beautiful':")
# The source shows W2v_model['beautiful'] which is correct if the model is loaded as a
KeyedVectors object
print(w2v_model['beautiful']) # [33, 36]
# Example of an out-of-vocabulary (OOV) word
oov_word = 'practicalnlp'
if oov_word in w2v_model.vocab:
print(f"\nVector for '{oov_word}': {w2v_model[oov_word]}")
else:
print(f"\n'{oov_word}' is an out-of-vocabulary (OOV) word and not present in the
model's vocabulary.") # [36]
except FileNotFoundError:
print(f"Error: Model file not found at {path_to_model}. Please ensure the file is downloaded
and the path is correct.")
except Exception as e:
print(f"An error occurred: {e}")
Output:

Explanation: This code first loads a pre-trained Word2vec model using


KeyedVectors.load_word2vec_format. It then demonstrates two key functionalities:
• w2v_model.most_similar('beautiful', topn=5): This function finds the 5 words whose vectors
are closest (highest cosine similarity) to the vector of "beautiful," effectively identifying
semantically similar words.
• w2v_model['beautiful']: This directly retrieves the numerical vector (an array of floats) that
represents the word "beautiful" in the learned embedding space. The example demonstrates a
300-dimensional vector.
A crucial consideration when using pre-trained embeddings is the Out-of-Vocabulary (OOV)
problem. If a word in your input text is not present in the pre-trained model's vocabulary, the
model cannot return a vector for it, leading to a "key not found" error. This highlights the
importance of checking if a word exists in the vocabulary before attempting to retrieve its
vector.

16. Explain a Continuous Bag of Words(CBOW) model and apply it for sample text?
A. .The Continuous Bag of Words (CBOW) model is a type of neural word embedding model
introduced in Word2Vec. It is widely used in Natural Language Processing (NLP) for learning
word representations (embeddings) from text data.
There are some steps:
1. Context Window:
o You define a window size k (e.g., 2 words to the left and 2 to the right).

o Example: For the sentence


The quick brown fox jumps over the lazy dog
If target = fox, context = quick, brown, jumps, over
2. Input:
o Context words → these are converted to one-hot vectors.

o These are then averaged or summed.

3. Hidden Layer:
o A weight matrix converts the input into a dense vector (embedding space).

4. Output:
o The model predicts the target word (softmax over the vocabulary).

5. Training:
o Using a large corpus of text, the model updates weights to minimize the
prediction loss (cross-entropy).
Sample Sentence:
“The cat sat on the mat.”
Let’s use a window size = 2
Target Word Context Words

cat The, sat

sat The, cat, on

on cat, sat, the

the sat, on, mat

mat on, the

CBOW uses these (context → target) training pairs:


 (The, sat) → cat
 (The, cat, on) → sat
 (cat, sat, the) → on
 (sat, on, mat) → the
 (on, the) → mat
Note: Words are mapped to one-hot vectors → processed through hidden layer → trained to
predict target word.
CBOW is one of the foundational models for learning word embeddings, which are used in:
 Sentiment Analysis
 Machine Translation
 Question Answering
 Named Entity Recognition (NER)
 And many other downstream NLP tasks
It produces dense, low-dimensional vector representations of words where semantically
similar words are close in the vector space.

Summary:
Feature CBOW

Predicts Target word from context


Feature CBOW

Input Context words (surrounding words)

Output Target word (center word)

Type Word embedding (neural network-based)

Application Feature learning for many NLP models

Compared to Skip-gram Faster but less effective for rare words

[Link] the skipgram model and apply it for the sample text ?
Skip-gram Model in NLP
What is Skip-gram?
The Skip-gram model is a part of Word2Vec (word embeddings).
 Goal: Predict context words given a center word.
 It learns word embeddings such that words that appear in similar contexts have similar
vectors.
 Famous for capturing semantic meaning (e.g., king – man + woman ≈ queen).
How it Works (Step-by-Step)
Step 1: Input
We take a sentence and define a context window size.
Example sentence:
“The dog barks loudly”

Vocabulary: {The, dog, barks, loudly}


Window size = 1 (each word predicts its immediate neighbors).
Step 2: Create Training Pairs
For each word (center), pair it with nearby context words.
Center = “The” → Context = “dog” → Pair: (The, dog)
 Center = “dog” → Context = “The”, “barks” → Pairs: (dog, The), (dog, barks)
 Center = “barks” → Context = “dog”, “loudly” → Pairs: (barks, dog), (barks, loudly)
 Center = “loudly” → Context = “barks” → Pair: (loudly, barks)
Final training dataset:
{(The, dog), (dog, The), (dog, barks), (barks, dog), (barks, loudly), (loudly, barks)}
Step 3: One-Hot Encoding
Each word is represented as a one-hot vector.
Vocabulary size = 4
 The = [1, 0, 0, 0]
 dog = [0, 1, 0, 0]
 barks = [0, 0, 1, 0]
 loudly = [0, 0, 0, 1]
Step 4: Neural Network Architecture
 Input layer: One-hot vector of center word.
 Hidden layer: Weight matrix (size = vocab × embedding_dim). This is where
embeddings are learned.
 Output layer:Softmax probability over vocabulary (predicts the context word).
Training: For each (center, context) pair, update weights so that the center word is close to its
context words in vector space.
Example: Apply Skip-gram on a Small Sample
Sentence:
“The dog barks”
Step 1: Vocabulary
Vocabulary = {The, dog, barks}
Index: The=0, dog=1, barks=2
Step 2: Choose Window Size = 1
Generate training pairs:
 (The → dog)
 (dog → The), (dog → barks)
 (barks → dog)
So, Training Data = {(The, dog), (dog, The), (dog, barks), (barks, dog)}
Step 3: One-Hot Encoding
 “The” = [1,0,0]
 “dog” = [0,1,0]
 “barks” = [0,0,1]
Step 4: Neural Network (Simplified)
 Input (center word one-hot) → Hidden layer → Output (probability of context word).
 After training, hidden layer weights become embeddings.
Example embeddings after training (just for intuition):
 The → [0.2, 0.1]
 dog → [0.9, 0.8]
 barks → [0.85, 0.75]
We can see dog and barks are close in vector space, because they often co-occur.

18. What are visual embeddings, why it is called as universal text representation?
Visual embeddings are dense numerical vector representations of words or documents that
capture semantic relationships in continuous vector space. They transform discrete text into
numerical vectors that machine learning algorithms can [Link] are called universal text
representation because they provide a unified mathematical framework for any textual element
regardless of language or domain. Key characteristics include:
Embeddings capture semantic similarity through geometric relationships. Similar words are
positioned closer in multidimensional space, enabling mathematical operations to reveal
relationships like "king - man + woman ≈ queen".
These representations learn from large text corpora, capturing broad linguistic patterns that
generalize across domains. The same embedding space represents words from technical
documents, literature, and social media.
They support multiple languages and can align representations across languages. Modern
techniques like multilingual BERT create shared spaces where similar concepts from different
languages are positioned nearby.
Dense embeddings efficiently encode semantic information in 50-1000 dimensions while
preserving linguistic relationships. Pre-trained embeddings transfer to various downstream
tasks without task-specific feature engineering.
19. Utilize Doc2Vec from genism to create a document vectors ?
A) Doc2Vec
 Word2Vec gives vector representations for individual words.
 Doc2Vec (Document to Vector) extends this idea to entire documents, sentences, or
paragraphs.
 It was proposed by Le and Mikolov (2014) as “Paragraph Vector.”
How it works:
1. Each document is assigned a unique document ID (tag).
2. Just like Word2Vec, a neural model learns to predict words based on context, but now
it also considers the document ID.
3. This way, the model learns a vector that represents the semantic meaning of the whole
document.
There are two main versions:
 PV-DM (Distributed Memory) → Keeps track of context words + document ID.
 PV-DBOW (Distributed Bag of Words) → Ignores context words, predicts words only
from the document ID.
Implementation with Gensim
from [Link].doc2vec import Doc2Vec, TaggedDocument
# Sample corpus
documents = [
"Natural Language Processing is fun",
"I love learning NLP",
"Doc2Vec creates document embeddings",
"Word2Vec is for word embeddings",
"Deep learning improves NLP tasks"
]
# Step 1: Tag each document
tagged_data = [TaggedDocument(words=[Link](), tags=[str(i)]) for i, doc in
enumerate(documents)]

# Step 2: Train the Doc2Vec model


model = Doc2Vec(vector_size=50, window=2, min_count=1, workers=4, epochs=40)
model.build_vocab(tagged_data)
[Link](tagged_data,total_examples=model.corpus_count,epochs=[Link])
# Step 3: Get vector for a document
doc_vector = model.infer_vector(["I", "enjoy", "NLP"])
print("Vector for sample text:", doc_vector)
# Step 4: Find similar documents
similar_docs = [Link].most_similar([doc_vector], topn=3)
print("Most similar documents:", similar_docs)
Finally
 Doc2Vec → Converts whole documents into vectors.
 Useful for tasks like document similarity, clustering, classification, and
recommendation.
 Gensim makes implementation simple:
o Prepare TaggedDocument dataset

o Train Doc2Vec model

o Use .infer_vector() for new documents

Common questions

Powered by AI

The Continuous Bag of Words (CBOW) model predicts a target word from surrounding context words. It utilizes a context window and averages one-hot encoded vectors of the context words to form the input. The model then uses a hidden layer to convert this input into a dense vector and predict the target word. Conversely, the Skip-gram model does the opposite by predicting context words based on a center word. While CBOW is faster and suitable for frequent words, the Skip-gram model is better at capturing representations of rare words by learning word embeddings such that words in similar contexts have similar vectors .

Embedding in NLP is critical for mapping high-dimensional sparse vectors into dense ones that accurately capture semantic relationships. Techniques such as Word2Vec, GloVe, and FastText facilitate this transformation by creating vector representations where similar words are positioned close to each other in the embedding space. These techniques improve the computational efficiency and effectiveness of NLP models by preserving semantic properties and enabling operations that reveal linguistic and arithmetic relationships. Thus, embeddings serve as a foundational element for numerous NLP applications, enhancing tasks like sentiment analysis, machine translation, and more .

The Word2vec Keyed Vector model is employed to load and utilize pre-trained embeddings. For example, using gensim, one can load a pre-trained model such as 'GoogleNews-vectors-negative300.bin' and perform operations like finding words semantically similar to another or retrieving a word's vector. An example function is w2v_model.most_similar('beautiful', topn=5), which identifies words most similar to 'beautiful.' Another operation is w2v_model['beautiful'], which retrieves a 300-dimensional vector for 'beautiful.' It is crucial to check for Out-of-Vocabulary (OOV) words, as they may not be present in the model's vocabulary .

TF-IDF captures the importance of words by assigning weights based on word frequency in a document relative to its frequency across a larger corpus. Words that occur frequently in a document but rarely across other documents receive higher scores, making them appear more important. Common words like 'the' get low weights, while unique words like 'meat' get high weights. The limitations of TF-IDF include its inability to capture deep semantic meanings of words and the fact that it generates sparse vectors due to its reliance on the raw frequency of word occurrence without considering contextual relationships or word order .

The PV-DM (Distributed Memory) implementation of Doc2Vec keeps track of both context words and the document ID, learning to predict words based on this comprehensive input, which maintains semantic coherence across documents. PV-DBOW (Distributed Bag of Words), on the other hand, ignores the context words and predicts words only from the document ID, effectively training the model based on document representation alone. PV-DM uses more information and is generally better at capturing semantic details, while PV-DBOW is more robust in scenarios where word order is not crucial .

Visual embeddings are dense numerical vector representations that capture semantic relationships in a continuous vector space, transforming discrete text into numbers that machine learning algorithms can utilize. They are called universal text representations because they offer a unified mathematical framework applicable to any textual content regardless of language or domain. These embeddings capture semantic similarity through geometric relationships in vector space and are capable of language support by aligning representations across different languages, thus facilitating transfer learning and efficient cross-domain applications .

Doc2Vec models extend the Word2Vec approach to document representation by considering not only words but also a unique document ID as part of the learning process. While Word2Vec provides vector representations for individual words, Doc2Vec represents entire documents, sentences, or paragraphs, incorporating the semantic context of the whole document. This makes Doc2Vec particularly suitable for tasks requiring a holistic document understanding, such as document similarity, clustering, classification, and recommendation, offering a framework to infer semantic relationships at a broader scope .

Pre-trained Word Embeddings are models trained on vast text corpora, capturing distributional similarities between words. Word2vec is a popular technique that exemplifies this by demonstrating relationships such as 'king - man + woman ≈ queen.' These embeddings are low-dimensional and dense, providing a strong baseline for machine learning tasks without extensive additional training. They are useful because they enable models to leverage semantic relationships inherently captured during pre-training, significantly improving tasks like semantic word similarity and text classification .

Distributed representation in NLP involves representing words as dense, low-dimensional vectors, as opposed to traditional high-dimensional sparse vectors like one-hot encoding or TF-IDF. These dense vectors capture semantic meaning by placing words with similar meanings close to each other in the vector space, which makes computations more efficient and better at capturing relationships between words. This representation approach addresses the limitations of sparse vectors, which are computationally inefficient and lack the ability to capture deep semantic relationships between words .

Distributional similarity refers to understanding a word's meaning based on the context in which it appears. The key idea is that words occurring in similar contexts tend to have similar meanings. This contrasts with relying solely on dictionary definitions, allowing models to deduce nuanced meanings from usage patterns. For example, although 'rocks' literally means 'stones,' in the context of 'NLP rocks,' it means exciting or cool. This distributional property is crucial for developing effective word embeddings that accurately capture word semantics beyond surface-level features .

You might also like