word2vec - Understanding and implementation¶

This notebook covers the topic of word2vec, a popular NLP algorithm for embedding words into a low-dimensional space, while preserving some notion of semantic similarity.

The sources are drawn from:

Intuition about word similarity measures and word embeddings¶

Distributional hypothesis: In linguistics, this hypothesis states that words that are used and occur in the same context tend to purport similar meanings (aka a word is characterized by the company it keeps)

In the context of this hypothesis, we can assign each word $w$ a vector $q_w$. We'll come back to how $q_w$ is defined, but if we have two words $w$ and $w'$, we can define a similarity between them by using the cosine of the angle between $q_{w}$ and $q_{w'}$.

$s(w,w') = \frac{q_{w}\cdot q_{w'}}{|q_{w}|\cdot|q_{w'}|}$

This is known as cosine similarity. Similar words have $s$ close to 1, while dissimilar words have $s$ close to -1.

How should we create $q_w$ for a word $w$? We want $q_w$ to contain entries that describe a "similarity score" of $w$ with other entities in various sentences (words, word groups etc).

For instance, for the word 'dog', $q_{\text{dog}}$ would have a high score under the 'animal' component, but a low score under the 'culinary' component.

The problem is that there are too many possible components to consider for creating such a vector. And it is very difficult to design components that are ideal for determining similarity between words via the way described above. Here is where neural networks come to the rescue. Neural networks are able to learn representations from features, making feature engineering automated. Therefore, we could try making a neural network that will learn to create good word embeddings as part of its training, in order to solve some task we give it. The important thing to understand here is that it is not the task that we care about - it's the representation that the network will learn after solving the task that interests us. Note that the network will learn a representation that consists of several latent semantic attributes, that, while non-interpretable to us, can be used to formulate similarity metrics that are highly semantically accurate.

Word embeddings in Pytorch¶

Embeddings are stored in a $|V| \times D$ matrix, where $V$ is our vocabulary. We use a map to map indices in the embedding to words in the dictionary.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

import matplotlib.pyplot as plt

word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)

N-gram language modeling¶

In an n-gram language model, we are given a sequence of words $w$, and we want to compute $P(w_i | w_{i-1},...,w_{i-n+1})$. $n$ is the context size above. The bigger it is, the more data we have to store.

# Set the context size and embedding dimensions to something reasonable.
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10

# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

# We should tokenize the input, but we will ignore that for now
# Build a list of tuples.
# Each tuple is ([ word_i-CONTEXT_SIZE, ..., word_i-1 ], target word)
ngrams = [
    (
        [test_sentence[i - j - 1] for j in range(CONTEXT_SIZE)],
        test_sentence[i]
    )
    for i in range(CONTEXT_SIZE, len(test_sentence))
]

# Print the first 3, just so you can see what they look like.
print(ngrams[:3])

[(['forty', 'When'], 'winters'), (['winters', 'forty'], 'shall'), (['shall', 'winters'], 'besiege')]

# Remove duplicate words to build the vocabulary.
vocab = set(test_sentence)

# Build the word to index map.
word_to_ix = {word: i for i, word in enumerate(vocab)}

The neural network below is composed of 3 layers:

The embedding layer
A hidden linear layer.
An output layer.

The embedding layer does not do much; it just stores the embeddings of each word in our vocabulary. The output of a forward pass is a vector of log-likelihoods for each word in the vocabulary.

# The class below builds the N-gram neural network.
class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        
        # Outputs an embedding in D dimensions of the |V| words in our vocabulary.
        # When the network is trained, this layer will contain the embedding
        # we desire. 
        # If we feed this layer k indices from [1, vocab_size], it outputs the corresponding
        # word embeddings (a k x D tensor) by looking them up in its |V|xD table.
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # Takes the embeddings of the context_size words we fed to the Embedding layer and
        # runs it through a linear layer.
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        
        # Outputs the probability vector for each word (|V| x 1)
        self.linear2 = nn.Linear(128, vocab_size)

    # Forward propagation. Inputs here is a tensor with context_size columns,
    # one per context word. For instance, the simplest input to this model
    # is a 1x2 tensor, when the context window has size 2.
    def forward(self, inputs):
        
        # First, generate the embeddings for each word in the context window. We 
        # convert them to a 1 row vector after that.
        embeds = self.embeddings(inputs).view((1, -1))
        
        # Apply the first linear layer with a Relu
        out = F.relu(self.linear1(embeds))
        
        # Apply the second linear layer with a log softmax.
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        
        return log_probs

After defining our model, we are now ready to train it. A forward pass through the model gives us a vector of probabilities $p(w | w_{i-1}, w_{i-2})$. We know that for some word $w \in V$ we have whether it is the optimal word, and that corresponds to a one-hot vector $c$.

What is the probability that the answer is actually $c$, given that the parameters of the model lead us to $p$? That is captured by the likelihood of the model's parameters:

$l(p_{\theta},c) = P(c | \theta) = \prod\limits_{j=1}^{|V|} c_j\cdot p(w_j | w_{i-1},w_{i-2})$

Instead of maximizing this value, we will use gradient descent to minimize the negative log likelihood.

losses = []

# We use the negative log-likelihood loss.
loss_function = nn.NLLLoss()

# Define our model.
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)

# Our optimizer is a standard stochastic gradient descent method. We don't want to
# be optimizing our loss over the entire dataset, so we will be looking at each
# individual training example separately.
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(500):
    total_loss = 0
    for context, target in ngrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
    
# Plot the loss decreasing.
plt.plot(losses)    
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.show()

# To get the embedding of a particular word, e.g. "beauty"
print(model.embeddings.weight[word_to_ix["beauty"]])

tensor([-0.6917, -1.4360, -0.2075, -0.5217, -1.5193,  0.4163, -1.0064,  0.0406,
         0.3711, -0.2062], grad_fn=<SelectBackward0>)

Word2vec model and architecture¶

Word2vec is just another model to generate word embeddings based on context. The context window here consists of $N$ words before and $N$ words after any given word. $N$ here is a hyperparameter. We'll keep it to 2.

In the original word2vec paper, two architectures are proposed:

CBOW (Continuous Bag of Words) - a model that predicts a current word based on its context words
Skip-Gram - a model that predicts context words based on the current word.

We'll look at CBOW first.

Continuous Bag-Of-Words¶

CBOW is typically used to pretrain word embeddings and feed them into some more complicated model. Given a target word $w_i$ and an $N$ context window $C$ on each side: $w_{i-1},...,w_{i-N}$ and $w_{i+1},...,w_{i+N}$, we try to minimize

$-\log p(w_i | C) = -\log\text{Softmax}\left(A\left(\sum\limits_{w \in C}q_w\right) + b\right)$

The model architecture can be seen in the image below. It consists of an embedding layer as before (this is the target embedding matrix we are trying to learn), and a dense linear layer with a softmax activation.

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
EMBEDDING_DIM = 5

raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

# Data pre-processing.
word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):
    context = (
        [raw_text[i - j - 1] for j in range(CONTEXT_SIZE)]
        + [raw_text[i + j + 1] for j in range(CONTEXT_SIZE)]
    )
    target = raw_text[i]
    data.append((context, target))
# print(data[:5])

# Define the neural network.
class CBOW(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(CBOW, self).__init__()
        
        # The embedding layer that stores the embedding the model learns.
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # The linear layer.
        self.linear1 = nn.Linear(2 * context_size, vocab_size)

    def forward(self, inputs):
        
        # Obtain the embedding for each context word. Embeds here is a 
        # (2 * context_size) x embedding_dim tensor. 
        embeds = self.embeddings(inputs)
        
        # Perform the averaging step and remove one dimension.
        embeds = embeds.mean(dim = 1, keepdim = True).view((1, -1))
        
        # Go through the linear layer and the softmax activation.
        out = self.linear1(embeds)
        
        out = F.log_softmax(out, dim=1)
        
        return out
       
# A function to create a tensor of indices from the context.
def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)

print ("Start training!")

# Create and train the model.
losses = []

# We use the negative log-likelihood loss.
loss_function = nn.NLLLoss()

# Define our model.
model = CBOW(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)

# Our optimizer is a standard stochastic gradient descent method. We don't want to
# be optimizing our loss over the entire dataset, so we will be looking at each
# individual training example separately.
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(500):
    total_loss = 0
    for context, target in data:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = make_context_vector(context, word_to_ix)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next words
        probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
    
# Plot the loss decreasing.
plt.plot(losses)    
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.show()

Start training!

# To get the embedding of a particular word, e.g. "process"
print(model.embeddings.weight[word_to_ix["process"]])

tensor([-0.0736, -0.9770, -0.9789,  2.4915, -0.9673],
       grad_fn=<SelectBackward0>)