Seq2Seq

Sequence to Sequence models with Attention

An importart milestone to understanding how more modern neural network architectures work is grasping and implementing the architectures of RNNs, encoder-decoder models and the attention mechanism.

These ideas allowed Deep Learning techniques to perform sublimely well in the areas of Machine translation around 2014-2018, before Transformer models began to dominate the state-of-the-art. But before we get to Transformers, we need to understand sequence to sequence models.

Sources:

RNNs

Recurrent Neural Networks (RNNs) are models that can process sequential inputs, like sentences for translation or generative tasks. Inherently sequential, RNNs work by re-purposing their hidden state towards future timesteps. Slide6.jpg

Types of RNNs

There are multiple types of RNNs, each suited for a different type of problem:

  • One-to-One: Vanilla Feed-Forward NNs suited for classification tasks.
  • One-to-Many: Suited for tasks like picture descriptions
  • Many-to-One: Suited for tasks like sentiment analysis
  • Many-to-Many (same-length): Suited for tasks like image translation
  • Encoder-decoder models: Suited for tasks like Machine translation and many others. karpathy.jpg

Disadvantages of RNNs

  • Bottleneck phenomenon: the encoder has to fit an good enough description of all encoded data into one vector, which limit the size of its input
  • Loss of context: being able to handle sequential data is good, but we need to be able to learn dependencies between the various entries of the input sequence. In a vanilla RNN, this is not possible.
  • Memory constraints: unless the hidden layers are massive, the RNN cannot remember much about previous timesteps (nor access future ones). Models known as LSTMs (Long-Short-Term-Memory) offer a solution to this problem.
  • Parallelizability: sequential computation means longer training times.

Attention

The attention mechanism is a solution to the first three of the above concerns. By directing its attention to specific parts of the sequence, we are able to:

  • Retain larger parts of it (memory)
  • Draw connections between different parts of it (context)
  • Represent it more freely (bottleneck)

The mechanism of attention is simple, and consists of two main key points:

  1. The encoder no longer outputs a single hidden layer - we retain all intermediate states and return a matrix.
  2. Each decoder timestep consumes that matrix and uses its own hidden layer to decide which colums of the matrix to focus on more. It does so by using the hidden layer as a set of weights over the columns of the encoder output matrix.

Slide36.jpg

Network architectures could change, but in essence the idea of attention is as simple as that: choose what to focus on based on some parameters which the network learns. It's much like real-life really: You can't learn the contents of a book by sequentially going through it. You need to learn to focus on various specific pieces of content - mulitple ones at times.

The idea is so revolutionary, that it turns out to be the only thing we need to make models really powerful. That is essentially what Transformers did, but we'll leave that for another time.

For now, we'll focus on implementation. We'll implement a sequence-to-sequence model, which is a fancy term for an encoder-decoder model that will translate English sentences (sequences of words) to French sentences (sequences of words) by using RNNs and Attention in Pytorch. The implementation will be based on the ideas described above, so don't worry about the exact architecture details.

The basic data processing

First, we'll need to set some ground work for processing words, strings and dictionaries. We'll use the French-to-English translation dataset found here: https://download.pytorch.org/tutorial/data.zip

In [1]:
# Import essential libraries
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
In [2]:
SOS_token = 0 # Start of sentence.
EOS_token = 1 # End of sentence

# This class will help us handle our language dataset.
# We'll represent words as one-hot vectors. We could also use word-embeddings here!
class Lang:
    def __init__(self, name):
        self.name = name                         # name of language
        self.word2index = {}                     # word -> index map
        self.word2count = {}                     # word -> word count map
        self.index2word = {0: "SOS", 1: "EOS"}   # index -> word map
        self.n_words = 2                         # Count SOS and EOS

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            # Append new word to end of dictionary
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1
            
# Other utility functions            
            
# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s
In [3]:
# 
# This function will help us parse through the 'dictionary' dataset.
# 

def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines. Each line has the form
    # [English] [\t] [Other language]
    lines = open(r'C:\Users\themi\OneDrive\Projects\Machine Learning\data\%s-%s.txt' % (lang1, lang2), encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs
In [4]:
# 
# We want to trim down our dataset because it is massive!
# We will only consider phrases with at most MAX_LENGTH letters.
# We also would like to limit our sentences to start with some pre-determined prefixes. 
# All that for simplicity. With enough time and resources, we could train our model on the entirety of the dataset!
#

MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)


def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH and \
        p[1].startswith(eng_prefixes)


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

#
# Function to prepare our data:
#

def prepareData(lang1, lang2, reverse=False):
    
    # Parse the dictionary file into pairs
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    
    # Filter the pairs to simplify
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    
    # Add the sentences and words to our language objects so that we can create our one-hot encodings.
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    
    # Return the objects that we will work with.
    return input_lang, output_lang, pairs

# Let's try it for French->English!
input_lang, output_lang, pairs = prepareData('eng', 'fra', True)
print(random.choice(pairs))
Reading lines...
Read 135842 sentence pairs
Trimmed to 10599 sentence pairs
Counting words...
Counted words:
fra 4345
eng 2803
['tu es celui qui m a entraine .', 'you re the one who trained me .']

The model

Now that the data processing stuff is out of the way and we have a nice, organized way of dealing with our bags of words in both languages, we are ready to build our model!

Recall that our model consists of:

  • An encoder that takes in the one-hot vectors (and previous hidden states) and generates representation vectors.
  • An attention-enhanced decoder that takes the entirety of the hidden state matrix and applies weight-based attention to it in order to produce a context vector.
  • An output-producing mechanism that takes in the context and hidden state vectors and produces a word in the other language.

The encoder

Let's talk about the encoder first. Here is the architecture we will use. Each encoder "layer" corresponds to a single timestep and looks like a feedforward neural network, only that across timesteps these layers "borrow" information from each other.

encoder-network.png

A detail we need to address is the use of Gated Recurrent Units (GRUs) in this RNN.

I can probably make a more extensive set of notes on this topic, but GRUs are an 'evolved' version of a type of architecture in RNNs called Long-Short-Term-Memory (LSTM). Both GRUs and LSTMs use the concept of gates, which are quantities that are calculated in each timestep to serve different purposes. There are gates that help the network retain information across timesteps, gates that help it forget and gates that combine all the information collected in each timestep. The values of gates are combined in certain ways to produce the hidden layers of the RNN in each timestep.

Screenshot-from-2021-03-17-14-24-12.webp

Let's stick to a rough description of GRUs: As shown above, a GRU takes the input at time $t$, $x_t$, the hidden layer from the previous timestamp $h_{t-1}$ and produces a new hidden layer $h_t$.

Let's see how it does that.

440px-Gated_Recurrent_Unit,_base_type.svg.png

This is what a fully gated unit looks like architecturally.

We have two gates ($\sigma$ is the sigmoid function):

  • Update gate: $z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)$
  • Reset gate: $r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)$

The reset gate is used to create the candidate activation vector $\hat{h}_t = \phi(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h)$, in which the network has the chance to reset its memory ($\phi$ is the hyperbolic tangent)

And that is finally combined with the update gate, in which the network has the chance to retain memory long term: $$h_t = (1-z_t)\odot h_{t-1} + z_t \odot \hat{h}_t$$

Again, the architecture of our encoder is portrayed in the following figure:

encoder-network.png

In [5]:
#
# This class represents the encoder RNN
#

class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size

        # We generate embeddings of hidden_size dimensions for each word.
        # Think of `self.embedding` as a |V| x hidden_size matrix, where if an input index is provided we output
        # the row that represents that word's embedding.
        self.embedding = nn.Embedding(input_size, hidden_size)
        
        # Our RNN is composed of an 
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        
        # `input` is an index (1x1 tensor) for a word in the dictionary.
        # Generate its embedding: 1x1x`hidden_size` tensor.
        embedded = self.embedding(input).view(1, 1, -1)
        
        # Pass the embedded vector into our GRU layer, which spits out the same information twice
        #   * `output`: (1x`hidden_size) -> the final hidden state of the input sequence.
        #   * `hidden`: 1x`hidden_size` -> the final hidden state of the input sequence.
        # We'll be saving `output` and feeding it to our decoder, while `hidden` will be passed to the next GRU with the
        # next word. 
        # Granted, we could probably make this work by a single GRU with multiple layers, instead of invoking forward 
        # multiple times.
        output = embedded
        output, hidden = self.gru(output, hidden)
        return output, hidden

    # The initial hidden state is a vector of zeros (it could also be random)
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

The decoder

Non-attention based

Without attention, we only have access to the last hidden state of the encoder. The decoder is an RNN resembling the encoder:

decoder-network.png

In [6]:
# Implementing the non-attention based RNN.
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        # An embedding layer over the entire vocabulary of the target language (English)
        # `output_size` is simply the size of that vocabulary.
        self.embedding = nn.Embedding(output_size, hidden_size)
        
        # Again, we use a GRU layer as before.
        self.gru = nn.GRU(hidden_size, hidden_size)
        
        # To produce the output word, the mechanism is reminiscent of a FFNN: a linear layer and a log softmax to 
        # generate probabilities for each word.
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        
        # First generate the embedding of the word.
        # Our input word initially is `SOS` (start of sentence)
        output = self.embedding(input).view(1, 1, -1)
        
        # Introduce a layer of non-linearity with an element-wise ReLU.
        output = F.relu(output)
        
        # The 'meat' of the decoder - the GRU layer.
        output, hidden = self.gru(output, hidden)
        
        # We have our hidden state - let's now generate the prediction!
        output = self.softmax(self.out(output[0]))
        return output, hidden

    # As before, the initial hidden state is defined by default to be zeros, even though we'll end up using the encoder's
    # final hidden state.
    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

Attention based decoders

We'll be focusing on attention-based decoders in the remainder of this exposition. Recall the main idea behind attention based decoders: We preserve all of the hidden states and use the decoder's own hidden state as a weight mechanism that tells us which word to focus on when translating each next word.

attention-decoder-network.png

In [7]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.dropout_p = dropout_p
        self.max_length = max_length

        # The hidden state of the decoder will be used as attention weights.
        # The input word from the target language is embedded at each timestep...
        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        
        # ... then fed into a dropout layer that has empirically been found to reduce overfitting.
        self.dropout = nn.Dropout(self.dropout_p)
        
        # To actually calculate the attention weights, we'll use a FFNN where our feature vectors
        # are the concatenations of the embeddings from before and the previous hidden state.
        # The attention weights have `max_length` entries.
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        
        # When we finally have the weights applied on the encoder's output, we have our context vector. 
        # We combine it with the embedding of the previous target word and after a last FFNW-GRU-FFNN 
        # architecture we have both our hidden state for the next iteration and our output target word!
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_outputs):
        
        # `input` is an index in the target language vocabulary. 
        # Generate the embedding and mask it through the drop out.
        embedded = self.embedding(input).view(1, 1, -1)
        embedded = self.dropout(embedded)

        # Now we generate the attention weights and apply them to the hidden states of the encoder.
        # `attn_weights` is a `max_length` vector.
        # `encoder_outputs` is a **typically** a |source sequence length| x `hidden_size` tensor.
        # However, we pad it with zeroes to make the BMM (Batch Matrix-Matrix Product) work out.
        attn_weights = F.softmax(
            self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
        attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                 encoder_outputs.unsqueeze(0))
        
        # attn_applied ends up having a length of `hidden_size`!

        # Concatenate the applied weights and the embedding and combine them through another linear layer.
        output = torch.cat((embedded[0], attn_applied[0]), 1)
        output = self.attn_combine(output).unsqueeze(0)

        # Finally, our RNN GRU layer will give us the output (which will become the translated word) and the next 
        # timestep's hidden layer.
        output = F.relu(output)
        output, hidden = self.gru(output, hidden)

        output = F.log_softmax(self.out(output[0]), dim=1)
        return output, hidden, attn_weights

    def initHidden(self):
        return torch.zeros(1, 1, self.hidden_size, device=device)

Training

Let's train our model!

A couple of things to keep in mind as we do this:

  • We initially give the decoder as its first input, and the lasat hindden state of the encoder as its first hidden state.
  • We use teacher forcing, a technique that uses the real target outputs as the next input to the decoder rather than the decoder's guesses. This has both good and bad consequences, but it helps the model converge faster.
In [8]:
# Some helper functions with evaluation and data processing...

import time
import math
import matplotlib.pyplot as plt
import numpy as np

# Seconds -> Minutes
def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

# Counts time since some other time.
def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))

# Extracts indices from a sentence
def indexesFromSentence(lang, sentence):
    return [lang.word2index[word] for word in sentence.split(' ')]

# Creates a tensor of indices from a sentence.
def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)

# Creates a pair of tensors of indices from a pair of sentences.
def tensorsFromPair(pair):
    input_tensor = tensorFromSentence(input_lang, pair[0])
    target_tensor = tensorFromSentence(output_lang, pair[1])
    return (input_tensor, target_tensor)
In [9]:
teacher_forcing_ratio = 0.5

# This function will train our model for *one* training example.
def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion, max_length=MAX_LENGTH):
    
    # Initialize the hidden layer of the encoder to all zeroes. 
    encoder_hidden = encoder.initHidden()

    # Initialize the optimizers.
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # `input_tensor` and `target_tensor` represent the sequences of words to translate between.
    input_length = input_tensor.size(0)
    target_length = target_tensor.size(0)

    # Initialize the encoder outputs array. As mentioned above, this array should have input_length elements, but
    # we shall pad with zeroes for normalizing purposes.
    encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

    # Keep track of the loss.
    loss = 0

    # Go through all timesteps (words in the input sequence) and run the encoder.
    for ei in range(input_length):
        # The hidden state is recycled to be inputted to the next iteration
        encoder_output, encoder_hidden = encoder(
            input_tensor[ei], encoder_hidden)
        
        # The hidden state is retained.
        encoder_outputs[ei] = encoder_output[0, 0]

    # The initial token to give to the decoder is the SOS token.
    decoder_input = torch.tensor([[SOS_token]], device=device)

    # The initial hidden state of the decoder is the final state of the encoder.
    decoder_hidden = encoder_hidden

    # Decide how much teacher forcing to use.
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: Feed the target as the next input
        for di in range(target_length):
            
            # For each target word, run the decoder with the input, its previous hidden layer and the 
            # encoder outputs.
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            
            # See how we did.
            loss += criterion(decoder_output, target_tensor[di])
            decoder_input = target_tensor[di]  # Teacher forcing

    else:
        # Without teacher forcing: use its own predictions as the next input
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            topv, topi = decoder_output.topk(1)
            decoder_input = topi.squeeze().detach()  # detach from history as input

            # See how we did.
            loss += criterion(decoder_output, target_tensor[di])
            if decoder_input.item() == EOS_token:
                break

    # Collect the gradients.
    loss.backward()

    # Step through the optimizers
    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss.item() / target_length
In [10]:
# End-to-end training!
def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    # Stochastic gradient descent with negative log-likelihood loss.
    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [tensorsFromPair(random.choice(pairs))
                      for i in range(n_iters)]
    criterion = nn.NLLLoss()

    # For each iteration (aka training epoch)...
    for iter in range(1, n_iters + 1):
        
        # Collect the input and target sentence tensors (aka French -> English)
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        target_tensor = training_pair[1]

        # Calculate the loss for this training example.
        loss = train(input_tensor, target_tensor, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        # Print some progress stuff.
        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                         iter, iter / n_iters * 100, print_loss_avg))

        # Gather data for our plot.
        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0
            
    # Show our training plot. 
    plt.plot(plot_losses)
    plt.show()

Evaluation

Let's evaluate our model by seeing how it does on our dataset. There's plenty of metrics to use such as cross validation and holdout, by we won't be very pedantic here.

In [11]:
# Evaluating our model on a sentence. This basically runs a forward method through the encoder and decoder
# and spits out the decoded word!
def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
    with torch.no_grad():
        input_tensor = tensorFromSentence(input_lang, sentence)
        input_length = input_tensor.size()[0]
        encoder_hidden = encoder.initHidden()

        encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)

        for ei in range(input_length):
            encoder_output, encoder_hidden = encoder(input_tensor[ei],
                                                     encoder_hidden)
            encoder_outputs[ei] += encoder_output[0, 0]

        decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS

        decoder_hidden = encoder_hidden

        decoded_words = []
        decoder_attentions = torch.zeros(max_length, max_length)

        for di in range(max_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_outputs)
            decoder_attentions[di] = decoder_attention.data
            topv, topi = decoder_output.data.topk(1)
            if topi.item() == EOS_token:
                decoded_words.append('<EOS>')
                break
            else:
                decoded_words.append(output_lang.index2word[topi.item()])

            decoder_input = topi.squeeze().detach()

        return decoded_words, decoder_attentions[:di + 1]
    
# We'll evaluate sentences from the training set randomly.
def evaluateRandomly(encoder, decoder, n=10):
    for i in range(n):
        pair = random.choice(pairs)
        print('>', pair[0])
        print('=', pair[1])
        output_words, attentions = evaluate(encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        print('<', output_sentence)
        print('')
In [15]:
# Moment of truth!
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)

trainIters(encoder1, attn_decoder1, 50000, print_every=5000)
1m 50s (- 16m 31s) (5000 10%) 2.8606
3m 35s (- 14m 20s) (10000 20%) 2.2917
5m 20s (- 12m 28s) (15000 30%) 1.9570
7m 7s (- 10m 41s) (20000 40%) 1.7184
8m 54s (- 8m 54s) (25000 50%) 1.5189
10m 41s (- 7m 7s) (30000 60%) 1.3428
12m 28s (- 5m 20s) (35000 70%) 1.2049
14m 16s (- 3m 34s) (40000 80%) 1.0803
16m 46s (- 1m 51s) (45000 90%) 0.9977
19m 10s (- 0m 0s) (50000 100%) 0.8790
[4.595295878009189, 3.532706383356973, 3.2845928985580564, 3.410499882902418, 3.4821896350800032, 3.206889937468937, 3.399516320864359, 3.339419912618305, 3.2028600875385225, 3.0380362603853617, 3.0460230672927127, 2.9114389139357058, 2.9365690602272276, 2.8299879961089487, 2.949933467501685, 2.9673777626733933, 2.881657762406366, 3.007640420785026, 2.8323764488507828, 2.826314787577069, 2.864543114624326, 2.9306601502630456, 2.841962843266744, 2.6822609015040926, 2.744314802661774, 2.7226067719459532, 2.748655249489679, 2.7272964832744906, 2.8497214450306365, 2.784794330078459, 2.840617697628717, 2.6268470785731366, 2.528963910897573, 2.6977091198989336, 2.636846829592235, 2.5326339723231297, 2.6092693572120074, 2.5443765780433782, 2.6243648055583706, 2.5300386767349554, 2.5363764015303722, 2.573983142118605, 2.733949935890379, 2.5621571248637305, 2.5691238524497497, 2.601146747369614, 2.418660453467143, 2.4786609036316944, 2.44760428860074, 2.356352768943424, 2.461989532258775, 2.4334068122629136, 2.5463051821428637, 2.610715086895322, 2.3925045796765216, 2.3902822609326195, 2.4679710446236625, 2.3469487493303083, 2.348698655018731, 2.31806302978122, 2.324137888109874, 2.3657345368937843, 2.4502541881061726, 2.3151486476678693, 2.385025701507689, 2.1816276418746456, 2.3961048851845757, 2.330171551325964, 2.2775160586417664, 2.3923776894561835, 2.3275379262102978, 2.2409137875882412, 2.335026640977179, 2.5864713904592724, 2.3098843213149474, 2.277389667420159, 2.432588785496969, 2.1474822772561555, 2.4384410622876787, 2.2215599268580246, 2.259063230877831, 2.297256555810807, 2.14280565250866, 2.296821041205572, 2.394292748579903, 2.200478662863611, 2.036088719583693, 2.12098209604006, 2.1347817242107694, 2.3543000295956933, 1.9707628597494158, 2.164821849170186, 2.180272199360151, 2.1555076551002177, 2.1945028312471173, 2.2163236662811694, 2.004000798331367, 2.1398515870362997, 2.189908205459987, 2.0807768070943773, 2.0637067459632474, 2.119621674346545, 2.1123433179249833, 2.0160438361035458, 1.951269792973049, 1.9980577683742076, 2.0739288011127046, 2.1798950172359985, 2.0758304555056593, 2.080365163055678, 2.0751383501963008, 2.217568904668566, 1.917804148911484, 1.9757490939639863, 2.096062901941557, 2.176550073286843, 1.9766341960808582, 2.014016220874257, 1.9550618959305774, 2.042647405105925, 1.9948434228707876, 1.9975605780113312, 2.065650681908168, 1.9811758894750051, 2.0191266786181736, 1.8693642461205289, 1.9436416442271258, 1.9086041977159558, 1.9472578713288373, 1.9569096706367666, 2.020229225629851, 1.9223700440951752, 1.9338776149096946, 1.951329929074598, 1.944470128331865, 1.847840576404617, 1.7855305048265153, 1.8259537051670134, 1.9308740384370564, 1.8170775743666148, 1.891713742623253, 1.871387390715735, 1.8714940435999923, 1.7873517824789835, 1.8961952213034747, 1.8824620035508324, 1.7865989996592204, 1.730754611795856, 1.6475068921399492, 1.7005959977157534, 1.8603985555957232, 1.7792556834996693, 1.632225989669088, 1.7673689084488253, 1.8911639091448178, 2.0162847346294495, 1.941696536387715, 1.8697285512198532, 1.6436389493043455, 1.821231167231287, 1.7752514196747824, 1.7692633956311243, 1.8526527860486317, 1.8031780282722576, 1.7953654156866523, 1.6883301182862314, 1.7583471584286963, 1.768433722615243, 1.6649051150807312, 1.7000691121461844, 1.9055733063466969, 1.6180897808283092, 1.5613404516520957, 1.7622578705851992, 1.7727867595496638, 1.713604990285541, 1.876325358858184, 1.6621601069891259, 1.4990046376841408, 1.7551136522557997, 1.6770015524597393, 1.9094008650410736, 1.6747576455275226, 1.652321772396092, 1.688651942109305, 1.683458109853996, 1.8462286852238667, 1.6074001984113735, 1.6196197537391905, 1.5365991467701066, 1.7224309322171738, 1.5417075674344618, 1.766677719163043, 1.6206578382878083, 1.5772742943068343, 1.5901547460645915, 1.4824846575052024, 1.5738375884918938, 1.7235744804671358, 1.5022323456889104, 1.616105693275493, 1.5949418951641947, 1.6541869191839584, 1.6575397144163408, 1.5184166740373013, 1.5392588523035022, 1.59528671290761, 1.5186725588743648, 1.7540621053860295, 1.5481816946220777, 1.4500593160130677, 1.564351313903218, 1.628886903971433, 1.5785466422268326, 1.5359786045331334, 1.5490982023439706, 1.5704118572047774, 1.6287906560727528, 1.494434031560307, 1.5555558229921351, 1.4332206200389632, 1.4152813270073088, 1.5194431961420038, 1.5120535710520209, 1.5782843831910016, 1.6548085405921173, 1.5464856288385767, 1.6351990084326453, 1.4990199328282525, 1.628459750349795, 1.6665738968716723, 1.4220257537312926, 1.4154339104015674, 1.3672195734665502, 1.4209768456717333, 1.525081834954874, 1.4408239602068116, 1.513148744349443, 1.4016988566697592, 1.4514576187417625, 1.5180242395486143, 1.6912514211705756, 1.5122526810717958, 1.3624641217620126, 1.2188774334066914, 1.4655781361034936, 1.5709506096645953, 1.2749482630197961, 1.531378714367511, 1.1976731171276835, 1.5522301035873476, 1.454343919677393, 1.3543289844488335, 1.5494438624351263, 1.296484153366278, 1.3643113934823443, 1.521697219104994, 1.4458441560240016, 1.3293822956477845, 1.3162926014770584, 1.3889910727663644, 1.5151531403166434, 1.5337654788229196, 1.345281165923391, 1.2961397942595538, 1.4004326224241936, 1.3901279258983479, 1.3201285147704778, 1.2432071394744728, 1.3672519054261465, 1.261517848021691, 1.3768877727696818, 1.3789798255657397, 1.3700345060853738, 1.212283344796726, 1.41323957888853, 1.3475387424689436, 1.456793843297732, 1.383293369074663, 1.322183002782719, 1.263121466003713, 1.415497906791786, 1.2006488901770542, 1.5203804171461435, 1.1489454095585947, 1.2484186587231971, 1.3561040482535254, 1.2882924646013791, 1.1169438644257328, 1.3598338907027527, 1.2984913334850048, 1.3982297383731315, 1.3650175547500456, 1.1386946546104217, 1.2499942277426284, 1.306249589367045, 1.1471149397227502, 1.365039092067688, 1.3359479736264739, 1.1113672398054406, 1.4236723041520234, 1.4394782049950388, 1.1865088194826294, 1.0808073618918186, 1.233511535065042, 1.3636694659619102, 1.162957248078925, 1.1979630391432174, 1.188918478089429, 1.2602820094182847, 1.2811156032530087, 1.1688035747437246, 1.0949122722437463, 1.1048114241475147, 1.3898335492894762, 1.18566147264649, 1.2595551183628184, 1.1710486046108934, 1.3324820021793007, 1.2140263128914532, 1.2007642892759949, 1.0859911503995214, 1.1235917801611004, 1.3472216743945131, 1.252232775629986, 1.184828607220498, 1.1938159891066566, 1.1832950635150317, 1.2099380628078702, 1.220426992291496, 1.3216171190835653, 1.1344565050332318, 1.097648291215064, 1.2441643641307238, 1.2657216988548874, 1.0711975349202045, 1.2987737444530405, 1.3319636925624476, 1.1733223176752285, 1.1727160867958786, 1.1361751993790503, 1.1207018540638785, 1.1555726223505212, 1.2737467093791752, 1.191722952918874, 1.0811647534472129, 1.1096731915130966, 0.9772901167469837, 1.0762582722083915, 1.2686170016356877, 1.0076329023518733, 0.9641850184831354, 1.1082006325310183, 1.2125229304683585, 1.0596961598143217, 1.060027254500796, 0.9865896125634511, 1.0870302051077525, 0.9767872245526976, 0.9910799362919636, 1.0388178025760821, 1.1655404372693057, 1.2219094292374828, 1.0981471909603902, 1.0616844959819596, 1.0773407764590923, 1.007541411401024, 1.2339127723167809, 1.1830762275461526, 1.0993959812992626, 1.0944342273439447, 1.1201394754626923, 1.0851750782458554, 1.1553706112360906, 1.149332831800694, 1.0908405110034203, 1.2578956997801383, 1.246646838901535, 1.0013987552717563, 0.9774295408055778, 1.0316343704543656, 1.2001724992288012, 1.1020160763355942, 1.125543320269457, 1.2149165413590175, 0.9506652562618254, 1.0738193060632735, 0.9650268155140063, 1.2567032765221027, 1.0611458586083045, 1.2729761295076165, 0.8656434909369031, 1.135925155838331, 1.0956245139248313, 0.9616685052376891, 0.8652872965054377, 0.980374905184266, 0.9101050884292006, 1.0956103034568208, 1.0303159345679338, 0.9992292948625392, 1.144154077715344, 1.119816533918655, 1.0044635387309604, 1.2827359462378047, 1.1055649476731346, 1.0989454038110045, 0.9037332709883417, 0.8898616151984723, 0.8372269848229158, 1.0533715413051938, 1.1036023305927716, 1.0352810094450557, 1.0852347852599529, 1.1519657763171764, 0.992702079465938, 0.9579542381578967, 0.9536628584790795, 1.1297798232154717, 1.0052693920056264, 1.1028536002687988, 1.1835531653508782, 0.8981888144465665, 1.143853087989111, 0.8677532363034899, 1.0214513889026073, 0.9317049656057997, 0.8809813920869709, 0.9423737076679392, 1.099108524243628, 0.9383628448514594, 0.9381645298979819, 0.8048260541712005, 0.7754229747749983, 0.9765109089060671, 1.0391393376598284, 1.0698139215960865, 0.8599179584937435, 0.8414094856436999, 1.017191697871875, 1.0657037114451333, 0.9131456680412567, 0.8193640181578341, 1.0810696555255421, 1.0023226673210426, 0.8929271343608224, 0.9938148242154763, 1.0878007760237134, 1.0058683204322107, 0.834711082639557, 0.8305550485993665, 1.0033946240390574, 0.9934038237707008, 0.9951508675986339, 1.1536836592342177, 0.7446352347461943, 0.8810409941741874, 0.7746429459820426, 0.8202669274861378, 0.9634206909591718, 0.791894653996778, 0.8853948401009276, 0.9243672037164843, 0.9242129374024657, 0.8613405329123844, 1.050206437083937, 0.9046552767910891, 0.9114579027526911, 0.7117571859276954, 0.9213348357800455, 0.7206588600230593, 0.9429050055058701, 0.888004658082411, 0.8567775942295317, 0.868259929335957, 0.8757619410793464, 0.8992070295004148, 0.8763266985047429, 0.9817551040602109, 0.9654631891120523, 0.8743256244023168, 0.9444957428638425, 0.8419445729799687, 0.8054945618990395, 0.6954462930610373, 0.8919862120422936, 0.9724917962414521, 0.8200075105410954, 0.879502363305125, 1.063505506447502, 0.7329596882859866, 0.8747957510878994, 0.7706487775133362, 0.845752455002377, 1.0007472900350887, 0.9112910258538196, 0.7423592621207474, 0.776558332294226, 0.7793848980975058, 0.806649358477384]
In [17]:
evaluateRandomly(encoder1, attn_decoder1)
> j y vais maintenant .
= i m going there now .
< i m going now . <EOS>

> elle attend famille .
= she s pregnant .
< she is pregnant . <EOS>

> j ai honte du comportement de mon fils .
= i am ashamed of my son s conduct .
< i m ashamed of my my . <EOS>

> je viens du bresil .
= i am from brazil .
< i m from . . <EOS>

> tu n es pas aussi maligne que moi .
= you re not as smart as me .
< you re not as smart as me . <EOS>

> je vais piquer un somme .
= i m going to go take a nap .
< i m going to take a a . <EOS>

> vous etes extraverti .
= you re extroverted .
< you re stuck . <EOS>

> je suis habitue a ce probleme .
= i am familiar with this subject .
< i m used to the . . <EOS>

> elle cherche un meilleur emploi .
= she is after a better job .
< she s looking for a job . <EOS>

> vous n etes pas contusionnee .
= you re not bruised .
< you re not bruised . <EOS>

Visualizing Attention

Just out of curiosity, let's examine the attention weights in each iteration for a specific translation task. That would intuitively show us which encoder hidden state the network was trained to focus on when translating each word.

In [18]:
output_words, attentions = evaluate(
    encoder1, attn_decoder1, "je suis trop froid .")
plt.matshow(attentions.numpy())
Out[18]:
<matplotlib.image.AxesImage at 0x169913c1070>
In [20]:
# Let's add some extra finery for the viewing experience.
import matplotlib.ticker as ticker

def showAttention(input_sentence, output_words, attentions):
    # Set up figure with colorbar
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(attentions.numpy(), cmap='bone')
    fig.colorbar(cax)

    # Set up axes
    ax.set_xticklabels([''] + input_sentence.split(' ') +
                       ['<EOS>'], rotation=90)
    ax.set_yticklabels([''] + output_words)

    # Show label at every tick
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()


def evaluateAndShowAttention(input_sentence):
    output_words, attentions = evaluate(
        encoder1, attn_decoder1, input_sentence)
    print('input =', input_sentence)
    print('output =', ' '.join(output_words))
    showAttention(input_sentence, output_words, attentions)


evaluateAndShowAttention("elle a cinq ans de moins que moi .")

evaluateAndShowAttention("elle est trop petit .")

evaluateAndShowAttention("je ne crains pas de mourir .")

evaluateAndShowAttention("c est un jeune directeur plein de talent .")
input = elle a cinq ans de moins que moi .
output = she s five years younger than i am . <EOS>
C:\Users\themi\AppData\Local\Temp\ipykernel_1428\4028537085.py:12: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_xticklabels([''] + input_sentence.split(' ') +
C:\Users\themi\AppData\Local\Temp\ipykernel_1428\4028537085.py:14: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels([''] + output_words)
input = elle est trop petit .
output = she s too trusting . <EOS>
input = je ne crains pas de mourir .
output = i m not scared of dying . <EOS>
input = c est un jeune directeur plein de talent .
output = he s an intelligent young man . <EOS>