There’s a whole class of NLP tasks that rely on sequential output, or outputs that are sequences of potentially varying length. For example,

Translation: taking a sentence in one language as input and outputting the same sentence in another language.
Conversation: taking a statement or question as input and responding to it.
Summarization: taking a large body of text as input and outputting a summary of it.

Therefore, in this post, sequence-to-sequence models are introduced, a deep learning-based framework for handling these types of problems.

The Seq2Seq Model

The sequence-to-sequence model is an example of a Conditional Language Model.

Language Model because the decoder is predicting the next word of the target sentence y
Conditional because its predictions are also conditioned on the source sentence x

NMT directly calculates $P(y | x)$,

$P(y | x)=P\left(y_{1} | x\right) P\left(y_{2} | y_{1}, x\right) P\left(y_{3} | y_{1}, y_{2}, x\right) \ldots P\left(y_{T} | y_{1}, \ldots, y_{T-1}, x\right)$

Here $P\left(y_{T} | y_{1}, \ldots, y_{T-1}, x\right)$ is the probability of next target word given target words so far and source sentence $x$.

creen Shot 2019-08-05 at 9.59.19 P

Output decoding

Greedy decoding

When generating next word, we always take the most probable word on next step and we stop untl the model produces a <END> token.

But this method has no way to undo decisions.

creen Shot 2019-08-05 at 10.04.18 P

Beam search decoding

Core idea

On each step of decoder, keep track of the k most probable partial translations (which we call hypotheses). Here k is the beam size.

A hypothesis has a score which is its log probability:

$\operatorname{score}\left(y_{1}, \ldots, y_{t}\right)=\log P_{\mathrm{LM}}\left(y_{1}, \ldots, y_{t} | x\right)=\sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right)$

Scores are all negative, and higher score is better
We search for high-scoring hypotheses, tracking top k on each step

Stop criterion

creen Shot 2019-08-05 at 10.24.32 P

Selection criterion

We have our list of completed hypotheses, but How to select top one with highest score?

Say if we choose the hypothese with highest score, which is computed by:

$\operatorname{score}\left(y_{1}, \ldots, y_{t}\right)=\log P_{\mathrm{LM}}\left(y_{1}, \ldots, y_{t} | x\right)=\sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right)$

Then it will tend to choose the shorter sentence, which usually has a higher score.

In this case, we fix it by normalizing by length:

$\frac{1}{t} \sum_{i=1}^{t} \log P_{\mathrm{LM}}\left(y_{i} | y_{1}, \ldots, y_{i-1}, x\right)$

and use this to select top one.

Example

Say we set k=2. We start generation with <START> token. Suppose the top two candidates are he and I:

creen Shot 2019-08-05 at 10.15.39 P

Then we keep track of the two lines to generate next words:

creen Shot 2019-08-05 at 10.16.44 P

The new score is parent’s score plus current probablity. Here we have four candidates, but since our k is 2, we choose top 2 among these 4 candidates.

creen Shot 2019-08-05 at 10.18.38 P

creen Shot 2019-08-05 at 10.18.54 P

creen Shot 2019-08-05 at 10.19.21 P

Machine Translation Evaluation

BLEU (Bilingual Evaluation Understudy) is the one we use for evaluation. BLEU compares the machine-written translation to one or several human-written translation(s), and computes a similarity score based on:

n-gram precision (usually for 1, 2, 3 and 4-grams)
Plus a penalty for too-short system translations

BLEU is useful but imperfect:

There are many valid ways to translate a sentence
So a good translation can get a poor BLEU score because it
has low n-gram overlap with the human translation

The Attention Mechanism

Problem in Seq2Seq

The fixed-length vector carries the burden of encoding the the entire “meaning” of the input sequence, no matter how long that may be. With all the variance in language, this is a very hard problem.

creen Shot 2019-08-05 at 10.35.49 PM-506257

Core ideas

creen Shot 2019-08-05 at 10.39.50 P

creen Shot 2019-08-05 at 10.43.43 P

creen Shot 2019-08-05 at 10.44.10 P

In decoder, insted of just use the hidden state at the first step of the decoder, we operate the dot product between the decoder hidden state with the encoder hidden state of each step in encoder. So that we can get several scalar scores.

creen Shot 2019-08-05 at 10.44.43 P

Once we have these scores, we use a softmax function to turn the scores into a distribution.

creen Shot 2019-08-05 at 10.46.38 P

Once we have the attention output, we use the attention distribution to take a weighted sum of the encoder hidden states. creen Shot 2019-08-05 at 10.48.26 P

Finally Concatenate weighted sum of the encoder hidden states with decoder hidden state, then use to compute $\widehat{y}_{1}$ as before.

creen Shot 2019-08-05 at 10.50.52 P

creen Shot 2019-08-05 at 10.51.04 P

Attention in Equations

Remember that our seqzseq model is made of two parts, an encoder that encodes the input sentence, and a decoder that leverages the information extracted by the decoder to produce the translated sentence. Basically, our input is a sequence of words $x_1,…, x_n$ that we want to translate, and our target sentence is a sequence of words $y_1,…,y_m$.

Encoder

Let $(h_1,., h_n)$ be the hidden vectors representing the input sentence. These vectors are the output of a BI-LSTM for instance, ane capture contextual representation of each word in the sentence
Decoder

We want to compute the hidden states $s_i$ of the decoder using a recursive formula of the form

$s_i =f (s_{i-1}, y_{i-1}, c_i)$

where $s_{i-1}$ is the previous hidden vector, $y_{i-1}$ is the generated word at the previous step, and $c_i$ is a context vector that capture the context from the original sentence that is relevant to the time step i of the decoder. The context vector $c_i$ captures relevant information for the i-th decoding time step (unlike the standard Seg2 seq in which theres only one context vector). For each hidden vector from the original sentence hi, compute a score

$e_{i,i}= a (s_{i-1}, h_i)$

where a is any function, for instance a single layer fully-connected neural network. Then we end up with a sequence of scalar value $e_{i,1},…, e_{i,n}$. Normalize these scores into a vector $\alpha_{i}=\left(\alpha_{i, 1}, \ldots, \alpha_{i, n}\right)$, using a softmax layer.

$\alpha_{i, j}=\frac{\exp \left(e_{i, j}\right)}{\sum_{k=1}^{n} \exp \left(e_{i, k}\right)}$

Then, compute the context vector $c_i$ as the weighted average of the hidden vectors from the original sentence

$c_{i}=\sum_{j=1}^{n} \alpha_{i, j} h_{j}$

Intuitively, this vector captures the relevant contextual information from the original sentence for the i-th step of the decoder.

creen Shot 2019-08-05 at 10.52.05 P

Attention Variants

creen Shot 2019-08-05 at 10.52.34 P

creen Shot 2019-08-05 at 10.52.43 P

Conncetion with translation alignment

The attention-based model learns to assign significance to different parts of the input for each step of the output. In the context of translation, attention can be thought of as “alignment.” Bahdanau et al. argue that the attention scores aij at decoding step i signify the words in the source sentence that align with word i in the target. Noting this, we can use attention scores to build an alignment table – a table mapping words in the source to corresponding words in the target sentence – based on the learned encoder and decoder from our Seq2Seq NMT system.

creen Shot 2019-07-13 at 5.50.09 P

NLP-Seq2Seq