Recurrent Neural Networks, which are a type of artificial neural network designed to recognize patterns in sequences of data, such as text, genomes, handwriting, the spoken word, or numerical times series data emanating from sensors, stock markets and government agencies. These algorithms take time and sequence into account, they have a temporal dimension.

Problems with Vanilla NN link

A glaring limitation of Vanilla Neural Networks (and also Convolutional Networks) is that their API is too constrained: they accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output (e.g. probabilities of different classes).

iags-157883

Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors hold the RNN’s state (more on this soon). From left to right: (1) Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.

RNN

At a high level, a recurrent neural network (RNN) processes sequences — whether daily stock prices, sentences, or sensor measurements — one element at a time while retaining a memory (called a state) of what has come previously in the sequence.

A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation

given an observation sequence $\mathbf{x}=\left\{\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{T}\right\}$ and its corresponding label $y=\left\{y_{1}, y_{2}, \ldots, y_{T}\right\}$, we want to learn a map $f : \mathbf{x} \mapsto y$.

creen Shot 2019-06-19 at 12.15.53 P

Suppose that we have the following RNN model, such that

$\begin{array}{l}{\mathbf{h}_{t}=\tanh \left(W_{h h} \mathbf{h}_{t-1}+W_{x h} \mathbf{x}_{t}+\mathbf{b}_{\mathbf{h}}\right)} \\ {z_{t}=\operatorname{softmax}\left(W_{h z} \mathbf{h}_{t}+\mathbf{b}_{z}\right)}\end{array}$

where $z_t$ is the prediction at the time step $t$.

We can minimize the negative log likelihood objective function:

$\mathcal{L}(\mathrm{x}, \mathrm{y})=-\sum_{t} y_{t} \log z_{t}$

In the following, we will use notation $L$ as the objective function for simplicity. And further we will use $L(t + 1)$ to indicate the output at the time step t + 1, s.t. $L(t + 1) = -y_{t+1}logz_{t+1}$.

Let’s set $\alpha_{t}=W_{h z} \mathbf{h}_{t}+\mathbf{b}_{z}$ and then we have $z_t=softmax(\alpha_t)$. By taking the derivative with respect to $ \alpha_t$, we get the following:

$\frac{\partial \mathcal{L}}{\partial \alpha_{t}}=-\left(y_{t}-z_{t}\right)$

Note the weight $W_{hz}$ is shared across all time sequence, thus we can dierentiate to it at each time

step and sum all together

$\frac{\partial \mathcal{L}}{\partial W_{h z}}=\sum_{t} \frac{\partial \mathcal{L}}{\partial z_{t}} \frac{\partial z_{t}}{\partial W_{h z}}$

Similarly, we can get the gradient w.r.t. bias $b_z$

$\frac{\partial \mathcal{L}}{\partial b_{z}}=\sum_{t} \frac{\partial \mathcal{L}}{\partial z_{t}} \frac{\partial z_{t}}{\partial b_{z}}$

Now let’s go through the details to derive the gradient w.r.t. $W_{hh}$. Considering at the time step $t \to t+1$ in figure1,

$\frac{\partial \mathcal{L}(t+1)}{\partial W_{h h}}=\frac{\partial \mathcal{L}(t+1)}{\partial z_{t+1}} \frac{\partial z_{t+1}}{\partial \mathbf{h}_{t+1}} \frac{\partial \mathbf{h}_{t+1}}{\partial W_{h h}}$

where we only consider one step $t \to t+1$. And because the hidden state $h_{t+1}$ partially dependents

on $h_t$, so we can use backpropagation to compute the above partial derivative. Think further $W_{hh}$ is shared cross the whole time sequence. Thus, at the time step $(t -1) \to t$, we can further get the partial derivative w.r.t. $W_{hh}$ as follows

Thus, at the time step $t+1$, we can compute gradient w.r.t. $z_{t+1}$ and further use backpropagation

through time (BPTT) from t to 0 to calculate gradient w.r.t. $W_{hh}$, shown as the red chain in Fig. 1.

Thus, if we only consider the output $z_{t+1}$ at the time step $t+1$, we can yield the following gradient

w.r.t. $W_{hh}$

$\frac{\partial \mathcal{L}(t+1)}{\partial W_{h h}}=\sum_{k=1}^{t} \frac{\partial \mathcal{L}(t+1)}{\partial z_{t+1}} \frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_{t+1}} \frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_{k}} \frac{\partial \mathbf{h}_{k}}{\partial W_{h h}}$

Aggregate the gradients w.r.t. $W_{hh}$ over the whole time sequence with back propagation, we can

finally yield the following gradient w.r.t. $W_{hh}$

$\frac{\partial \mathcal{L}}{\partial W_{h h}}=\sum_{t} \sum_{k=1}^{t+1} \frac{\partial \mathcal{L}(t+1)}{\partial z_{t+1}} \frac{\partial z_{t+1}}{\partial \mathbf{h}_{t+1}} \frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_{k}} \frac{\partial \mathbf{h}_{k}}{\partial W_{h h}}$

Now we turn to derive the gradient w.r.t. $W_{xh}$. Similarly, we consider the time step $t + 1$ (only

contribution from $x_{t+1}$) and calculate the gradient w.r.t. to $W_{xh}$ as follows

$\frac{\partial \mathcal{L}(t+1)}{\partial W_{x h}}=\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}_{t+1}} \frac{\partial \mathbf{h}_{t+1}}{\partial W_{x h}}$

Because $h_t$ and $x_{t+1}$ both make contribution to $h_{t+1}$, we need to backpropagte to $h_t$ as well. If we

consider the contribution from the time step $t$, we can further get

$\begin{aligned} & \frac{\partial \mathcal{L}(t+1)}{\partial W_{x h}}=\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}_{t+1}} \frac{\partial \mathbf{h}_{t+1}}{\partial W_{x h}}+\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}_{t}} \frac{\partial \mathbf{h}_{t}}{\partial W_{x h}} \\=& \frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}_{t+1}} \frac{\partial \mathbf{h}_{t+1}}{\partial W_{x h}}+\frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}_{t+1}} \frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_{t}} \frac{\partial \mathbf{h}_{t}}{\partial W_{x h}} \end{aligned}$

Thus, summing up all contributions from $t$ to 0 via backpropagation, we can yield the gradient at

the time step $t + 1$

$\frac{\partial \mathcal{L}(t+1)}{\partial W_{x h}}=\sum_{k=1}^{t+1} \frac{\partial \mathcal{L}(t+1)}{\partial \mathbf{h}_{t+1}} \frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_{k}} \frac{\partial \mathbf{h}_{k}}{\partial W_{x h}}$

Further, we can take derivative w.r.t. $W_{xh}$ over the whole sequence as

$\frac{\partial \mathcal{L}}{\partial W_{x h}}=\sum_{t} \sum_{k=1}^{t+1} \frac{\partial \mathcal{L}(t+1)}{\partial z_{t+1}} \frac{\partial z_{t+1}}{\partial \mathbf{h}_{t+1}} \frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_{k}} \frac{\partial \mathbf{h}_{k}}{\partial W_{x h}}$

However, there are gradient vanishing or exploding problems to RNNs. Notice that $\frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_{k}}$ indicates matrix multiplication over the sequence. Because RNNs need to backpropagate gradients over a long sequence (with small values in the matrix multiplication), gradient value will shrink layer over layer, and eventually vanish after a few time steps. Thus, the states that are far away from the current time step does not contribute to the parameters’ gradient computing (or parameters that RNNs is learning). Another direction is the gradient exploding, which attributed to large values in matrix multiplication.

Forward propagation

042406-20170306142253375-17597177

$x$: input, $h$: hidden layer, $o$: output, $y$: target label, $L$: loss function, $t$: time

At time t, we have hidden state:

$h^{(t)}=\phi(Ux^{(t)}+Wh^{(t)}+b)$

where $\phi()$ is activation function, typically $tanh()$, and $b$ is the bias.

The output is at time t is:

$o^{(t)}=Vh^{(t)}+c$

Then the prediction is:

$\hat{y}^{(t)}=\sigma(o^{(t)})$

where $\sigma()$ is the activation function, typically $softmax()$.

Back Propagation through Time

ref ref2 ref4

Let’s quickly recap the basic equations of our RNN.

$\begin{aligned} s_{t} &=\tanh \left(U x_{t}+W s_{t-1}\right) \\ \hat{y}_{t} &=\operatorname{softmax}\left(V s_{t}\right) \end{aligned}$

We also defined our loss, or error, to be the cross entropy loss, given by:

$\begin{aligned} E_{t}\left(y_{t}, \hat{y}_{t}\right) &=-y_{t} \log \hat{y}_{t} \\ E(y, \hat{y}) &=\sum_{t} E_{t}\left(y_{t}, \hat{y}_{t}\right) \\ &=-\sum_{t} y_{t} \log \hat{y}_{t} \end{aligned}$

Here, $y_t$ is the correct word at time step $t$, and $\hat y_t$ is our prediction. We typically treat the full sequence (sentence) as one training example, so the total error is just the sum of the errors at each time step (word).

nn-bptt

Remember that our goal is to calculate the gradients of the error with respect to our parameters $U$, $V$ and $W$.

$\begin{aligned} \frac{\partial E_{3}}{\partial V} &=\frac{\partial E_{3}}{\partial \hat{y}_{3}} \frac{\partial \hat{y}_{3}}{\partial V} \\ &=\frac{\partial E_{3}}{\partial \hat{y}_{3}} \frac{\partial \hat{y}_{3}}{\partial z_{3}} \frac{\partial z_{3}}{\partial V} \\ &=\left(\hat{y}_{3}-y_{3}\right) \otimes s_{3} \end{aligned}$

In the above, $z_{3}=V s_{3}$ and $\oplus$ is the outer product of two vectors. We can find that $\frac{\partial E_{3}}{\partial V}$ only depends on the values at the current time step, $\hat{y}_{3}, y_{3}, s_{3}$.

But the story is different for $\frac{\partial E_{3}}{\partial W}$ (and for $U$). To see why, we write out the chain rule, just as above:

$\frac{\partial E_{3}}{\partial W}=\frac{\partial E_{3}}{\partial \hat{y}_{3}} \frac{\partial \hat{y}_{3}}{\partial s_{3}} \frac{\partial s_{3}}{\partial W}$

Now, note that $s_{3}=\tanh \left(U x_{t}+W s_{2}\right)$ depends on $s_2$, which depends on $W$ and $s_1$, and so on. So if we take the derivative with respect to $W$ we can’t simply treat $s_2$ as a constant! We need to apply the chain rule again and what we really have is this:

$\frac{\partial E_{3}}{\partial W}=\sum_{k=0}^{3} \frac{\partial E_{3}}{\partial \hat{y}_{3}} \frac{\partial \hat{y}_{3}}{\partial s_{3}} \frac{\partial s_{3}}{\partial s_{k}} \frac{\partial s_{k}}{\partial W}$

We sum up the contributions of each time step to the gradient. In other words, because $W$ is used in every step up to the output we care about, we need to backpropagate gradients from $t=3$ through the network all the way to $t=0$.

nn-bptt-with-gradient

Hand-Written RNN

hand-writtrn-deduction-1 hand-writtrn-deduction-2

Here I am going to give an simple problem about how RNN can be used to solve problems and aim to have a better understanding of how forward and backward propagation in RNN work.

Anyways here we go. The problem is very simple, we are going to use RNN to count how many ones there are in the given data.

$\begin{array}{l}{x=[ } \\ {[0,0,1]} \\ {[0,1,1]} \\ {[1,1,1]} \\ {\text { ]}}\end{array}\\ y = [1,2,3]$

As seen above, the training data is $x$ and $y$ is the groundtruth.

The corresponding RNN architecture looks like the following:

creen Shot 2019-06-18 at 9.36.36 P

which is the unrolled version of our network architecture.

For this network, we have two weight matrixs, $W_x$ and $W_{rec}$, and the forward propagation is:

$S_{1}=S_0 \cdot W_{r e c}+X_{1} \cdot W_{x} \\ S_{2}=S_1 \cdot W_{r e c}+X_{2} \cdot W_{x}\\ S_{3}=S_2 \cdot W_{r e c}+X_{3} \cdot W_{x}$

We define the MSE loss function:

$\cos t=\frac{1}{m}\left(S_{3}-y\right)^{2}$

Now lets perform back propagation through time. We have to get derivative respect to $W_x$ and $W_{rec}$ for each state.

State3:

$\frac{d \xi}{d W_{x}}=\frac{d \xi}{d S_{1}} \cdot \frac{d S_{3}}{d W_x}=\frac{2}{m}\left(S_{3}-y\right) \cdot x_{3}\\ \frac{d \xi}{d W_{r e c}}=\frac{d \xi}{d S_{1}} \cdot \frac{d S_{3}}{d W r e c}=\frac{2}{m}\left(S_{3}-y\right) \cdot S_{2}$

State2:

$\frac{d \xi}{d W_{x}}=\frac{d \xi}{d S_{1}} \cdot \frac{d S_{3}}{d S_2}\cdot \frac{d S_{2}}{d W_x}=\frac{2}{m}\left(S_{3}-y\right) \cdot W_{rec}\cdot x_{2}\\ \frac{d \xi}{d W_{rec}}=\frac{d \xi}{d S_{1}} \cdot \frac{d S_{3}}{d S_2}\cdot \frac{d S_{2}}{d W_{rec}}=\frac{2}{m}\left(S_{3}-y\right) \cdot W_{rec}\cdot S_{1}\\$

State1:

$\frac{d \xi}{d W_{x}}=\frac{d \xi}{d S_{1}} \cdot \frac{d S_{3}}{d S_2}\cdot \frac{d S_{2}}{d S_1}\cdot \frac{d S_{1}}{d W_x}=\frac{2}{m}\left(S_{3}-y\right) \cdot W_{rec}\cdot W_{rec}\cdot x_{2}\\ \frac{d \xi}{d W_{rec}}=\frac{d \xi}{d S_{1}} \cdot \frac{d S_{3}}{d S_2}\cdot \frac{d S_{2}}{d S_1}\cdot \frac{d S_{1}}{d W_{rec}}=\frac{2}{m}\left(S_{3}-y\right) \cdot W_{rec}\cdot W_{rec}\cdot S_{1}\\$

So that’s it! The simple math behind training RNN.

creen Shot 2019-06-18 at 9.53.34 P

Numpy Implementation

点击显/隐内容

def rnn_step_forward(x, prev_h, Wx, Wh, b):
    """
    Run the forward pass for a single timestep of a vanilla RNN that uses a tanh
    activation function.

    The input data has dimension D, the hidden state has dimension H, and we use
    a minibatch size of N.

    Inputs:
    - x: Input data for this timestep, of shape (N, D).
    - prev_h: Hidden state from previous timestep, of shape (N, H)
    - Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
    - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
    - b: Biases of shape (H,)

    Returns a tuple of:
    - next_h: Next hidden state, of shape (N, H)
    - cache: Tuple of values needed for the backward pass.
    """
    next_h, cache = None, None
    ##############################################################################
    # TODO: Implement a single forward step for the vanilla RNN. Store the next  #
    # hidden state and any values you need for the backward pass in the next_h   #
    # and cache variables respectively.                                          #
    ##############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    raw_h = x.dot(Wx) + prev_h.dot(Wh) + b
    next_h = np.tanh(raw_h)
    cache = (x, prev_h, Wx, Wh, raw_h, next_h)

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return next_h, cache

点击显/隐内容

def rnn_step_backward(dnext_h, cache):

    """

    Backward pass for a single timestep of a vanilla RNN.

Inputs:
- dnext_h: Gradient of loss with respect to next hidden state, of shape (N, H)
- cache: Cache object from the forward pass

Returns a tuple of:
- dx: Gradients of input data, of shape (N, D)
- dprev_h: Gradients of previous hidden state, of shape (N, H)
- dWx: Gradients of input-to-hidden weights, of shape (D, H)
- dWh: Gradients of hidden-to-hidden weights, of shape (H, H)
- db: Gradients of bias vector, of shape (H,)
"""
dx, dprev_h, dWx, dWh, db = None, None, None, None, None
##############################################################################
# TODO: Implement the backward pass for a single step of a vanilla RNN.      #
#                                                                            #
# HINT: For the tanh function, you can compute the local derivative in terms #
# of the output value from tanh.                                             #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

x, prev_h, Wx, Wh, raw_h, next_h = cache

draw_h = (1-next_h**2)*dnext_h
db = np.sum(draw_h,axis=0)
dx = draw_h.dot(Wx.T)
dWx = x.T.dot(draw_h)
dprev_h = draw_h.dot(Wh.T)
dWh = prev_h.T.dot(draw_h)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
#                               END OF YOUR CODE                             #
##############################################################################
return dx, dprev_h, dWx, dWh, db

点击显/隐内容

def rnn_forward(x, h0, Wx, Wh, b):

    """

    Run a vanilla RNN forward on an entire sequence of data. We assume an input

    sequence composed of T vectors, each of dimension D. The RNN uses a hidden

    size of H, and we work over a minibatch containing N sequences. After running

    the RNN forward, we return the hidden states for all timesteps.

Inputs:
- x: Input data for the entire timeseries, of shape (N, T, D).
- h0: Initial hidden state, of shape (N, H)
- Wx: Weight matrix for input-to-hidden connections, of shape (D, H)
- Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H)
- b: Biases of shape (H,)

Returns a tuple of:
- h: Hidden states for the entire timeseries, of shape (N, T, H).
- cache: Values needed in the backward pass
"""
h, cache = None, None
##############################################################################
# TODO: Implement forward pass for a vanilla RNN running on a sequence of    #
# input data. You should use the rnn_step_forward function that you defined  #
# above. You can use a for loop to help compute the forward pass.            #
##############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
N, T, D = x.shape
_, H = h0.shape
h = np.zeros((N,T,H))
cache = {}
prev_h = h0
for i in range(T):
    x_ = x[:,i,:]
    next_h, cache_ = rnn_step_forward(x_, prev_h, Wx, Wh, b)
    prev_h = next_h
    h[:,i,:] = next_h
    cache[i] = cache_

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
#                               END OF YOUR CODE                             #
##############################################################################
return h, cache

点击显/隐内容

def rnn_backward(dh, cache):
    """
    Compute the backward pass for a vanilla RNN over an entire sequence of data.

    Inputs:
    - dh: Upstream gradients of all hidden states, of shape (N, T, H). 
    
    NOTE: 'dh' contains the upstream gradients produced by the 
    individual loss functions at each timestep, *not* the gradients
    being passed between timesteps (which you'll have to compute yourself
    by calling rnn_step_backward in a loop).

    Returns a tuple of:
    - dx: Gradient of inputs, of shape (N, T, D)
    - dh0: Gradient of initial hidden state, of shape (N, H)
    - dWx: Gradient of input-to-hidden weights, of shape (D, H)
    - dWh: Gradient of hidden-to-hidden weights, of shape (H, H)
    - db: Gradient of biases, of shape (H,)
    """
    dx, dh0, dWx, dWh, db = None, None, None, None, None
    ##############################################################################
    # TODO: Implement the backward pass for a vanilla RNN running an entire      #
    # sequence of data. You should use the rnn_step_backward function that you   #
    # defined above. You can use a for loop to help compute the backward pass.   #
    ##############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
    N, T, H = dh.shape
    x = cache[0][0]
    _,D  = x.shape
    dx = np.zeros((N, T, D))
    dh0 = np.zeros((N, H))
    dWx = np.zeros((D, H))
    dWh = np.zeros((H, H))
    db = np.zeros(H)
    
    dprev_h = np.zeros((N, H))
    for i in reversed(range(T)):
        dnext_h = dh[:,i,:] + dprev_h
        dx[:, i, :], dprev_h, dWx_, dWh_, db_ = rnn_step_backward(dnext_h, cache[i])
        dWx += dWx_
        dWh += dWh_
        db += db_
    dh0 = dprev_h
        

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################
    return dx, dh0, dWx, dWh, db

RNN Vanishing Gradients Problem

What is gradients vanishing?

creen Shot 2019-08-04 at 5.58.02 P

creen Shot 2019-08-04 at 6.01.14 P

Why it is a problem?

The information from long-term timesteps is gone.

creen Shot 2019-08-04 at 6.11.11 P

If the gradient becomes vanishingly small over longer distances (step t to step t+n), then we can’t tell whether:

There’s no dependency between step t and t+n in the data
We have wrong parameters to capture the true dependency between t and t+n

creen Shot 2019-08-04 at 6.18.31 P

Long Short Term Memory networks

It is usually just called “LSTMs” – are a special kind of RNN, capable of learning long-term dependencies.

All recurrent neural networks have the form of a chain of repeating modules of neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

STM3-SimpleRN

LSTMs also have this chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four, interacting in a very special way.

STM3-chai

For now, let’s just try to get comfortable with the notation we’ll be using.

STM2-notatio

In the above diagram, each line carries an entire vector, from the output of one node to the inputs of others. The pink circles represent pointwise operations, like vector addition, while the yellow boxes are learned neural network layers. Lines merging denote concatenation, while a line forking denote its content being copied and the copies going to different locations.

The core idea behind LSTMs

The key to LSTMs is the cell state, the horizontal line running through the top of the diagram.

The cell state is kind of like a conveyor belt. It runs straight down the entire chain, with only some minor linear interactions. It’s very easy for information to just flow along it unchanged.

STM3-C-lin

The LSTM does have the ability to remove or add information to the cell state, carefully regulated by structures called gates.

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. STM3-gat

The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state.

Step-by-Step LSTM Walk Through

The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1ht−1and xtxt, and outputs a number between 00 and 11 for each number in the cell state Ct−1Ct−1. A 11represents “completely keep this” while a 00 represents “completely get rid of this.”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

STM3-focus-

The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, $\tilde C$, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

STM3-focus-

It’s now time to update the old cell state, $C_{t-1}$, into the new cell state $C_t$. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by $f_t$, forgetting the things we decided to forget earlier. Then we add $i_t* \tilde C_{t}$.

This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

STM3-focus-

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between -1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

STM3-focus-o-207940

Backpropagation

Recall that the forward pass of LSTM is like:

creen Shot 2019-07-01 at 4.30.58 P

The unrolled network during the forward pass is shown above. The cell state at time T, $c^T$ is responsible for computing h as well as the next cell state $c^{T+1}$. At each time step, the cell output h is shown to be passed to some more layers on which cost function C is computed, as the way an LSTM would be used in a typical application like captioning or language modeling. source

creen Shot 2019-07-01 at 4.34.08 P

The unrolled network during the backward pass is shown below. All the arrows in the previous slide have now changed their direction. The cell state at time T, $c^{T}$ receive gradients from $h^T$ as well as the next cell state $c^{T+1}$.

Note that for the last node, since it dosen’t have a next time stamp, it receive no gradients from $c^{T+1}$ and $h^T$, which means $d_{next_state} = 0$ and $d_{next_hidden}=0$.

Numpy Implementation

点击显/隐内容

def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b):

    """

    Forward pass for a single timestep of an LSTM.
The input data has dimension D, the hidden state has dimension H, and we use
a minibatch size of N.

Note that a sigmoid() function has already been provided for you in this file.

Inputs:
- x: Input data, of shape (N, D)
- prev_h: Previous hidden state, of shape (N, H)
- prev_c: previous cell state, of shape (N, H)
- Wx: Input-to-hidden weights, of shape (D, 4H)
- Wh: Hidden-to-hidden weights, of shape (H, 4H)
- b: Biases, of shape (4H,)

Returns a tuple of:
- next_h: Next hidden state, of shape (N, H)
- next_c: Next cell state, of shape (N, H)
- cache: Tuple of values needed for backward pass.
"""
next_h, next_c, cache = None, None, None
#############################################################################
# TODO: Implement the forward pass for a single timestep of an LSTM.        #
# You may want to use the numerically stable sigmoid implementation above.  #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

N,D = x.shape
_,H = prev_h.shape

gates = x.dot(Wx) + prev_h.dot(Wh) + b # (N, 4H)
gate_i = sigmoid(gates[:,:H]) # (N,H)
gate_f = sigmoid(gates[:,H:2*H])
gate_o = sigmoid(gates[:,2*H:3*H])
gate_g = np.tanh(gates[:,3*H:])

next_c = prev_c*gate_f + gate_g*gate_i
next_h = gate_o*np.tanh(next_c)

cache = (x, prev_h, prev_c, Wx, Wh, gate_i, gate_f, gate_o, gate_g, next_c)

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
##############################################################################
#                               END OF YOUR CODE                             #
##############################################################################

return next_h, next_c, cache

点击显/隐内容

def lstm_step_backward(dnext_h, dnext_c, cache):
    """
    Backward pass for a single timestep of an LSTM.

    Inputs:
    - dnext_h: Gradients of next hidden state, of shape (N, H)
    - dnext_c: Gradients of next cell state, of shape (N, H)
    - cache: Values from the forward pass

    Returns a tuple of:
    - dx: Gradient of input data, of shape (N, D)
    - dprev_h: Gradient of previous hidden state, of shape (N, H)
    - dprev_c: Gradient of previous cell state, of shape (N, H)
    - dWx: Gradient of input-to-hidden weights, of shape (D, 4H)
    - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H)
    - db: Gradient of biases, of shape (4H,)
    """
    dx, dprev_h, dprev_c, dWx, dWh, db = None, None, None, None, None, None
    #############################################################################
    # TODO: Implement the backward pass for a single timestep of an LSTM.       #
    #                                                                           #
    # HINT: For sigmoid and tanh you can compute local derivatives in terms of  #
    # the output value from the nonlinearity.                                   #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    
    (x, prev_h, prev_c, Wx, Wh, gate_i, gate_f, gate_o, gate_g, next_c) = cache
    N,H = dnext_h.shape
    _,D = x.shape
    
    dgate_o = dnext_h*np.tanh(next_c)
    dtanh_next_c = dnext_h*gate_o #(N, H)
    
    dnext_c_ = dtanh_next_c*(1-np.tanh(next_c)**2)
    dnext_c = dnext_c_ + dnext_c
    
    dgate_f = dnext_c*prev_c
    dprev_c = dnext_c*gate_f
    
    dgate_g = dnext_c*gate_i
    dgate_i = dnext_c*gate_g
    
    dgate_g = dgate_g*(1-gate_g**2)
    dgate_o = dgate_o*gate_o*(1-gate_o)
    dgate_f = dgate_f*gate_f*(1-gate_f)
    dgate_i = dgate_i*gate_i*(1-gate_i)
    
    dgate = np.concatenate([dgate_i,dgate_f,dgate_o,dgate_g],axis=1)
    db = np.sum(dgate,axis=0)
    dx = dgate.dot(Wx.T)
    dprev_h = dgate.dot(Wh.T)
    dWx = x.T.dot(dgate)
    dWh = prev_h.T.dot(dgate)
    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return dx, dprev_h, dprev_c, dWx, dWh, db

点击显/隐内容

def lstm_forward(x, h0, Wx, Wh, b):
    """
    Forward pass for an LSTM over an entire sequence of data. We assume an input
    sequence composed of T vectors, each of dimension D. The LSTM uses a hidden
    size of H, and we work over a minibatch containing N sequences. After running
    the LSTM forward, we return the hidden states for all timesteps.

    Note that the initial cell state is passed as input, but the initial cell
    state is set to zero. Also note that the cell state is not returned; it is
    an internal variable to the LSTM and is not accessed from outside.

    Inputs:
    - x: Input data of shape (N, T, D)
    - h0: Initial hidden state of shape (N, H)
    - Wx: Weights for input-to-hidden connections, of shape (D, 4H)
    - Wh: Weights for hidden-to-hidden connections, of shape (H, 4H)
    - b: Biases of shape (4H,)

    Returns a tuple of:
    - h: Hidden states for all timesteps of all sequences, of shape (N, T, H)
    - cache: Values needed for the backward pass.
    """
    h, cache = None, None
    #############################################################################
    # TODO: Implement the forward pass for an LSTM over an entire timeseries.   #
    # You should use the lstm_step_forward function that you just defined.      #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    cache = {}
    N, T, D = x.shape
    _, H = h0.shape
    h = np.zeros((N,T,H))
    prev_h = h0
    prev_c = np.zeros_like(h0)
    for t in range(T):
        prev_h, prev_c, cache[t] = lstm_step_forward(x[:,t,:], prev_h, prev_c, Wx, Wh, b)
        h[:,t,:] = prev_h

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return h, cache

点击显/隐内容

def lstm_backward(dh, cache):
    """
    Backward pass for an LSTM over an entire sequence of data.]

    Inputs:
    - dh: Upstream gradients of hidden states, of shape (N, T, H)
    - cache: Values from the forward pass

    Returns a tuple of:
    - dx: Gradient of input data of shape (N, T, D)
    - dh0: Gradient of initial hidden state of shape (N, H)
    - dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H)
    - dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H)
    - db: Gradient of biases, of shape (4H,)
    """
    dx, dh0, dWx, dWh, db = None, None, None, None, None
    #############################################################################
    # TODO: Implement the backward pass for an LSTM over an entire timeseries.  #
    # You should use the lstm_step_backward function that you just defined.     #
    #############################################################################
    # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

    N, T, H = dh.shape
    _, D = cache[0][0].shape
    dx = np.zeros((N,T,D))
    dWx = np.zeros((D,4*H))
    dWh = np.zeros((H,4*H))
    db = np.zeros((4*H))
    dprev_c = np.zeros((N, H)) 
    dprev_h = np.zeros((N, H)) 
    for t in range(T-1,-1,-1):
        dnext_h = dprev_h + dh[:,t,:]
        dnext_c = dprev_c
        dx[:,t,:], dprev_h, dprev_c, dWx_, dWh_, db_ = lstm_step_backward(dnext_h, dnext_c, cache[t])
        dWx += dWx_
        dWh += dWh_
        db += db_
    dh0 = dprev_h

    # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
    ##############################################################################
    #                               END OF YOUR CODE                             #
    ##############################################################################

    return dx, dh0, dWx, dWh, db

GRU

So now we know how an LSTM work, let’s briefly look at the GRU. The GRU is the newer generation of Recurrent Neural networks and is pretty similar to an LSTM. GRU’s got rid of the cell state and used the hidden state to transfer information. It also only has two gates, a reset gate and update gate.

creen Shot 2019-06-20 at 12.39.27 P

Reset Gate

creen Shot 2019-06-20 at 12.49.39 P

$\mathbf{R}_{t}=\sigma\left(\mathbf{X}_{t} \mathbf{W}_{x r}+\mathbf{H}_{t-1} \mathbf{W}_{h r}+\mathbf{b}_{r}\right)$ $\mathbf{Z}_{t}=\sigma\left(\mathbf{X}_{t} \mathbf{W}_{x z}+\mathbf{H}_{t-1} \mathbf{W}_{h z}+\mathbf{b}_{z}\right)$

creen Shot 2019-06-20 at 12.51.45 P

$\tilde{\mathbf{H}}_{t}=\tanh \left(\mathbf{X}_{t} \mathbf{W}_{x h}+\left(\mathbf{R}_{t} \odot \mathbf{H}_{t-1}\right) \mathbf{W}_{h h}+\mathbf{b}_{h}\right)$ $\mathbf{H}_{t}=\mathbf{Z}_{t} \odot \mathbf{H}_{t-1}+\left(1-\mathbf{Z}_{t}\right) \odot \tilde{\mathbf{H}}_{t}$

How LSTM/GRU Solve Vanishing Gradients

In order to understand this question, we need to introdcue shortcut firstly. In ResNet, the shortcut is like:

creen Shot 2019-08-04 at 8.51.23 P

which is $x^{(t)}=x^{(t-1)}+F(x^{(t-1)})$. The next output is the combination of two parts: shortcut($x^{(t-1)}$) and non-linear transformation of $x^{(t-1)}$. Because of the shortcut, there is always a “1” gradient flowing back in backpropagation. In some extent, it sovles the problem of vanishing gradient.

Now back to our fancy RNN, take GRU as example, the updating rules of GRU is:

creen Shot 2019-08-04 at 8.51.07 P

Let’s focus on how we get the new hidden state $h^{(t)}$ and let’s roll out the formulation:

$\boldsymbol{h}^{(t)}=\boldsymbol{h}^{(t-1)}+ \left(\tilde{\boldsymbol{h}}^{(t)}-\boldsymbol{h}^{(t-1)}\right) \circ \boldsymbol{u}^{(t)}$

which is very similar to the ResNet becasue there is a direct flow from previous state to the next state.

Keras - Learn the Alphabet

In this implementation, we are going to develop and constrast a number of different LSTM networks. The task we are going to perform is that given a letter of the alphabet, predict the next letter. This is a simple sequence prediction problem that once understood can be generalized to other sequence prediction problems like time series prediction and sequence classification.

Data Preparation

import numpy as np
from keras.models import Sequential
from keras import layers
from keras.utils import np_utils
from collections import defaultdict
np.random.seed(7)
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
char2int = defaultdict(int)
int2char = defaultdict(str)
for idx,val in enumerate(alphabet):
    char2int[val] = idx
    int2char[idx] = val

Because neural network can only process number, we map the letters to integer value.

seq_length = 1
dataX = []
dataY = []
for i in range(0,len(alphabet)-seq_length):
    seq_in = alphabet[i:i+seq_length]
    seq_out = alphabet[i+seq_length]
    dataX.append([char2int[idx] for idx in seq_in])
    dataY.append([char2int[idx] for idx in seq_out])
    print(seq_in,'->',seq_out)

Here, we create our input dataset and corresponding output dataset; we use an input length of 1. Running the code will produce the following output:

A -> B
B -> C
C -> D
...
X -> Y
Y -> Z

Then, we need to reshape our data into a format expected by the LSTM networks, that is, [data_size,time_steps,feature_num]. Then we can normalize the input integers to the range [0,1]. Finally, we can think of this problem as a sequence classification task, where each of the 26 letters represents a different class. As such, we can convert the output (y) to a one hot encoding, using the Keras built-in function to_categorical().

input_shape = (len(dataX),seq_length,1)
X = np.reshape(dataX,input_shape)
X = X/float(len(alphabet))
y = np_utils.to_categorical(dataY)

One-Char to One-Char

Let’s start off by designing a simple LSTM to learn how to predict the next character in the alphabet given the context of just one character.

Let’s define an LSTM network with 32 units and an output layer with a softmax activation function for making predictions. Because this is a multi-class classification problem, we can use the log loss function (called “categorical_crossentropy” in Keras), and optimize the network using the ADAM optimization function.

model = Sequential()
model.add(LSTM(32,input_shape=(X.shape[1],X.shape[2])))
model.add(Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.fit(X,y,epochs=500,batch_size=1)

After we fit the model we can evaluate and summarize the performance on the entire training dataset.

1 2	scores = model.evaluate(X,y) print("Model Accuracy: %.2f%%"%(scores[1]*100))

We can then re-run the training data through the network and generate predictions, converting both the input and output pairs back into their original character format to get a visual idea of how well the network learned the problem.

for pattern in dataX:
    x = np.reshape(pattern,(1,len(pattern),1))
    x = x/float(len(alphabet))
    prediction = model.predict(x)
    index = np.argmax(prediction)
    result = int2char[index]
    seq_in = [int2char[id] for id in pattern]
    print(seq_in,'->',result)

Model Accuracy: 84.00%
['A'] -> B
['B'] -> C
['C'] -> D
['D'] -> E
['E'] -> F
['F'] -> G
['G'] -> H
['H'] -> I
['I'] -> J
['J'] -> K
['K'] -> L
['L'] -> M
['M'] -> N
['N'] -> O
['O'] -> P
['P'] -> Q
['Q'] -> R
['R'] -> S
['S'] -> T
['T'] -> U
['U'] -> W
['V'] -> Y
['W'] -> Z
['X'] -> Z
['Y'] -> Z

We can see that this problem is indeed difficult for the network to learn. The reason is, the poor LSTM units do not have any context to work with. Each input-output pattern is shown to the network in a random order and the state of the network is reset after each pattern (each batch where each batch contains one pattern). This is abuse of the LSTM network architecture, treating it like a standard multilayer Perceptron. Next, let’s try a different framing of the problem in order to provide more sequence to the network from which to learn.

A Three-Char Feature Window to One-Char Mapping

A popular approach to adding more context to data for multilayer Perceptrons is to use the window method. We can do this by increasing the input sequence length length from 1 to 3, for example:seq_length=3, which creates training patterns like:

1
2
3

ABC -> D
BCD -> E
CDE -> F

Each element in the sequence is then provided as a new input feature to the network. This requires a modification of how the input sequences reshaped in the data preparation step:

1 2	# reshape X to be #[batch_size,time_stamps,features] X = numpy.reshape(dataX, (len(dataX), 1, seq_length))

The entire code is provided below for completeness.

点击显/隐内容

import numpy as np
from keras.models import Sequential
from keras import layers
from keras.utils import np_utils
from collections import defaultdict

np.random.seed(7)
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
char2int = defaultdict(int)
int2char = defaultdict(str)
for idx,val in enumerate(alphabet):
    char2int[val] = idx
    int2char[idx] = val

seq_length = 3
dataX = []
dataY = []
for i in range(0,len(alphabet)-seq_length):
    seq_in = alphabet[i:i+seq_length]
    seq_out = alphabet[i+seq_length]
    dataX.append([char2int[idx] for idx in seq_in])
    dataY.append([char2int[idx] for idx in seq_out])
    print(seq_in,'->',seq_out)

input_shape = (len(dataX),1,seq_length)
# input_shape = (len(dataX),1,seq_length)

#[batch_size,time_stamps,features]
X = np.reshape(dataX,input_shape)
X = X/float(len(alphabet))

y = np_utils.to_categorical(dataY)

model = Sequential()
model.add(layers.LSTM(32,input_shape=(X.shape[1],X.shape[2])))
model.add(layers.Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

def predict(model):
    for pattern in dataX:
        x = np.reshape(pattern,(1,1,len(pattern)))
        x = x/float(len(alphabet))
        prediction = model.predict(x)
        index = np.argmax(prediction)
        result = int2char[index]
        seq_in = [int2char[id] for id in pattern]
        print(seq_in,'->',result)

def naive_lstm():
    model.fit(X,y,epochs=500,batch_size=1,verbose=1)
    scores = model.evaluate(X,y,verbose=1)
    print("Model Accuracy: %.2f%%"%(scores[1]*100))
    predict(model)


if __name__ == '__main__':
    naive_lstm()

Model Accuracy: 86.96%
['A', 'B', 'C'] -> D
['B', 'C', 'D'] -> E
...
['T', 'U', 'V'] -> X
['U', 'V', 'W'] -> Z
['V', 'W', 'X'] -> Z
['W', 'X', 'Y'] -> Z

We can see a little improvement in the performance that may or may not be true.

Again, this is a misuse of the LSTM network by a poor framing of the problem. Indeed, the sequences of letters are time steps of one feature rather than one time step of separate features. We have given more context to the network, but not more sequence as it expected.

In the next section, we will give more context to the network in the form of time steps.

A Three-Char Time Step Window to One-Char Mapping

We still take as input a sequence with length being 3, seq_length=3. The difference is that the reshaping of the input data takes the sequence as a time step sequence of one feature, rather than a single time step of multiple features.

1
2
3

input_shape = (len(dataX),seq_length,1)
#[batch_size,time_stamps,features]
X = np.reshape(dataX,input_shape)

This is the correct intended use of providing sequence context to your LSTM in Keras. The full code example is provided below for completeness.

点击显/隐内容

import numpy as np
from keras.models import Sequential
from keras import layers
from keras.utils import np_utils
from collections import defaultdict

np.random.seed(7)
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
char2int = defaultdict(int)
int2char = defaultdict(str)
for idx,val in enumerate(alphabet):
    char2int[val] = idx
    int2char[idx] = val

seq_length = 3
dataX = []
dataY = []
for i in range(0,len(alphabet)-seq_length):
    seq_in = alphabet[i:i+seq_length]
    seq_out = alphabet[i+seq_length]
    dataX.append([char2int[idx] for idx in seq_in])
    dataY.append([char2int[idx] for idx in seq_out])
    print(seq_in,'->',seq_out)

input_shape = (len(dataX),seq_length,1)
#[batch_size,time_stamps,features]
X = np.reshape(dataX,input_shape)
X = X/float(len(alphabet))

y = np_utils.to_categorical(dataY)

model = Sequential()
model.add(layers.LSTM(32,input_shape=(X.shape[1],X.shape[2])))
model.add(layers.Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
def predict(model):
    for pattern in dataX:
        x = np.reshape(pattern,(1,len(pattern),1))
        x = x/float(len(alphabet))
        prediction = model.predict(x)
        index = np.argmax(prediction)
        result = int2char[index]
        seq_in = [int2char[id] for id in pattern]
        print(seq_in,'->',result)
def naive_lstm():
    model.fit(X,y,epochs=500,batch_size=1,verbose=1)
    scores = model.evaluate(X,y,verbose=1)
    print("Model Accuracy: %.2f%%"%(scores[1]*100))
    predict(model)
if __name__ == '__main__':
    naive_lstm()

Model Accuracy: 100.00%
['A', 'B', 'C'] -> D
...
['W', 'X', 'Y'] -> Z

We can see that the model learns the problem perfectly as evidenced by the model evaluation and the example predictions.

LSTM State Within A Batch

The Keras implementation of LSTMs resets the state of the network after each batch.

This suggests that if we had a batch size large enough to hold all input patterns and if all the input patterns were ordered sequentially, that the LSTM could use the context of the sequence within the batch to better learn the sequence.

We can demonstrate this easily by modifying the first example for learning a one-to-one mapping and increasing the batch size from 1 to the size of the training dataset.

Additionally, Keras shuffles the training dataset before each training epoch. To ensure the training data patterns remain sequential, we can disable this shuffling.

And the training epoch becomes 5000 from 500.

1	model.fit(X,y,epochs=5000,batch_size=len(X),verbose=1)

点击显/隐内容

import numpy as np
from keras.models import Sequential
from keras import layers
from keras.utils import np_utils
from collections import defaultdict
from keras.preprocessing.sequence import pad_sequences
np.random.seed(7)
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
char2int = defaultdict(int)
int2char = defaultdict(str)
for idx,val in enumerate(alphabet):
    char2int[val] = idx
    int2char[idx] = val
seq_length = 1
dataX = []
dataY = []
for i in range(0,len(alphabet)-seq_length):
    seq_in = alphabet[i:i+seq_length]
    seq_out = alphabet[i+seq_length]
    dataX.append([char2int[idx] for idx in seq_in])
    dataY.append([char2int[idx] for idx in seq_out])
    print(seq_in,'->',seq_out)
dataX = pad_sequences(dataX,maxlen=seq_length,dtype='float32')
input_shape = (len(dataX),seq_length,1)
#[batch_size,time_stamps,features]
X = np.reshape(dataX,input_shape)
X = X/float(len(alphabet))
y = np_utils.to_categorical(dataY)
model = Sequential()
model.add(layers.LSTM(32,input_shape=(X.shape[1],X.shape[2])))
model.add(layers.Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
def predict(model):
    for pattern in dataX:
        x = np.reshape(pattern,(1,len(pattern),1))
        x = x/float(len(alphabet))
        prediction = model.predict(x)
        index = np.argmax(prediction)
        result = int2char[index]
        seq_in = [int2char[id] for id in pattern]
        print(seq_in,'->',result)
    print("Random Test >>>>>>>>>>>")
    for i in range(20):
        idx = np.random.randint(len(X))
        pattern = dataX[idx] # X is normalized float but dataX is integer.
        x = np.reshape(pattern, (1, len(pattern), 1))
        x = x / float(len(alphabet))
        prediction = model.predict(x)
        index = np.argmax(prediction)
        result = int2char[index]
        seq_in = [int2char[id] for id in pattern]
        print(seq_in, '->', result)
def naive_lstm():
    model.fit(X,y,epochs=5000,batch_size=len(X),verbose=1)
    scores = model.evaluate(X,y,verbose=1)
    print("Model Accuracy: %.2f%%"%(scores[1]*100))
    predict(model)
if __name__ == '__main__':
    naive_lstm()

Model Accuracy: 100.00%
['A'] -> B
['B'] -> C
...
['X'] -> Y
['Y'] -> Z
Random Test >>>>>>>>>>>
['T'] -> U
['N'] -> O
...
['S'] -> T
['T'] -> U
['R'] -> S

As we expected, the network is able to use the within-sequence context to learn the alphabet, achieving 100% accuracy on the training data.

Importantly, the network can make accurate predictions for the next letter in the alphabet for randomly selected characters.

Stateful LSTM for a One-Char to One-Char Mapping

Ideally, we want to expose the network to the entire sequence and let it learn the inter-dependencies, rather than us define those dependencies explicitly in the framing of the problem.

We can do this in Keras by making the LSTM layers stateful and manually resetting the state of the network at the end of the epoch, which is also the end of the training sequence.

This is truly how the LSTM networks are intended to be used. We find that by allowing the network itself to learn the dependencies between the characters, that we need a smaller network (half the number of units) and fewer training epochs (almost half).

We first need to define our LSTM layer as stateful. In so doing, we must explicitly specify the batch size as a dimension on the input shape. This also means that when we evaluate the network or make predictions, we must also specify and adhere to this same batch size. This is not a problem now as we are using a batch size of 1. This could introduce difficulties when making predictions when the batch size is not one as predictions will need to be made in batch and in sequence.

1 2	batch_size = 1 model.add(LSTM(50, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))

An important difference in training the stateful LSTM is that we train it manually one epoch at a time and reset the state after each epoch. We can do this in a for loop. Again, we do not shuffle the input, preserving the sequence in which the input training data was created.

1
2
3

for i in range(300):
	model.fit(X, y, epochs=1, batch_size=batch_size, verbose=2, shuffle=False)
	model.reset_states()

As mentioned, we specify the batch size when evaluating the performance of the network on the entire training dataset.

# summarize performance of the model
scores = model.evaluate(X, y, batch_size=batch_size, verbose=0)
model.reset_states()
print("Model Accuracy: %.2f%%" % (scores[1]*100))

Finally, we can demonstrate that the network has indeed learned the entire alphabet. We can seed it with the first letter “A”, request a prediction, feed the prediction back in as an input, and repeat the process all the way to “Z”.

# demonstrate some model predictions
seed = [char_to_int[alphabet[0]]]
for i in range(0, len(alphabet)-1):
	x = numpy.reshape(seed, (1, len(seed), 1))
	x = x / float(len(alphabet))
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	print(int_to_char[seed[0]], "->", int_to_char[index])
	seed = [index]
model.reset_states()

We can also see if the network can make predictions starting from an arbitrary letter.

# demonstrate a random starting point
letter = "K"
seed = [char_to_int[letter]]
print("New start: ", letter)
for i in range(0, 5):
	x = numpy.reshape(seed, (1, len(seed), 1))
	x = x / float(len(alphabet))
	prediction = model.predict(x, verbose=0)
	index = numpy.argmax(prediction)
	print(int_to_char[seed[0]], "->", int_to_char[index])
	seed = [index]
model.reset_states()

The entire code listing is provided below for completeness.

点击显/隐内容

import numpy as np
from keras.models import Sequential
from keras import layers
from keras.utils import np_utils
from collections import defaultdict
from keras.preprocessing.sequence import pad_sequences
np.random.seed(7)
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
char2int = defaultdict(int)
int2char = defaultdict(str)
for idx,val in enumerate(alphabet):
    char2int[val] = idx
    int2char[idx] = val

seq_length = 1
dataX = []
dataY = []
for i in range(0,len(alphabet)-seq_length):
    seq_in = alphabet[i:i+seq_length]
    seq_out = alphabet[i+seq_length]
    dataX.append([char2int[idx] for idx in seq_in])
    dataY.append([char2int[idx] for idx in seq_out])
    print(seq_in,'->',seq_out)
dataX = pad_sequences(dataX,maxlen=seq_length,dtype='float32')
input_shape = (len(dataX),seq_length,1)
#[batch_size,time_stamps,features]
X = np.reshape(dataX,input_shape)
X = X/float(len(alphabet))

y = np_utils.to_categorical(dataY)

batch_size = 1
model = Sequential()
model.add(layers.LSTM(50,batch_input_shape=(batch_size,X.shape[1],X.shape[2]),stateful=True))
model.add(layers.Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

def predict(model):
    seed = [char2int[alphabet[0]]]
    for i in range(len(alphabet)-1):
        x = np.reshape(seed,(1,len(seed),1))
        x = x/float(len(alphabet))
        prediction = model.predict(x)
        index = np.argmax(prediction)
        result = int2char[index]
        print(int2char[seed[0]],'->',result)
        seed = [index]
    model.reset_states()

    print("Random Test >>>>>>>>>>>")
    seed = [char2int['K']]
    print("New start: ", 'K')
    for i in range(5):
        x = np.reshape(seed, (1, len(seed), 1))
        x = x / float(len(alphabet))
        prediction = model.predict(x)
        index = np.argmax(prediction)
        result = int2char[index]
        print(int2char[seed[0]], '->', result)
        seed = [index]
    model.reset_states()

def naive_lstm():
    for i in range(300):
        model.fit(X,y,epochs=1,batch_size=batch_size,verbose=1,shuffle=False)
        model.reset_states()
    scores = model.evaluate(X,y,batch_size=batch_size,verbose=1)
    model.reset_states()
    print("Model Accuracy: %.2f%%"%(scores[1]*100))
    predict(model)


if __name__ == '__main__':
    naive_lstm()

Running the example provides the following output.

Model Accuracy: 96.00%
A -> B
B -> C
...
W -> X
X -> Y
Y -> Y
Random Test >>>>>>>>>>>
New start:  K
K -> B
B -> C
C -> D
D -> E
E -> F

We can see that the network has memorized the entire alphabet perfectly. It used the context of the samples themselves and learned whatever dependency it needed to predict the next character in the sequence.

We can also see that if we seed the network with the first letter, that it can correctly rattle off the rest of the alphabet.

We can also see that it has only learned the full alphabet sequence and that from a cold start. When asked to predict the next letter from “K” that it predicts “B” and falls back into regurgitating the entire alphabet.

To truly predict “K” the state of the network would need to be warmed up iteratively fed the letters from “A” to “J”. This tells us that we could achieve the same effect with a “stateless” LSTM by preparing training data like:

---a -> b
--ab -> c
-abc -> d
abcd -> e

Where the input sequence is fixed at 25 (a-to-y to predict z) and patterns are prefixed with zero-padding.

Finally, this raises the question of training an LSTM network using variable length input sequences to predict the next character.

LSTM with Variable-Length Input to One-Char Output

In the previous section, we discovered that the Keras “stateful” LSTM was really only a shortcut to replaying the first n-sequences, but didn’t really help us learn a generic model of the alphabet.

In this section we explore a variation of the “stateless” LSTM that learns random subsequences of the alphabet and an effort to build a model that can be given arbitrary letters or subsequences of letters and predict the next letter in the alphabet.

Firstly, we are changing the framing of the problem. To simplify we will define a maximum input sequence length and set it to a small value like 5 to speed up training. This defines the maximum length of subsequences of the alphabet will be drawn for training. In extensions, this could just as set to the full alphabet (26) or longer if we allow looping back to the start of the sequence.

We also need to define the number of random sequences to create, in this case 1000. This too could be more or less. I expect less patterns are actually required.

num_inputs = 1000
max_len = 5
dataX = []
dataY = []
for i in range(num_inputs):
    start = np.random.randint(len(alphabet)-2)
    end = np.random.randint(start,min(start+max_len,len(alphabet)-1))
    seq_in = alphabet[start:end+1]
    seq_out = alphabet[end+1]
    dataX.append([char2int[idx] for idx in seq_in])
    dataY.append([char2int[idx] for idx in seq_out])
    print(seq_in,'->',seq_out)
dataX = pad_sequences(dataX,maxlen=max_len,dtype='float32')
input_shape = (len(dataX),max_len,1)
#[batch_size,time_stamps,features]
X = np.reshape(dataX,input_shape)
X = X/float(len(alphabet))
y = np_utils.to_categorical(dataY)

Running the code, we create input patterns that look like the following:

STUV -> W
IJKLM -> N
STUV -> W
TUVWX -> Y
RSTU -> V

The input sequences vary in length between 1 and max_len and therefore require zero padding. Here, we use left-hand-side (prefix) padding with the Keras built in pad_sequences() function.

1	dataX = pad_sequences(dataX,maxlen=max_len,dtype='float32')

The trained model is evaluated on randomly selected input patterns. This could just as easily be new randomly generated sequences of characters. I also believe this could also be a linear sequence seeded with “A” with outputs fes back in as single character inputs.

The full code listing is provided below for completeness.

点击显/隐内容

import numpy as np
from keras.models import Sequential
from keras import layers
from keras.utils import np_utils
from collections import defaultdict
from keras.preprocessing.sequence import pad_sequences
np.random.seed(7)
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
char2int = defaultdict(int)
int2char = defaultdict(str)
for idx,val in enumerate(alphabet):
    char2int[val] = idx
    int2char[idx] = val

num_inputs = 1000
max_len = 5
dataX = []
dataY = []
for i in range(num_inputs):
    start = np.random.randint(len(alphabet)-2)
    end = np.random.randint(start,min(start+max_len,len(alphabet)-1))
    seq_in = alphabet[start:end+1]
    seq_out = alphabet[end+1]
    dataX.append([char2int[idx] for idx in seq_in])
    dataY.append([char2int[idx] for idx in seq_out])
    print(seq_in,'->',seq_out)
dataX = pad_sequences(dataX,maxlen=max_len,dtype='float32')
input_shape = (len(dataX),max_len,1)
#[batch_size,time_stamps,features]
X = np.reshape(dataX,input_shape)
X = X/float(len(alphabet))
y = np_utils.to_categorical(dataY)

batch_size = 1
model = Sequential()
model.add(layers.LSTM(32,input_shape=(X.shape[1],X.shape[2])))
model.add(layers.Dense(y.shape[1],activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

def predict(model):
    for i in range(20):
        pattern_index = np.random.randint(len(dataX))
        pattern = dataX[pattern_index]
        x = pad_sequences([pattern], maxlen=max_len, dtype='float32')
        x = np.reshape(x, (1, max_len, 1))
        x = x / float(len(alphabet))
        prediction = model.predict(x, verbose=0)
        index = np.argmax(prediction)
        result = int2char[index]
        seq_in = [int2char[value] for value in pattern]
        print(seq_in, "->", result)

def naive_lstm():
    model.fit(X,y,epochs=500,batch_size=batch_size,verbose=1)
    model.reset_states()
    scores = model.evaluate(X,y,verbose=1)
    model.reset_states()
    print("Model Accuracy: %.2f%%"%(scores[1]*100))
    predict(model)


if __name__ == '__main__':
    naive_lstm()

Running this code produces the following output:

Model Accuracy: 98.90%
['Q', 'R'] -> S
['W', 'X'] -> Y
['W', 'X'] -> Y
['C', 'D'] -> E
['E'] -> F
['S', 'T', 'U'] -> V
['G', 'H', 'I', 'J', 'K'] -> L
['O', 'P', 'Q', 'R', 'S'] -> T
['C', 'D'] -> E
['O'] -> P
['N', 'O', 'P'] -> Q
['D', 'E', 'F', 'G', 'H'] -> I
['X'] -> Y
['K'] -> L
['M'] -> N
['R'] -> T
['K'] -> L
['E', 'F', 'G'] -> H
['Q'] -> R
['Q', 'R', 'S'] -> T

We can see that although the model did not learn the alphabet perfectly from the randomly generated subsequences, it did very well. The model was not tuned and may require more training or a larger network, or both (an exercise for the reader).

This is a good natural extension to the “all sequential input examples in each batch” alphabet model learned above in that it can handle ad hoc queries, but this time of arbitrary sequence length (up to the max length).

PyTorch - Classify Names with a Character-Level RNN

Preparing data

Each line contains a name and we need to convert them from Unicode to ASCII.

Once we have read all files, we need to create two dataset: languages=[] containing all target categories ,languages2names={}, mapping each languages to corresponding names.

Then we use one-hot tensor to represent each letter, whose size is <1,n_letters>. Therefore, a name is represented as <name_len, 1, n_letters>.

After that, we are going to define a RNN structure like the following:

8747470733a2f2f692e696d6775722e636f6d2f5a32786279534f2e706e6