Book Notes: Neural Network Methods for Natural Language Processing -- Part 3 Specialized Architectures, Ch14-16 RNNs

For the pdf slides, click here

Ch14 Recurrent Neural Networks: Modeling Sequences and Stacks

RNNs overview

  • RNNs allow representing arbitrarily sized sequential inputs in fixed-sized vectors, while paying attention to the structured properties of the inputs

  • This chapter describes RNNs as an abstraction: an interface for translating a sequence of inputs into a fixed sized output, that can be plugged as components in larger networks

  • RNNs allow for language models that do not make the markov assumption, and condition the next word on the entire sentence history

  • It is important to understand that the RNN does not do much on its own, but serves as a trainable component in a larger network

The RNN Abstraction

The RNN abstraction

  • On a high level, the RNN is a function that

    • Takes as input an arbitrary length ordered sequence of nn dindin-dimensional vectors x1:n=x1,,xnx1:n=x1,,xn, and
    • Returns as output a single doutdout dimensional vector ynyn
    • The output vector ynyn is then used for further prediction yn=RNN(x1:n)xiRdin,ynRdoutyn=RNN(x1:n)xiRdin,ynRdout
  • This implicitly defines an output vector yiyi for each prefix x1:ix1:i. We denote by RNN the function returning this sequence y1:n=RNN(x1:n)yi=RNN(x1:i)xiRdin,yiRdouty1:n=RNN(x1:n)yi=RNN(x1:i)xiRdin,yiRdout

  • The RNN function provides a framework for conditioning on the entire history x1,,xix1,,xi without the Markov assumption

The R function and the O function

  • The RNN is defined recursively, by means of a function RR taking as the state vector si1si1 and an input vector xixi and returning a new state vector sisi

  • The state vector sisi is then mapped to an output vector yiyi using a simple deterministic function OO si=R(si1,xi)yi=O(si)si=R(si1,xi)yi=O(si)

  • The functions RR and OO are the same across the sequence positions, but the RNN keeps track of the states of computation through the state vector sisi that is kept and being passed across invocations of RR

An illustration of the RNN

  • We include here the parameters θθ in order to highlight the fact that the same parameters are shared across all time steps

An illustration of the RNN (unrolled)

Common RNN usages

Common RNN usage patterns: acceptor

  • Acceptor: based on the supervision signal only at the final output vector ynyn

    • Typically, the RNN’s output vector ynyn is fed into a fully connected layer or an MLP, which produce a prediction

Common RNN usage patterns: encoder

  • Encoder: also only uses the final output vector ynyn. Here ynyn is treated as an encoding of the information in the sequence, and is used as additional information together with other signals

Common RNN usage patterns: transducer

  • Transducer: The loss of unrolled sequence will be used

  • A natural use case of the transduction is for language modeling, where the sequence of words x1:ix1:i is used to predict a distribution over the (i+1)(i+1)th word

  • RNN based lanuage models are shown to provide vastly better perplxities than traditional language models

  • Using RNNs as transducers allows us to relax the Markov assumption and condition on the entire prediction history

An illustration of transducer

Bidirectional RNNs and Deep RNNs

Bidirectional RNNs (biRNN)

  • A useful elaboration of an RNN is a biRNN

  • Consider an input sequence x1:nx1:n. The biRNN works by maintaining two separate states, sfisfi and sbisbi for each input position ii

    • The forward state sfisfi is based on x1,x2,,xix1,x2,,xi
    • The backward state sbisbi is based on xn,xn1,,xixn,xn1,,xi
  • The output at position ii is based on the concatenation of the two output vectors yi=[yfi;ybi]=[Of(sfi);Ob(sbi)]yi=[yfi;ybi]=[Of(sfi);Ob(sbi)]

  • Thus, we define biRNN as biRNN(x1:n,i)=yi=[RNNf(x1:i),RNNb(xn:i)]biRNN(x1:n,i)=yi=[RNNf(x1:i),RNNb(xn:i)]

An illustration of biRNN

Deep (multi-layer stacked) RNNs

  • The input for the first RNN are x1:nx1:n, while the input of the jjth RNN (j2j2) are the outputs of the RNN below it, yj11:nyj11:n

  • While it is not theoretically clear what is the additional power gained by the deeper architecture, it was observed empirically that deep RNNs work better than shallower ones on some tasks

  • The author’s experience: using two or more layers indeed often improves over using a single one

A note on reading the literature

  • Unfortunately, it is often the case that inferring the exact model form from reading its description in a research paper can be quite challenging

  • For example,
    • The inputs to the RNN can be either one-hot vectors or embedded representations
    • The input sequence can be padded with start-of-sequence and/or end-of-sequence symbols, or not

An illustration of deep RNN

Ch15 Concrete Recurrent Neural Network Architectures

Simple RNN

Simple RNN (SRNN)

  • The nonlinear function gg is usually tanh or ReLU
  • The output function O()O() is the identify function si=RSRNN(xi,si1)=g(si1Ws+xiWx+b)yi=OSRNN(si)=sisi=RSRNN(xi,si1)=g(si1Ws+xiWx+b)yi=OSRNN(si)=si si,yiRds,xiRdx,WxRdx×ds,WsRds×ds,bRdssi,yiRds,xiRdx,WxRdx×ds,WsRds×ds,bRds

  • SRNN is hard to train effectively because of the vanishing gradients problem

Gated Architectures: LSTM and GRU

Gated architectures

  • An apparent problem with SRNN is that the memory access is not controlled. At each step of the computation, the entire memory state is read, and the entire memory state is written

  • We denote the hadamard-product operation (element-wise product) as

  • To control memory access, consider a binary vector g{0,1}ng{0,1}n

  • For a memory sRdsRd and an input xRdxRd, the computation

sgx+(1g)ssgx+(1g)s “reads” the entries in xx that correspond to the 1 values in gg, and writes them to the new memory ss. Locations that weren’t read to are copied from the memory ss to the new memory ss through the use of the gate (1g)(1g)

An illustration of binary gate

Differentiable gates

  • The gates should not be static, but be controlled by the current memory state and the input, and their behavior should be learned

  • Obstacle: learning in our framework entails being differentiable (because of the backpropagation algorithm), but the binary 0-1 values used in th e gates are not differentiable

  • Solution: approximate the hard gating mechanism with a soft, but diffrentiable, gating mechanism

  • To achieve these differentiable gates, we replace the requirement that g{0,1}ng{0,1}n, and allow arbitrary real numbers gRngRn. These are then passed through a sigmoid function σ(g)σ(g), which take values in the range (0,1)(0,1)

LSTM

  • Long Short-Term Memory (LSTM): explicitly splits the state vector sisi into two halves, where one half is treated as “memory cells” cjcj, and the hidden state component hjhj

sj=RLSTM(sj1,xj)=[cj;hj]sj=RLSTM(sj1,xj)=[cj;hj]

  • There are three gates, iinput, fforget, and ooutput cj=fcj1+izz=tanh(xjWxz+hj1Whz)hj=otanh(cj)yj=OLSTM(sj)=hjcj=fcj1+izz=tanh(xjWxz+hj1Whz)hj=otanh(cj)yj=OLSTM(sj)=hj

LSTM gates

  • The gates are based on xj,hj1xj,hj1 and are passed through a sigmoid activation function i=σ(xjWxi+hj1Whi)f=σ(xjWxf+hj1Whf)o=σ(xjWxo+hj1Who)i=σ(xjWxi+hj1Whi)f=σ(xjWxf+hj1Whf)o=σ(xjWxo+hj1Who)

  • When training LSTM networks, it is strongly recommended to always initialize the bias term of the forget gate to be close to one

GRU

  • Gated Recurrent Unit (GRU) is shown to perform comparably to the LSTM on several datasets

  • GRU has substantially fewer gates that LSTM and doesn’t have a separate memory component sj=RGRU(sj1,xj)=(1z)sj1+z˜sj˜sj=tanh(xjWxs+(rsj1)Wsg)yj=OGRU(sj)=sj

  • Gate r controls access to the previous state sj1 in ˜sj
  • Gate z controls the proportions of the interpolation between sj1 and ˜sj when in the updated state sj z=σ(xjWxz+sj1Wsz)r=σ(xjWxr+sj1Wsr)

Ch16 Modeling with Recurrent Networks

Sentiment Classification

Acceptors

  • The simplest use of RNN: read in an input sequence, and produce a binary of multi-class answer at the end

  • The power of RNN is often not needed for many natural language classification tasks, because the word-order and sentence structure turn out to not be very important in many cases, and bag-of-words or bag-of-ngrams classifier often works just as well or even better than RNN acceptors

Sentiment classification: sentence level

  • The sentence level sentiment classification is straightforward to model using an RNN acceptor:

    • Tokenization
    • RNN reads in the words of the sentence one at a time
    • The final RNN state is then fed into a MLP followed by a softmax-layer with two outputs
    • The network is trained with cross-entropy loss based on the gold sentiment labels p(label=kw1:n)=ˆy[k]ˆy=softmax{MLP[RNN(x1:n)]}x1:n=E[w1],,E[wn]
  • biRNN: it is often helpful to extend the RNN model into the biRNN ˆy=softmax{MLP[RNNf(x1:n);RNNb(xn:1)]}

Hierarchical biRNN

  • For longer sentences, it can be useful to use a hierarchical architecture, in which the sentence is split into smaller spans based on punctuation

  • Suppose a sentence w1:n is split into m spans w11:1,,wm1:m, then the architecture is p(label=kw1:n)=ˆy[k]ˆy=softmax{MLP[RNN(z1:m)]}zi=[RNNf(xi1:i),RNNb(xii:1)]xi1:i=E[wi1],,E[wii]

  • Each of the m different spans may convey a different sentiment

  • The higher-level acceptor reads the summary z1:m produced by the lower level encoder, and decides on the overall sentiment

Document level sentiment classification

  • Document level sentiment classification and harder than sentence level classification

  • A hierarchical architecture is useful:

    • Each sentence si is encoded using a gated RNN, producing a vector zi
    • The vectors z1:n are then fed into a second gated RNN, producing a vector h=RNN(z1:n)
    • h is then used fro prediction ˆy=softmax(MLP(h))
  • Keeping all intermediate vectors of the document-level RNN produces slightly higher results in some cases h1:n=RNN(z1:n)ˆy=softmax(MLP(1nni=1hi))

References

  • Goldberg, Yoav. (2017). Neural Network Methods for Natural Language Processing, Morgan & Claypool