Book Notes: Neural Network Methods for Natural Language Processing -- Part 3 Specialized Architectures, Ch13 CNN

For the pdf slides, click here

Overview on CNN and RNN for NLP

  • CNN and RNN architectures explored in this part of the book are primarily used as feature extractors

  • CNNs and RNNs as Lego bricks: one just needs to make sure that input and output dimensions of the different components match

Ch13 Ngram Detectors: Convolutional Neural Networks

CNN Overivew

CNN overview for NLP

  • CBOW assigns the following two sentences the same representations

    • “it was not good, it was actually quite bad”
    • “it was not bad, it was actually quite good”
  • Looking at ngrams is much more informative than looking at a bag-of-words

  • This chapter introduces the convolution-and-pooling (also called convolutional neural networks, or CNNs), which is tailored to this modeling problem

Benefits of CNN for NLP

  • CNNs will identify ngrams that are predictive for the task at hand, without the need to pre-specify an embedding vector for each possible ngram

  • CNNs also allows to share predictive behavior between ngrams that share similar components, even if the exact ngrams was never seen at test time

  • Example paper: link

Basic Convolution + Pooling

Convolution

  • The main idea behind a convolution and pooling architecture of language tasks is to apply a non-linear (learned) function over each instantiation of a k-word sliding window over the sentence

  • This function (also called “filter”) transforms a window of k words into a scalar value

    • Intuitively, when the sliding window of size k is run over a sequence, the filter function learns to identify informative kgrams
  • Several such filters can be applied, resulting in dimensional vector (each dimension corresponding to one filter) that captures important properties of the words in the window

Pooling

  • Then a “pooling” operation is used to combine the vectors resulting from the different windows into a single -dimensional vector, by taking the max or the average value observed in each of the dimensions over the different windows

    • The intention is to focus on the most important “features” in the sentence, regardless of their location
  • The resulting -dimensional vector is then fed further into a network that is used for prediction

1D convolutions over text

  • A filter is a dot-product with a weight vector parameter u, which is often followed by nonlinear activation function

  • Define the operation (wi:i+k1) to be the concatenation of the vectors wi,,wi+k1. The concatenated vector of the ith window is then xi=(wi:i+k1)=[wi;wi+1;;wi+k1]Rkdemb

  • Apply the filter to each window-vector, resulting scalar value pi: pi=g(xiu)piR,xiRkdemb,uRkdemb

Joint formulation of 1D convolutions

  • It is customary to use different filters u1,,u, which can be arranged into a matrix U, and a bias vector b is often added

pi=g(xiU+b)piR,xiRkdemb,URkdemb×,bR

  • Ideally, each dimension captures a different kind of indicative information

  • The main idea behind the convolution layer: to apply the same parameterized function over all kgrams in the sequence. This creates a sequence of m vectors, each representing a particular kgram in the sequence

Narrow vs wide convolutions

  • For a sentence of length n with a window of size k

  • Narrow convolutions: there are nk+1 positions to start the sequence, and we get nk+1 vectors p1:nk+1

  • Wide convolutions: an alternative is to pad the sentence with k1 padding-words to each side, resulting in n+k+1 vectors p1:n+k+1

  • We use m to denote the number of resulting vectors

Vector pooling

  • Applying the convolution over the text results in m vectors p1:m, each piR

  • These vectors are then combined (pooled) into a single vector cR representing the entire sequence

  • During training, the vector c is fed into downstream network layers (e.g., an MLP), culminating in an output layer which is used for prediction

Different pooling methods

  • Max pooling: the most common, taking the maximum value across each dimension j=1,, c[j]=max1impi,[j]

    • The effect of the max-pooling operation is to get the most salient information across window positions
  • Average pooling c=1mi=1mpi

  • K-max pooling: the top k values in each dimension are retained instead of only the best one, while preserving the order in which they appeared in the text

An illustration of convolution and pooling

Variations

  • Rather than a single convolutional layer, several convolutional layers may be applied in parallel

  • For example, we may have four different convolutional layers, each with a different window size in the range 2-5, capturing kgram sequences of varying lengths

Hierarchical Convolutions

Hierarchical convolutions

  • The 1D convolution approach described so far can the thought of as a ngram detector: a convolution layer with a window of size k is learning to identify indicative k-gram in the input

p1:m=CONVU,bk(w1:n)

  • We can extend this into a hierarchy of convolutional layers with r layers that feed into each other p1:m11=CONVU1,b1k1(w1:n)p1:m22=CONVU2,b2k2(p1:m11)p1:mrr=CONVUr,brkr(p1:mr1r1)

Hierarchical convolutions, continued

  • For r layers with a window of size k, each vector pir will be sensitive to a window of r(k1)+1 words

  • Moreover, the vector pir can be sensitive to gappy-ngrams of k+r1 works, potentially capturing patterns such as “not ___ good” or “obvious ___ predictable ___ plot”, where ___ stands for a short sequence of words

Strides

  • So far, the convolution operation is applied to each k-word window in the sequence, i.e., windows starting at indices 1,2,3,. This is said to have a stride of size 1

  • Larger strides are also possible. For example, with a stride of size 2, the convolution operation will be applied to windows starting at indices 1,3,5,

  • Convolution with window size k and stride size s: p1:m=CONVU,bk,s(w1:n)pi=g((w1+(i1)s:(s+k)i)U+b)

An illustration of stride size

References

  • Goldberg, Yoav. (2017). Neural Network Methods for Natural Language Processing, Morgan & Claypool