Ch13 Ngram Detectors: Convolutional Neural Networks

For the pdf slides, click here

Overview on CNN and RNN for NLP

CNN and RNN architectures explored in this part of the book are primarily used as feature extractors
CNNs and RNNs as Lego bricks: one just needs to make sure that input and output dimensions of the different components match

Ch13 Ngram Detectors: Convolutional Neural Networks

CNN Overivew

CNN overview for NLP

CBOW assigns the following two sentences the same representations
- “it was not good, it was actually quite bad”
- “it was not bad, it was actually quite good”
Looking at ngrams is much more informative than looking at a bag-of-words
This chapter introduces the convolution-and-pooling (also called convolutional neural networks, or CNNs), which is tailored to this modeling problem

Benefits of CNN for NLP

CNNs will identify ngrams that are predictive for the task at hand, without the need to pre-specify an embedding vector for each possible ngram
CNNs also allows to share predictive behavior between ngrams that share similar components, even if the exact ngrams was never seen at test time
Example paper: link

Basic Convolution $+$ Pooling

Convolution

The main idea behind a convolution and pooling architecture of language tasks is to apply a non-linear (learned) function over each instantiation of a $k$ -word sliding window over the sentence
This function (also called “filter”) transforms a window of $k$ words into a scalar value
- Intuitively, when the sliding window of size $k$ is run over a sequence, the filter function learns to identify informative $k$ grams
Several such filters can be applied, resulting in $ℓ$ dimensional vector (each dimension corresponding to one filter) that captures important properties of the words in the window

Pooling

Then a “pooling” operation is used to combine the vectors resulting from the different windows into a single $ℓ$ -dimensional vector, by taking the max or the average value observed in each of the $ℓ$ dimensions over the different windows
- The intention is to focus on the most important “features” in the sentence, regardless of their location
The resulting $ℓ$ -dimensional vector is then fed further into a network that is used for prediction

1D convolutions over text

A filter is a dot-product with a weight vector parameter $u$ , which is often followed by nonlinear activation function
Define the operation $\oplus (w_{i : i + k - 1})$ to be the concatenation of the vectors $w_{i}, \dots, w_{i + k - 1}$ . The concatenated vector of the $i$ th window is then $x_{i} = \oplus (w_{i : i + k - 1}) = [w_{i}; w_{i + 1}; \dots; w_{i + k - 1}] \in R^{k \cdot d_{emb}}$
Apply the filter to each window-vector, resulting scalar value $p_{i}$ : $\begin{aligned} p_{i} = g (x_{i} \cdot u) \\ p_{i} \in R, x_{i} \in R^{k \cdot d_{emb}}, u \in R^{k \cdot d_{emb}} \end{aligned}$

Joint formulation of 1D convolutions

It is customary to use $ℓ$ different filters $u_{1}, \dots, u_{ℓ}$ , which can be arranged into a matrix $U$ , and a bias vector $b$ is often added

$\begin{aligned} p_{i} = g (x_{i} \cdot U + b) \\ p_{i} \in R^{ℓ}, x_{i} \in R^{k \cdot d_{emb}}, U \in R^{k \cdot d_{emb} \times ℓ}, b \in R^{ℓ} \end{aligned}$

Ideally, each dimension captures a different kind of indicative information
The main idea behind the convolution layer: to apply the same parameterized function over all $k$ grams in the sequence. This creates a sequence of $m$ vectors, each representing a particular $k$ gram in the sequence

Narrow vs wide convolutions

For a sentence of length $n$ with a window of size $k$
Narrow convolutions: there are $n - k + 1$ positions to start the sequence, and we get $n - k + 1$ vectors $p_{1 : n - k + 1}$
Wide convolutions: an alternative is to pad the sentence with $k - 1$ padding-words to each side, resulting in $n + k + 1$ vectors $p_{1 : n + k + 1}$
We use $m$ to denote the number of resulting vectors

Vector pooling

Applying the convolution over the text results in $m$ vectors $p_{1 : m}$ , each $p_{i} \in R^{ℓ}$
These vectors are then combined (pooled) into a single vector $c \in R^{ℓ}$ representing the entire sequence
During training, the vector $c$ is fed into downstream network layers (e.g., an MLP), culminating in an output layer which is used for prediction

Different pooling methods

Max pooling: the most common, taking the maximum value across each dimension $j = 1, \dots, ℓ$ $c_{[j]} = max_{1 \leq i \leq m} p_{i, [j]}$
- The effect of the max-pooling operation is to get the most salient information across window positions
Average pooling $c = \frac{1}{m} \sum_{i = 1}^{m} p_{i}$
$K$ -max pooling: the top $k$ values in each dimension are retained instead of only the best one, while preserving the order in which they appeared in the text

An illustration of convolution and pooling

Variations

Rather than a single convolutional layer, several convolutional layers may be applied in parallel
For example, we may have four different convolutional layers, each with a different window size in the range 2-5, capturing $k$ gram sequences of varying lengths

Hierarchical Convolutions

Hierarchical convolutions

The 1D convolution approach described so far can the thought of as a ngram detector: a convolution layer with a window of size $k$ is learning to identify indicative $k$ -gram in the input

$p_{1 : m} = {CONV}_{U, b}^{k} (w_{1 : n})$

We can extend this into a hierarchy of convolutional layers with $r$ layers that feed into each other $\begin{aligned} p_{1 : m_{1}}^{1} & = {CONV}_{U^{1}, b^{1}}^{k_{1}} (w_{1 : n}) \\ p_{1 : m_{2}}^{2} & = {CONV}_{U^{2}, b^{2}}^{k_{2}} (p_{1 : m_{1}}^{1}) \\ \dots \\ p_{1 : m_{r}}^{r} & = {CONV}_{U^{r}, b^{r}}^{k_{r}} (p_{1 : m_{r - 1}}^{r - 1}) \end{aligned}$

Hierarchical convolutions, continued

For $r$ layers with a window of size $k$ , each vector $p_{i}^{r}$ will be sensitive to a window of $r (k - 1) + 1$ words
Moreover, the vector $p_{i}^{r}$ can be sensitive to gappy-ngrams of $k + r - 1$ works, potentially capturing patterns such as “not ___ good” or “obvious ___ predictable ___ plot”, where ___ stands for a short sequence of words

Strides

So far, the convolution operation is applied to each $k$ -word window in the sequence, i.e., windows starting at indices $1, 2, 3, \dots$ . This is said to have a stride of size $1$
Larger strides are also possible. For example, with a stride of size $2$ , the convolution operation will be applied to windows starting at indices $1, 3, 5, \dots$
Convolution with window size $k$ and stride size $s$ : $\begin{aligned} p_{1 : m} & = {CONV}_{U, b}^{k, s} (w_{1 : n}) \\ p_{i} & = g (\oplus (w_{1 + (i - 1) s : (s + k) i}) \cdot U + b) \end{aligned}$

An illustration of stride size

References

Goldberg, Yoav. (2017). Neural Network Methods for Natural Language Processing, Morgan & Claypool

Book Notes: Neural Network Methods for Natural Language Processing -- Part 3 Specialized Architectures, Ch13 CNN