For the pdf slides, click here

Ch1 Introduction

Three most common types of NNs in NLP

Feed-forward networks, i.e., multi-layer perceptrons (MLPs), or fully connected layers
- Allow to work with fixed sized inputs
- Or with variable length inputs in which we can disregard the order of the elements (continuous bags of words)
Recurrent neural networks (RNNs)
- Specialized models for sequential data
- Produce a fixed size vector that summarizes the sequence
- Doesn’t require fixed sized input (e.g., lengths of input sequence can vary)
Convolutional feed-forward networks (CNNs)
- Good at extracting local patterns in the data

About neural networks

Some of the neural network techniques (e.g., MLP) are simple generalizations of the linear models and can be used as almost drop-in replacements for the linear classifiers
RNNs and CNNs are rarely used as standalone components. They are used to extract features and being fed into other network components, and trained to work in tandem with them.

Success stories

Multi-layer feed-forward networks can provide competitive results on sentiment classification and factoid question answering
Networks with convolutional and pooling layers are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input
- Convolutional and pooling layers allow the model to learn to find such local indicators, regardless of their position

Note

In this book, vectors are assumed to be row vectors

Ch2 Linear Models

One-hot and dense vector representations (p23)

The input \(\mathbf{x}\) in language classification example contains the normalized bigram counts in the document \(D\)
\(D_{[i]}\) is the bigram at document position \(i\)
Each vector \(\mathbf{x}^{D_{[i]}} \in \mathbb{R}^{d}\) is a one-hot vector
The following \(\mathbf{x}\) is an averaged bag of words, or just bag of words: \[ \mathbf{x} = \frac{1}{|D|} \sum_{i=1}^{|D|} \mathbf{x}^{D_{[i]}} \]
Bag of words doesn’t consider orders among words

Minibatch stochastic gradient descent (SGD) algorithm (p32)

Goal: set the parameters \(\Theta\) to minimize the total loss \[ \mathcal{L}(\Theta) = \sum_{i=1}^n L\left(f(\mathbf{x}_i; \theta), \mathbf{y}_i \right) \] over the training set
Learning rate: \(\eta_t\)
Minibatch size \(m\), can vary from \(m=1\) to \(m=n\)
After the inner loop, \(\hat{\mathbf{g}}\) contains the gradient estimate

Ch4 Feed-Forward Neural Networks

A feed-forward neural network with one hidden-layer

\(g\) is a nonlinear function
The first layer transforms the data into a good representation, while the second layer applies a linear classifier to that representation
Layers resulting from linear transformations are called fully connected, or affline

Common nonlinearities (p45)

Sigmoid (currently considered to be deprecated for use in internal layers of NN)
Hyperbolic tangent (tanh): common
ReLU: common

Regularization and dropout

L2 regularization, also called weight decay is effective, and tuning the regularization strength \(\lambda\) is advisable
Dropout training: randomly dropping (setting to zero) half of the neurons in the network in each training example in the stochastic-gradient training
Dropout is effective in NLP applications of NNs

Ch5 Neural Network Training

Comment

The objective function for nonlinear neural networks is not convex, and gradient-based methods may get stuck in a local minima
Still, gradient-based methods produce good results in practice
Choice of optimization algorithm: the book author likes Adam, as it is very effective and relatively robust to the choice of learning rate

Initialization

It is advised to run several restarts of the training starting at different random initializations, and choosing the best one based on a development set
Xavier initialization:
When using ReLU nonlinearities, the following Gaussian initialization may work better than Xavier. The weights should be initialized by sampling from a zero-mean Gaussian distribution whose sd is \(\sqrt{2/d_{\text{in}}}\)

Vanishing and exploding gradients

Vanishing gradients
- batch-normalization, i.e., for every minibatch, normalize the inputs to each of the network layers to have zero mean and unit variance
- Or use specialized architectures that are designed to assist in gradient flow (i.e., LSTM and GRU)
Exploding gradients: clipping the gradients if their norm exceeds a given threshold

Learning rate

Experiment with a range of initial learning rates in range \([0, 1]\), e.g., 0.001, 0.01, 0.1, 1.
Learning rate scheduling decreases the rate as a function of the number of observed minibatches
- A common schedule is dividing the initial learning rate by the iteration number
- Bottou’s recommendation: \[ \eta_t = \frac{\eta_0}{1 + \eta_0 \lambda t} \]

References

Goldberg, Yoav. (2017). Neural Network Methods for Natural Language Processing, Morgan & Claypool

Book Notes: Neural Network Methods for Natural Language Processing -- Part 1 Supervised Classification and Feed-forward Neural Networks

Ch1 Introduction

Three most common types of NNs in NLP

About neural networks

Success stories

Note

Ch2 Linear Models

One-hot and dense vector representations (p23)

Minibatch stochastic gradient descent (SGD) algorithm (p32)

Ch4 Feed-Forward Neural Networks

A feed-forward neural network with one hidden-layer

Common nonlinearities (p45)

Regularization and dropout

Ch5 Neural Network Training

Comment

Initialization

Vanishing and exploding gradients

Learning rate

References