*For the pdf slides, click here*

# Ch1 Introduction

### Three most common types of NNs in NLP

Feed-forward networks, i.e., multi-layer perceptrons (MLPs), or fully connected layers

- Allow to work with fixed sized inputs
- Or with variable length inputs in which we can disregard the order of the elements (continuous bags of words)

Recurrent neural networks (RNNs)

- Specialized models for sequential data
- Produce a fixed size vector that summarizes the sequence
- Doesn’t require fixed sized input (e.g., lengths of input sequence can vary)

Convolutional feed-forward networks (CNNs)

- Good at extracting local patterns in the data

### About neural networks

Some of the neural network techniques (e.g., MLP) are simple generalizations of the linear models and can be used as almost drop-in replacements for the linear classifiers

RNNs and CNNs are rarely used as standalone components. They are used to extract features and being fed into other network components, and trained to work in tandem with them.

### Success stories

Multi-layer feed-forward networks can provide competitive results on sentiment classification and factoid question answering

Networks with convolutional and pooling layers are useful for classification tasks in which we expect to find strong local clues regarding class membership, but these clues can appear in different places in the input

- Convolutional and pooling layers allow the model to learn to find such local indicators, regardless of their position

### Note

**In this book, vectors are assumed to be row vectors**

# Ch2 Linear Models

### One-hot and dense vector representations (p23)

The input \(\mathbf{x}\) in language classification example contains the normalized bigram counts in the document \(D\)

- \(D_{[i]}\) is the bigram at document position \(i\)
Each vector \(\mathbf{x}^{D_{[i]}} \in \mathbb{R}^{d}\) is a one-hot vector

The following \(\mathbf{x}\) is an averaged bag of words, or just bag of words: \[ \mathbf{x} = \frac{1}{|D|} \sum_{i=1}^{|D|} \mathbf{x}^{D_{[i]}} \]

Bag of words doesn’t consider orders among words

### Minibatch stochastic gradient descent (SGD) algorithm (p32)

Goal: set the parameters \(\Theta\) to minimize the total loss \[ \mathcal{L}(\Theta) = \sum_{i=1}^n L\left(f(\mathbf{x}_i; \theta), \mathbf{y}_i \right) \] over the training set

Learning rate: \(\eta_t\)

Minibatch size \(m\), can vary from \(m=1\) to \(m=n\)

After the inner loop, \(\hat{\mathbf{g}}\) contains the gradient estimate

# Ch4 Feed-Forward Neural Networks

### Common nonlinearities (p45)

Sigmoid (currently considered to be deprecated for use in internal layers of NN)

Hyperbolic tangent (tanh): common

ReLU: common

### Regularization and dropout

L2 regularization, also called weight decay is effective, and tuning the regularization strength \(\lambda\) is advisable

Dropout training: randomly dropping (setting to zero) half of the neurons in the network in each training example in the stochastic-gradient training

Dropout is effective in NLP applications of NNs

# Ch5 Neural Network Training

### Comment

The objective function for nonlinear neural networks is not convex, and gradient-based methods may get stuck in a local minima

Still, gradient-based methods produce good results in practice

Choice of optimization algorithm: the book author likes Adam, as it is very effective and relatively robust to the choice of learning rate

### Initialization

It is advised to run several restarts of the training starting at different random initializations, and choosing the best one based on a development set

Xavier initialization:

When using ReLU nonlinearities, the following Gaussian initialization may work better than Xavier. The weights should be initialized by sampling from a zero-mean Gaussian distribution whose sd is \(\sqrt{2/d_{\text{in}}}\)

### Vanishing and exploding gradients

Vanishing gradients

batch-normalization, i.e., for every minibatch, normalize the inputs to each of the network layers to have zero mean and unit variance

Or use specialized architectures that are designed to assist in gradient flow (i.e., LSTM and GRU)

Exploding gradients: clipping the gradients if their norm exceeds a given threshold

### Learning rate

Experiment with a range of initial learning rates in range \([0, 1]\), e.g., 0.001, 0.01, 0.1, 1.

Learning rate scheduling decreases the rate as a function of the number of observed minibatches

A common schedule is dividing the initial learning rate by the iteration number

Bottou’s recommendation: \[ \eta_t = \frac{\eta_0}{1 + \eta_0 \lambda t} \]

### References

- Goldberg, Yoav. (2017). Neural Network Methods for Natural Language Processing, Morgan & Claypool