*For the pdf slides, click here*

# Ch9 Language Modeling

## Language Modeling with the Markov Assumption

### Language modeling with the Markov assumption

The task of language modeling is to assign a probability to any sequence of words \(w_{1:n}\), i.e., to estimate \[ P(w_{1:n}) = P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_{1:2}) \cdots P(w_n \mid w_{1:n-1}) \]

Non-RNN language models make use of the Markov assumption: the future is independent of the past given the present

A \(k\)th order Markov assumption assumes \[ P(w_{i+1} \mid w_{1:i}) \approx P(w_{i+1} \mid w_{i-k+1:i}) \]

Thus, the probability of the sentence becomes \[ P(w_{1:n}) = \prod_{i=1}^n P(w_i \mid w_{i-k:i-1}) \] where \(w_{-k}, \ldots, w_0\) are special padding symbols

This chapter discusses \(k\)th order language model. Chapter 14 will discuss language models without the Markov assumption

### Perplexity: evaluation of language models

An intrinsic evaluation of language models is perplexity over unseen sentences

Given a text corpus of \(n\) words \(w_1, \ldots, w_n\) and a language model function \(LM\), the perplexity of LM with respect to the corpus is \[ 2^{-\frac{1}{n} \sum_{i=1}^n \log_2 LM(w_i \mid w_{1:i-1})} \]

Good language models will assign high probabilities to the events in the corpus, resulting in lower perplexity values

Perplexities are corpus specific, so perplexities of two language models are only comparable with respect to the same evaluation corpus

## Neural Language Models

### Neural language models

- Input to the neural network is a \(k\)gram of words \(w_{1:k}\), and the output is a probability distribution over the next word

### Approximation of the softmax operation in cross entropy

Cross entropy loss works very well, but requires the use of a costly softmax operation which can be prohibitive for very large vocabularies

- This promotes the use of alternative losses and/or approximations
- Hierarchical softmax (using tree)
- Self-normalizing approaches, e.g., noise-contrastive estimation (NCE)
- Sampling approaches

NCE: replaces the cross-entropy objective with a collection of binary classification problems, requiring the evaluation of the assigned scores for \(k\) random words rather than the entire vocabulary

### Using language models for generation

Predict a probability distribution over the first word conditioned on the start symbol, and draw a random word according to the predicted distribution

Then predict a probability distribution over the second word conditioned on the first

And so on, until predicting the end-of-sequence \(</s>\) symbol

Already with \(k=3\) this produces very passable text, and the quality improves with higher orders

Another option is to use beam search in order to find a sequence with a globally high probability

# Ch10 Pre-trained Word Representations

### Random initialization of word embedding models

The Word2Vec model initializes word vectors to uniformly sampled numbers in the range \(\left[-\frac{1}{2d}, \frac{1}{2d}\right]\)

Another option is xavier initialization, initializing with uniformly sampled numbers in the range \(\left[-\frac{\sqrt{6}}{\sqrt{d}}, \frac{\sqrt{6}}{\sqrt{d}}\right]\)

### Unsupervised training of word embedding vectors

Key idea: one would like the embedding vectors of “similar” words to have similar vectors

Word similarity is from the distributional hypothesis: words are similar if they appear in similar contexts

The different methods all create supervised training instances in which the goal is to

- either predict the word from its context,
- or predict the context from the word

An important benefit of training word embedding on large amount of unannotated data: it provides vector representations for words that do not appear in the supervised training set

## Word Simiarlity Matrices and SVD

### Word-context matrices

Denote \(V_W\) the st of words and \(V_C\) the set of possible contexts

Assume that \(w_i\) is the \(i\)th word in the words vocabulary and \(c_j\) is the \(j\)th word in the context vocabulary

The matrix \(\mathbf{M}^f \in \mathbb{R}^{|V_W| \times |V_C|}\) is the word-context matrix, with \(f\) being an association measure of the strength between a word and a context \[ \mathbf{M}^f_{[i,j]} = f(w_i, c_j) \]

### Similarity measures

- When words are represented as vectors, one can computing similarity by cosine similarity \[\begin{align*} \text{sim}_{\text{cos}} &= \frac{\mathbf{u} \cdot \mathbf{v}} {\|\mathbf{u}\|_2 \|\mathbf{v}\|_2}\\ &=\frac{\sum_i\mathbf{u}_{[i]} \cdot \mathbf{v}_{[i]}} {\sqrt{\sum_i(\mathbf{u}_{[i]})^2} \sqrt{\sum_i(\mathbf{v}_{[i]})^2}} \end{align*}\]

### Word-context weighting and PMI

Denote by \(\#(w, c)\) the number of times word \(w\) occurred in the context \(c\) in the corpus \(D\), and let \(|D|\) be the corpus size

Pointwise mutual information (PMI) \[ \text{PMI}(w, c) = \log \frac{P(w, c)}{P(w) P(c)} = \log \frac{\#(w, c) |D|}{\#(w) \#(c)} \]

To resolve the \(\log 0\) issue for pairs \((w, c)\) never observed in the corpus, we can use the positive PMI (PPMI) \[ \text{PPMI}(w,c) = max\{\text{PMI}(w, c), 0\} \]

A deficiency of PMI: it tends to assign high value to rare events

Solution: it is advisable to apply a count threshold (to discount rare events) before using the PMI metric

### Dimensionality reduction through matrix factorization

Potential obstacle of representing words as the explicit set of contexts: data sparsity, some entries in \(\mathbf{M}\) may be incorrect because we don’t have enough data points

Also, the explicit word vectors (row in \(\mathbf{M}\)) are of a very high dimension

Both issues can be alleviated by using dimension reduction techniques, e.g., singular value decomposition (SVD)

### Mathematics of SVD

A \(m\times n\) matrix \(\mathbf{M}\) can be factorized into \[ \begin{array}{ccccc} \mathbf{M} & = & \mathbf{U} &\mathbf{D} & \mathbf{V}^T\\ m\times n & & m\times m & m\times n & n\times n \end{array} \]

- Matrix \(\mathbf{D}\) is diagonal. Matrices \(\mathbf{U}\) and \(\mathbf{V}\) are orthonormal, i.e., their rows are unit-length and orthogonal to each other

Dimension reduction under SVD: with a small value \(d\), \[ \begin{array}{ccccc} \mathbf{M}' & = & \tilde{\mathbf{U}} & \tilde{\mathbf{D}} & \tilde{\mathbf{V}}^T \\ m\times n & & m\times d & d\times d & d\times n \end{array} \]

- \(\mathbf{M}'\) is the best rank-\(d\) approximation of \(\mathbf{M}\) under the \(L_2\) loss

### Use SVD to obtain word vectors

The low-dimensional rows of \[ \mathbf{W} = \tilde{\mathbf{U}} \tilde{\mathbf{D}} \] are low-rank approximations of the high-dimensional rows of the original matrix \(\mathbf{M}\)

- In the sense that computing the dot product between rows of \(\mathbf{W}\) is equivalent to computing dot product between the reconstructed matrix \(\mathbf{M}'\). \[ \mathbf{W}_{[i]} \cdot \mathbf{W}_{[j]} = \mathbf{M}'_{[i]} \cdot \mathbf{M}'_{[j]} \]

When using SVD for word similarity, the rows of \(\mathbf{M}\) correspond to words, the columns to contexts. Thus the rows of \(\mathbf{W}\) are low-dimensional word representations.

In practice, it is often better to not use \(\mathbf{W} = \tilde{\mathbf{U}} \tilde{\mathbf{D}}\), but instead to use the more balanced version \(\mathbf{W} = \tilde{\mathbf{U}} \sqrt{\tilde{\mathbf{D}}}\), or even directly using \(\mathbf{W} = \tilde{\mathbf{U}}\)

### Collobert and Weston’s algorithm

Instead of computing a probability distribution over target words given a context, Collobert and Weston’s model only attempts to assign a score to each word, such that the correct word scores above the incorrect ones (p123)

Denote \(w\) the target word, \(c_{1:k}\) an ordered list of context items

Let \(v_w(w)\) and \(v_c(c)\) be embedding functions mapping word and context indices to \(d_{\text{emb}}\) dimensional vectors

## Word2Vec Model

### Word2Vec model: overview

- Word2Vec is a software package implementing
- two different context representations (CBOW and Skip-Gram) and
- two different optimization objectives (Negative-Sampling and Hierarchical Softmax)

- Here, we focus on the Negative-Sampling (NS) objective

### Word2Vec model: negative sampling

Consider a set \(D\) of correct word-context pairs, and a set \(\bar{D}\) of incorrect word-context pairs

Goal: estimate the probability \(P(D=1 \mid w, c)\), which should be high (1) for pairs from \(D\) and low (0) for pairs from \(\bar{D}\)

The probability function: a sigmoid over the score \(s(w, c)\) \[ P(D=1 \mid w, c) = \frac{1}{1 + e^{-s(w, c)}} \]

The corpus-wide objective function is to maximize the log-likelihood of the data \(D \cup \bar{D}\) \[ \mathcal{L}(\Theta; D, \bar{D}) = \sum_{(w, c)\in D}\log P(D=1\mid w, c) + \sum_{(w, c)\in \bar{D}}\log P(D=0\mid w, c) \]

NS approximates the softmax function (normalizing term expensive to compute) with sigmoid functions

### Word2Vec: NS, continued

The positive examples \(D\) are generated from a corpus

The negative samples \(\bar{D}\) can be generated as follows

For each good pair \((w, c)\in D\), sample \(k\) words \(w_{1:k}\) and add each of \((w_i, c)\) as a negative example to \(\bar{D}\). This results in \(\bar{D}\) being \(k\) times as large as \(D\). The number of negative samples \(k\) is a parameter of the algorithm

The negative words \(w\) can be sampled according to their corpus-based frequency. Actually in Word2Vec implementation, a smoothed version in which the counts are raised to the power of \(\frac{3}{4}\) before normalizing: \[ \frac{\#(w)^{0.75}}{\sum_{w'} \#(w')^{0.75}} \] This version gives more relative weights to less frequent words, and results in better word similarities in practice.

### Word2Vec: CBOW

For a multi-word context \(c_{1:k}\), the CBOW variant of Word2Vec defines the context vector \(\mathbf{c}\) to be a sum of the embedding vectors of the context components \[ \mathbf{c} = \sum_{i=1}^k \mathbf{c}_i \]

The score of the word-context pair is simply defined as \[ s(w, c) = \mathbf{w} \cdot \mathbf{c} \]

Thus, the probability of a true pair is \[ P(D = 1\mid w, c_{1:k}) = \frac{1}{1 + e^{-(\mathbf{w} \cdot \mathbf{c}_1 + \mathbf{w} \cdot \mathbf{c}_2 + \cdots + \mathbf{w} \cdot \mathbf{c}_k)}} \]

The CBOW variant loses the order information between the context’s elements

In return, it allows the use of variable-length contexts

### Word2Vec: Skip-Gram

For a \(k\)-element context \(c_{1:k}\), the skip-gram variant assumes that the elements \(c_i\) in the context are independent from each other, essentially treating them as \(k\) different contexts: \((w, c_1), (w, c_2), \ldots, (w, c_k)\)

The scoring function is the same as the CBOW version \[ s(w, c_i) = \mathbf{w} \cdot \mathbf{c_i} \]

The probability is a product of \(k\) terms \[ P(D = 1\mid w, c_{1:k}) = \prod_{i=1}^k P(D = 1\mid w, c_i) = \prod_{i=1}^k \frac{1}{1 + e^{-\mathbf{w} \cdot \mathbf{c}_i}} \]

While the independence assumption is strong, the skip-gram variant is very effective in practice

### GloVe

- GloVe constructs an explicit word-context matrix, and trains the word and context vectors \(\mathbf{w}\) and \(\mathbf{c}\) attempting to satisfy \[ \mathbf{w} \cdot \mathbf{c} + \mathbf{b}_{[w]} + \mathbf{b}_{[c]} = \log \#(w, c), \quad \forall (w, c) \in D \] where \(\mathbf{b}_{[w]}\) and \(\mathbf{b}_{[c]}\) are word-specific and context-specific trained biases

## Choice of Contexts

### Choice of contexts: window approach

The most common is a sliding window approach, containing a sequence of \(2m+1\) words. The middle word is called the focus word and the \(m\) words to each side are the contexts

Effective window size: usually 2-5.

Larger windows tend to produce more topical similarities (e.g., “dog”, “bark”, and “leash” will be grouped together, as well as “walked”, “run”, and “walking”)

Smaller windows tend to produce more functional and syntactic similarities (e.g., “Poodle”, “Pitbull”, and “Rottweiler”, or “walking”, “running”, and “approaching”)

Many variants on the window approach are possible. One

**may**- lemmatize words before learning
- apply text normalization
- filter too short of too long sentences
- remove capitalization

### Limitations of distributional methods

Black sheep: people are less likely to mention known information than they are to mention novel ones

- For example, when people talk of
*white sheep*, they will likely refer to them as*sheep*, while for black sheep are are much more likely to retain the color information and say*black sheep*

- For example, when people talk of
Antonyms: words are opposite of each other (

*good*vs*bad*,*buy*vs*sell*,*hot*vs*cold*) tend to appear in similar contexts

# Ch11 Using Word Embeddings

## Resources of Common Pre-Training Word Embeddings

### Common pre-training word embeddings

Efficient implementation of Word2Vec

- GenSim python package: https://radimrehurek.com/gensim/

Efficient implementation of GloVe

## Usages: Find Similarity, Word Analogies

### Pre-trained word embedding usages

Calculate word similarity, e.g., using cosine similarity

Word clustering, e.g., using KMeans

Find similar words

With row-normalized embedding matrix, the cosine similarity between two words \(w_1\) and \(w_2\) is \[ \text{sim}_{\text{cos}}(w_1, w_2) = \mathbf{E}_{[w_1]} \cdot \mathbf{E}_{[w_2]} \]

We are often interested in the \(k\) most similar words to a given word \(w\). Let \(\mathbf{w} = \mathbf{E}_{[w]}\), then the similarity to all other words can be computed by the matrix-vector multiplication \[ \mathbf{s} = \mathbf{E} \mathbf{w} \]

### More similarity measures

Similarity to a group of words: average similarity to the items in the group \[ \mathbf{s}_{[w]} = \text{sim}(w, w_{1:k}) = \mathbf{E} (\mathbf{w}_1 + \cdots + \mathbf{w}_k) / k \]

Short document similarity: consider two documents \(D_1 = w_1^1, \ldots, w_m^1\) and \(D_2 = w_1^2, \ldots, w_n^2\), \[\begin{align*} \text{sim}_{\text{doc}}(D_1, D_2) &= \sum_{i=1}^m \sum_{j=1}^n \text{cos}(\mathbf{w}_i^1, \mathbf{w}_j^2)\\ &= \left(\sum_{i=1}^m \mathbf{w}_i^1 \right) \cdot \left(\sum_{j=1}^n \mathbf{w}_j^2 \right) \end{align*}\]

### Word analogies

One can perform “algebra” on the word vectors and get meaningful results

- For example, \[ \mathbf{w}_{\text{king}} - \mathbf{w}_{\text{man}} + \mathbf{w}_{\text{woman}} \approx \mathbf{w}_{\text{queen}} \]

Analogy solving task: to answer analogy questions of the form \[ man : woman \rightarrow king:? \]

Solve the analogy question by maximization \[\begin{align*} \text{analogy}(m:w \rightarrow k:?) & = \arg\max_{v \in V\backslash \{m, w, k\}} \text{cos}(\mathbf{v}, \mathbf{k} - \mathbf{m} + \mathbf{w}) \end{align*}\]

### Practicalities and pitfalls

While off-the-shelf, pre-trained word embeddings can be downloaded and used, it is advised to not just blindly download word embeddings and treat them as a black box

Be aware of choices such as the source of the training corpus

- Larger training corpus is not always better. A smaller but cleaner, or smaller but more domain-focused corpus are often more effective

**When using off-the-shelf embedding vectors, it is better to use the same tokenization and text normalization schemes that were used when deriving the corpus**

### References

- Goldberg, Yoav. (2017). Neural Network Methods for Natural Language Processing, Morgan & Claypool