Ch9 Language Modeling
- Language Modeling with the Markov Assumption
- Neural Language Models
Ch10 Pre-trained Word Representations
Ch11 Using Word Embeddings
- Resources of Common Pre-Training Word Embeddings
- Usages: Find Similarity, Word Analogies

For the pdf slides, click here

Ch9 Language Modeling

Language Modeling with the Markov Assumption

Language modeling with the Markov assumption

The task of language modeling is to assign a probability to any sequence of words $w_{1 : n}$ , i.e., to estimate $P (w_{1 : n}) = P (w_{1}) P (w_{2} ∣ w_{1}) P (w_{3} ∣ w_{1 : 2}) \dots P (w_{n} ∣ w_{1 : n - 1})$
Non-RNN language models make use of the Markov assumption: the future is independent of the past given the present
- A $k$ th order Markov assumption assumes $P (w_{i + 1} ∣ w_{1 : i}) \approx P (w_{i + 1} ∣ w_{i - k + 1 : i})$
- Thus, the probability of the sentence becomes $P (w_{1 : n}) = \prod_{i = 1}^{n} P (w_{i} ∣ w_{i - k : i - 1})$ where $w_{- k}, \dots, w_{0}$ are special padding symbols
- This chapter discusses $k$ th order language model. Chapter 14 will discuss language models without the Markov assumption

Perplexity: evaluation of language models

An intrinsic evaluation of language models is perplexity over unseen sentences
Given a text corpus of $n$ words $w_{1}, \dots, w_{n}$ and a language model function $L M$ , the perplexity of LM with respect to the corpus is $2^{- \frac{1}{n} \sum_{i = 1}^{n} \log_{2} L M (w_{i} ∣ w_{1 : i - 1})}$
Good language models will assign high probabilities to the events in the corpus, resulting in lower perplexity values
Perplexities are corpus specific, so perplexities of two language models are only comparable with respect to the same evaluation corpus

Neural Language Models

Neural language models

Input to the neural network is a $k$ gram of words $w_{1 : k}$ , and the output is a probability distribution over the next word

Approximation of the softmax operation in cross entropy

Cross entropy loss works very well, but requires the use of a costly softmax operation which can be prohibitive for very large vocabularies
This promotes the use of alternative losses and/or approximations
- Hierarchical softmax (using tree)
- Self-normalizing approaches, e.g., noise-contrastive estimation (NCE)
- Sampling approaches
NCE: replaces the cross-entropy objective with a collection of binary classification problems, requiring the evaluation of the assigned scores for $k$ random words rather than the entire vocabulary

Using language models for generation

Predict a probability distribution over the first word conditioned on the start symbol, and draw a random word according to the predicted distribution
Then predict a probability distribution over the second word conditioned on the first
And so on, until predicting the end-of-sequence $< / s >$ symbol
Already with $k = 3$ this produces very passable text, and the quality improves with higher orders
Another option is to use beam search in order to find a sequence with a globally high probability

Ch10 Pre-trained Word Representations

Random initialization of word embedding models

The Word2Vec model initializes word vectors to uniformly sampled numbers in the range $[- \frac{1}{2 d}, \frac{1}{2 d}]$
Another option is xavier initialization, initializing with uniformly sampled numbers in the range $[- \frac{\sqrt{6}}{\sqrt{d}}, \frac{\sqrt{6}}{\sqrt{d}}]$

Unsupervised training of word embedding vectors

Key idea: one would like the embedding vectors of “similar” words to have similar vectors
Word similarity is from the distributional hypothesis: words are similar if they appear in similar contexts
The different methods all create supervised training instances in which the goal is to
- either predict the word from its context,
- or predict the context from the word
An important benefit of training word embedding on large amount of unannotated data: it provides vector representations for words that do not appear in the supervised training set

Word Simiarlity Matrices and SVD

Word-context matrices

Denote $V_{W}$ the st of words and $V_{C}$ the set of possible contexts
Assume that $w_{i}$ is the $i$ th word in the words vocabulary and $c_{j}$ is the $j$ th word in the context vocabulary
The matrix $M^{f} \in R^{| V_{W} | \times | V_{C} |}$ is the word-context matrix, with $f$ being an association measure of the strength between a word and a context $M_{[i, j]}^{f} = f (w_{i}, c_{j})$

Similarity measures

When words are represented as vectors, one can computing similarity by cosine similarity $\begin{aligned} {sim}_{cos} & = \frac{u \cdot v}{‖ u ‖_{2} ‖ v ‖_{2}} \\ = \frac{\sum_{i} u_{[i]} \cdot v_{[i]}}{\sqrt{\sum_{i} (u_{[i]})^{2}} \sqrt{\sum_{i} (v_{[i]})^{2}}} \end{aligned}$

Word-context weighting and PMI

Denote by $# (w, c)$ the number of times word $w$ occurred in the context $c$ in the corpus $D$ , and let $| D |$ be the corpus size
Pointwise mutual information (PMI) $PMI (w, c) = \log \frac{P (w, c)}{P (w) P (c)} = \log \frac{# (w, c) | D |}{# (w) # (c)}$
To resolve the $\log 0$ issue for pairs $(w, c)$ never observed in the corpus, we can use the positive PMI (PPMI) $PPMI (w, c) = m a x {PMI (w, c), 0}$
A deficiency of PMI: it tends to assign high value to rare events
Solution: it is advisable to apply a count threshold (to discount rare events) before using the PMI metric

Dimensionality reduction through matrix factorization

Potential obstacle of representing words as the explicit set of contexts: data sparsity, some entries in $M$ may be incorrect because we don’t have enough data points
Also, the explicit word vectors (row in $M$ ) are of a very high dimension
Both issues can be alleviated by using dimension reduction techniques, e.g., singular value decomposition (SVD)

Mathematics of SVD

A $m \times n$ matrix $M$ can be factorized into $\begin{array}{ccccc} M & = & U & D & V^{T} \\ m \times n & m \times m & m \times n & n \times n \end{array}$
- Matrix $D$ is diagonal. Matrices $U$ and $V$ are orthonormal, i.e., their rows are unit-length and orthogonal to each other
Dimension reduction under SVD: with a small value $d$ , $\begin{array}{ccccc} M^{'} & = & \tilde{U} & \tilde{D} & {\tilde{V}}^{T} \\ m \times n & m \times d & d \times d & d \times n \end{array}$
- $M^{'}$ is the best rank- $d$ approximation of $M$ under the $L_{2}$ loss

Use SVD to obtain word vectors

The low-dimensional rows of $W = \tilde{U} \tilde{D}$ are low-rank approximations of the high-dimensional rows of the original matrix $M$
- In the sense that computing the dot product between rows of $W$ is equivalent to computing dot product between the reconstructed matrix $M^{'}$ . $W_{[i]} \cdot W_{[j]} = M_{[i]}^{'} \cdot M_{[j]}^{'}$
When using SVD for word similarity, the rows of $M$ correspond to words, the columns to contexts. Thus the rows of $W$ are low-dimensional word representations.
In practice, it is often better to not use $W = \tilde{U} \tilde{D}$ , but instead to use the more balanced version $W = \tilde{U} \sqrt{\tilde{D}}$ , or even directly using $W = \tilde{U}$

Collobert and Weston’s algorithm

Instead of computing a probability distribution over target words given a context, Collobert and Weston’s model only attempts to assign a score to each word, such that the correct word scores above the incorrect ones (p123)
Denote $w$ the target word, $c_{1 : k}$ an ordered list of context items
Let $v_{w} (w)$ and $v_{c} (c)$ be embedding functions mapping word and context indices to $d_{emb}$ dimensional vectors

Word2Vec Model

Word2Vec model: overview

Word2Vec is a software package implementing
- two different context representations (CBOW and Skip-Gram) and
- two different optimization objectives (Negative-Sampling and Hierarchical Softmax)
Here, we focus on the Negative-Sampling (NS) objective

Word2Vec model: negative sampling

Consider a set $D$ of correct word-context pairs, and a set $\bar{D}$ of incorrect word-context pairs
Goal: estimate the probability $P (D = 1 ∣ w, c)$ , which should be high (1) for pairs from $D$ and low (0) for pairs from $\bar{D}$
The probability function: a sigmoid over the score $s (w, c)$ $P (D = 1 ∣ w, c) = \frac{1}{1 + e^{- s (w, c)}}$
The corpus-wide objective function is to maximize the log-likelihood of the data $D \cup \bar{D}$ $L (Θ; D, \bar{D}) = \sum_{(w, c) \in D} \log P (D = 1 ∣ w, c) + \sum_{(w, c) \in \bar{D}} \log P (D = 0 ∣ w, c)$
NS approximates the softmax function (normalizing term expensive to compute) with sigmoid functions

Word2Vec: NS, continued

The positive examples $D$ are generated from a corpus
The negative samples $\bar{D}$ can be generated as follows
- For each good pair $(w, c) \in D$ , sample $k$ words $w_{1 : k}$ and add each of $(w_{i}, c)$ as a negative example to $\bar{D}$ . This results in $\bar{D}$ being $k$ times as large as $D$ . The number of negative samples $k$ is a parameter of the algorithm
- The negative words $w$ can be sampled according to their corpus-based frequency. Actually in Word2Vec implementation, a smoothed version in which the counts are raised to the power of $\frac{3}{4}$ before normalizing: $\frac{# (w)^{0.75}}{\sum_{w^{'}} # (w^{'})^{0.75}}$ This version gives more relative weights to less frequent words, and results in better word similarities in practice.

Word2Vec: CBOW

For a multi-word context $c_{1 : k}$ , the CBOW variant of Word2Vec defines the context vector $c$ to be a sum of the embedding vectors of the context components $c = \sum_{i = 1}^{k} c_{i}$
The score of the word-context pair is simply defined as $s (w, c) = w \cdot c$
Thus, the probability of a true pair is $P (D = 1 ∣ w, c_{1 : k}) = \frac{1}{1 + e^{- (w \cdot c_{1} + w \cdot c_{2} + \dots + w \cdot c_{k})}}$
The CBOW variant loses the order information between the context’s elements
In return, it allows the use of variable-length contexts

Word2Vec: Skip-Gram

For a $k$ -element context $c_{1 : k}$ , the skip-gram variant assumes that the elements $c_{i}$ in the context are independent from each other, essentially treating them as $k$ different contexts: $(w, c_{1}), (w, c_{2}), \dots, (w, c_{k})$
The scoring function is the same as the CBOW version $s (w, c_{i}) = w \cdot c_{i}$
The probability is a product of $k$ terms $P (D = 1 ∣ w, c_{1 : k}) = \prod_{i = 1}^{k} P (D = 1 ∣ w, c_{i}) = \prod_{i = 1}^{k} \frac{1}{1 + e^{- w \cdot c_{i}}}$
While the independence assumption is strong, the skip-gram variant is very effective in practice

GloVe

GloVe constructs an explicit word-context matrix, and trains the word and context vectors $w$ and $c$ attempting to satisfy $w \cdot c + b_{[w]} + b_{[c]} = \log # (w, c), \forall (w, c) \in D$ where $b_{[w]}$ and $b_{[c]}$ are word-specific and context-specific trained biases

Choice of Contexts

Choice of contexts: window approach

The most common is a sliding window approach, containing a sequence of $2 m + 1$ words. The middle word is called the focus word and the $m$ words to each side are the contexts
Effective window size: usually 2-5.
- Larger windows tend to produce more topical similarities (e.g., “dog”, “bark”, and “leash” will be grouped together, as well as “walked”, “run”, and “walking”)
- Smaller windows tend to produce more functional and syntactic similarities (e.g., “Poodle”, “Pitbull”, and “Rottweiler”, or “walking”, “running”, and “approaching”)
Many variants on the window approach are possible. One may
- lemmatize words before learning
- apply text normalization
- filter too short of too long sentences
- remove capitalization

Limitations of distributional methods

Black sheep: people are less likely to mention known information than they are to mention novel ones
- For example, when people talk of white sheep, they will likely refer to them as sheep, while for black sheep are are much more likely to retain the color information and say black sheep
Antonyms: words are opposite of each other (good vs bad, buy vs sell, hot vs cold) tend to appear in similar contexts

Ch11 Using Word Embeddings

Resources of Common Pre-Training Word Embeddings

Common pre-training word embeddings

Efficient implementation of Word2Vec
- GenSim python package: https://radimrehurek.com/gensim/
Efficient implementation of GloVe
- https://nlp.stanford.edu/projects/glove/

Usages: Find Similarity, Word Analogies

Pre-trained word embedding usages

Calculate word similarity, e.g., using cosine similarity
Word clustering, e.g., using KMeans
Find similar words
- With row-normalized embedding matrix, the cosine similarity between two words $w_{1}$ and $w_{2}$ is ${sim}_{cos} (w_{1}, w_{2}) = E_{[w_{1}]} \cdot E_{[w_{2}]}$
- We are often interested in the $k$ most similar words to a given word $w$ . Let $w = E_{[w]}$ , then the similarity to all other words can be computed by the matrix-vector multiplication $s = E w$

More similarity measures

Similarity to a group of words: average similarity to the items in the group $s_{[w]} = sim (w, w_{1 : k}) = E (w_{1} + \dots + w_{k}) / k$
Short document similarity: consider two documents $D_{1} = w_{1}^{1}, \dots, w_{m}^{1}$ and $D_{2} = w_{1}^{2}, \dots, w_{n}^{2}$ , $\begin{aligned} {sim}_{doc} (D_{1}, D_{2}) & = \sum_{i = 1}^{m} \sum_{j = 1}^{n} cos (w_{i}^{1}, w_{j}^{2}) \\ = (\sum_{i = 1}^{m} w_{i}^{1}) \cdot (\sum_{j = 1}^{n} w_{j}^{2}) \end{aligned}$

Word analogies

One can perform “algebra” on the word vectors and get meaningful results
- For example, $w_{king} - w_{man} + w_{woman} \approx w_{queen}$
Analogy solving task: to answer analogy questions of the form $m a n : w o m a n \to k i n g : ?$
Solve the analogy question by maximization $\begin{aligned} analogy (m : w \to k : ?) & = \arg max_{v \in V ∖ {m, w, k}} cos (v, k - m + w) \end{aligned}$

Practicalities and pitfalls

While off-the-shelf, pre-trained word embeddings can be downloaded and used, it is advised to not just blindly download word embeddings and treat them as a black box
Be aware of choices such as the source of the training corpus
- Larger training corpus is not always better. A smaller but cleaner, or smaller but more domain-focused corpus are often more effective
When using off-the-shelf embedding vectors, it is better to use the same tokenization and text normalization schemes that were used when deriving the corpus

References

Goldberg, Yoav. (2017). Neural Network Methods for Natural Language Processing, Morgan & Claypool

Book Notes: Neural Network Methods for Natural Language Processing -- Part 2 Working with Natural Language Data, Ch9-11

Ch9 Language Modeling

Language Modeling with the Markov Assumption

Language modeling with the Markov assumption

Perplexity: evaluation of language models

Neural Language Models

Neural language models

Approximation of the softmax operation in cross entropy

Using language models for generation

Ch10 Pre-trained Word Representations

Random initialization of word embedding models

Unsupervised training of word embedding vectors

Word Simiarlity Matrices and SVD

Word-context matrices

Similarity measures

Word-context weighting and PMI

Dimensionality reduction through matrix factorization

Mathematics of SVD

Use SVD to obtain word vectors

Collobert and Weston’s algorithm

Word2Vec Model

Word2Vec model: overview

Word2Vec model: negative sampling

Word2Vec: NS, continued

Word2Vec: CBOW

Word2Vec: Skip-Gram

GloVe

Choice of Contexts

Choice of contexts: window approach

Limitations of distributional methods

Ch11 Using Word Embeddings

Resources of Common Pre-Training Word Embeddings

Common pre-training word embeddings

Usages: Find Similarity, Word Analogies

Pre-trained word embedding usages

More similarity measures

Word analogies

Practicalities and pitfalls

References