Book Notes: Neural Network Methods for Natural Language Processing -- Part 2 Working with Natural Language Data, Ch9-11

For the pdf slides, click here

Ch9 Language Modeling

Language Modeling with the Markov Assumption

Language modeling with the Markov assumption

  • The task of language modeling is to assign a probability to any sequence of words \(w_{1:n}\), i.e., to estimate \[ P(w_{1:n}) = P(w_1) P(w_2 \mid w_1) P(w_3 \mid w_{1:2}) \cdots P(w_n \mid w_{1:n-1}) \]

  • Non-RNN language models make use of the Markov assumption: the future is independent of the past given the present

    • A \(k\)th order Markov assumption assumes \[ P(w_{i+1} \mid w_{1:i}) \approx P(w_{i+1} \mid w_{i-k+1:i}) \]

    • Thus, the probability of the sentence becomes \[ P(w_{1:n}) = \prod_{i=1}^n P(w_i \mid w_{i-k:i-1}) \] where \(w_{-k}, \ldots, w_0\) are special padding symbols

    • This chapter discusses \(k\)th order language model. Chapter 14 will discuss language models without the Markov assumption

Perplexity: evaluation of language models

  • An intrinsic evaluation of language models is perplexity over unseen sentences

  • Given a text corpus of \(n\) words \(w_1, \ldots, w_n\) and a language model function \(LM\), the perplexity of LM with respect to the corpus is \[ 2^{-\frac{1}{n} \sum_{i=1}^n \log_2 LM(w_i \mid w_{1:i-1})} \]

  • Good language models will assign high probabilities to the events in the corpus, resulting in lower perplexity values

  • Perplexities are corpus specific, so perplexities of two language models are only comparable with respect to the same evaluation corpus

Neural Language Models

Neural language models

  • Input to the neural network is a \(k\)gram of words \(w_{1:k}\), and the output is a probability distribution over the next word

Approximation of the softmax operation in cross entropy

  • Cross entropy loss works very well, but requires the use of a costly softmax operation which can be prohibitive for very large vocabularies

  • This promotes the use of alternative losses and/or approximations
    • Hierarchical softmax (using tree)
    • Self-normalizing approaches, e.g., noise-contrastive estimation (NCE)
    • Sampling approaches
  • NCE: replaces the cross-entropy objective with a collection of binary classification problems, requiring the evaluation of the assigned scores for \(k\) random words rather than the entire vocabulary

Using language models for generation

  • Predict a probability distribution over the first word conditioned on the start symbol, and draw a random word according to the predicted distribution

  • Then predict a probability distribution over the second word conditioned on the first

  • And so on, until predicting the end-of-sequence \(</s>\) symbol

  • Already with \(k=3\) this produces very passable text, and the quality improves with higher orders

  • Another option is to use beam search in order to find a sequence with a globally high probability

Ch10 Pre-trained Word Representations

Random initialization of word embedding models

  • The Word2Vec model initializes word vectors to uniformly sampled numbers in the range \(\left[-\frac{1}{2d}, \frac{1}{2d}\right]\)

  • Another option is xavier initialization, initializing with uniformly sampled numbers in the range \(\left[-\frac{\sqrt{6}}{\sqrt{d}}, \frac{\sqrt{6}}{\sqrt{d}}\right]\)

Unsupervised training of word embedding vectors

  • Key idea: one would like the embedding vectors of “similar” words to have similar vectors

  • Word similarity is from the distributional hypothesis: words are similar if they appear in similar contexts

  • The different methods all create supervised training instances in which the goal is to

    • either predict the word from its context,
    • or predict the context from the word
  • An important benefit of training word embedding on large amount of unannotated data: it provides vector representations for words that do not appear in the supervised training set

Word Simiarlity Matrices and SVD

Word-context matrices

  • Denote \(V_W\) the st of words and \(V_C\) the set of possible contexts

  • Assume that \(w_i\) is the \(i\)th word in the words vocabulary and \(c_j\) is the \(j\)th word in the context vocabulary

  • The matrix \(\mathbf{M}^f \in \mathbb{R}^{|V_W| \times |V_C|}\) is the word-context matrix, with \(f\) being an association measure of the strength between a word and a context \[ \mathbf{M}^f_{[i,j]} = f(w_i, c_j) \]

Similarity measures

  • When words are represented as vectors, one can computing similarity by cosine similarity \[\begin{align*} \text{sim}_{\text{cos}} &= \frac{\mathbf{u} \cdot \mathbf{v}} {\|\mathbf{u}\|_2 \|\mathbf{v}\|_2}\\ &=\frac{\sum_i\mathbf{u}_{[i]} \cdot \mathbf{v}_{[i]}} {\sqrt{\sum_i(\mathbf{u}_{[i]})^2} \sqrt{\sum_i(\mathbf{v}_{[i]})^2}} \end{align*}\]

Word-context weighting and PMI

  • Denote by \(\#(w, c)\) the number of times word \(w\) occurred in the context \(c\) in the corpus \(D\), and let \(|D|\) be the corpus size

  • Pointwise mutual information (PMI) \[ \text{PMI}(w, c) = \log \frac{P(w, c)}{P(w) P(c)} = \log \frac{\#(w, c) |D|}{\#(w) \#(c)} \]

  • To resolve the \(\log 0\) issue for pairs \((w, c)\) never observed in the corpus, we can use the positive PMI (PPMI) \[ \text{PPMI}(w,c) = max\{\text{PMI}(w, c), 0\} \]

  • A deficiency of PMI: it tends to assign high value to rare events

  • Solution: it is advisable to apply a count threshold (to discount rare events) before using the PMI metric

Dimensionality reduction through matrix factorization

  • Potential obstacle of representing words as the explicit set of contexts: data sparsity, some entries in \(\mathbf{M}\) may be incorrect because we don’t have enough data points

  • Also, the explicit word vectors (row in \(\mathbf{M}\)) are of a very high dimension

  • Both issues can be alleviated by using dimension reduction techniques, e.g., singular value decomposition (SVD)

Mathematics of SVD

  • A \(m\times n\) matrix \(\mathbf{M}\) can be factorized into \[ \begin{array}{ccccc} \mathbf{M} & = & \mathbf{U} &\mathbf{D} & \mathbf{V}^T\\ m\times n & & m\times m & m\times n & n\times n \end{array} \]

    • Matrix \(\mathbf{D}\) is diagonal. Matrices \(\mathbf{U}\) and \(\mathbf{V}\) are orthonormal, i.e., their rows are unit-length and orthogonal to each other
  • Dimension reduction under SVD: with a small value \(d\), \[ \begin{array}{ccccc} \mathbf{M}' & = & \tilde{\mathbf{U}} & \tilde{\mathbf{D}} & \tilde{\mathbf{V}}^T \\ m\times n & & m\times d & d\times d & d\times n \end{array} \]

    • \(\mathbf{M}'\) is the best rank-\(d\) approximation of \(\mathbf{M}\) under the \(L_2\) loss

Use SVD to obtain word vectors

  • The low-dimensional rows of \[ \mathbf{W} = \tilde{\mathbf{U}} \tilde{\mathbf{D}} \] are low-rank approximations of the high-dimensional rows of the original matrix \(\mathbf{M}\)

    • In the sense that computing the dot product between rows of \(\mathbf{W}\) is equivalent to computing dot product between the reconstructed matrix \(\mathbf{M}'\). \[ \mathbf{W}_{[i]} \cdot \mathbf{W}_{[j]} = \mathbf{M}'_{[i]} \cdot \mathbf{M}'_{[j]} \]
  • When using SVD for word similarity, the rows of \(\mathbf{M}\) correspond to words, the columns to contexts. Thus the rows of \(\mathbf{W}\) are low-dimensional word representations.

  • In practice, it is often better to not use \(\mathbf{W} = \tilde{\mathbf{U}} \tilde{\mathbf{D}}\), but instead to use the more balanced version \(\mathbf{W} = \tilde{\mathbf{U}} \sqrt{\tilde{\mathbf{D}}}\), or even directly using \(\mathbf{W} = \tilde{\mathbf{U}}\)

Collobert and Weston’s algorithm

  • Instead of computing a probability distribution over target words given a context, Collobert and Weston’s model only attempts to assign a score to each word, such that the correct word scores above the incorrect ones (p123)

  • Denote \(w\) the target word, \(c_{1:k}\) an ordered list of context items

  • Let \(v_w(w)\) and \(v_c(c)\) be embedding functions mapping word and context indices to \(d_{\text{emb}}\) dimensional vectors

Word2Vec Model

Word2Vec model: overview

  • Word2Vec is a software package implementing
    • two different context representations (CBOW and Skip-Gram) and
    • two different optimization objectives (Negative-Sampling and Hierarchical Softmax)
  • Here, we focus on the Negative-Sampling (NS) objective

Word2Vec model: negative sampling

  • Consider a set \(D\) of correct word-context pairs, and a set \(\bar{D}\) of incorrect word-context pairs

  • Goal: estimate the probability \(P(D=1 \mid w, c)\), which should be high (1) for pairs from \(D\) and low (0) for pairs from \(\bar{D}\)

  • The probability function: a sigmoid over the score \(s(w, c)\) \[ P(D=1 \mid w, c) = \frac{1}{1 + e^{-s(w, c)}} \]

  • The corpus-wide objective function is to maximize the log-likelihood of the data \(D \cup \bar{D}\) \[ \mathcal{L}(\Theta; D, \bar{D}) = \sum_{(w, c)\in D}\log P(D=1\mid w, c) + \sum_{(w, c)\in \bar{D}}\log P(D=0\mid w, c) \]

  • NS approximates the softmax function (normalizing term expensive to compute) with sigmoid functions

Word2Vec: NS, continued

  • The positive examples \(D\) are generated from a corpus

  • The negative samples \(\bar{D}\) can be generated as follows

    • For each good pair \((w, c)\in D\), sample \(k\) words \(w_{1:k}\) and add each of \((w_i, c)\) as a negative example to \(\bar{D}\). This results in \(\bar{D}\) being \(k\) times as large as \(D\). The number of negative samples \(k\) is a parameter of the algorithm

    • The negative words \(w\) can be sampled according to their corpus-based frequency. Actually in Word2Vec implementation, a smoothed version in which the counts are raised to the power of \(\frac{3}{4}\) before normalizing: \[ \frac{\#(w)^{0.75}}{\sum_{w'} \#(w')^{0.75}} \] This version gives more relative weights to less frequent words, and results in better word similarities in practice.

Word2Vec: CBOW

  • For a multi-word context \(c_{1:k}\), the CBOW variant of Word2Vec defines the context vector \(\mathbf{c}\) to be a sum of the embedding vectors of the context components \[ \mathbf{c} = \sum_{i=1}^k \mathbf{c}_i \]

  • The score of the word-context pair is simply defined as \[ s(w, c) = \mathbf{w} \cdot \mathbf{c} \]

  • Thus, the probability of a true pair is \[ P(D = 1\mid w, c_{1:k}) = \frac{1}{1 + e^{-(\mathbf{w} \cdot \mathbf{c}_1 + \mathbf{w} \cdot \mathbf{c}_2 + \cdots + \mathbf{w} \cdot \mathbf{c}_k)}} \]

  • The CBOW variant loses the order information between the context’s elements

  • In return, it allows the use of variable-length contexts

Word2Vec: Skip-Gram

  • For a \(k\)-element context \(c_{1:k}\), the skip-gram variant assumes that the elements \(c_i\) in the context are independent from each other, essentially treating them as \(k\) different contexts: \((w, c_1), (w, c_2), \ldots, (w, c_k)\)

  • The scoring function is the same as the CBOW version \[ s(w, c_i) = \mathbf{w} \cdot \mathbf{c_i} \]

  • The probability is a product of \(k\) terms \[ P(D = 1\mid w, c_{1:k}) = \prod_{i=1}^k P(D = 1\mid w, c_i) = \prod_{i=1}^k \frac{1}{1 + e^{-\mathbf{w} \cdot \mathbf{c}_i}} \]

  • While the independence assumption is strong, the skip-gram variant is very effective in practice


  • GloVe constructs an explicit word-context matrix, and trains the word and context vectors \(\mathbf{w}\) and \(\mathbf{c}\) attempting to satisfy \[ \mathbf{w} \cdot \mathbf{c} + \mathbf{b}_{[w]} + \mathbf{b}_{[c]} = \log \#(w, c), \quad \forall (w, c) \in D \] where \(\mathbf{b}_{[w]}\) and \(\mathbf{b}_{[c]}\) are word-specific and context-specific trained biases

Choice of Contexts

Choice of contexts: window approach

  • The most common is a sliding window approach, containing a sequence of \(2m+1\) words. The middle word is called the focus word and the \(m\) words to each side are the contexts

  • Effective window size: usually 2-5.

    • Larger windows tend to produce more topical similarities (e.g., “dog”, “bark”, and “leash” will be grouped together, as well as “walked”, “run”, and “walking”)

    • Smaller windows tend to produce more functional and syntactic similarities (e.g., “Poodle”, “Pitbull”, and “Rottweiler”, or “walking”, “running”, and “approaching”)

  • Many variants on the window approach are possible. One may

    • lemmatize words before learning
    • apply text normalization
    • filter too short of too long sentences
    • remove capitalization

Limitations of distributional methods

  • Black sheep: people are less likely to mention known information than they are to mention novel ones

    • For example, when people talk of white sheep, they will likely refer to them as sheep, while for black sheep are are much more likely to retain the color information and say black sheep
  • Antonyms: words are opposite of each other (good vs bad, buy vs sell, hot vs cold) tend to appear in similar contexts

Ch11 Using Word Embeddings

Resources of Common Pre-Training Word Embeddings

Common pre-training word embeddings

Usages: Find Similarity, Word Analogies

Pre-trained word embedding usages

  • Calculate word similarity, e.g., using cosine similarity

  • Word clustering, e.g., using KMeans

  • Find similar words

    • With row-normalized embedding matrix, the cosine similarity between two words \(w_1\) and \(w_2\) is \[ \text{sim}_{\text{cos}}(w_1, w_2) = \mathbf{E}_{[w_1]} \cdot \mathbf{E}_{[w_2]} \]

    • We are often interested in the \(k\) most similar words to a given word \(w\). Let \(\mathbf{w} = \mathbf{E}_{[w]}\), then the similarity to all other words can be computed by the matrix-vector multiplication \[ \mathbf{s} = \mathbf{E} \mathbf{w} \]

More similarity measures

  • Similarity to a group of words: average similarity to the items in the group \[ \mathbf{s}_{[w]} = \text{sim}(w, w_{1:k}) = \mathbf{E} (\mathbf{w}_1 + \cdots + \mathbf{w}_k) / k \]

  • Short document similarity: consider two documents \(D_1 = w_1^1, \ldots, w_m^1\) and \(D_2 = w_1^2, \ldots, w_n^2\), \[\begin{align*} \text{sim}_{\text{doc}}(D_1, D_2) &= \sum_{i=1}^m \sum_{j=1}^n \text{cos}(\mathbf{w}_i^1, \mathbf{w}_j^2)\\ &= \left(\sum_{i=1}^m \mathbf{w}_i^1 \right) \cdot \left(\sum_{j=1}^n \mathbf{w}_j^2 \right) \end{align*}\]

Word analogies

  • One can perform “algebra” on the word vectors and get meaningful results

    • For example, \[ \mathbf{w}_{\text{king}} - \mathbf{w}_{\text{man}} + \mathbf{w}_{\text{woman}} \approx \mathbf{w}_{\text{queen}} \]
  • Analogy solving task: to answer analogy questions of the form \[ man : woman \rightarrow king:? \]

  • Solve the analogy question by maximization \[\begin{align*} \text{analogy}(m:w \rightarrow k:?) & = \arg\max_{v \in V\backslash \{m, w, k\}} \text{cos}(\mathbf{v}, \mathbf{k} - \mathbf{m} + \mathbf{w}) \end{align*}\]

Practicalities and pitfalls

  • While off-the-shelf, pre-trained word embeddings can be downloaded and used, it is advised to not just blindly download word embeddings and treat them as a black box

  • Be aware of choices such as the source of the training corpus

    • Larger training corpus is not always better. A smaller but cleaner, or smaller but more domain-focused corpus are often more effective
  • When using off-the-shelf embedding vectors, it is better to use the same tokenization and text normalization schemes that were used when deriving the corpus


  • Goldberg, Yoav. (2017). Neural Network Methods for Natural Language Processing, Morgan & Claypool