CS221 Artificial Intelligence, Notes on Variables (Week 6-8)

  • State-based vs variable-based search problems: order

    • There is usually an (obvious) order for state-based search problems, e.g., certain start and end points.

    • For variable-based search problems:

      • Ordering doesn’t affect correctness (e.g., map coloring problem), so we might dynamically choose a better ordering of the variables (e.g., lookahead).
      • Variables are interdependent in a local way.
    • Equivalent terminologies

      • Variable-based models \(=\) graphical models
      • Probablistic graphical models = \(\{\)Markov networks, Bayesian networks\(\}\)
      • Markov networks \(=\) undirected graphical models
      • Bayesian networks \(=\) directed graphical models

Constraint Satisfaction Problems (CSPs)

Factor Graphs

drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 6
  • Variables (and their domains): \[ X = (X_1, \ldots, X_n), \text{ where } X_i \in \text{Domain}_i \]

  • Factors: constraints or preferrence \[ f_1, \ldots, f_m, \text{ where } f_j(X) \geq 0 \]

    • Factors measure how good an assignment is. We prefer \(X\) that can achieve higher value of \(f_j\).
    • If \(f_j(X) = 0 \text{ or } 1\), it describes a constraint.
      • For example, a factor for force \(X_1\) and \(X_2\) to be equal can be written as \(\mathbf{1}[X_1 = X_2]\).
    • Scope of a factor: set of variables it depends on.
    • Arity of a factor: number of variables in the scope
      • Unary factor: arity 1
      • Binary factor: arity 2
  • Assignment weights: each assignment \(x = (x_1, \ldots, x_n)\) has a weight \[\text{Weight}(x) = \prod_{j=1}^m f_j (x)\]

    • An assignment \(x\) is consistent if Weight\((x) > 0\)
  • Goal: find the best assignment of values to the variables to maximize the weight \[ \arg\max_x \text{Weight}(x) \]

    • A CSP is satisfiable if \(\max_x \text{Weight}(x) > 0\).

Markov Networks

  • A Markov network is a factor graph which defines a joint distribution over random variables \(X = (X_1, \ldots, X_n)\): \[ \mathbb{P}(X = x) = \frac{\text{Weight}(x)}{Z} \] where \(Z = \sum_{x^\prime} \text{Weight}(x^\prime)\) is the normalization constant.

    • Markov network \(=\) factor graphs \(+\) probability
  • Marginal probability of \(X_i = v\) is \[ \mathbb{P}(X_i = v) = \sum_{x: x_i = v} \mathbb{P}(X = x) \]

Gibbs Sampling

  • Gibbs sampling algorithm:
    • Initialize \(x\) to a random complete assignment
    • Loop through \(i = 1, \ldots, n\) until convergence:
      • Set \(x_i = v\) with probability \[\mathbb{P}(X_i = v \mid X_{-i} = x_{-i})\]
      • Increment count\(_i(x_i)\)
    • Estimate \[\hat{\mathbb{P}}(X_i = x_i) = \frac{\text{count}_i (x_i)}{\sum_v \text{count}_i (v)}\]
  • Search vs sampling
Iterated Conditional Modes Gibbs Sampling
maximum weight assignment marginal probabilities
choose best value sample a value
converges to local optimum marginals converge to correct answer

Bayesian Networks

  • Markov networks vs Bayesian networks
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 7
Markov networks Bayesian networks
arbitrary factors local conditional probabilities
set of preferences generative process
un-directed graphs directed graphs
  • Let \(X = (X_1, \ldots, X_n)\) be random variables. A Bayesian network is a directed acyclic graph (DAG) that specifies a joint distribution over \(X\) as a project of local conditional distributions, one for each node \[ \mathbb{P}(X_1 =x_1, \ldots, X_n = x_n) \stackrel{\text{def}}{=} \prod_{i=1}^n p\left(x_i \mid x_{\text{Parents}(i)} \right) \]

  • Reducing Bayesian networks to Markov networks

    • Remember to have a single factor connecting each parent.
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 7
  • Leveraging additional structure
    • Throw away any unobserved leaves before running inference
    • Throw away any disconnected components before running inference (independence)

Hidden Markov Model

  • Hidden Markov model (HMM): For each time step \(t = 1, \ldots, T\),
    • Generate object location \(H_t \sim p(H_t \mid H_{t-1})\)
    • Generate sensor reading \(E_t \sim p(E_t \mid H_t)\)
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 7

Forward Backward: estimate \(H\)

  • Lattice representation
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 7
  • Edge start \(\rightarrow H_1 = h_1\) has weight \(p(h_1)p(e_1 \mid h_1)\)

  • Edge \(H_{i -1} = h_{i=1}\rightarrow H_i = h_i\) has weight \(p(h_i \mid h_{i-1}) p(e_i \mid h_i)\)

  • Each path from start to end is an assignment with weight equal to the product of edge weights

  • Forward: sum of weights of paths from start to \(H_i = h_i\) \[\begin{align*} F_i(h_i) & = \sum_{h_{i-1}} F_{i-1}(h_{i-1}) \cdot \text{Weight}\left(H_{i-1}=h_{i-1}, H_i = h_i\right)\\ & = \sum_{h_{i-1}} F_{i-1}(h_{i-1}) \cdot p(h_i \mid h_{i-1}) p(e_i \mid h_i) \end{align*}\]

  • Backward: sum of weights of paths from \(H_i = h_i\) to end \[\begin{align*} B_i(h_i) & = \sum_{h_{i+1}} B_{i+1}(h_{i+1}) \cdot \text{Weight}\left(H_{i}=h_{i}, H_{i+1} = h_{i+1}\right)\\ & =\sum_{h_{i+1}} B_{i+1}(h_{i+1}) \cdot p(h_{i+1} \mid h_{i}) p(e_{i+1} \mid h_{i+1}) \end{align*}\]

  • Define: sum of weights of paths from start to end through \(H_i = h_i\) \[ S_i(h_i) = F_i(h_i) B_i(h_i) \]

  • Base cases: \[F_1 = p(h_1) p(e_1 \mid h_1)\] \[B_{n} = 1\]

  • Forward-backward algorithm for HMMs:

    • Compute \(F_1, F_2, \ldots, F_n\)
    • Compute \(B_n, B_{n-1}, \ldots, B_1\)
    • Compute \(S_i\) for each \(i\) and normalize \[ \mathbb{P}(H_i = h_i \mid E = e) = \frac{S_i(h_i)}{\sum_v S_i(v)} \]
  • Time complexity of forward-backward: \(O(n |\text{Domain}|^2)\)

Particle Filtering: estimate \(H\)

  • Particle filtering is kind of like beam search for HMMs: keep \(\leq K\) partial assignments (particles)

  • Particle filtering for HMMs: 3 steps

    1. Propose: For each old particle \((h_1, \ldots, h_{i-1})\), sample \(H_i \sim p(h_i \mid h_{i-1})\)
    2. Weight: For each old particle \((h_1, \ldots, h_{i-1}, h_i)\), weight it by \(p(e_i \mid h_i)\)
    3. Resample: Normalize weights and draw \(K\) samples to redistribute particles to more promising areas. The resulting distribution is approximately \(\mathbb{P}(H_1, \ldots, H_i \mid E_1, \ldots, E_i)\)

Parameter Estimation

Smoothing

  • Laplace smoothing: for each distribution \(d\) and partial assignment \((x_{\text{Parents}(i)}, x_i)\), add \(\lambda\) to count\(_d(x_{\text{Parents}(i)}, x_i)\)
    • This is like adding a Dirichlet prior

Expectation Maximization (EM) Algorithm

  • Initialize \(\theta\) randomly
  • Repeat until converge
    • E step: fix \(\theta\), update \(H\)
      • For each \(h\) compute \[q(h) = \mathbb{P}(H=h \mid E=e, \theta)\]
      • Create fully-observed weighted examples: \((h, e)\) with weight \(q(h)\)
    • M step: fix \(H\), update \(\theta\)
      • Maximum likelihood (count and normalize) on weighted examples to get \(\theta\)
  • Properties of the EM algorithm
    • EM algorithm deals with hidden variables \(H\)
    • Intuition: generalization of the K-means algorithm:
      • Cluster centroids = parameter \(\theta\)
      • Cluster assignments = hidden variables \(H\)
    • EM algorithm converges to local optima

A more general version of the EM algorithm

  1. Choose the initial parameters \(\boldsymbol\theta^{\text{old}}\)

  2. E step: since the conditional posterior \(p\left( \mathbf{Z} \mid \mathbf{X}, \boldsymbol\theta^{\text{old}} \right)\) contains all of our knowledge about the latent variable \(\mathbf{Z}\), we compute the expected complete-data log likelihood under it. \[\begin{align*} \mathcal{Q}(\boldsymbol\theta, \boldsymbol\theta^{\text{old}}) & = E_{\mathbf{Z} \mid \mathbf{X}, \boldsymbol\theta^{\text{old}}} \left\{\log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol\theta) \right\} \\ & = \sum_{\mathbf{Z}} p\left( \mathbf{Z} \mid \mathbf{X}, \boldsymbol\theta^{\text{old}} \right) \log p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol\theta) \end{align*}\]

  3. M step: revise parameter estimate \[ \boldsymbol\theta^{\text{new}} = \arg\max_{\boldsymbol\theta} \mathcal{Q}(\boldsymbol\theta, \boldsymbol\theta^{\text{old}}) \]

    • Note in the maximizing step, the logarithm acts driectly on the joint likelihood \(p(\mathbf{X}, \mathbf{Z} \mid \boldsymbol\theta)\), so the maximizating will be tractable.
  4. Check for convergence of the log likelihood or the parameter values. If not converged, use \(\boldsymbol\theta^{\text{new}}\) to replace \(\boldsymbol\theta^{\text{old}}\), and return to step 2.

See my EM algorithm notes for more details