CS221 Artificial Intelligence, Notes on Variables (Week 6-8)

  • State-based vs variable-based search problems: order

    • There is usually an (obvious) order for state-based search problems, e.g., certain start and end points.

    • For variable-based search problems:

      • Ordering doesn’t affect correctness (e.g., map coloring problem), so we might dynamically choose a better ordering of the variables (e.g., lookahead).
      • Variables are interdependent in a local way.
    • Equivalent terminologies

      • Variable-based models = graphical models
      • Probablistic graphical models = {Markov networks, Bayesian networks}
      • Markov networks = undirected graphical models
      • Bayesian networks = directed graphical models

Constraint Satisfaction Problems (CSPs)

Factor Graphs

drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 6
  • Variables (and their domains): X=(X1,,Xn), where XiDomaini

  • Factors: constraints or preferrence f1,,fm, where fj(X)0

    • Factors measure how good an assignment is. We prefer X that can achieve higher value of fj.
    • If fj(X)=0 or 1, it describes a constraint.
      • For example, a factor for force X1 and X2 to be equal can be written as 1[X1=X2].
    • Scope of a factor: set of variables it depends on.
    • Arity of a factor: number of variables in the scope
      • Unary factor: arity 1
      • Binary factor: arity 2
  • Assignment weights: each assignment x=(x1,,xn) has a weight Weight(x)=j=1mfj(x)

    • An assignment x is consistent if Weight(x)>0
  • Goal: find the best assignment of values to the variables to maximize the weight argmaxxWeight(x)

    • A CSP is satisfiable if maxxWeight(x)>0.

Markov Networks

  • A Markov network is a factor graph which defines a joint distribution over random variables X=(X1,,Xn): P(X=x)=Weight(x)Z where Z=xWeight(x) is the normalization constant.

    • Markov network = factor graphs + probability
  • Marginal probability of Xi=v is P(Xi=v)=x:xi=vP(X=x)

Gibbs Sampling

  • Gibbs sampling algorithm:
    • Initialize x to a random complete assignment
    • Loop through i=1,,n until convergence:
      • Set xi=v with probability P(Xi=vXi=xi)
      • Increment counti(xi)
    • Estimate P^(Xi=xi)=counti(xi)vcounti(v)
  • Search vs sampling
Iterated Conditional Modes Gibbs Sampling
maximum weight assignment marginal probabilities
choose best value sample a value
converges to local optimum marginals converge to correct answer

Bayesian Networks

  • Markov networks vs Bayesian networks
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 7
Markov networks Bayesian networks
arbitrary factors local conditional probabilities
set of preferences generative process
un-directed graphs directed graphs
  • Let X=(X1,,Xn) be random variables. A Bayesian network is a directed acyclic graph (DAG) that specifies a joint distribution over X as a project of local conditional distributions, one for each node P(X1=x1,,Xn=xn)=defi=1np(xixParents(i))

  • Reducing Bayesian networks to Markov networks

    • Remember to have a single factor connecting each parent.
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 7
  • Leveraging additional structure
    • Throw away any unobserved leaves before running inference
    • Throw away any disconnected components before running inference (independence)

Hidden Markov Model

  • Hidden Markov model (HMM): For each time step t=1,,T,
    • Generate object location Htp(HtHt1)
    • Generate sensor reading Etp(EtHt)
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 7

Forward Backward: estimate H

  • Lattice representation
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 7
  • Edge start H1=h1 has weight p(h1)p(e1h1)

  • Edge Hi1=hi=1Hi=hi has weight p(hihi1)p(eihi)

  • Each path from start to end is an assignment with weight equal to the product of edge weights

  • Forward: sum of weights of paths from start to Hi=hi Fi(hi)=hi1Fi1(hi1)Weight(Hi1=hi1,Hi=hi)=hi1Fi1(hi1)p(hihi1)p(eihi)

  • Backward: sum of weights of paths from Hi=hi to end Bi(hi)=hi+1Bi+1(hi+1)Weight(Hi=hi,Hi+1=hi+1)=hi+1Bi+1(hi+1)p(hi+1hi)p(ei+1hi+1)

  • Define: sum of weights of paths from start to end through Hi=hi Si(hi)=Fi(hi)Bi(hi)

  • Base cases: F1=p(h1)p(e1h1) Bn=1

  • Forward-backward algorithm for HMMs:

    • Compute F1,F2,,Fn
    • Compute Bn,Bn1,,B1
    • Compute Si for each i and normalize P(Hi=hiE=e)=Si(hi)vSi(v)
  • Time complexity of forward-backward: O(n|Domain|2)

Particle Filtering: estimate H

  • Particle filtering is kind of like beam search for HMMs: keep K partial assignments (particles)

  • Particle filtering for HMMs: 3 steps

    1. Propose: For each old particle (h1,,hi1), sample Hip(hihi1)
    2. Weight: For each old particle (h1,,hi1,hi), weight it by p(eihi)
    3. Resample: Normalize weights and draw K samples to redistribute particles to more promising areas. The resulting distribution is approximately P(H1,,HiE1,,Ei)

Parameter Estimation

Smoothing

  • Laplace smoothing: for each distribution d and partial assignment (xParents(i),xi), add λ to countd(xParents(i),xi)
    • This is like adding a Dirichlet prior

Expectation Maximization (EM) Algorithm

  • Initialize θ randomly
  • Repeat until converge
    • E step: fix θ, update H
      • For each h compute q(h)=P(H=hE=e,θ)
      • Create fully-observed weighted examples: (h,e) with weight q(h)
    • M step: fix H, update θ
      • Maximum likelihood (count and normalize) on weighted examples to get θ
  • Properties of the EM algorithm
    • EM algorithm deals with hidden variables H
    • Intuition: generalization of the K-means algorithm:
      • Cluster centroids = parameter θ
      • Cluster assignments = hidden variables H
    • EM algorithm converges to local optima

A more general version of the EM algorithm

  1. Choose the initial parameters θold

  2. E step: since the conditional posterior p(ZX,θold) contains all of our knowledge about the latent variable Z, we compute the expected complete-data log likelihood under it. Q(θ,θold)=EZX,θold{logp(X,Zθ)}=Zp(ZX,θold)logp(X,Zθ)

  3. M step: revise parameter estimate θnew=argmaxθQ(θ,θold)

    • Note in the maximizing step, the logarithm acts driectly on the joint likelihood p(X,Zθ), so the maximizating will be tractable.
  4. Check for convergence of the log likelihood or the parameter values. If not converged, use θnew to replace θold, and return to step 2.

See my EM algorithm notes for more details