Constraint Satisfaction Problems (CSPs)
Markov Networks
- Gibbs Sampling
Bayesian Networks
- Hidden Markov Model
  - Forward Backward: estimate $H$
  - Particle Filtering: estimate $H$
- Parameter Estimation

State-based vs variable-based search problems: order
- There is usually an (obvious) order for state-based search problems, e.g., certain start and end points.
- For variable-based search problems:
  - Ordering doesn’t affect correctness (e.g., map coloring problem), so we might dynamically choose a better ordering of the variables (e.g., lookahead).
  - Variables are interdependent in a local way.
- Equivalent terminologies
  - Variable-based models $=$ graphical models
  - Probablistic graphical models = ${$ Markov networks, Bayesian networks $}$
  - Markov networks $=$ undirected graphical models
  - Bayesian networks $=$ directed graphical models

Constraint Satisfaction Problems (CSPs)

Factor Graphs

Figure source: Stanford CS221 spring 2025, lecture slides week 6

Variables (and their domains): $X = (X_{1}, \dots, X_{n}), where X_{i} \in {Domain}_{i}$
Factors: constraints or preferrence $f_{1}, \dots, f_{m}, where f_{j} (X) \geq 0$
- Factors measure how good an assignment is. We prefer $X$ that can achieve higher value of $f_{j}$ .
- If $f_{j} (X) = 0 or 1$ , it describes a constraint.
  - For example, a factor for force $X_{1}$ and $X_{2}$ to be equal can be written as $1 [X_{1} = X_{2}]$ .
- Scope of a factor: set of variables it depends on.
- Arity of a factor: number of variables in the scope
  - Unary factor: arity 1
  - Binary factor: arity 2
Assignment weights: each assignment $x = (x_{1}, \dots, x_{n})$ has a weight $Weight (x) = \prod_{j = 1}^{m} f_{j} (x)$
- An assignment $x$ is consistent if Weight $(x) > 0$
Goal: find the best assignment of values to the variables to maximize the weight $\arg max_{x} Weight (x)$
- A CSP is satisfiable if $max_{x} Weight (x) > 0$ .

Exact Backtracking Search

In backtracking search
- Each node is a partial assignment, and each child node is an extension of the partial assignment.
- Each leaf node is a complete assignment.
For a partial assignment $x$ and a new variable $X_{i}$ that’s not in $x$ , dependent factor $D (x, X_{i})$ is the set of factors that only depend on $x$ and $X_{i}$ , but not on any other variables.

Backtracking search algorithm for CSPs

Backtrack( $x$ , $w$ , Domains):
- If $x$ is complete assignment: update best and return
- Choose unassigned variable $X_{i}$ (MCV)
- Order values in Domain $_{i}$ of chosen variable $X_{i}$ (LCV)
- For each value $v$ in that order:
  - $δ \leftarrow \prod_{f_{j} \in D (x, X_{i})} f_{j} (x \cup {X_{i} : v})$
  - If $δ = 0$ : continue (new partial assignment is inconsistent)
  - Domains $^{'} \leftarrow$ Domains via lookahead
  - If any Domains $_{i}^{'}$ is empty: continue
  - Backtrack( $x \cup {X_{i} : v}$ , $w δ$ , Domains $^{'}$ )
In the above algorithm, the blue contents are the ones that can be optimized
One-step lookahead: forward checking. After assigning a variable $X_{i}$ , eliminate inconsistent values from the dominas of $X_{i}$ ’s neighbors

Dynamic ordering

Choose an unassigned variable: choose the most constrained variable (MCV), i.e., the variable that has the smallest domain).
- If going to fail, fail early (more pruning)
- Because we need to find an assignment for every variable
- This is useful when some factor are constraints (can prune assginemnts with weight 0)
Order values of a selected variable: least constrained value (LCV), descending order of the sum of consistent values of neighboring variables
- Choose value that is most likely to lead to solution
- Because for each variable only need to choose some values
- Useful when all factors are constraints (Only need to find an assignment with weight 1)

Arc consistency

A variable $X_{i}$ is arc consistent wrt $X_{j}$ if for each $x_{i} \in {Domain}_{i}$ , there exists $x_{j} \in {Domain}_{j}$ such that $f ({X_{i} : x_{i}, X_{j} : x_{j}}) \neq 0$ for all factors $f$ whose scope contains $X_{i}$ and $X_{j}$ .
AC-3 algorithm: repeatedly enforce arc consistency on all variables

Approximate Search

Backtracking and beam search: extend partial assignments
Local search: modify complete assignments

Beam search

Greedy search: we assume we have a fixed ordering of the variables. Then in every step of assigning a value to a variable, greedy search is to use the assignment with the highest weight

Figure source: Stanford CS221 spring 2025, lecture slides week 6

Beam search: keep track of (at most) $K$ best partial assignments at each level of the search tree
- Note: these candidates are not guaranteed to be the $K$ best at each level
- Time complexity of beam search is $O (n K b)$
  - Depth: number of variables $n$
  - Branching factor $b = | {Domain}_{i} |$
  - Beam size $K$
- Beam size $K$ controls trade-off between efficiency and accuracy
  - $K = 1$ : greedy search, $O (n b)$ time
  - $K = \infty$ : BFS, $O (b^{n})$ time

Figure source: Stanford CS221 spring 2025, lecture slides week 6

Local search

The goal is to improve an old complete assignment.
Locality: When evaluating possible re-assignments to $X_{i}$ , only need to consider the factors that depend on $X_{i}$ .
Iterated conditional modes (ICM) algorithm
- Initialize $x$ to a random complete assignment
- Loop through $i = 1, \dots, n$ until convergence:
  - Compute weight of $x_{v} = x \cup {X_{i} : v}$ for each $v$
  - $x \leftarrow x_{v}$ with highest weights
ICM can stuck at local optima
ICM has linear time complexity

Markov Networks

A Markov network is a factor graph which defines a joint distribution over random variables $X = (X_{1}, \dots, X_{n})$ : $P (X = x) = \frac{Weight (x)}{Z}$ where $Z = \sum_{x^{'}} Weight (x^{'})$ is the normalization constant.
- Markov network $=$ factor graphs $+$ probability
Marginal probability of $X_{i} = v$ is $P (X_{i} = v) = \sum_{x : x_{i} = v} P (X = x)$

Gibbs Sampling

Gibbs sampling algorithm:
- Initialize $x$ to a random complete assignment
- Loop through $i = 1, \dots, n$ until convergence:
  - Set $x_{i} = v$ with probability $P (X_{i} = v ∣ X_{- i} = x_{- i})$
  - Increment count $_{i} (x_{i})$
- Estimate $\hat{P} (X_{i} = x_{i}) = \frac{{count}_{i} (x_{i})}{\sum_{v} {count}_{i} (v)}$
Search vs sampling

Iterated Conditional Modes	Gibbs Sampling
maximum weight assignment	marginal probabilities
choose best value	sample a value
converges to local optimum	marginals converge to correct answer

Bayesian Networks

Markov networks vs Bayesian networks

Figure source: Stanford CS221 spring 2025, lecture slides week 7

Markov networks	Bayesian networks
arbitrary factors	local conditional probabilities
set of preferences	generative process
un-directed graphs	directed graphs

Let $X = (X_{1}, \dots, X_{n})$ be random variables. A Bayesian network is a directed acyclic graph (DAG) that specifies a joint distribution over $X$ as a project of local conditional distributions, one for each node $P (X_{1} = x_{1}, \dots, X_{n} = x_{n}) \overset{def}{=} \prod_{i = 1}^{n} p (x_{i} ∣ x_{Parents (i)})$
Reducing Bayesian networks to Markov networks
- Remember to have a single factor connecting each parent.

Figure source: Stanford CS221 spring 2025, lecture slides week 7

Leveraging additional structure
- Throw away any unobserved leaves before running inference
- Throw away any disconnected components before running inference (independence)

Hidden Markov Model

Hidden Markov model (HMM): For each time step $t = 1, \dots, T$ ,
- Generate object location $H_{t} \sim p (H_{t} ∣ H_{t - 1})$
- Generate sensor reading $E_{t} \sim p (E_{t} ∣ H_{t})$

Figure source: Stanford CS221 spring 2025, lecture slides week 7

Forward Backward: estimate $H$

Lattice representation

Figure source: Stanford CS221 spring 2025, lecture slides week 7

Edge start $\to H_{1} = h_{1}$ has weight $p (h_{1}) p (e_{1} ∣ h_{1})$
Edge $H_{i - 1} = h_{i = 1} \to H_{i} = h_{i}$ has weight $p (h_{i} ∣ h_{i - 1}) p (e_{i} ∣ h_{i})$
Each path from start to end is an assignment with weight equal to the product of edge weights
Forward: sum of weights of paths from start to $H_{i} = h_{i}$ $\begin{aligned} F_{i} (h_{i}) & = \sum_{h_{i - 1}} F_{i - 1} (h_{i - 1}) \cdot Weight (H_{i - 1} = h_{i - 1}, H_{i} = h_{i}) \\ = \sum_{h_{i - 1}} F_{i - 1} (h_{i - 1}) \cdot p (h_{i} ∣ h_{i - 1}) p (e_{i} ∣ h_{i}) \end{aligned}$
Backward: sum of weights of paths from $H_{i} = h_{i}$ to end $\begin{aligned} B_{i} (h_{i}) & = \sum_{h_{i + 1}} B_{i + 1} (h_{i + 1}) \cdot Weight (H_{i} = h_{i}, H_{i + 1} = h_{i + 1}) \\ = \sum_{h_{i + 1}} B_{i + 1} (h_{i + 1}) \cdot p (h_{i + 1} ∣ h_{i}) p (e_{i + 1} ∣ h_{i + 1}) \end{aligned}$
Define: sum of weights of paths from start to end through $H_{i} = h_{i}$ $S_{i} (h_{i}) = F_{i} (h_{i}) B_{i} (h_{i})$
Base cases: $F_{1} = p (h_{1}) p (e_{1} ∣ h_{1})$ $B_{n} = 1$
Forward-backward algorithm for HMMs:
- Compute $F_{1}, F_{2}, \dots, F_{n}$
- Compute $B_{n}, B_{n - 1}, \dots, B_{1}$
- Compute $S_{i}$ for each $i$ and normalize $P (H_{i} = h_{i} ∣ E = e) = \frac{S_{i} (h_{i})}{\sum_{v} S_{i} (v)}$
Time complexity of forward-backward: $O (n | Domain |^{2})$

Particle Filtering: estimate $H$

Particle filtering is kind of like beam search for HMMs: keep $\leq K$ partial assignments (particles)
Particle filtering for HMMs: 3 steps
1. Propose: For each old particle $(h_{1}, \dots, h_{i - 1})$ , sample $H_{i} \sim p (h_{i} ∣ h_{i - 1})$
2. Weight: For each old particle $(h_{1}, \dots, h_{i - 1}, h_{i})$ , weight it by $p (e_{i} ∣ h_{i})$
3. Resample: Normalize weights and draw $K$ samples to redistribute particles to more promising areas. The resulting distribution is approximately $P (H_{1}, \dots, H_{i} ∣ E_{1}, \dots, E_{i})$

Parameter Estimation

Smoothing

Laplace smoothing: for each distribution $d$ and partial assignment $(x_{Parents (i)}, x_{i})$ , add $λ$ to count $_{d} (x_{Parents (i)}, x_{i})$
- This is like adding a Dirichlet prior

Expectation Maximization (EM) Algorithm

Initialize $θ$ randomly
Repeat until converge
- E step: fix $θ$ , update $H$
  - For each $h$ compute $q (h) = P (H = h ∣ E = e, θ)$
  - Create fully-observed weighted examples: $(h, e)$ with weight $q (h)$
- M step: fix $H$ , update $θ$
  - Maximum likelihood (count and normalize) on weighted examples to get $θ$
Properties of the EM algorithm
- EM algorithm deals with hidden variables $H$
- Intuition: generalization of the K-means algorithm:
  - Cluster centroids = parameter $θ$
  - Cluster assignments = hidden variables $H$
- EM algorithm converges to local optima

A more general version of the EM algorithm

Choose the initial parameters $θ^{old}$
E step: since the conditional posterior $p (Z ∣ X, θ^{old})$ contains all of our knowledge about the latent variable $Z$ , we compute the expected complete-data log likelihood under it. $\begin{aligned} Q (θ, θ^{old}) & = E_{Z ∣ X, θ^{old}} {\log p (X, Z ∣ θ)} \\ = \sum_{Z} p (Z ∣ X, θ^{old}) \log p (X, Z ∣ θ) \end{aligned}$
M step: revise parameter estimate $θ^{new} = \arg max_{θ} Q (θ, θ^{old})$
- Note in the maximizing step, the logarithm acts driectly on the joint likelihood $p (X, Z ∣ θ)$ , so the maximizating will be tractable.
Check for convergence of the log likelihood or the parameter values. If not converged, use $θ^{new}$ to replace $θ^{old}$ , and return to step 2.

See my EM algorithm notes for more details

CS221 Artificial Intelligence, Notes on Variables (Week 6-8)

Constraint Satisfaction Problems (CSPs)

Factor Graphs

Figure source: Stanford CS221 spring 2025, lecture slides week 6

Exact Backtracking Search

Backtracking search algorithm for CSPs

Dynamic ordering

Arc consistency

Approximate Search

Beam search

Figure source: Stanford CS221 spring 2025, lecture slides week 6

Figure source: Stanford CS221 spring 2025, lecture slides week 6

Local search

Markov Networks

Gibbs Sampling

Bayesian Networks

Figure source: Stanford CS221 spring 2025, lecture slides week 7

Figure source: Stanford CS221 spring 2025, lecture slides week 7

Hidden Markov Model

Figure source: Stanford CS221 spring 2025, lecture slides week 7

Forward Backward: estimate HH

Figure source: Stanford CS221 spring 2025, lecture slides week 7

Particle Filtering: estimate HH

Parameter Estimation

Smoothing

Expectation Maximization (EM) Algorithm

A more general version of the EM algorithm

Forward Backward: estimate $H$

Particle Filtering: estimate $H$