Week 3. Search

Figure source: Stanford CS221 spring 2025, Problem Workout 3 Slides

Note: DP is not necessarily a tree search algorithm. It’s a technique used to solve problems efficiently by breaking them down into smaller sub-problems. It can be applied to both tree search and graph search.
A good reference: state-based models cheetsheet

Tree Search

Model a search problem

Defining a search problem
- $s_{start}$ : stating state.
- Actions $(s)$ : possible actions
- Cost $(s, a)$ : action cost
- Succ $(s, a)$ : the successor state
- IsEnd $(s)$ : reached end state?
A search tree
- Node: a state
- Edge: an (action, cost) pair
- Root: starting state
- Leaf nodes: end states
Each root-to-leaf path represents a possible action sequence, and the sum of the costs of the edges is the cost of that path
Objective: find a minimal cost path to the end state
Coding a search problem: a class with the following methods:
- stateState() -> State
- isEnd(State) -> bool
- succAndCost(State) -> List[Tuple[str, State, float]] returns a list of (action, state, cost) tuples.

Tree search algorithms

$b$ : actions per state, $D$ maximum depth, $d$ depth of the solution path

Algorithm	Idea	Action costs	Space	Time
Backtracking search		Any cost	$O (D)$	$O (b^{D})$
Depth-first search	Backtracking + stop when find the first end state	All costs $c$ = 0	$O (D)$	$O (b^{D})$
Breadth-first search	Explore in order of increasing depth	Constant cost $c \geq 0$	$O (b^{d})$	$O (b^{d})$
DFS with iterative deepening	Call DFS for maximum depths 1, 2, $\dots$	Constant cost $c \geq 0$	$O (d)$	$O (b^{d})$

Always exponential time
Avoid exponential spaces with DFS-ID

Dynamic Programming

Dynamic programming (DP)
- A backtracking search algorithm with memorization (i.e. partial results are saved)
- May avoid the exponential running time of tree search
- Assumption: acyclic graph, so there is always a clear order in which to compute all the future (or past) costs.
For any given state $s$ , the DP future cost is $FutureCost (s) = {\begin{cases} 0 & if IsEnd (s) \\ min_{a \in Actions (s)} [Cost (s, a) + FutureCost (Succ (s, a))] & otherwise \end{cases}$
Note that for DP, we can analogously define $PastCost (s)$ .
A state is a summary of all the past actions sufficient to choose future actions optimally.

Graph Search

$N$ total states, $n$ of which are closer than end state

Algorithm	Cycles?	Idea	Action costs	Time/space
Dynamic programming	No	Save partial results	Any	$O (N)$
Uniform cost search	Yes	Enumerates states in order of increasing past cost	$Cost (s, a) \geq 0$	$O (n \log n)$

Exponential savings compared with tree search!
Graph search: can have cycles

Uniform cost search (UCS)

Figure source: Stanford CS221 spring 2025, lecture slides week 3

Types of states
- Explored: optimal path found; will not update cost to these states in the future
- Frontier: states we’ve seen, but still can update the cost of how to get there cheaper (i.e., priority queue)
- Unexplored: states we haven’t seen
Uniform cost search (UCS) algorithm
- Add $s_{start}$ to frontier
- Repeat until fronteir is empty:
  - Remove $s$ with smallest priority (minimum cost to $s$ among visited paths) $p$ from frontier
  - If $IsEnd (s)$ : return solution
  - Add $s$ to explored
  - For each action $a \in Actions(s)$ :
    - Get successor $s^{'} \leftarrow Succ (s, a)$
    - If $s^{'}$ is already in explored: continue
    - Update frontier with $s^{'}$ and priority with $p + Cost (s, a)$ (if it’s cheaper)
Properties of UCS
- Correctness: When a state $s$ is popped from the frontier and moved to explored, its priority is PastCost $(s)$ , the minimum cost to $s$ .
- USC potentially explores fewer states, but requires move overhead to maintain the priority queue

A*

A* algorithm: Run UCS with modified edge costs ${Cost}^{'} (s, a) = Cost (s, a) + h (Succ (s, a)) - h (s)$
- USC: explore states in order of PastCost $(s)$
- A*: explore states in order of PastCost $(s) + h (s)$
- $h (s)$ is a heuristic that estimates FutureCost $(s)$
Heuristic $h$
- Consistency:
  - Triangle inequality: ${Cost}^{'} (s, a) = Cost (s, a) + h (Succ (s, a)) - h (s) \geq 0$ , and
  - $h (s_{end}) = 0$
  Figure source: Stanford CS221 spring 2025, lecture slides week 3
- Admissibility: $h (s) \leq FutureCost (s)$
- If $h (s)$ is consistent, then it is admissible.
A* properties
- Correctness: if $h$ is consistent, then A* returns the minimum cost path
- Efficiency: A* explores all states $s$ satisfying $PastCost (s) \leq PastCost (s_{end}) - h (s)$
  - If $h (s) = 0$ , then A* is the same as UCS
  - If $h (s) = FutureCost (s)$ , then A* only explores nodes on a minimum cost path
  - Usually somewhere in between
Create $h (s)$ via relaxation
- Remove constraints
- Reduce edge costs (from $\infty$ to some finite cost)

Week 4. Markov Decision Processes

Uncertainty in the search process
- Performing action $a$ from a state $s$ can lead to several states $s_{1}^{'}, s_{2}^{'}, \dots$ in a probabilistic manner.
- The probability and reward of each $(s, a, s^{'})$ pair may be known or unknown.
  - If known: policy evaluation, value iteration
  - If unknown: reinforcement learning

Markov Decision Processes (offline)

Model an MDP

Markove decision process definition:
- States: the set of states
- $s_{start}$ : stating state.
- Actions $(s)$ : possible actions
- $T (s, a, s^{'})$ : transition probability of $s^{'}$ if take action $a$ in state $s$
- Reward $(s, a, s^{'})$ : reward for the transition $(s, a, s^{'})$
- IsEnd $(s)$ : reached end state?
- $γ \in (0, 1]$ : discount factor
A policy $π$ : a mapping from each state $s \in States$ to an action $a \in Actions (s)$

Policy Evaluation (on-policy)

Utility of a policy is the discounted sum of the rewards on the path
- Since the policy yields a random path, its utility is a random variable
For a policy, its path is $s_{0}, a_{1} r_{1} s_{1}, a_{2} r_{2} s_{2}, \dots$ (action, reward, new state) tuple, then the utility with discount $γ$ is $u_{1} = r_{1} + γ r_{2} + γ^{2} r_{3} + \dots$
$V_{π} (s)$ : the expected utility (called value) of a policy $π$ from state $s$
$Q_{π} (s, a)$ : Q-value, the expected utility of taking action $a$ from state $s$ , and then following policy $π$
Connections between $V_{π} (s)$ and $Q_{π} (s, a)$ :

$V_{π} (s) = {\begin{cases} 0 & if IsEnd (s) \\ Q_{π} (s, π (s)) & otherwise \end{cases}$

$Q_{π} (s, a) = \sum_{s^{'}} T (s, a, s^{'}) [Reward (s, a, s^{'}) + γ V_{π} (s^{'})]$

Iterative algorithm for policy evaluation
- Initialize $V_{π}^{(0)} (s) = 0$ for all $s$
- For iteration $t = 1, \dots, t_{PE}$ :
  - For each state $s$ , $V_{π}^{(t)} (s) = \sum_{s^{'}} T (s, π (s), s^{'}) [Reward (s, π (s), s^{'}) + γ V_{π}^{(t - 1)} (s^{'})]$
Choice of $t_{PE}$ : stop when values don’t change much $max_{s} | V_{π}^{(t)} (s) - V_{π}^{(t - 1)} (s) | \leq ϵ$
MDP time complexity is $O (t_{PE} S S^{'})$ , where $S$ is the number of states and $S^{'}$ is the number of $s^{'}$ with $T (s, a, s^{'}) \geq 0$

Value Iteration (off-policy)

Optimal value $V_{opt} (s)$ : maximum value attained by any policy
Optimal Q-value $Q_{opt} (s, a)$

$Q_{opt} (s, a) = \sum_{s^{'}} T (s, a, s^{'}) [Reward (s, a, s^{'}) + γ V_{o p t} (s^{'})]$

$V_{opt} (s) = {\begin{cases} 0 & if IsEnd (s) \\ max_{a \in Actions (s)} Q_{opt} (s, a) & otherwise \end{cases}$

Value iteration algorithm:
- Initialize $V_{opt}^{(0)} (s) = 0$ for all $s$
- For iteration $t = 1, \dots, t_{VI}$ :
  - For each state $s$ , $V_{opt}^{(t)} (s) = max_{a \in Actions (s)} \sum_{s^{'}} T (s, a, s^{'}) [Reward (s, a, s^{'}) + γ V_{opt}^{(t - 1)} (s)]$
VI time complexity is $O (t_{VI} S A S^{'})$ , where $S$ is the number of states, $A$ is the number of actions, and $S^{'}$ is the number of $s^{'}$ with $T (s, a, s^{'}) \geq 0$

Reinforcement Learning (online)

Figure source: Stanford CS221 spring 2025, lecture slides week 4

Summary of reinforcement learning algorithms

Algorithm	Estimating	Based on
Model-based Monte Carlo	$\hat{T}, \hat{R}$	$s_{0}, a_{1}, r_{1}, s_{1}, \dots$
Model-free Monte Carlo	${\hat{Q}}_{π}$	$u$
SARSA	${\hat{Q}}_{π}$	$r + {\hat{Q}}_{π}$
Q-learning	${\hat{Q}}_{opt}$	$r + {\hat{Q}}_{opt}$

Model-based methods (off-policy)

${\hat{Q}}_{opt} (s, a) = \sum_{s^{'}} \hat{T} (s, a, s^{'}) [\hat{Reward} (s, a, s^{'}) + γ {\hat{V}}_{o p t} (s^{'})]$

Based on data $s_{0}; a_{1}, r_{1}, s_{1}; a_{2}, r_{2}, s_{2}; \dots; a_{n}, r_{n}, s_{n}$ , estimate the transition probabilities and rewards of the MDP

$\begin{aligned} \hat{T} (s, a, s^{'}) & = \frac{# times (s, a, s^{'}) occurs}{# times (s, a) occurs} \\ \hat{Reward} (s, a, s^{'}) & = r in (s, a, r, s^{'}) \end{aligned}$

If $π$ is a non-deterministic policy which allows us to explore each (state, action) pair infinitely often, then the estimates of the transitions and rewards will converge.
- Thus, the estimates $\hat{T}$ and $\hat{Reward}$ are not necessarily policy dependent. So model-based methods are off-policy estimations.

Model-free Monte Carlo (on-policy)

The main idea is to estimate the Q-values directly
Original formula ${\hat{Q}}_{π} (s, a) = average of u_{t} where s_{t - 1} = s, a_{t} = a$
Equivalent formulation: for each $(s, a, u)$ , let $\begin{aligned} η & = \frac{1}{1 + # updates to (s, a)} \\ {\hat{Q}}_{π} (s, a) & \leftarrow (1 - η) \underset{prediction}{\underset{⏟}{{\hat{Q}}_{π} (s, a)}} + η \underset{data}{\underset{⏟}{u}} \\ = {\hat{Q}}_{π} (s, a) - η ({\hat{Q}}_{π} (s, a) - u) \end{aligned}$
- Implied objective: least squares ${({\hat{Q}}_{π} (s, a) - u)}^{2}$

SARSA (on-policy)

SARSA algorithm

On each $(s, a, r, s^{'}, a^{'})$ : ${\hat{Q}}_{π} (s, a) \leftarrow (1 - η) {\hat{Q}}_{π} (s, a) + η [\underset{data}{\underset{⏟}{r}} + γ \underset{estimate}{\underset{⏟}{{\hat{Q}}_{π} (s^{'}, a^{'})}}]$

SARSA uses estimates ${\hat{Q}}_{π} (s^{'}, a^{'})$ instead of just raw data $u$

Model-free Monte Carlo	SARSA
$u$	$r + {\hat{Q}}_{π} (s^{'}, a^{'})$
based on one path	based on estimate
unbiased	biased
large variance	small variance
wait until end of update	can update immediately

Q-learning (off-policy)

On each $(s, a, r, s^{'})$ : $\begin{aligned} {\hat{Q}}_{opt} (s, a) & \leftarrow (1 - η) \underset{prediction}{\underset{⏟}{{\hat{Q}}_{opt} (s, a)}} + η \underset{target}{\underset{⏟}{[r + γ {\hat{V}}_{opt} (s^{'})]}} \\ {\hat{V}}_{opt} (s^{'}) & = max_{a^{'} \in Actions (s^{'})} {\hat{Q}}_{opt} (s^{'}, a^{'}) \end{aligned}$

Epsilon-greedy

Reinforcement learning algorithm template
- For $t = 1, 2, 3, \dots$
  - Choose action $a_{t} = π_{act} (s_{t - 1})$
  - Receive reward $r_{t}$ and observe new state $s_{t}$
  - Update parameters
What exploration policy $π_{act}$ to use? Need to balance exploration and exploitation.
Epsilon-greedy policy $π_{act} (s) = {\begin{cases} \arg max_{a \in Actions (s)} {\hat{Q}}_{opt} (s, a) & probability 1 - ϵ \\ random from Actions (s) & probability ϵ \end{cases}$

Function appxomation

Large state spaces is hard to explore.For better generalization to handle unseen states/actions, we can use function approximation, to define
- features $ϕ (s, a)$ , and
- weights $w$ , such that ${\hat{Q}}_{opt} (s, a; w) = w \cdot ϕ (s, a)$
Q-learning with function approximation: on each $(s, a, r, s^{'})$ ,

$w \leftarrow w - η [\underset{prediction}{\underset{⏟}{{\hat{Q}}_{opt} (s, a; w)}} - \underset{target}{\underset{⏟}{(r + γ {\hat{V}}_{opt} (s^{'}))}}] ϕ (s, a)$

Week 5. Games

Modeling Games

A simple example game
Figure source: Stanford CS221 spring 2025, lecture slides week 5
Game tree
- Each node is a decision point for a player
- Each root-to-leaf path is a possible outcome of the game
- Nodes to indicate certain policy: use $△$ for maximum node (agent maximize his utility) and $▽$ for minimum node (opponent minimizes agent’s utility)

Figure source: Stanford CS221 spring 2025, lecture slides week 5

Two-player zero-sum game:
- $s_{start}$
- Actions $(s)$
- Succ $(s, a)$
- IsEnd $(s)$
- Utility $(s)$ : agent’s utility for end state $s$
- Player $(s) \in Players$ : player who controls state s
- Players $= {agent, opp}$
- Zero-sum: utility of the agent is negative the utility of the opponent
Property of games
- All the utility is at the end state
  - For a game about win/lose at the end (e.g., chess), Utility $(s)$ is $\infty$ (if agent wins), $- \infty$ (if opponent wins), or $0$ (if draw).
- Different players in control at different state
Policies (for player $p$ in state $s$ )
- Deterministic policy: $π_{p} (s) \in Actions (s)$
- Stochastic policy: $π_{p} (s, a) \in [0, 1]$ , a probability

Game Algorithms

Game evaluation

Use recurrence for policy evaluation to estimate value of the game (expected utility):

$V_{eval} (s) = {\begin{cases} Utility (s) & IsEnd (s) \\ \sum_{a \in Actions (s)} π_{agent} (s, a) V_{eval} (Succ (s, a)) & Player (s) = agent \\ \sum_{a \in Actions (s)} π_{opp} (s, a) V_{eval} (Succ (s, a)) & Player (s) = opp \end{cases}$

Expectimax

Expectimax: given opponent’s policy, find the best policy for the agent

$V_{exptmax} (s) = {\begin{cases} Utility (s) & IsEnd (s) \\ max_{a \in Actions (s)} V_{exptmax} (Succ (s, a)) & Player (s) = agent \\ \sum_{a \in Actions (s)} π_{opp} (s, a) V_{exptmax} (Succ (s, a)) & Player (s) = opp \end{cases}$

Minimax

Minimax assumes the worst case for the opponent’s policy

$V_{minmax} (s) = {\begin{cases} Utility (s) & IsEnd (s) \\ max_{a \in Actions (s)} V_{minmax} (Succ (s, a)) & Player (s) = agent \\ min_{a \in Actions (s)} V_{minmax} (Succ (s, a)) & Player (s) = opp \end{cases}$

Figure source: Stanford CS221 spring 2025, lecture slides week 5

Minimax properties
- Best against minimax opponent $V (π_{max}, π_{min}) \geq V (π_{agent}, π_{min}) for all π_{agent}$
- Lower bound against any opponent $V (π_{max}, π_{min}) \leq V (π_{max}, π_{opp}) for all π_{opp}$
- Not optimal if opponent is known! Here, $π_{7}$ stands for an arbitrary known policy, for example, random choice with equal probabilities. $V (π_{max}, π_{7}) \geq V (π_{exptmax (7)}, π_{7}) for opponent π_{7}$
Relationship between game values

Figure source: Stanford CS221 spring 2025, lecture slides week 5

Expectiminimax

Modify the game to introduce randomness

Figure source: Stanford CS221 spring 2025, lecture slides week 5

This is equivalent to having a third player, the coin $V_{exptminmax} (s) = {\begin{cases} Utility (s) & IsEnd (s) \\ max_{a \in Actions (s)} V_{exptminmax} (Succ (s, a)) & Player (s) = agent \\ min_{a \in Actions (s)} V_{exptminmax} (Succ (s, a)) & Player (s) = opp \\ \sum_{a \in Actions (s)} π_{coin} (s, a) V_{exptminmax} (Succ (s, a)) & Player (s) = coin \end{cases}$

Evaluation functions

Computation complexity: with branching factor $b$ and depth $d$ (number of turns $2 d$ because of two players), since this is a tree search, we have
- $O (d)$ space
- $O (b^{2 d})$ time
Limited depth tree search: stop at maximum depth $d_{max}$ $V_{minmax} (s, d) = {\begin{cases} Utility (s) & IsEnd (s) \\ Eval (s) & d = 0 \\ max_{a \in Actions (s)} V_{minmax} (Succ (s, a), d) & Player (s) = agent \\ min_{a \in Actions (s)} V_{minmax} (Succ (s, a), d - 1) & Player (s) = opp \end{cases}$
- Use: at state $s$ , call $V_{minmax} (s, d_{max})$
An evaluation function Eval $(s)$ is a possibly very weak estimate of the value $V_{minmax} (s)$ , using domain knowledge
- This is similar to FutureCost $(s)$ in the A* search problems. But unlike A*, there are no guarantees on the error for approximation.

Alpha-Beta pruning

Branch and bound
- Maintain lower and upper bounds on values.
- If intervals don’t overlap, then we can choose optimally without future work
Alpha-beta prining for minimax:
- $a_{s}$ : lower bound on value of max node $s$
- $b_{s}$ : upper bound on value of min node $s$
- Store $α_{s} = max_{s^{'} \leq s} a_{s^{'}}$ and $β_{s} = min_{s^{'} \leq s} b_{s^{'}}$
- Prune a node if its interval doesn’t have non-trivial overlap with every ancestor
Pruning depends on the order
- Worse ordering: $O (b^{2 \cdot d})$ time
- Best ordering: $O (b^{2 \cdot 0.5 d})$ time
- Random ordering: $O (b^{2 \cdot 0.75 d})$ time, when $b = 2$
- In practice, we can order based on the evaluation function Eval $(s)$ :
  - Max nodes: order successors by decreasing Eval $(s)$
  - Min nodes: order successors by increasing Eval $(s)$

CS221 Artificial Intelligence, Notes on States (Week 3-5)

Week 3. Search

Figure source: Stanford CS221 spring 2025, Problem Workout 3 Slides

Tree Search

Model a search problem

Tree search algorithms

Dynamic Programming

Graph Search

Uniform cost search (UCS)

Figure source: Stanford CS221 spring 2025, lecture slides week 3

A*

Figure source: Stanford CS221 spring 2025, lecture slides week 3

Week 4. Markov Decision Processes

Markov Decision Processes (offline)

Model an MDP

Policy Evaluation (on-policy)

Value Iteration (off-policy)

Reinforcement Learning (online)

Figure source: Stanford CS221 spring 2025, lecture slides week 4

Model-based methods (off-policy)

Model-free Monte Carlo (on-policy)

SARSA (on-policy)

Q-learning (off-policy)

Epsilon-greedy

Function appxomation

Week 5. Games

Modeling Games

Figure source: Stanford CS221 spring 2025, lecture slides week 5

Figure source: Stanford CS221 spring 2025, lecture slides week 5

Game Algorithms

Game evaluation

Expectimax

Minimax

Figure source: Stanford CS221 spring 2025, lecture slides week 5

Figure source: Stanford CS221 spring 2025, lecture slides week 5

Expectiminimax

Figure source: Stanford CS221 spring 2025, lecture slides week 5

Evaluation functions

Alpha-Beta pruning

More Topics

TD learning (on-policy)

Simultaneous games