Week 3. Search
Figure source: Stanford CS221 spring 2025, Problem Workout 3 Slides
Note: DP is not necessarily a tree search algorithm. It’s a technique used to solve problems efficiently by breaking them down into smaller sub-problems. It can be applied to both tree search and graph search.
A good reference: state-based models cheetsheet
Tree Search
Model a search problem
Defining a search problem
- sstartsstart: stating state.
- Actions(s)(s): possible actions
- Cost(s,a)(s,a): action cost
- Succ(s,a)(s,a): the successor state
- IsEnd(s)(s): reached end state?
A search tree
- Node: a state
- Edge: an (action, cost) pair
- Root: starting state
- Leaf nodes: end states
Each root-to-leaf path represents a possible action sequence, and the sum of the costs of the edges is the cost of that path
Objective: find a minimal cost path to the end state
Coding a search problem: a class with the following methods:
stateState() -> StateisEnd(State) -> boolsuccAndCost(State) -> List[Tuple[str, State, float]]returns a list of(action, state, cost)tuples.
Tree search algorithms
- bb: actions per state, DD maximum depth, dd depth of the solution path
| Algorithm | Idea | Action costs | Space | Time |
|---|---|---|---|---|
| Backtracking search | Any cost | O(D)O(D) | O(bD)O(bD) | |
| Depth-first search | Backtracking + stop when find the first end state | All costs cc = 0 | O(D)O(D) | O(bD)O(bD) |
| Breadth-first search | Explore in order of increasing depth | Constant cost c≥0c≥0 | O(bd)O(bd) | O(bd)O(bd) |
| DFS with iterative deepening | Call DFS for maximum depths 1, 2, …… | Constant cost c≥0c≥0 | O(d)O(d) | O(bd)O(bd) |
- Always exponential time
- Avoid exponential spaces with DFS-ID
Dynamic Programming
Dynamic programming (DP)
- A backtracking search algorithm with memorization (i.e. partial results are saved)
- May avoid the exponential running time of tree search
- Assumption: acyclic graph, so there is always a clear order in which to compute all the future (or past) costs.
For any given state ss, the DP future cost is FutureCost(s)={0 if IsEnd(s)mina∈Actions(s)[Cost(s,a)+FutureCost(Succ(s,a))] otherwise
Note that for DP, we can analogously define PastCost(s).
A state is a summary of all the past actions sufficient to choose future actions optimally.
Graph Search
- N total states, n of which are closer than end state
| Algorithm | Cycles? | Idea | Action costs | Time/space |
|---|---|---|---|---|
| Dynamic programming | No | Save partial results | Any | O(N) |
| Uniform cost search | Yes | Enumerates states in order of increasing past cost | Cost(s,a)≥0 | O(nlogn) |
- Exponential savings compared with tree search!
- Graph search: can have cycles
Uniform cost search (UCS)
Figure source: Stanford CS221 spring 2025, lecture slides week 3
Types of states
- Explored: optimal path found; will not update cost to these states in the future
- Frontier: states we’ve seen, but still can update the cost of how to get there cheaper (i.e., priority queue)
- Unexplored: states we haven’t seen
Uniform cost search (UCS) algorithm
- Add sstart to frontier
- Repeat until fronteir is empty:
- Remove s with smallest priority (minimum cost to s among visited paths) p from frontier
- If IsEnd(s): return solution
- Add s to explored
- For each action a∈Actions(s):
- Get successor s′←Succ(s,a)
- If s′ is already in explored: continue
- Update frontier with s′ and priority with p+Cost(s,a) (if it’s cheaper)
Properties of UCS
Correctness: When a state s is popped from the frontier and moved to explored, its priority is PastCost(s), the minimum cost to s.
USC potentially explores fewer states, but requires move overhead to maintain the priority queue
A*
A* algorithm: Run UCS with modified edge costs Cost′(s,a)=Cost(s,a)+h(Succ(s,a))−h(s)
- USC: explore states in order of PastCost(s)
- A*: explore states in order of PastCost(s)+h(s)
- h(s) is a heuristic that estimates FutureCost(s)
Heuristic h
Consistency:
- Triangle inequality: Cost′(s,a)=Cost(s,a)+h(Succ(s,a))−h(s)≥0, and
- h(send)=0

Figure source: Stanford CS221 spring 2025, lecture slides week 3
Admissibility: h(s)≤FutureCost(s)
If h(s) is consistent, then it is admissible.
A* properties
- Correctness: if h is consistent, then A* returns the minimum cost path
- Efficiency: A* explores all states s satisfying
PastCost(s)≤PastCost(send)−h(s)
- If h(s)=0, then A* is the same as UCS
- If h(s)=FutureCost(s), then A* only explores nodes on a minimum cost path
- Usually somewhere in between
Create h(s) via relaxation
- Remove constraints
- Reduce edge costs (from ∞ to some finite cost)
Week 4. Markov Decision Processes
- Uncertainty in the search process
- Performing action a from a state s can lead to several states s′1,s′2,… in a probabilistic manner.
- The probability and reward of each (s,a,s′) pair may be known or unknown.
- If known: policy evaluation, value iteration
- If unknown: reinforcement learning
Markov Decision Processes (offline)
Model an MDP
- Markove decision process definition:
- States: the set of states
- sstart: stating state.
- Actions(s): possible actions
- T(s,a,s′): transition probability of s′ if take action a in state s
- Reward(s,a,s′): reward for the transition (s,a,s′)
- IsEnd(s): reached end state?
- γ∈(0,1]: discount factor
- A policy π: a mapping from each state s∈States to an action a∈Actions(s)
Policy Evaluation (on-policy)
Utility of a policy is the discounted sum of the rewards on the path
- Since the policy yields a random path, its utility is a random variable
For a policy, its path is s0,a1r1s1,a2r2s2,… (action, reward, new state) tuple, then the utility with discount γ is u1=r1+γr2+γ2r3+…
Vπ(s): the expected utility (called value) of a policy π from state s
Qπ(s,a): Q-value, the expected utility of taking action a from state s, and then following policy π
Connections between Vπ(s) and Qπ(s,a):
Vπ(s)={0 if IsEnd(s)Qπ(s,π(s)) otherwise
Qπ(s,a)=∑s′T(s,a,s′)[Reward(s,a,s′)+γVπ(s′)]
Iterative algorithm for policy evaluation
- Initialize V(0)π(s)=0 for all s
- For iteration t=1,…,tPE:
- For each state s, V(t)π(s)=∑s′T(s,π(s),s′)[Reward(s,π(s),s′)+γV(t−1)π(s′)]
Choice of tPE: stop when values don’t change much maxs|V(t)π(s)−V(t−1)π(s)|≤ϵ
MDP time complexity is O(tPESS′), where S is the number of states and S′ is the number of s′ with T(s,a,s′)≥0
Value Iteration (off-policy)
Optimal value Vopt(s): maximum value attained by any policy
Optimal Q-value Qopt(s,a)
Qopt(s,a)=∑s′T(s,a,s′)[Reward(s,a,s′)+γVopt(s′)]
Vopt(s)={0 if IsEnd(s)maxa∈Actions(s)Qopt(s,a) otherwise
- Value iteration algorithm:
- Initialize V(0)opt(s)=0 for all s
- For iteration t=1,…,tVI:
- For each state s, V(t)opt(s)=maxa∈Actions(s)∑s′T(s,a,s′)[Reward(s,a,s′)+γV(t−1)opt(s)]
- VI time complexity is O(tVISAS′), where S is the number of states, A is the number of actions, and S′ is the number of s′ with T(s,a,s′)≥0
Reinforcement Learning (online)
Figure source: Stanford CS221 spring 2025, lecture slides week 4
- Summary of reinforcement learning algorithms
| Algorithm | Estimating | Based on |
|---|---|---|
| Model-based Monte Carlo | ˆT,ˆR | s0,a1,r1,s1,… |
| Model-free Monte Carlo | ˆQπ | u |
| SARSA | ˆQπ | r+ˆQπ |
| Q-learning | ˆQopt | r+ˆQopt |
Model-based methods (off-policy)
ˆQopt(s,a)=∑s′ˆT(s,a,s′)[^Reward(s,a,s′)+γˆVopt(s′)]
- Based on data s0;a1,r1,s1;a2,r2,s2;…;an,rn,sn, estimate the transition probabilities and rewards of the MDP
ˆT(s,a,s′)=#times (s,a,s′) occurs#times (s,a) occurs^Reward(s,a,s′)=r in (s,a,r,s′)
- If π is a non-deterministic policy which allows us to explore each (state, action) pair infinitely often, then the estimates of the transitions and rewards will converge.
- Thus, the estimates ˆT and ^Reward are not necessarily policy dependent. So model-based methods are off-policy estimations.
Model-free Monte Carlo (on-policy)
The main idea is to estimate the Q-values directly
Original formula ˆQπ(s,a)=average of ut where st−1=s,at=a
Equivalent formulation: for each (s,a,u), let η=11+#updates to (s,a)ˆQπ(s,a)←(1−η)ˆQπ(s,a)⏟prediction+ηu⏟data=ˆQπ(s,a)−η(ˆQπ(s,a)−u)
- Implied objective: least squares (ˆQπ(s,a)−u)2
SARSA (on-policy)
- SARSA algorithm
On each (s,a,r,s′,a′): ˆQπ(s,a)←(1−η)ˆQπ(s,a)+η[r⏟data+γˆQπ(s′,a′)⏟estimate]
- SARSA uses estimates ˆQπ(s′,a′) instead of just raw data u
| Model-free Monte Carlo | SARSA |
|---|---|
| u | r+ˆQπ(s′,a′) |
| based on one path | based on estimate |
| unbiased | biased |
| large variance | small variance |
| wait until end of update | can update immediately |
Q-learning (off-policy)
On each (s,a,r,s′): ˆQopt(s,a)←(1−η)ˆQopt(s,a)⏟prediction+η[r+γˆVopt(s′)]⏟targetˆVopt(s′)=maxa′∈Actions(s′)ˆQopt(s′,a′)
Epsilon-greedy
Reinforcement learning algorithm template
- For t=1,2,3,…
- Choose action at=πact(st−1)
- Receive reward rt and observe new state st
- Update parameters
- For t=1,2,3,…
What exploration policy πact to use? Need to balance exploration and exploitation.
Epsilon-greedy policy πact(s)={argmaxa∈Actions(s)ˆQopt(s,a) probability 1−ϵrandom from Actions(s) probability ϵ
Function appxomation
- Large state spaces is hard to explore.For better generalization to handle unseen states/actions, we can use function approximation, to define
- features ϕ(s,a), and
- weights w, such that ˆQopt(s,a;w)=w⋅ϕ(s,a)
- Q-learning with function approximation: on each (s,a,r,s′),
w←w−η[ˆQopt(s,a;w)⏟prediction−(r+γˆVopt(s′))⏟target]ϕ(s,a)
Week 5. Games
Modeling Games
- A simple example game
Figure source: Stanford CS221 spring 2025, lecture slides week 5
- Game tree
- Each node is a decision point for a player
- Each root-to-leaf path is a possible outcome of the game
- Nodes to indicate certain policy: use △ for maximum node (agent maximize his utility) and ▽ for minimum node (opponent minimizes agent’s utility)
Figure source: Stanford CS221 spring 2025, lecture slides week 5
Two-player zero-sum game:
- sstart
- Actions(s)
- Succ(s,a)
- IsEnd(s)
- Utility(s): agent’s utility for end state s
- Player(s)∈Players: player who controls state s
- Players ={agent,opp}
- Zero-sum: utility of the agent is negative the utility of the opponent
Property of games
- All the utility is at the end state
- For a game about win/lose at the end (e.g., chess), Utility(s) is ∞ (if agent wins), −∞ (if opponent wins), or 0 (if draw).
- Different players in control at different state
- All the utility is at the end state
Policies (for player p in state s)
- Deterministic policy: πp(s)∈Actions(s)
- Stochastic policy: πp(s,a)∈[0,1], a probability
Game Algorithms
Game evaluation
- Use recurrence for policy evaluation to estimate value of the game (expected utility):
Veval(s)={Utility(s) IsEnd(s)∑a∈Actions(s)πagent(s,a)Veval(Succ(s,a)) Player(s)=agent∑a∈Actions(s)πopp(s,a)Veval(Succ(s,a)) Player(s)=opp
Expectimax
- Expectimax: given opponent’s policy, find the best policy for the agent
Vexptmax(s)={Utility(s) IsEnd(s)maxa∈Actions(s)Vexptmax(Succ(s,a)) Player(s)=agent∑a∈Actions(s)πopp(s,a)Vexptmax(Succ(s,a)) Player(s)=opp
Minimax
- Minimax assumes the worst case for the opponent’s policy
Vminmax(s)={Utility(s) IsEnd(s)maxa∈Actions(s)Vminmax(Succ(s,a)) Player(s)=agentmina∈Actions(s)Vminmax(Succ(s,a)) Player(s)=opp
Figure source: Stanford CS221 spring 2025, lecture slides week 5
Minimax properties
- Best against minimax opponent V(πmax,πmin)≥V(πagent,πmin) for all πagent
- Lower bound against any opponent V(πmax,πmin)≤V(πmax,πopp) for all πopp
- Not optimal if opponent is known! Here, π7 stands for an arbitrary known policy, for example, random choice with equal probabilities. V(πmax,π7)≥V(πexptmax(7),π7) for opponent π7
Relationship between game values
Figure source: Stanford CS221 spring 2025, lecture slides week 5
Expectiminimax
- Modify the game to introduce randomness
Figure source: Stanford CS221 spring 2025, lecture slides week 5
- This is equivalent to having a third player, the coin Vexptminmax(s)={Utility(s) IsEnd(s)maxa∈Actions(s)Vexptminmax(Succ(s,a)) Player(s)=agentmina∈Actions(s)Vexptminmax(Succ(s,a)) Player(s)=opp∑a∈Actions(s)πcoin(s,a)Vexptminmax(Succ(s,a)) Player(s)=coin
Evaluation functions
Computation complexity: with branching factor b and depth d (number of turns 2d because of two players), since this is a tree search, we have
- O(d) space
- O(b2d) time
Limited depth tree search: stop at maximum depth dmax Vminmax(s,d)={Utility(s) IsEnd(s)Eval(s)d=0maxa∈Actions(s)Vminmax(Succ(s,a),d) Player(s)=agentmina∈Actions(s)Vminmax(Succ(s,a),d−1) Player(s)=opp
- Use: at state s, call Vminmax(s,dmax)
An evaluation function Eval(s) is a possibly very weak estimate of the value Vminmax(s), using domain knowledge
- This is similar to FutureCost(s) in the A* search problems. But unlike A*, there are no guarantees on the error for approximation.
Alpha-Beta pruning
Branch and bound
- Maintain lower and upper bounds on values.
- If intervals don’t overlap, then we can choose optimally without future work
Alpha-beta prining for minimax:
- as: lower bound on value of max node s
- bs: upper bound on value of min node s
- Store αs=maxs′≤sas′ and βs=mins′≤sbs′
- Prune a node if its interval doesn’t have non-trivial overlap with every ancestor
Pruning depends on the order
- Worse ordering: O(b2⋅d) time
- Best ordering: O(b2⋅0.5d) time
- Random ordering: O(b2⋅0.75d) time, when b=2
- In practice, we can order based on the evaluation function Eval(s):
- Max nodes: order successors by decreasing Eval(s)
- Min nodes: order successors by increasing Eval(s)
More Topics
TD learning (on-policy)
General learning framework
- Objective function 12(prediction(w)−target)2
- Update w←w−η(prediction(w)−target)∇wprediction(w)
Temporal difference (TD) learning: on each (s,a,r,s′),
w←w−η[V(s;w)−(r+γV(s′;w))]∇wV(s;w)
- For linear function V(s;w)=w⋅ϕ(s)
Simultaneous games
Payoff matrix: for two players, V(a,b) is A’s utility if A chooses action a and B chooses action b
Strategies (policies)
- Pure strategy: a single action
- Mixed strategy: a probability distribution of actions
Game evaluation: the value of the game if player A follows πA and player B follows πB is V(πA,πB)=∑a,bπA(a)πB(b)V(a,b)
For pure strategies, going second is no worse maxaminbV(a,b)≤minbmaxaV(a,b)
Mixed strategy: second play can play pure strategy: for any fixed mixed strategy πA, minπBV(πA,πB) can be obtained by a pure strategy
Minimax theorem: for every simultaneous two-player zero-sum game with a finite number of actions maxπAminπBV(πA,πB)=minπBmaxπAV(πA,πB)
- Revealing your optimal mixed strategy doesn’t hurt you.