CS221 Artificial Intelligence, Notes on States (Week 3-5)

Week 4. Markov Decision Processes

  • Uncertainty in the search process
    • Performing action a from a state s can lead to several states s1,s2, in a probabilistic manner.
    • The probability and reward of each (s,a,s) pair may be known or unknown.
      • If known: policy evaluation, value iteration
      • If unknown: reinforcement learning

Markov Decision Processes (offline)

Model an MDP

  • Markove decision process definition:
    • States: the set of states
    • sstart: stating state.
    • Actions(s): possible actions
    • T(s,a,s): transition probability of s if take action a in state s
    • Reward(s,a,s): reward for the transition (s,a,s)
    • IsEnd(s): reached end state?
    • γ(0,1]: discount factor
  • A policy π: a mapping from each state sStates to an action aActions(s)

Policy Evaluation (on-policy)

  • Utility of a policy is the discounted sum of the rewards on the path

    • Since the policy yields a random path, its utility is a random variable
  • For a policy, its path is s0,a1r1s1,a2r2s2, (action, reward, new state) tuple, then the utility with discount γ is u1=r1+γr2+γ2r3+

  • Vπ(s): the expected utility (called value) of a policy π from state s

  • Qπ(s,a): Q-value, the expected utility of taking action a from state s, and then following policy π

  • Connections between Vπ(s) and Qπ(s,a):

Vπ(s)={0 if IsEnd(s)Qπ(s,π(s)) otherwise

Qπ(s,a)=sT(s,a,s)[Reward(s,a,s)+γVπ(s)]

  • Iterative algorithm for policy evaluation

    • Initialize Vπ(0)(s)=0 for all s
    • For iteration t=1,,tPE:
      • For each state s, Vπ(t)(s)=sT(s,π(s),s)[Reward(s,π(s),s)+γVπ(t1)(s)]
  • Choice of tPE: stop when values don’t change much maxs|Vπ(t)(s)Vπ(t1)(s)|ϵ

  • MDP time complexity is O(tPESS), where S is the number of states and S is the number of s with T(s,a,s)0

Value Iteration (off-policy)

  • Optimal value Vopt(s): maximum value attained by any policy

  • Optimal Q-value Qopt(s,a)

Qopt(s,a)=sT(s,a,s)[Reward(s,a,s)+γVopt(s)]

Vopt(s)={0 if IsEnd(s)maxaActions(s)Qopt(s,a) otherwise

  • Value iteration algorithm:
    • Initialize Vopt(0)(s)=0 for all s
    • For iteration t=1,,tVI:
      • For each state s, Vopt(t)(s)=maxaActions(s)sT(s,a,s)[Reward(s,a,s)+γVopt(t1)(s)]
  • VI time complexity is O(tVISAS), where S is the number of states, A is the number of actions, and S is the number of s with T(s,a,s)0

Reinforcement Learning (online)

drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 4
  • Summary of reinforcement learning algorithms
Algorithm Estimating Based on
Model-based Monte Carlo T^,R^ s0,a1,r1,s1,
Model-free Monte Carlo Q^π u
SARSA Q^π r+Q^π
Q-learning Q^opt r+Q^opt

Model-based methods (off-policy)

Q^opt(s,a)=sT^(s,a,s)[Reward^(s,a,s)+γV^opt(s)]

  • Based on data s0;a1,r1,s1;a2,r2,s2;;an,rn,sn, estimate the transition probabilities and rewards of the MDP

T^(s,a,s)=#times (s,a,s) occurs#times (s,a) occursReward^(s,a,s)=r in (s,a,r,s)

  • If π is a non-deterministic policy which allows us to explore each (state, action) pair infinitely often, then the estimates of the transitions and rewards will converge.
    • Thus, the estimates T^ and Reward^ are not necessarily policy dependent. So model-based methods are off-policy estimations.

Model-free Monte Carlo (on-policy)

  • The main idea is to estimate the Q-values directly

  • Original formula Q^π(s,a)=average of ut where st1=s,at=a

  • Equivalent formulation: for each (s,a,u), let η=11+#updates to (s,a)Q^π(s,a)(1η)Q^π(s,a)prediction+ηudata=Q^π(s,a)η(Q^π(s,a)u)

    • Implied objective: least squares (Q^π(s,a)u)2

SARSA (on-policy)

  • SARSA algorithm

On each (s,a,r,s,a): Q^π(s,a)(1η)Q^π(s,a)+η[rdata+γQ^π(s,a)estimate]

  • SARSA uses estimates Q^π(s,a) instead of just raw data u
Model-free Monte Carlo SARSA
u r+Q^π(s,a)
based on one path based on estimate
unbiased biased
large variance small variance
wait until end of update can update immediately

Q-learning (off-policy)

On each (s,a,r,s): Q^opt(s,a)(1η)Q^opt(s,a)prediction+η[r+γV^opt(s)]targetV^opt(s)=maxaActions(s)Q^opt(s,a)

Epsilon-greedy

  • Reinforcement learning algorithm template

    • For t=1,2,3,
      • Choose action at=πact(st1)
      • Receive reward rt and observe new state st
      • Update parameters
  • What exploration policy πact to use? Need to balance exploration and exploitation.

  • Epsilon-greedy policy πact(s)={argmaxaActions(s)Q^opt(s,a) probability 1ϵrandom from Actions(s) probability ϵ

Function appxomation

  • Large state spaces is hard to explore.For better generalization to handle unseen states/actions, we can use function approximation, to define
    • features ϕ(s,a), and
    • weights w, such that Q^opt(s,a;w)=wϕ(s,a)
  • Q-learning with function approximation: on each (s,a,r,s),

wwη[Q^opt(s,a;w)prediction(r+γV^opt(s))target]ϕ(s,a)

Week 5. Games

Modeling Games

  • A simple example game drawing
    Figure source: Stanford CS221 spring 2025, lecture slides week 5
  • Game tree
    • Each node is a decision point for a player
    • Each root-to-leaf path is a possible outcome of the game
    • Nodes to indicate certain policy: use for maximum node (agent maximize his utility) and for minimum node (opponent minimizes agent’s utility)
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 5
  • Two-player zero-sum game:

    • sstart
    • Actions(s)
    • Succ(s,a)
    • IsEnd(s)
    • Utility(s): agent’s utility for end state s
    • Player(s)Players: player who controls state s
    • Players ={agent,opp}
    • Zero-sum: utility of the agent is negative the utility of the opponent
  • Property of games

    • All the utility is at the end state
      • For a game about win/lose at the end (e.g., chess), Utility(s) is (if agent wins), (if opponent wins), or 0 (if draw).
    • Different players in control at different state
  • Policies (for player p in state s)

    • Deterministic policy: πp(s)Actions(s)
    • Stochastic policy: πp(s,a)[0,1], a probability

Game Algorithms

Game evaluation

  • Use recurrence for policy evaluation to estimate value of the game (expected utility):

Veval(s)={Utility(s) IsEnd(s)aActions(s)πagent(s,a)Veval(Succ(s,a)) Player(s)=agentaActions(s)πopp(s,a)Veval(Succ(s,a)) Player(s)=opp

Expectimax

  • Expectimax: given opponent’s policy, find the best policy for the agent

Vexptmax(s)={Utility(s) IsEnd(s)maxaActions(s)Vexptmax(Succ(s,a)) Player(s)=agentaActions(s)πopp(s,a)Vexptmax(Succ(s,a)) Player(s)=opp

Minimax

  • Minimax assumes the worst case for the opponent’s policy

Vminmax(s)={Utility(s) IsEnd(s)maxaActions(s)Vminmax(Succ(s,a)) Player(s)=agentminaActions(s)Vminmax(Succ(s,a)) Player(s)=opp

drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 5
  • Minimax properties

    • Best against minimax opponent V(πmax,πmin)V(πagent,πmin) for all πagent
    • Lower bound against any opponent V(πmax,πmin)V(πmax,πopp) for all πopp
    • Not optimal if opponent is known! Here, π7 stands for an arbitrary known policy, for example, random choice with equal probabilities. V(πmax,π7)V(πexptmax(7),π7) for opponent π7
  • Relationship between game values

drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 5

Expectiminimax

  • Modify the game to introduce randomness
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 5
  • This is equivalent to having a third player, the coin Vexptminmax(s)={Utility(s) IsEnd(s)maxaActions(s)Vexptminmax(Succ(s,a)) Player(s)=agentminaActions(s)Vexptminmax(Succ(s,a)) Player(s)=oppaActions(s)πcoin(s,a)Vexptminmax(Succ(s,a)) Player(s)=coin

Evaluation functions

  • Computation complexity: with branching factor b and depth d (number of turns 2d because of two players), since this is a tree search, we have

    • O(d) space
    • O(b2d) time
  • Limited depth tree search: stop at maximum depth dmax Vminmax(s,d)={Utility(s) IsEnd(s)Eval(s)d=0maxaActions(s)Vminmax(Succ(s,a),d) Player(s)=agentminaActions(s)Vminmax(Succ(s,a),d1) Player(s)=opp

    • Use: at state s, call Vminmax(s,dmax)
  • An evaluation function Eval(s) is a possibly very weak estimate of the value Vminmax(s), using domain knowledge

    • This is similar to FutureCost(s) in the A* search problems. But unlike A*, there are no guarantees on the error for approximation.

Alpha-Beta pruning

  • Branch and bound

    • Maintain lower and upper bounds on values.
    • If intervals don’t overlap, then we can choose optimally without future work
  • Alpha-beta prining for minimax:

    • as: lower bound on value of max node s
    • bs: upper bound on value of min node s
    • Store αs=maxssas and βs=minssbs
    • Prune a node if its interval doesn’t have non-trivial overlap with every ancestor
  • Pruning depends on the order

    • Worse ordering: O(b2d) time
    • Best ordering: O(b20.5d) time
    • Random ordering: O(b20.75d) time, when b=2
    • In practice, we can order based on the evaluation function Eval(s):
      • Max nodes: order successors by decreasing Eval(s)
      • Min nodes: order successors by increasing Eval(s)

More Topics

TD learning (on-policy)

  • General learning framework

    • Objective function 12(prediction(w)target)2
    • Update wwη(prediction(w)target)wprediction(w)
  • Temporal difference (TD) learning: on each (s,a,r,s),

    wwη[V(s;w)(r+γV(s;w))]wV(s;w)

    • For linear function V(s;w)=wϕ(s)

Simultaneous games

  • Payoff matrix: for two players, V(a,b) is A’s utility if A chooses action a and B chooses action b

  • Strategies (policies)

    • Pure strategy: a single action
    • Mixed strategy: a probability distribution of actions
  • Game evaluation: the value of the game if player A follows πA and player B follows πB is V(πA,πB)=a,bπA(a)πB(b)V(a,b)

  • For pure strategies, going second is no worse maxaminbV(a,b)minbmaxaV(a,b)

  • Mixed strategy: second play can play pure strategy: for any fixed mixed strategy πA, minπBV(πA,πB) can be obtained by a pure strategy

  • Minimax theorem: for every simultaneous two-player zero-sum game with a finite number of actions maxπAminπBV(πA,πB)=minπBmaxπAV(πA,πB)

    • Revealing your optimal mixed strategy doesn’t hurt you.