CS221 Artificial Intelligence, Notes on States (Week 3-5)

Week 4. Markov Decision Processes

  • Uncertainty in the search process
    • Performing action \(a\) from a state \(s\) can lead to several states \(s'_1, s'_2, \ldots\) in a probabilistic manner.
    • The probability and reward of each \((s, a, s')\) pair may be known or unknown.
      • If known: policy evaluation, value iteration
      • If unknown: reinforcement learning

Markov Decision Processes (offline)

Model an MDP

  • Markove decision process definition:
    • States: the set of states
    • \(s_\text{start}\): stating state.
    • Actions\((s)\): possible actions
    • \(T(s, a, s')\): transition probability of \(s'\) if take action \(a\) in state \(s\)
    • Reward\((s, a, s')\): reward for the transition \((s, a, s')\)
    • IsEnd\((s)\): reached end state?
    • \(\gamma \in (0, 1]\): discount factor
  • A policy \(\pi\): a mapping from each state \(s \in \text{States}\) to an action \(a \in \text{Actions}(s)\)

Policy Evaluation (on-policy)

  • Utility of a policy is the discounted sum of the rewards on the path

    • Since the policy yields a random path, its utility is a random variable
  • For a policy, its path is \(s_0, a_1 r_1 s_1, a_2r_2s_2, \ldots\) (action, reward, new state) tuple, then the utility with discount \(\gamma\) is \[ u_1 = r_1 + \gamma r_2 + \gamma^2 r_3 + \ldots \]

  • \(V_\pi(s)\): the expected utility (called value) of a policy \(\pi\) from state \(s\)

  • \(Q_\pi(s, a)\): Q-value, the expected utility of taking action \(a\) from state \(s\), and then following policy \(\pi\)

  • Connections between \(V_\pi(s)\) and \(Q_\pi(s, a)\):

\[ V_\pi(s) = \begin{cases} 0 & \text{ if IsEnd}(s)\\ Q_\pi(s, \pi(s)) & \text{ otherwise} \end{cases} \]

\[ Q_\pi(s, a) = \sum_{s'} T(s, a, s')\left[\text{Reward}(s, a, s') + \gamma V_\pi(s')\right] \]

  • Iterative algorithm for policy evaluation

    • Initialize \(V_\pi^{(0)}(s) = 0\) for all \(s\)
    • For iteration \(t = 1, \ldots, t_\text{PE}\):
      • For each state \(s\), \[V_\pi^{(t)}(s) = \sum_{s'} T\left(s, \pi(s), s'\right)\left[\text{Reward}\left(s, \pi(s), s'\right) + \gamma V_\pi^{(t-1)}(s')\right]\]
  • Choice of \(t_\text{PE}\): stop when values don’t change much \[ \max_{s} \left|V_\pi^{(t)}(s) - V_\pi^{(t-1)}(s)\right| \leq \epsilon \]

  • MDP time complexity is \(O(t_\text{PE} S S')\), where \(S\) is the number of states and \(S'\) is the number of \(s'\) with \(T(s, a, s') \geq 0\)

Value Iteration (off-policy)

  • Optimal value \(V_\text{opt}(s)\): maximum value attained by any policy

  • Optimal Q-value \(Q_\text{opt}(s, a)\)

\[ Q_\text{opt}(s, a) = \sum_{s'} T(s, a, s') \left[\text{Reward}(s, a, s') + \gamma V_{opt}(s') \right] \]

\[ V_\text{opt}(s) = \begin{cases} 0 & \text{ if IsEnd}(s)\\ \max_{a\in \text{Actions}(s)} Q_\text{opt}(s, a) & \text{ otherwise} \end{cases} \]

  • Value iteration algorithm:
    • Initialize \(V_\text{opt}^{(0)}(s) = 0\) for all \(s\)
    • For iteration \(t = 1, \ldots, t_\text{VI}\):
      • For each state \(s\), \[V_\text{opt}^{(t)}(s) = \max_{a\in \text{Actions}(s)} \sum_{s'} T\left(s, a, s'\right)\left[\text{Reward}\left(s, a, s'\right) + \gamma V_\text{opt}^{(t-1)}(s)\right]\]
  • VI time complexity is \(O(t_\text{VI} S A S')\), where \(S\) is the number of states, \(A\) is the number of actions, and \(S'\) is the number of \(s'\) with \(T(s, a, s') \geq 0\)

Reinforcement Learning (online)

drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 4
  • Summary of reinforcement learning algorithms
Algorithm Estimating Based on
Model-based Monte Carlo \(\hat{T}, \hat{R}\) \(s_0, a_1, r_1, s_1, \ldots\)
Model-free Monte Carlo \(\hat{Q}_\pi\) \(u\)
SARSA \(\hat{Q}_\pi\) \(r + \hat{Q}_\pi\)
Q-learning \(\hat{Q}_\text{opt}\) \(r + \hat{Q}_\text{opt}\)

Model-based methods (off-policy)

\[ \hat{Q}_\text{opt}(s, a) = \sum_{s'} \hat{T}(s, a, s') \left[\widehat{\text{Reward}}(s, a, s') + \gamma \hat{V}_{opt}(s') \right] \]

  • Based on data \(s_0; a_1, r_1, s_1; a_2, r_2, s_2; \ldots ; a_n, r_n, s_n\), estimate the transition probabilities and rewards of the MDP

\[\begin{align*} \hat{T}(s, a, s') & = \frac{\# \text{times } (s, a, s') \text{ occurs}} {\# \text{times } (s, a) \text{ occurs}} \\ \widehat{\text{Reward}}(s, a, s') & = r \text{ in } (s, a, r, s') \end{align*}\]

  • If \(\pi\) is a non-deterministic policy which allows us to explore each (state, action) pair infinitely often, then the estimates of the transitions and rewards will converge.
    • Thus, the estimates \(\hat{T}\) and \(\widehat{\text{Reward}}\) are not necessarily policy dependent. So model-based methods are off-policy estimations.

Model-free Monte Carlo (on-policy)

  • The main idea is to estimate the Q-values directly

  • Original formula \[ \hat{Q}_\pi (s, a) = \text{average of } u_t \text{ where } s_{t-1} = s, a_t = a \]

  • Equivalent formulation: for each \((s, a, u)\), let \[\begin{align*} \eta & = \frac{1}{1 + \#\text{updates to }(s, a)}\\ \hat{Q}_\pi(s, a) & \leftarrow (1-\eta) \underbrace{\hat{Q}_\pi (s, a)}_{\text{prediction}} + \eta \underbrace{u}_{\text{data}}\\ & = \hat{Q}_\pi (s, a) - \eta\left(\hat{Q}_\pi (s, a) - u\right) \end{align*}\]

    • Implied objective: least squares \[\left(\hat{Q}_\pi (s, a) - u \right)^2\]

SARSA (on-policy)

  • SARSA algorithm

On each \((s, a, r, s', a')\): \[ \hat{Q}_\pi(s, a) \leftarrow (1-\eta) \hat{Q}_\pi(s, a) + \eta \left[\underbrace{r}_{\text{data}} + \gamma \underbrace{\hat{Q}_\pi(s', a')}_{\text{estimate}}\right] \]

  • SARSA uses estimates \(\hat{Q}_\pi(s', a')\) instead of just raw data \(u\)
Model-free Monte Carlo SARSA
\(u\) \(r + \hat{Q}_\pi(s', a')\)
based on one path based on estimate
unbiased biased
large variance small variance
wait until end of update can update immediately

Q-learning (off-policy)

On each \((s, a, r, s')\): \[\begin{align*} \hat{Q}_\text{opt}(s, a) & \leftarrow (1-\eta) \underbrace{\hat{Q}_\text{opt}(s, a)}_{\text{prediction}} + \eta \underbrace{\left[r + \gamma \hat{V}_\text{opt}(s')\right]}_{\text{target}}\\ \hat{V}_\text{opt}(s') &= \max_{a' \in \text{Actions}(s')} \hat{Q}_\text{opt}(s', a') \end{align*}\]

Epsilon-greedy

  • Reinforcement learning algorithm template

    • For \(t = 1, 2, 3, \ldots\)
      • Choose action \(a_t = \pi_\text{act}(s_{t-1})\)
      • Receive reward \(r_t\) and observe new state \(s_t\)
      • Update parameters
  • What exploration policy \(\pi_\text{act}\) to use? Need to balance exploration and exploitation.

  • Epsilon-greedy policy \[ \pi_\text{act}(s) = \begin{cases} \arg\max_{a \in \text{Actions}(s)} \hat{Q}_\text{opt}(s, a) & \text{ probability } 1- \epsilon\\ \text{random from Actions}(s) & \text{ probability } \epsilon \end{cases} \]

Function appxomation

  • Large state spaces is hard to explore.For better generalization to handle unseen states/actions, we can use function approximation, to define
    • features \(\phi(s, a)\), and
    • weights \(\mathbf{w}\), such that \[\hat{Q}_\text{opt}(s, a; \mathbf{w}) = \mathbf{w} \cdot \phi(s, a)\]
  • Q-learning with function approximation: on each \((s, a, r, s')\),

\[ \mathbf{w} \leftarrow \mathbf{w} - \eta\left[ \underbrace{\hat{Q}_\text{opt}(s, a; \mathbf{w})}_{\text{prediction}} - \underbrace{\left(r + \gamma \hat{V}_\text{opt}(s') \right)}_{\text{target}} \right] \phi(s, a) \]

Week 5. Games

Modeling Games

  • A simple example game drawing
    Figure source: Stanford CS221 spring 2025, lecture slides week 5
  • Game tree
    • Each node is a decision point for a player
    • Each root-to-leaf path is a possible outcome of the game
    • Nodes to indicate certain policy: use \(\bigtriangleup\) for maximum node (agent maximize his utility) and \(\bigtriangledown\) for minimum node (opponent minimizes agent’s utility)
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 5
  • Two-player zero-sum game:

    • \(s_\text{start}\)
    • Actions\((s)\)
    • Succ\((s, a)\)
    • IsEnd\((s)\)
    • Utility\((s)\): agent’s utility for end state \(s\)
    • Player\((s) \in \text{Players}\): player who controls state s
    • Players \(= \{\text{agent}, \text{opp} \}\)
    • Zero-sum: utility of the agent is negative the utility of the opponent
  • Property of games

    • All the utility is at the end state
      • For a game about win/lose at the end (e.g., chess), Utility\((s)\) is \(\infty\) (if agent wins), \(-\infty\) (if opponent wins), or \(0\) (if draw).
    • Different players in control at different state
  • Policies (for player \(p\) in state \(s\))

    • Deterministic policy: \(\pi_p(s) \in \text{Actions}(s)\)
    • Stochastic policy: \(\pi_p(s, a) \in [0, 1]\), a probability

Game Algorithms

Game evaluation

  • Use recurrence for policy evaluation to estimate value of the game (expected utility):

\[ V_\text{eval}(s) = \begin{cases} \text{Utility}(s) & \text{ IsEnd}(s)\\ \sum_{a\in\text{Actions}(s)} \pi_\text{agent}(s, a) V_\text{eval}(\text{Succ}(s, a)) & \text{ Player}(s) = \text{agent}\\ \sum_{a\in\text{Actions}(s)} \pi_\text{opp}(s, a) V_\text{eval}(\text{Succ}(s, a)) & \text{ Player}(s) = \text{opp}\\ \end{cases} \]

Expectimax

  • Expectimax: given opponent’s policy, find the best policy for the agent

\[ V_\text{exptmax}(s) = \begin{cases} \text{Utility}(s) & \text{ IsEnd}(s)\\ \max_{a\in\text{Actions}(s)} V_\text{exptmax}(\text{Succ}(s, a)) & \text{ Player}(s) = \text{agent}\\ \sum_{a\in\text{Actions}(s)} \pi_\text{opp}(s, a) V_\text{exptmax}(\text{Succ}(s, a)) & \text{ Player}(s) = \text{opp}\\ \end{cases} \]

Minimax

  • Minimax assumes the worst case for the opponent’s policy

\[ V_\text{minmax}(s) = \begin{cases} \text{Utility}(s) & \text{ IsEnd}(s)\\ \max_{a\in\text{Actions}(s)} V_\text{minmax}(\text{Succ}(s, a)) & \text{ Player}(s) = \text{agent}\\ \min_{a\in\text{Actions}(s)} V_\text{minmax}(\text{Succ}(s, a)) & \text{ Player}(s) = \text{opp}\\ \end{cases} \]

drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 5
  • Minimax properties

    • Best against minimax opponent \[ V(\pi_\max, \pi_\min) \geq V(\pi_\text{agent}, \pi_\min) \text{ for all } \pi_\text{agent} \]
    • Lower bound against any opponent \[ V(\pi_\max, \pi_\min) \leq V(\pi_\max, \pi_\text{opp}) \text{ for all } \pi_\text{opp} \]
    • Not optimal if opponent is known! Here, \(\pi_7\) stands for an arbitrary known policy, for example, random choice with equal probabilities. \[ V(\pi_\max, \pi_7) \geq V(\pi_{\text{exptmax}(7)}, \pi_7) \text{ for opponent } \pi_7 \]
  • Relationship between game values

drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 5

Expectiminimax

  • Modify the game to introduce randomness
drawing
Figure source: Stanford CS221 spring 2025, lecture slides week 5
  • This is equivalent to having a third player, the coin \[ V_\text{exptminmax}(s) = \begin{cases} \text{Utility}(s) & \text{ IsEnd}(s)\\ \max_{a\in\text{Actions}(s)} V_\text{exptminmax}(\text{Succ}(s, a)) & \text{ Player}(s) = \text{agent}\\ \min_{a\in\text{Actions}(s)} V_\text{exptminmax}(\text{Succ}(s, a)) & \text{ Player}(s) = \text{opp}\\ \sum_{a\in\text{Actions}(s)} \pi_\text{coin}(s, a) V_\text{exptminmax}(\text{Succ}(s, a)) & \text{ Player}(s) = \text{coin}\\ \end{cases} \]

Evaluation functions

  • Computation complexity: with branching factor \(b\) and depth \(d\) (number of turns \(2d\) because of two players), since this is a tree search, we have

    • \(O(d)\) space
    • \(O(b^{2d})\) time
  • Limited depth tree search: stop at maximum depth \(d_\max\) \[ V_\text{minmax}(s, d) = \begin{cases} \text{Utility}(s) & \text{ IsEnd}(s)\\ \text{Eval}(s) & d=0\\ \max_{a\in\text{Actions}(s)} V_\text{minmax}(\text{Succ}(s, a), d) & \text{ Player}(s) = \text{agent}\\ \min_{a\in\text{Actions}(s)} V_\text{minmax}(\text{Succ}(s, a), d-1) & \text{ Player}(s) = \text{opp}\\ \end{cases} \]

    • Use: at state \(s\), call \(V_\text{minmax}(s, d_\max)\)
  • An evaluation function Eval\((s)\) is a possibly very weak estimate of the value \(V_\text{minmax}(s)\), using domain knowledge

    • This is similar to FutureCost\((s)\) in the A* search problems. But unlike A*, there are no guarantees on the error for approximation.

Alpha-Beta pruning

  • Branch and bound

    • Maintain lower and upper bounds on values.
    • If intervals don’t overlap, then we can choose optimally without future work
  • Alpha-beta prining for minimax:

    • \(a_s\): lower bound on value of max node \(s\)
    • \(b_s\): upper bound on value of min node \(s\)
    • Store \(\alpha_s = \max_{s'\leq s} a_{s'}\) and \(\beta_s = \min_{s'\leq s} b_{s'}\)
    • Prune a node if its interval doesn’t have non-trivial overlap with every ancestor
  • Pruning depends on the order

    • Worse ordering: \(O(b^{2\cdot d})\) time
    • Best ordering: \(O(b^{2\cdot 0.5 d})\) time
    • Random ordering: \(O(b^{2\cdot 0.75 d})\) time, when \(b=2\)
    • In practice, we can order based on the evaluation function Eval\((s)\):
      • Max nodes: order successors by decreasing Eval\((s)\)
      • Min nodes: order successors by increasing Eval\((s)\)

More Topics

TD learning (on-policy)

  • General learning framework

    • Objective function \[ \frac{1}{2}\left(\text{prediction}(\mathbf{w}) - \text{target} \right)^2 \]
    • Update \[ \mathbf{w} \leftarrow \mathbf{w} - \eta \left(\text{prediction}(\mathbf{w}) - \text{target} \right) \nabla_\mathbf{w} \text{prediction}(\mathbf{w}) \]
  • Temporal difference (TD) learning: on each \((s, a, r, s')\),

    \[ \mathbf{w} \leftarrow \mathbf{w} - \eta \left[V(s; \mathbf{w}) - \left(r + \gamma V(s'; \mathbf{w})\right) \right] \nabla_\mathbf{w} V(s; \mathbf{w}) \]

    • For linear function \[V(s; \mathbf{w}) = \mathbf{w} \cdot \phi(s)\]

Simultaneous games

  • Payoff matrix: for two players, \(V(a, b)\) is A’s utility if A chooses action \(a\) and B chooses action \(b\)

  • Strategies (policies)

    • Pure strategy: a single action
    • Mixed strategy: a probability distribution of actions
  • Game evaluation: the value of the game if player A follows \(\pi_A\) and player B follows \(\pi_B\) is \[ V(\pi_A, \pi_B) = \sum_{a, b} \pi_A(a) \pi_B(b) V(a, b) \]

  • For pure strategies, going second is no worse \[\max_a \min_b V(a, b) \leq \min_b \max_a V(a, b)\]

  • Mixed strategy: second play can play pure strategy: for any fixed mixed strategy \(\pi_A\), \(\min_{\pi_B} V(\pi_A, \pi_B)\) can be obtained by a pure strategy

  • Minimax theorem: for every simultaneous two-player zero-sum game with a finite number of actions \[ \max_{\pi_A} \min_{\pi_B} V(\pi_A, \pi_B) = \min_{\pi_B}\max_{\pi_A} V(\pi_A, \pi_B) \]

    • Revealing your optimal mixed strategy doesn’t hurt you.