TD (Temporal Difference): TD is a class of algorithms, and it focuses on updating value estimates based on a learned estimate of future rewards. TD learning updates the value of a state based on the reward received and the estimated value of the next state.
SARSA (State-Action-Reward-State-Action): SARSA is a specific TD method that updates the Q-value using the action that was actually taken by the agent in the next state. SARSA learns from the actions the agent is currently following (i.e., on-policy learning). It updates the value estimates using the next action taken under the current policy.
3 of 10
On-Policy vs. Off-Policy Learning:
TD (Q-learning is off-policy TD):
Off-policy: In Q-learning (which is a TD method), the algorithm learns the value of the optimal action in the next state, regardless of the action actually taken by the agent during the episode. It focuses on learning the best possible policy, rather than following the current policy during learning.
Update rule: It updates based on the maximum reward (best action) in the next state:
Exploration vs. Exploitation: Since Q-learning uses the maximum Q-value for the next state (which is off-policy), it aims at learning the optimal policy, even if the agent is exploring suboptimal actions at the moment.
4 of 10
In SARSA (State-Action-Reward-State-Action), we use a Q-function (action-value function) instead of a value function because SARSA is an on-policy algorithm, meaning it evaluates and improves a policy by learning the value of state-action pairs. Here's why the Q-function is necessary and how it differs from the value function:
1. Value Function vs. Q-Function
Value Function (V(s)): This measures the expected return (sum of future rewards) starting from a particular state s, following a policy. It doesn’t specify which action is taken, just the value of being in that state under a given policy.
Q-Function (Q(s, a)): This measures the expected return starting from a specific state s, taking a specific action a, and then following a policy from that point onward. The Q-function explicitly evaluates the quality of taking an action in a state.
2. Why Use Q-Function in SARSA?
SARSA is about learning from interactions: SARSA aims to learn the quality of the actions (how good a specific action is in a specific state). In each step, SARSA updates the Q-value based on the action chosen by the current policy, hence learning Q-values, not state values.
Action Selection: The Q-function allows the agent to choose the best action based on the learned values. The goal in SARSA is not just to evaluate the value of states but to decide which actions to take in each state, which is critical for making policy decisions.
On-Policy Nature: SARSA directly learns from the actions it takes. Since the algorithm follows its policy to select actions, it updates the Q-value for the specific state-action pair (s, a) based on the next state and action (s', a'), rather than only updating the state’s value.
5 of 10
SARSA:
On-policy: SARSA is an on-policy algorithm because it updates the value function based on the current policy's chosen action. It considers both the current state-action pair and the next state-action pair based on the current policy (which might not be optimal yet).
Update rule
Here, A’ is the action chosen in the next state under the current policy. Exploration vs. Exploitation: Since SARSA uses the next action A’ that was actually taken, it respects the current exploration-exploitation trade-off during learning.
6 of 10
Risk and Behavior During Exploration:
Q-learning (TD Off-policy): It assumes the agent will follow an optimal policy in the future, so it always updates based on the best action available. This can lead to more aggressive behavior during exploration since the agent always tries to maximize its expected rewards in the long run.
SARSA (On-policy TD): Since SARSA updates based on the actual actions taken, it incorporates the current policy’s exploratory actions (like ϵ \epsilonϵ-greedy). This tends to result in safer behavior during exploration since the updates account for the potential suboptimal actions the agent is currently taking.
7 of 10
Use Cases and Practical Differences:
Q-learning (TD Off-policy): It tends to learn faster and often converges to a better optimal policy, but it may behave riskily during the exploration phase. It's useful in environments where you care about learning the optimal policy and can tolerate some level of risk during learning.
SARSA (On-policy TD): It typically results in safer, more stable learning during exploration since it accounts for the agent’s exploration. This makes SARSA useful in environments where risky behavior can lead to severe penalties or where the agent needs to balance exploration and exploitation smoothly.
8 of 10
9 of 10
In a cliff-walking scenario, Q-learning may learn to take risky shortcuts near the cliff edge to maximize rewards since it updates based on the best possible future action. However, during exploration, it might fall off the cliff multiple times.
SARSA, on the other hand, accounts for the risk of falling off the cliff during exploration because it updates based on the actual actions taken (including suboptimal ones), resulting in more cautious behavior near the cliff edge.