Building Foundations in Reinforcement Learning (RL)
A comprehensive introduction to the fundamentals of Reinforcement Learning, covering its concepts, components, formalization, and key strategies
June 25, 2025
What is Reinforcement Learning?
Reinforcement Learning is a type of machine learning where:
An agent interacts with an environment It observes the state of the environment Takes an action
Receives a reward
Learns a policy to maximize cumulative rewards over time
Think of a child learning to ride a bicycle:
they try, fall, adjust — learning from trial and error.
Components of RL
Component
Description
Agent | The learner or decision-maker |
Environment | The world with which the agent interacts |
State (s) | A representation of the current situation |
Action (a) | The move the agent can make |
Reward (r) | Feedback signal from the environment |
Policy (π) | Strategy used by the agent to choose actions |
Value Function (V) | Expected long-term return from a state |
Q-function (Q)
Expected return from taking an action in a state
Markov Decision Process (MDP)
S₁
S₂
S₃
S₄
An MDP formalizes the RL problem:
States (S): The set of all possible situations in the environment
Actions (A): The set of all possible moves the agent can make
Transition probabilities (P): Likelihood of moving from one state to another
Rewards (R): Immediate feedback received after an action
Discount factor (γ): How much future reward is worth compared to immediate reward
Key Property:
In an MDP, the future state depends only on the current state and action, not on previous states or actions.
Exploration vs Exploitation
Exploration
Exploitation
Exploration
Trying new actions to learn more about the environment, even if they might not be immediately rewarding.
Example: Testing different routes to find a potentially faster path.
Exploitation
Choosing the best-known action based on current knowledge to maximize immediate rewards.
Example: Taking the route you know is reliable to ensure on-time arrival.
The Balance Challenge:
A successful RL agent must balance exploration and exploitation for optimal learning and performance:
Too much exploration: wastes resources on suboptimal actions Too much exploitation: may miss better long-term strategies