JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 5

Building Foundations in Reinforcement Learning (RL)

A comprehensive introduction to the fundamentals of Reinforcement Learning, covering its concepts, components, formalization, and key strategies

June 25, 2025

2 of 5

What is Reinforcement Learning?

Reinforcement Learning is a type of machine learning where:

An agent interacts with an environment It observes the state of the environment Takes an action

Receives a reward

Learns a policy to maximize cumulative rewards over time

Think of a child learning to ride a bicycle:

they try, fall, adjust — learning from trial and error.

Components of RL

Component

Description

Agent	The learner or decision-maker
Environment	The world with which the agent interacts
State (s)	A representation of the current situation
Action (a)	The move the agent can make
Reward (r)	Feedback signal from the environment
Policy (π)	Strategy used by the agent to choose actions
Value Function (V)	Expected long-term return from a state

Q-function (Q)

Expected return from taking an action in a state

Markov Decision Process (MDP)

S₁

S₂

S₃

S₄

An MDP formalizes the RL problem:

States (S): The set of all possible situations in the environment

Actions (A): The set of all possible moves the agent can make

Transition probabilities (P): Likelihood of moving from one state to another

Rewards (R): Immediate feedback received after an action

Discount factor (γ): How much future reward is worth compared to immediate reward

Key Property:

In an MDP, the future state depends only on the current state and action, not on previous states or actions.

Exploration vs Exploitation

Exploration

Exploitation

Exploration

Trying new actions to learn more about the environment, even if they might not be immediately rewarding.

Example: Testing different routes to find a potentially faster path.

Exploitation

Choosing the best-known action based on current knowledge to maximize immediate rewards.

Example: Taking the route you know is reliable to ensure on-time arrival.

The Balance Challenge:

A successful RL agent must balance exploration and exploitation for optimal learning and performance:

Too much exploration: wastes resources on suboptimal actions Too much exploitation: may miss better long-term strategies