Imagine playing Pac-Man – you control the yellow character, navigating through a maze, eating pellets while avoiding ghosts. As a human player, you're making decisions based on what you see and trying to maximize your score. Reinforcement Learning (RL) agents work similarly, but they're computer programs that learn to play games or solve problems through trial and error, just like how you might get better at Pac-Man with practice.
An agent in reinforcement learning is essentially a decision-maker – it's the "brain" that chooses what actions to take in any given situation. The agent is constantly learning from its experiences, getting better at making decisions that lead to higher rewards over time.

Figure 1: Basic reinforcement learning agent-environment interaction diagram
The Core Components of Reinforcement Learning
The Agent-Environment Interaction Loop
At the heart of reinforcement learning lies a simple but powerful concept: the agent-environment interaction loop. This continuous cycle forms the foundation of how RL agents learn and make decisions.
Understanding the Environment
The environment is everything the agent interacts with. In Pac-Man, the environment includes the maze layout, the positions of pellets, the locations and movements of ghosts, and the current score. Think of the environment as the "world" in which the agent operates.

Figure 2: Pac-man game environment showing agent, ghosts, and pellets
States: The Agent's Perception of the World
A state represents the current situation or configuration of the environment that the agent observes. In Pac-Man, a state might include:
- Pac-Man's position
- Ghost positions
- Pellet status
- Power pellet activity
- Score
Mathematically, we can represent the state space as S, which contains all possible states. The current state at time t is denoted as s_t.
Actions: The Choices Available
Actions are the decisions or moves that an agent can make in any given state. In Pac-Man, the action space A is relatively simple:
- Move Up
- Move Down
- Move Left
- Move Right
- Stay Still (if supported)
Rewards: The Learning Signal
Rewards are numerical values that the environment gives to the agent after each action. In Pac-Man, the reward structure might look like:
- +10: Regular pellet
- +50: Power pellet
- +200: Eat ghost while powered
- -500: Caught by ghost
- +1000: Clear all pellets
- -1: Each move (efficiency penalty)
Policy: The Agent's Strategy
A policy is the strategy that determines what action the agent should take in any given state. It is denoted as π (pi) and can be written as π(s) for deterministic policies or π(a|s) for stochastic policies.
Deterministic vs. Stochastic Policies
Deterministic policies always choose the same action for a given state. For example: "If there's a pellet to the right, always move right."
Stochastic policies assign probabilities to different actions. For example: "If there's a pellet to the right, move right with 70% probability, but also consider other directions with lower probabilities to avoid getting trapped."
Mathematically, for a stochastic policy:
Example Policy in Pac-Man
Policy π:
- If ghost is adjacent and no power pellet active: → Move away from ghost (40% probability each for safe directions)
- Else if power pellet available and ghost nearby: → Move toward power pellet (80% probability)
- Else if pellet visible: → Move toward nearest pellet (60% probability)
- Else: → Explore randomly (25% probability each direction)
The Q-Function: Estimating Action Values
The Q-function (Quality function) estimates how good it is to take a specific action in a specific state, considering all future rewards.
Mathematical Definition
The Q-function is denoted as Q(s, a) and represents the expected cumulative reward:
Where:
- γ (gamma) is the discount factor (0 ≤ γ < 1)
- Rₜ₊₁ is the reward at time t+1
- E[] denotes expected value
Simple Example: Q-Values in Pac-Man
Let's say Pac-Man is at position (3,2) with pellets in multiple directions. The Q-values might look like:
- Q((3,2), "Move Right") = 85 (leads to pellet + safe path)
- Q((3,2), "Move Left") = 30 (leads to dead end)
- Q((3,2), "Move Up") = -200 (ghost is there!)
- Q((3,2), "Move Down") = 60 (neutral space)
The agent would choose "Move Right" because it has the highest Q-value.
Conclusion: The Power of Learning Through Interaction
Reinforcement learning agents represent a powerful paradigm where artificial intelligence systems learn optimal behavior through trial-and-error interaction with their environment. Using Pac-Man as our example, we've seen how agents can start with no knowledge and gradually develop sophisticated strategies.
The same principles that allow an agent to master Pac-Man can be applied to training robots, optimizing resource allocation, controlling autonomous vehicles, or even developing game-playing AIs that surpass human performance.
But we're just getting started! In the next blog, we'll dive into a real-world application—how we're implementing reinforcement learning in Pixie to respond dynamically to user feedback. Stay tuned.
