Simplified Reinforcement Learning: Q Learning

Q Learning, a model-free reinforcement learning algorithm, aims to learn the quality of actions and telling an agent what action is to be taken under which circumstance. Through the course of this blog, we will learn more about Q Learning, and it’s learning process with the help of an example.

Contributed by: Rahul Purohit

Richard S. Sutton in his book “Reinforcement Learning – An Introduction” considered as the Gold Standard, gives a very intuitive definition – “Reinforcement learning is learning what to do—how to map situations to actions—to maximize a numerical reward signal.” The field of reinforcement learning (RL from now on) is not new. It was initiated as early as the 1960s (earlier referred to as “hedonistic” learning system). Although it failed to gain popularity with Supervised Learning (SL), attracting a large group of researchers’ interest. Only in the last decade or so, researchers have come to realize untapped potential RL possesses. DeepMind’s AlphaGo, Alpha Zero, are some brilliant examples of the powers of RL, and it is just the beginning.

Also, this would be a good place to understand how RL is different from SL. SL put in simple words is learning from a set of examples provided by an external supervisor, each example being a description of a phenomenon (independent variables) with a result (dependent variable) associated with it. The objective of SL is to exploit the knowledge gained from training examples and use it to determine the result of unseen data. This works pretty well for most of the problems, except it fails in the case of Interactive problems (e.g. Games, robotic manoeuvres etc.) where gathering a set of examples that are representative and exhaustive is not feasible. This is where the RL systems come to rescue, RL systems can learn without a set of examples explicitly given by an external supervisor, and rather, the agent itself interacts with the environment and can figure out a combination of actions that leads to the desired outcome.

There are two popular Learning approaches

1. Policy Based-

In this learning approach, a policy i.e. a function mapping each state to the best action is optimized. Once we have a well-defined policy, the agent can determine the best action to take by giving the current state as an input to the policy.

We can further divide the policies in two types-

Deterministic – A policy at a given state returns a unique action

S=(s) ➡ A= (a)

Stochastic – Instead of returning a unique action, it returns a probability distribution of actions at a given state.

Policy ➡ p (A = a | S = s)

2. Value Based-

In value-based RL, the objective is to optimize a value function, a function (can be thought of as a simple Lookup table) which maps maximum future reward to a given state. The value of each state is the total amount of reward an RL agent can expect to receive until the fulfilment of goal.

Q Learning

Q Learning comes under Value-based learning algorithms. The objective is to optimize a value function suited to a given problem/environment. The ‘Q’ stands for quality; it helps in finding the next action resulting in a state of the highest quality. This approach is rather simple and intuitive. It a very good place to start the RL journey. The values are stored in a table, called a Q Table.

Let us devise a simple 2D game environment of size 4 x 4 and understand how Q- Learning can be used to arrive at the best solution.

Goal: Guide the kid to the Park

Reward System:
A. Get candy = +10 points
B. Encounter Dog = -50 points
C. Reach Park = +50 points

End of an Episode:
A. Encounter Dog
B. Reach Park

Now let us see how a typical Q learning agent will play this game. First, let us create a Q- table where we will keep a track of all values associated with each state. The Q Table will have rows equal to the number of states in the problem i.e. 16 in our case, and the number of columns would be equal to the number of actions an agent can make which happens to be 4 (Up, Down, Left & Right).

ACTIONS STATES	UP	DOWN	LEFT	RIGHT
1 (START)	0	0	0	0
2	0	0	0	0
……	0	0	0	0
16	0	0	0	0

Sample Q-Table for 4 x 4 2D game environment

Learning Process

Step 1: Initialization

When the agent plays the game for the first time, it has no prior knowledge so let’s initialize the table with zeroes.

Step 2: Exploitation OR Exploration

Now the agent can interact with the environment in two ways: either it can use already gained info from the Q-table i.e. exploit, or it can venture to uncharted territories i.e. explore. Exploitation becomes very useful when the agent has worked out a high number of episodes and has information about the environment. Whereas, the exploration becomes important when the agent is naïve and does not have much experience. This tradeoff between exploitation and exploration can be handled by including epsilon in the value function. Ideally, at initial stages, we would like to give more preference to exploration, while in the later stages exploitation would be more useful.

In Step 2, the agent takes an action (exploit or explore).

Step 3: Measure Reward

After the agent performs an action decided in step 2, it reaches the next state say s’. Now again at state s’ the four actions can be performed, each one leading to a different reward score.

For e.g, the boy moves from 1 to 5, now either 6 can be selected or 9 can be selected. Now for finding the reward value for state 5, we will find out the reward values of all the future states i.e, 6 & 9, and select the maximum value.

At 5, there are two options (For simplicity retracing steps is not performed)–

Go to 9 : End of Episode
Go to 6 : At state 6 there are again 3 options –

Go to 7 – End of Episode
Go to 2 – Continue this step until reach end of episode and find out the reward
Go to 10 – Continue this step, find out reward

Sample Calculation

Path A reward = 10 + 50 = 60

Path B reward = 50

Max Reward = 60 (Path A)

Total Rewards at State 5: -50 (Faced dog at 9), 10 + 60 (Max reward from State 6 onwards)

Value of reward at 5 = Max (-50 , 10+60 ) = 70

Step 4: Update the Q table

The reward value calculated in step 3 is then used to update the value at state 5 using the Bellman’s equation-

Here, Learning rate = A constant which determines how much weightage you want to give to the new value vs the old value.

Discount Rate = Constant that discounts the effect of future rewards (0.8 to 0.99), i.e., balance the effect of future rewards in the new values.

The agent will iterate over these steps and achieve a Q- Table with updated values. Now using this Q-Table is as simple as using a map, for each state select an action, which leads to a state with the maximum Q value.

If you found this helpful and wish to learn more such concepts, you can enrol with Great Learning Academy’s free online courses today.