Reinforcement learning is useful when you have no training data or specific enough expertise about the problem. On a high level, you know WHAT you want, but not really HOW to get there. Luckily, all you need is a reward mechanism, and the reinforcement learning model will figure out how to maximize the reward, if you just let it “play” long enough. This is analogous to teaching a dog to sit down using treats. At first the dog is clueless and tries random things on your command. At some point, it accidentally lands on its butt and gets a sudden reward. As time goes by, and given enough iterations, it’ll figure out the expert strategy of sitting down on cue.

Introduction link

The idea behind Reinforcement Learning is that an agent will learn from the environment by interacting with it and receiving rewards for performing actions.

The Reinforcement Learning Process

Let’s imagine an agent learning to play Super Mario Bros as a working example. The Reinforcement Learning (RL) process can be modeled as a loop that works like this:

Our Agent receives state S0 from the Environment (In our case we receive the first frame of our game (state) from Super Mario Bros (environment))
Based on that state S0, agent takes an action A0 (our agent will move right)
Environment transitions to a new state S1 (new frame)
Environment gives some reward R1 to the agent (not dead: +1)

This RL loop outputs a sequence of state, action and reward.

The goal of the agent is to maximize the expected cumulative reward.

The central idea of the Reward Hypothesis

Why is the goal of the agent to maximize the expected cumulative reward?

Well, Reinforcement Learning is based on the idea of the reward hypothesis. All goals can be described by the maximization of the expected cumulative reward.

That’s why in Reinforcement Learning, to have the best behavior, we need to maximize the expected cumulative reward.

The cumulative reward at each time step t can be written as:

_ylz4lplMffGQR_g

which is equivalent to:

_AFAuM1Y8zmso4yB5mOApZ

However, in reality, we can’t just add the rewards like that. The rewards that come sooner (in the beginning of the game) are more probable to happen, since they are more predictable than the long term future reward.

_tciNrjN6pW60-h0PiQRiX

Let say your agent is this small mouse and your opponent is the cat. Your goal is to eat the maximum amount of cheese before being eaten by the cat.

As we can see in the diagram, it’s more probable to eat the cheese near us than the cheese close to the cat (the closer we are to the cat, the more dangerous it is).

As a consequence, the reward near the cat, even if it is bigger (more cheese), will be discounted. We’re not really sure we’ll be able to eat it.

To discount the rewards, we proceed like this:

We define a discount rate called gamma. It must be between 0 and 1.

The larger the gamma, the smaller the discount. This means the learning agent cares more about the long term reward.
On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).

Our discounted cumulative expected rewards is:

_zrzRTXt8rtWF5fX__kZ-y

To be simple, each reward will be discounted by gamma to the exponent of the time step. As the time step increases, the cat gets closer to us, so the future reward is less and less probable to happen.

Episodic or Continuing tasks

A task is an instance of a Reinforcement Learning problem. We can have two types of tasks: episodic and continuous.

Episodic task

In this case, we have a starting point and an ending point (a terminal state). This creates an episode: a list of States, Actions, Rewards, and New States.

For instance think about Super Mario Bros, an episode begin at the launch of a new Mario and ending: when you’re killed or you’re reach the end of the level.

Continuing task

These are tasks that continue forever (no terminal state). In this case, the agent has to learn how to choose the best actions and simultaneously interacts with the environment.

For instance, an agent that do automated stock trading. For this task, there is no starting point and terminal state. The agent keeps running until we decide to stop him.

Monte Carlo vs TD Learning methods

We have two ways of learning:

Collecting the rewards at the end of the episode and then calculating the maximum expected future reward: Monte Carlo Approach
Estimate the rewards at each step: Temporal Difference Learning

Monte Carlo

When the episode ends (the agent reaches a “terminal state”), the agent looks at the total cumulative reward to see how well it did. In Monte Carlo approach, rewards are only received at the end of the game.

Then, we start a new game with the added knowledge. The agent makes better decisions with each iteration.

_RLLzQl4YadpbhPlxpa5f6

Let’s take an example:

_tciNrjN6pW60-h0PiQRiXg-206746

If we take the maze environment:

We always start at the same starting point.
We terminate the episode if the cat eats us or if we move > 20 steps.
At the end of the episode, we have a list of State, Actions, Rewards, and New States.
The agent will sum the total rewards Gt (to see how well it did).
It will then update V(st) based on the formula above.
Then start a new game with this new knowledge.

By running more and more episodes, the agent will learn to play better and better.

Temporal Difference Learning : learning at each time step

TD Learning, on the other hand, will not wait until the end of the episode to update the maximum expected future reward estimation: it will update its value estimation V for the non-terminal states St occurring at that experience.

This method is called TD(0) or one step TD (update the value function after any individual step).

_LLfj11fivpkKZkwQ8uPi3

TD methods only wait until the next time step to update the value estimates. At time t+1 they immediately form a TD target using the observed reward Rt+1 and the current estimate V(St+1).

TD target is an estimation: in fact you update the previous estimate V(St) by updating it towards a one-step target.

Exploration/Exploitation trade-off

Before looking at the different strategies to solve Reinforcement Learning problems, we must cover one more very important topic: the exploration/exploitation trade-off.

Exploration is finding more information about the environment.
Exploitation is exploiting known information to maximize the reward.

Remember, the goal of our RL agent is to maximize the expected cumulative reward. However, we can fall into a common trap.

_APLmZ8CVgu0oY3sQBVYIu

In this game, our mouse can have an infinite amount of small cheese (+1 each). But at the top of the maze there is a gigantic sum of cheese (+1000).

However, if we only focus on reward, our agent will never reach the gigantic sum of cheese. Instead, it will only exploit the nearest source of rewards, even if this source is small (exploitation).

But if our agent does a little bit of exploration, it can find the big reward.

This is what we call the exploration/exploitation trade off. We must define a rule that helps to handle this trade-off. We’ll see in future articles different ways to handle it.

Three approaches to RL

Now that we defined the main elements of Reinforcement Learning, let’s move on to the three approaches to solve a Reinforcement Learning problem. These are value-based, policy-based, and model-based.

Value Based

In value-based RL, the goal is to optimize the value function V(s).

The value function is a function that tells us the maximum expected future reward the agent will get at each state.

The value of each state is the total amount of the reward an agent can expect to accumulate over the future, starting at that state.

_kvtRAhBZO-h77Iw

The agent will use this value function to select which state to choose at each step. The agent takes the state with the biggest value. _2_JRk-4O523bcOcSy1u31

In the maze example, at each step we will take the biggest value: -7, then -6, then -5 (and so on) to attain the goal.

Policy Based

In the maze example, at each step we will take the biggest value: -7, then -6, then -5 (and so on) to attain the goal.

_8B4cAhvM-K4y9a5

We learn a policy function. This lets us map each state to the best corresponding action.

We have two types of policy:

Deterministic: a policy at a given state will always return the same action.
Stochastic: output a distribution probability over actions.

_DNiQGeUl1FKunRb

_fii7Z01laRGateAJDvloA

We learn a policy function. This lets us map each state to the best corresponding action.

We have two types of policy:

Deterministic: a policy at a given state will always return the same action.
Stochastic: output a distribution probability over actions.

Model Based

In model-based RL, we model the environment. This means we create a model of the behavior of the environment. The problem is each environment will need a different model representation.

The Q-learning algorithm

_r-F8AfutP0a8gPWs_5BBL

$\lambda$ - determines how much importance we want to give to future rewards. A high value for the discount factor (close to 1) captures the long-term effective award, whereas, a discount factor of 0 makes our agent consider only immediate reward, hence making it greedy.

It looks a bit intimidating, but what it does is quite simple. We can summarize it as:

Update the value estimation of an action based on the reward we got and the reward we expect next.

This is the fundamental thing we are doing. The learning rate and discount, while required, are just there to tweak the behavior. The discount will define how much we weigh future expected action values over the one we just experienced. The learning rate is sort of an overall gas pedal. Go too fast and you’ll drive past the optimal, go too slow and you’ll never get there.

Why do we need to gamble and take random actions? For the same reason that the accountant got stuck. Since our default strategy is still greedy, that is we take the most lucrative option by default, we need to introduce some stochasticity to ensure all possible \pairs are explored.

Example Design: Self-Driving Cab

Problem illustration

In this problem, we try to solve solve a problem with Q-Learning in python with OpenAI Gym. Let’s design a simulation of a self-driving cab. The major goal is to demonstrate, in a simplified environment, how you can use RL techniques to develop an efficient and safe approach for tackling this problem. The Smartcab’s job is to pick up the passenger at one location and drop them off in another. Here are a few things that we’d love our Smartcab to take care of:

Drop off the passenger to the right location.
Save passenger’s time by taking minimum time possible to drop off
Take care of passenger’s safety and traffic rules

There are different aspects that need to be considered here while modeling an RL solution to this problem: rewards, states, and actions.

Rewards

The agent should receive a high positive reward for a successful dropoff because this behavior is highly desired
The agent should be penalized if it tries to drop off a passenger in wrong locations
The agent should get a slight negative reward for not making it to the destination after every time-step. “Slight” negative because we would prefer our agent to reach late instead of making wrong moves trying to reach to the destination as fast as possible

State Space

In Reinforcement Learning, the agent encounters a state, and then takes action according to the state it’s in.

The State Space is the set of all possible situations our taxi could inhabit. The state should contain useful information the agent needs to make the right action.

Let’s say we have a training area for our Smartcab where we are teaching it to transport people in a parking lot to four different locations (R, G, Y, B):

einforcement_Learning_Taxi_Env.width-120

Let’s assume Smartcab is the only vehicle in this parking lot. We can break up the parking lot into a 5x5 grid, which gives us 25 possible taxi locations. These 25 locations are one part of our state space. Notice the current location state of our taxi is coordinate (3, 1).

You’ll also notice there are four locations that we can pick up and drop off a passenger: R, G, Y, B or [(0,0), (0,4), (4,0), (4,3)] in (row, col) coordinates. Our illustrated passenger is in location Y and they wish to go to location R.

When we also account for one additional passenger state of being inside the taxi, we can take all combinations of passenger locations and destination locations to come to a total number of states for our taxi environment; there’s four destinations and five passenger locations.

So, our taxi environment has $5\times 5 \times 5 \times 4=500$ total possible states.

The agent encounters one of the 500 states and it takes an action. The action in our case can be to move in a direction or decide to pickup/dropoff a passenger.

In other words, we have six possible actions:

south
north
east
west
pickup
dropoff

This is the action space: the set of all the actions that our agent can take in a given state.

You’ll notice in the illustration above, that the taxi cannot perform certain actions in certain states due to walls. In environment’s code, we will simply provide a -1 penalty for every wall hit and the taxi won’t move anywhere. This will just rack up penalties causing the taxi to consider going around the wall.

Action Space

The agent encounters one of the 500 states and it takes an action. The action in our case can be to move in a direction or decide to pickup/dropoff a passenger.

In other words, we have six possible actions:

south
north
east
west
pickup
dropoff

This is the action space: the set of all the actions that our agent can take in a given state.

Implementation with Python

Fortunately, OpenAI Gym has this exact environment already built for us.

Gym provides different game environments which we can plug into our code and test an agent. The library takes care of API for providing all the information that our agent would require, like possible actions, score, and current state. We just need to focus just on the algorithm part for our agent.

We’ll be using the Gym environment called Taxi-V2, which all of the details explained above were pulled from. The objectives, rewards, and actions are all the same.

Gym’s interface

Firstly, we need to load the game environment and render what it looks like:

1
2
3

import gym
env = gym.make("Taxi-v2").env
env.render()

creen Shot 2019-03-04 at 4.50.53 P

The core gym interface is env, which is the unified environment interface. The following are the envmethods that would be quite helpful to us:

env.reset: Resets the environment and returns a random initial state.
env.step(action): Step the environment by one timestep. Returns
- observation: Observations of the environment
- reward: If your action was beneficial or not
- done: Indicates if we have successfully picked up and dropped off a passenger, also called one episode
- info: Additional info such as performance and latency for debugging purposes
env.render: Renders one frame of the environment (helpful in visualizing the environment)

Note: We are using the .env on the end of make to avoid training stopping at 200 iterations, which is the default for the new version of Gym

Reminder of the problem

Here’s our restructured problem statement (from Gym docs):

“There are 4 locations (labeled by different letters), and our job is to pick up the passenger at one location and drop him off at another. We receive +20 points for a successful drop-off and lose 1 point for every time-step it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.”

env.reset() # rest environment to a new, random state
env.render()
print("Action Space {}".format(env.action_space))
print("State Space {}".format(env.observation_space))

creen Shot 2019-03-04 at 4.56.32 P

The filled square represents the taxi, which is yellow without a passenger and green with a passenger.
The pipe (“|”) represents a wall which the taxi cannot cross.
R, G, Y, B are the possible pickup and destination locations. The blue letter represents the current passenger pick-up location, and the purple letter is the current destination.

As verified by the prints, we have an Action Space of size 6 and a State Space of size 500. As you’ll see, our RL algorithm won’t need any more information than these two things. All we need is a way to identify a state uniquely by assigning a unique number to every possible state, and RL learns to choose an action number from 0-5 where:

0 = south
1 = north
2 = east
3 = west
4 = pickup
5 = dropoff

Recall that the 500 states correspond to a encoding of the taxi’s location, the passenger’s location, and the destination location.

Reinforcement Learning will learn a mapping of states to the optimal action to perform in that state by exploration, i.e. the agent explores the environment and takes actions based off rewards defined in the environment.

The optimal action for each state is the action that has the highest cumulative long-term reward.

We can actually take our illustration above, encode its state, and give it to the environment to render in Gym. Recall that we have the taxi at row 3, column 1, our passenger is at location 2, and our destination is location 0. Using the Taxi-v2 state encoding method, we can do the following:

state = env.encode(3,1,2,0) #(taxi_row,taxi_column,passenger_position,destination_index)
print("State: ",state)
env.s = state
env.render()

creen Shot 2019-03-04 at 5.07.33 P

We are using our illustration’s coordinates to generate a number corresponding to a state between 0 and 499, which turns out to be 328 for our illustration’s state.

Then we can set the environment’s state manually with env.env.s using that encoded number. You can play around with the numbers and you’ll see the taxi, passenger, and destination move around.

The reward table

When the Taxi environment is created, there is an initial Reward table that’s also created, called P. We can think of it like a matrix that has the number of states as rows and number of actions as columns, i.e. a $states \times actions$.

Since every state is in this matrix, we can see the default reward values assigned to our illustration’s state:

1	env.P[328]

{0: [(1.0, 428, -1, False)],
 1: [(1.0, 228, -1, False)],
 2: [(1.0, 348, -1, False)],
 3: [(1.0, 328, -1, False)],
 4: [(1.0, 328, -10, False)],
 5: [(1.0, 328, -10, False)]}

This dictionary has the structure {action: [(probability, nextstate, reward, done)]}.

A few things to note:

The 0-5 corresponds to the actions (south, north, east, west, pickup, dropoff) the taxi can perform at our current state in the illustration.
In this env, probability is always 1.0.
The nextstate is the state we would be in if we take the action at this index of the dict
All the movement actions have a -1 reward and the pickup/dropoff actions have -10 reward in this particular state. If we are in a state where the taxi has a passenger and is on top of the right destination, we would see a reward of 20 at the dropoff action (5)
done is used to tell us when we have successfully dropped off a passenger in the right location. Each successfull dropoff is the end of an episode

Note that if our agent chose to explore action two in this state it would be going East into a wall. The source code has made it impossible to actually move the taxi across a wall, so if the taxi chooses that action, it will just keep accruing -1 penalties, which affects the long-term reward.

Without Reinforcement Learning

Let’s see what would happen if we try to brute-force our way to solving the problem without RL.

Since we have our P table for default rewards in each state, we can try to have our taxi navigate just using that.

We’ll create an infinite loop which runs until one passenger reaches one destination (one episode), or in other words, when the received reward is 20. The env.action_space.sample() method automatically selects one random action from set of all possible actions.

Let’s see what happens:

env.s = 328 # set environment to illustration's state
epochs = 0
penalties, reward = 0,0
frames = [] # store each step for visualization
done = False
while not done:
    action = env.action_space.sample() # choose next step randomly 
    state,reward,done,info = env.step(action)
    if reward == -10: # illegal pick-up and drop-off actions
        penalties += 1
    # put each rendered frame into dict for visualization    
    frames.append({
            'frame':env.render(mode='ansi'),
            'state':state,
            'action':action,
            'reward':reward
        })
    epochs += 1
print("Timesteps taken: {}".format(epochs))
print("Penalties incurred: {}".format(penalties))

1 2	Timesteps taken: 1348 Penalties incurred: 431

from IPython.display import clear_output
from time import sleep
def print_frames(frames):
    for i,frame in enumerate(frames):
        clear_output(wait=True)
        print(frame['frame'])
        print(f"Timestep: {i+1}")
        print(f"State: {frame['state']}")
        print(f"Action: {frame['action']}")
        print(f"Reward: {frame['reward']}")
        sleep(.1)
print_frames(frames)

penAI_Gym_Taxi_-Animation-183059

Not good. Our agent takes thousands of timesteps and makes lots of wrong drop offs to deliver just one passenger to the right destination.

This is because we aren’t learning from past experience. We can run this over and over, and it will never optimize. The agent has no memory of which action was best for each state, which is exactly what Reinforcement Learning will do for us.

Enter Reinforcement Learning

Summing up the Q-Learning Process

Breaking it down into steps, we get

Initialize the Q-table by all zeros.
Start exploring actions: For each state, select any one among all possible actions for the current state (S).
Travel to the next state (S’) as a result of that action (a).
For all possible actions from the state (S’) select the one with the highest Q-value.
Update Q-table values using the equation.
Set the next state as the current state.
If goal state is reached, then end and repeat the process.

After enough random exploration of actions, the Q-values tend to converge serving our agent as an action-value function which it can exploit to pick the most optimal action from a given state.

There’s a tradeoff between exploration (choosing a random action) and exploitation (choosing actions based on already learned Q-values). We want to prevent the action from always taking the same route, and possibly overfitting, so we’ll be introducing another parameter called $\epsilon$ “epsilon” to cater to this during training.

Instead of just selecting the best learned Q-value action, we’ll sometimes favor exploring the action space further. Larger epsilon value results in episodes with more penalties (on average) which is obvious because we are exploring and making random decisions.

Training the agent

First, we’ll initialize the Q-table to a 500×6500×6 matrix of zeros:

1 2	import numpy as np q_table = np.zeros([env.observation_space.n,env.action_space.n])

In the first part of while not done, we decide whether to pick a random action or to exploit the already computed Q-values. This is done simply by using the epsilon value and comparing it to the random.uniform(0, 1) function, which returns an arbitrary number between 0 and 1.

We execute the chosen action in the environment to obtain the next_state and the reward from performing the action. After that, we calculate the maximum Q-value for the actions corresponding to the next_state, and with that, we can easily update our Q-value to the new_q_value:

import random
alpha = 0.1 #learning rate
lambda_ = 0.6 # discount
epsilon = 0.1
all_epochs = []
all_penalties = []
for i in range(1,100001):
    state = env.reset()
    epochs,penalties,reward = 0,0,0
    done = False
    while not done:
        if random.uniform(0,1)<epsilon:
            action = env.action_space.sample() #explore action space
        else:
            action = np.argmax(q_table[state]) #exploit learned values
        next_state, reward, done, info = env.step(action)
        old_value = q_table[state,action]
        next_max = np.max(q_table[next_state])
        new_value = (1-alpha)*old_value + alpha*(reward+lambda_*next_max)
        q_table[state,action] = new_value
        if reward == -10:
            penalties += 1
        state = next_state
        epochs += 1
    if i%100==0:
        clear_output(wait=True)
        print(f"Episode:{i}")
print("Training finished.\n")

Now that the Q-table has been established over 100,000 episodes, let’s see what the Q-values are at our illustration’s state:

1	q_table[328]

1 2	array([ -2.41338735, -2.27325184, -2.41222668, -2.36038871, -11.15090102, -10.99517141])

The max Q-value is “north” (-2.27325184), so it looks like Q-learning has effectively learned the best action to take in our illustration’s state!

Evaluating the agent

Let’s evaluate the performance of our agent. We don’t need to explore actions any further, so now the next action is always selected using the best Q-value:

total_epochs, total_penalties = 0, 0
episodes = 100
for _ in range(episodes):
    state = env.reset()
    epochs, penalties, reward = 0, 0, 0
    done = False
    while not done:
        action = np.argmax(q_table[state])
        state, reward, done, info = env.step(action)
        if reward == -10:
            penalties += 1
        epochs += 1
    total_penalties += penalties
    total_epochs += epochs
print(f"Results after {episodes} episodes:")
print(f"Average timesteps per episode: {total_epochs / episodes}")
print(f"Average penalties per episode: {total_penalties / episodes}")

1
2
3

Results after 100 episodes:
Average timesteps per episode: 12.55
Average penalties per episode: 0.0

We can see from the evaluation, the agent’s performance improved significantly and it incurred no penalties, which means it performed the correct pickup/dropoff actions with 100 different passengers.

Comparison between learning w/o rl and w rl

With Q-learning agent commits errors initially during exploration but once it has explored enough (seen most of the states), it can act wisely maximizing the rewards making smart moves. Let’s see how much better our Q-learning solution is when compared to the agent making just random moves.

We evaluate our agents according to the following metrics,

Average number of penalties per episode: The smaller the number, the better the performance of our agent. Ideally, we would like this metric to be zero or very close to zero.
Average number of timesteps per trip: We want a small number of timesteps per episode as well since we want our agent to take minimum steps(i.e. the shortest path) to reach the destination.
Average rewards per move: The larger the reward means the agent is doing the right thing. That’s why deciding rewards is a crucial part of Reinforcement Learning. In our case, as both timesteps and penalties are negatively rewarded, a higher average reward would mean that the agent reaches the destination as fast as possible with the least penalties”

Measure	Random agent’s performance	Random agent’s performance
Average rewards per move	-3.9012092102214075	0.6962843295638126
Average number of penalties per episode	920.45	0.0
Average number of timesteps per trip	2848.14	12.38

Hyperparameters and optimizations

The values of alpha, lambda, and epsilon were mostly based on intuition and some “hit and trial”, but there are better ways to come up with good values.

Ideally, all three should decrease over time because as the agent continues to learn, it actually builds up more resilient priors;

$\alpha$: (the learning rate) should decrease as you continue to gain a larger and larger knowledge base.
$\lambda$: as you get closer and closer to the deadline, your preference for near-term reward should increase, as you won’t be around long enough to get the long-term reward, which means your gamma should decrease.
$\epsilon$: as we develop our strategy, we have less need of exploration and more exploitation to get more utility from our policy, so as trials increase, epsilon should decrease.

OpenAI-GYM

Gym is a toolkit for developing and comparing reinforcement learning algorithms. The gym library is a collection of test problems — environments — that you can use to work out your reinforcement learning algorithms. These environments have a shared interface, allowing you to write general algorithms.

Environments

import gym
env = gym.make("Taxi-v2")
env.reset()
for _ in range(1000):
    env.render()
    env.step(env.action_space.sample()) # take a random action

gym.make(): create an environment
reset(): reset the environment and returns a random initial state.
render(): print out the current environmnet
env.step(action): step the environment by one timestep. returns
- observation(object): an environment-specific object representing your observation of the environment.
- reward(float): amount of reward achieved by the previous action.
- done(boolean): whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated.
- info(dict): Additional info such as performance and latency for debugging purposes

This is just an implementation of the classic “agent-environment loop”. Each timestep, the agent chooses an action, and the environment returns an observation and a reward. The process gets started by calling reset(), which returns an initial observation. So a more proper way of writing the previous code would be to respect the done flag:

import gym
env = gym.make("Taxi-v2")
for i_episode in range(20):
    observation =  env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation,reward,done,info = env.step(action)
        if done:
            print(print("Episode finished after {} timesteps".format(t+1)))
            break

Spaces

In the examples above, we’ve been sampling random actions from the environment’s action space. But what actually are those actions? Every environment comes with an action_space and an observation_space. These attributes are of type Space, and they describe the format of valid actions and observations:

import gym
env = gym.make("Taxi-v2")
print(env.action_space)
#> Discrete(6)
print(env.observation_space)
#> Discrete(500)

The Discrete space allows a fixed range of non-negative numbers.

References

Reinforcement Learning Tutorial Part 1: Q-Learning

An introduction to Q-Learning: reinforcement learning

Reinforcement Q-Learning from Scratch in Python with OpenAI Gym