Last time, we established some theory for our DQN. Now it’s time to implement it.

We won’t build a full DQN from the DeepMind paper, not yet. Instead we’ll build a simplified version. I call it the basic DQN.

The basic DQN is the same as the full DQN, but missing a target network and reward clipping. We’ll get to that in the next post.

We use the OpenAI gym, the CartPole-v1 environment, and Python 3.6. Let’s get started.

## Setup

As usual, first thing is to import libraries:

```
# General
import gym
import pandas as pd
import numpy as np
# Neural network
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
```

We then set up the environment with the `gym.make()`

command.

`env = gym.make("CartPole-v1")`

## Approximating *q(s,a)* with a neural network

The DQN is for problems that have a continuous state, not a discrete state. That rules out the use of a q-table.

Instead we build a neural network to represent *q*. There are many ways to build a neural network. I choose keras.

This neural network will map the state, **s** (usually a vector) to the possible actions, **a**.

*One input neuron for every component of the state vector. One output neuron for every possible action.*

Say the above neural network represented the q-function. Our state would be a vector of length three. Examples? X, Y and Z position for a game, or card sum, dealer faceup card and useable ace for blackjack.

We don’t need a complicated neural network structure for our problem. A network with one hidden layer will be fine. The environment dictates the number of input and output neurons in our network. Activation functions are relu, except for a linear output layer. Using the Adam optimiser lets us avoid choosing a learning rate.

```
## Building the nnet that approximates q
n_actions = env.action_space.n # dim of output layer
input_dim = env.observation_space.shape[0] # dim of input layer
model = Sequential()
model.add(Dense(64, input_dim = input_dim , activation = 'relu'))
model.add(Dense(32, activation = 'relu'))
model.add(Dense(n_actions, activation = 'linear'))
model.compile(optimizer=Adam(), loss = 'mse')
```

## Parameters

Next we initiate parameters:

`n_episodes`

: how many episodes to run`gamma`

: the discount factor that applies to rewards`epsilon`

: the exploration parameter: will decline over time.`minibatch_size`

: how manyexperiences to include each time we update the*q*neural network`r_sums`

: list to store the reward from each episode`replay_memory`

: list to hold theexperiences`mem_max_size`

: once the`replay_memory`

list gets longer than this, overwrite old samples with new ones

Here’s some values for these parameters. Many different parameter values may work. Use hyperparameter tuning if you want to find the best ones.

```
n_episodes = 1000
gamma = 0.99
epsilon = 1
minibatch_size = 32
r_sums = [] # stores rewards of each epsiode
replay_memory = [] # replay memory holds s, a, r, s'
mem_max_size = 100000
```

## Interaction

```
for n in range(n_episodes):
s = env.reset()
done=False
r_sum = 0
while not done:
# Uncomment this to see the agent learning
# env.render()
# Feedforward pass for current state to get predicted q-values for all actions
qvals_s = model.predict(s.reshape(1,4))
# Choose action to be epsilon-greedy
if np.random.random() < epsilon: a = env.action_space.sample()
else: a = np.argmax(qvals_s);
# Take step, store results
sprime, r, done, info = env.step(a)
r_sum += r
# add to memory, respecting memory buffer limit
if len(replay_memory) > mem_max_size:
replay_memory.pop(0)
replay_memory.append({"s":s,"a":a,"r":r,"sprime":sprime,"done":done})
# Update state
s=sprime
# Train the nnet that approximates q(s,a), using the replay memory
model=replay(replay_memory, minibatch_size = minibatch_size)
# Decrease epsilon until we hit a target threshold
if epsilon > 0.01: epsilon -= 0.001
print("Total reward:", r_sum)
r_sums.append(r_sum)
```

We need to reset the environment each time we reach the end of an episode. That’s what `env.reset()`

does.

Let’s go inside the inner loop. Say the agent is in a state *s*.

First the agent needs to know what the *q*-value is for each action in state *s*. Get this with the neural network’s predict method.

The agent then has to select an action. It could take the action with the highest *q*-value, or it could move randomly. It’s up to the epsilon parameter to determine how often each should happen.

The agent then executes its action. It ends up in *s’* (sprime) and gets a reward *r*.
Now the agent has a experience. We’ll also add the variable done to know if the episode has finished, making

*. Our experience replay function will use these experiences.*

Time to add it to the memory. The memory can only hold mem_max_size elements. If our memory is full, we’ll ditch the oldest entry and replace it with the latest experience.

After, we train the q neural network using a `minibatch_size`

amount of experiences. We’ll write a function called

`replay()`

to do this.Finally we decrease epsilon a little. As the agent learns more about the environment it should explore less, until it hits a threshold.

## The experience replay function

Time to write the `replay()`

function used earlier. This function trains our neural network that approximates *q*.

```
def replay(replay_memory, minibatch_size=32):
# choose <s,a,r,s',done> experiences randomly from the memory
minibatch = np.random.choice(replay_memory, minibatch_size, replace=True)
# create one list containing s, one list containing a, etc
s_l = np.array(list(map(lambda x: x['s'], minibatch)))
a_l = np.array(list(map(lambda x: x['a'], minibatch)))
r_l = np.array(list(map(lambda x: x['r'], minibatch)))
sprime_l = np.array(list(map(lambda x: x['sprime'], minibatch)))
done_l = np.array(list(map(lambda x: x['done'], minibatch)))
# Find q(s', a') for all possible actions a'. Store in list
# We'll use the maximum of these values for q-update
qvals_sprime_l = model.predict(sprime_l)
# Find q(s,a) for all possible actions a. Store in list
target_f = model.predict(s_l)
# q-update target
# For the action we took, use the q-update value
# For other actions, use the current nnet predicted value
for i,(s,a,r,qvals_sprime, done) in enumerate(zip(s_l,a_l,r_l,qvals_sprime_l, done_l)):
if not done: target = r + gamma * np.max(qvals_sprime)
else: target = r
target_f[i][a] = target
# Update weights of neural network with fit()
# Loss function is 0 for actions we didn't take
model.fit(s_l, target_f, epochs=1, verbose=0)
return model
```

Start off by selecting some entries randomly from our memory to form our minibatch. Select the samples without replacement.

Next we create one list to hold each s element, one list to hold each a element, one list to hold each *r* element, and so on. We’ll use these lists later on.

Now we find the action *a’* that maximises *q(s’,a’)*. We can use the predict() function of our neural network. Remember, by inputting the state s’ on one end, we get the q-estimate for every action. Now we know the best q-value for the state s’.

We eventually will need to train our neural network with some labelled pairs. Our input to the neural network needs to be the same dimensions as s, and we need one value for every action. Remember this diagram?

We don’t know how good the other actions were. We didn’t take them – but we need a *q*-value for these actions anyway. What do we do?
Solution: use the current predicted values of the q-network for these actions. We can get these with `target_f = model.predict(s_l)`

. The loss function will be 0 for these since we have a perfect fit.

Now we need to add the q-estimate for the action the agent took. We do this with the q-learning update. We also need to split between the cases where the episode is already finished or not. If `done=True`

then there is no next state to go to – the target is *r*. Else the q-update gives the target. Use the target to overwrite the predicted values for the action the agent took. Do this with the `target_f`

vector.

Finally train the model using the minibatch. We don’t loop through the elements one by one and train on each of them. We train on all the elements at once and get the benefits of minibatch training.

## Performance

This environment is considered solved by getting an average reward of 475 over 100 episodes. The basic DQN can do this in xx episodes. How do we go? Here’s the results for 2500 episodes:

The basic DQN works pretty well. It learns how to balance the pole. It even gets quite good at it. But then it forgets what it knows. The performance becomes terrible. Eventually it will relearn how to balance the pole and the performance will return to high levels. We can see evidence of this towards the final episodes.

Why does this happen? The agent becomes successful at the task. The pole is balanced and the agents memory only has these experiences. This agent then uses this data for training. It then loses skills for when the pole isn’t already balanced. The memory starts to have more diverse experiences again. The agent starts to perform better.

This forgetting and relearning seems to be a constant feature for the DQN algorithm. Both the basic and full DQNs have this problem, called catastrophic forgetting. See this thread or this paper for more details.