Deep Reinforcement Learning
In this notebook, the first applications and attemps to use DL in RL are explained. Moreover, there is a practical example of DQN in this notebook.
Deep Reinforcement Learning (DRL) consists of Deep Learning (DL) and Reinforcement Learning. In past years, DL has created a new way to think about machine learning projects. Image Processing, NLP, and autonomous driving are successful examples of DL applications, and the idea of using DL in RL started immediately after the significant performance of neural networks in supervised learning. Therefore, some novel algorithms were developed, whose simpler versions had existed before. But, new ones are armed with neural networks.
However, the important point is to completely realize why DL is useful in RL. It is necessary to explain problems that are not feasible to solve without DL, and then discuss the effect and result of DL. Here there are some cases where DL helps us to efficiently and simply solve RL problems.
One of the pioneer papers in DRL is "Playing Atari with Deep Reinforcement Learning" which was published in 2013. This paper shows that it is possible to learn to play Atari games without having any experience with the game. In addition, the paper uses one model for different games, and it explains how this procedure can be done. Here, there is a summary of this paper.
Let's assume that each frame of one Attari game is represented by an 210 160 RGB image, and there are 60 frames per second. The goal is to create a single neural network agent that is able to successfully learn to play as many of the games as possible. The network was not provided with any game-specific information or hand-designed visual features, and was not privy to the internal state of the emulator;
We consider tasks in which an agent interacts with an environment $E$, in this case the Atari emulator,
in a sequence of actions, observations and rewards. At each time-step the agent selects an action
at from the set of legal game actions, $A = \{1, . . . , K\}$. The action is passed to the emulator and
modifies its internal state and the game score. In general $E$ may be stochastic. The emulator’s
internal state is not observed by the agent; instead it observes an image $x^t \in \mathbb{R}^d$
from the emulator,
which is a vector of raw pixel values representing the current screen. In addition it receives a reward
$r_t$ representing the change in game score. Note that in general the game score may depend on the
whole prior sequence of actions and observations; feedback about an action may only be received
after many thousands of time-steps have elapsed.
After defining states and actions, the agent starts interacting with $E$. The goal of the agent is to interact with the emulator by selecting actions in a way that maximises future rewards. We define the optimal action-value function $Q^∗
(s, a)$
as the maximum expected return achievable by following any strategy, after seeing some sequence
$s$ and then taking some action $a$, $Q^∗(s, a) = \max_{\pi} E [R_t|s_t = s, a_t = a, \pi]$, where $\pi$ is a policy
mapping sequences to actions (or distributions over actions). The optimal action-value function obeys an important identity known as the Bellman equation.
$$Q^∗(s, a) = E{s'∼E} [r + \lambda \max{a'}Q^(s', a'))|s, a]$$
The basic idea behind many reinforcement learning algorithms is to estimate the actionvalue function, by using the Bellman equation as an iterative update. In practice, this basic approach is totally impractical, because the action-value function is estimated separately for each sequence, without any generalisation. Instead, it is common to use a function approximator to estimate the action-value function, $Q(s, a; θ) \approx Q^∗(s, a)$. We use to a neural network function approximator with weights $\theta$ as a Q-network. We use an architecture in which there is a separate output unit for each possible action, and only the state representation is an input to the neural network. Also, our model outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
RL projects are sometimes critical, and the algorithms must not be vulnerable to adversarial policies. Although attack and defenses mechanisms in RL are studied very well, the DRL algorithm can be misled by adversarial policies. There is a paper called ADVERSARIAL POLICIES: ATTACKING DEEP REINFORCEMENT LEARNING that discusses problems with DRL algorithms. Here is a summary of this paper.
Is it possible to attack an RL agent simply by choosing an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial? We demonstrate the existence of adversarial policies in zero-sum games between simulated humanoid robots with proprioceptive observations, against state-of-the-art victims trained via self-play to
be robust to opponents. The adversarial policies reliably win against the victims but generate seemingly random and uncoordinated behavior. We find that these policies are more successful in high-dimensional environments, and induce substantially different activations in the victim policy network than when the victim plays against a normal opponent.
We model the victim as playing against an opponent in a two-player Markov game (Shapley, 1953). Our threat model assumes the attacker can control the opponent, in which case we call the opponent an adversary. We denote the adversary and victim by subscript $\alpha$ and $\nu$ respectively. The adversary is allowed unlimited black-box access to actions sampled from $\pi_\nu$, but is not given any white-box information such as weights or activations. We further assume the victim agent follows a fixed stochastic policy $\pi_\nu$, corresponding to the common case of a pre-trained model deployed with static weights.
Since the victim policy $\pi_\nu$ is held fixed, the two-player Markov game $M$ reduces to a single-player MDP $M_\alpha = (S, A_\alpha, T_\alpha, R_\alpha)$ that the attacker must solve. The goal of the attacker is to find an adversarial policy $\pi_\alpha$ maximizing the sum of discounted rewards: $$\sum_{t=0}^{\infty}\lambda R_\alpha(s^{(t)}, a_\alpha^{(t)}, s_\alpha^{(t)}), \quad \text{where} \quad s^{(t+1)} ∼ T_\alpha(s_\alpha^{(t)}, a_\alpha^{(t)}) \quad and \quad a_α ∼ \pi_\alpha(· | s^{(t)}))$$ Note the MDP’s dynamics $T_\alpha$ will be unknown even if the Markov game’s dynamics $T$ are known since the victim policy $\pi_\mu$ is a black-box. Consequently, the attacker must solve an RL problem.
The adversarial policies beat the victim not by performing the intended task (e.g. blocking a goal), but rather by exploiting weaknesses in the victim’s policy. This effect is best seen by watching the videos at https://adversarialpolicies.github.io/.
One might wonder if the adversarial policies win because they are outside the training distribution of the victim. To test this, we evaluate victims against two simple off-distribution baselines: a random policy Rand (green) and a lifeless policy Zero (red). These baselines win as often as 30% to 50% in Kick and Defend, but less than 1% of the time in Sumo and You Shall Not Pass. This is well below the performance of our adversarial policies. We conclude that most victim policies are robust to off-distribution observations that are not adversarially optimized.
Run these cells on Google Colab (they don't work on Windows!)
!pip install torch
!pip install gym
!pip install atari_py
Because the main point of the project is implementing the DQN algorithm, so we won't build the game from scratch, we just use the environment provided by openAI, gym. The deep learning lib I use here is pytorch, because it's easy to learn and debug.
# Load toolkit
import logging
import gym
import math
import random
import torch
import torch.nn as nn
import torch.optim
import torch.autograd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from collections import deque
from src.env_wrapper import make_atari, wrap_deepmind, wrap_pytorch
USE_CUDA = torch.cuda.is_available() # if we have a gpu to make the compute faster
import urllib.request
urllib.request.urlretrieve('http://www.atarimania.com/roms/Roms.rar','Roms.rar')
!pip install unrar
!unrar x Roms.rar
!mkdir rars
!mv HC\ ROMS.zip rars
!mv ROMS.zip rars
!python -m atari_py.import_roms rars
# Load game environment
env = gym.make('PongNoFrameskip-v4')
# Reset environment
observation = env.reset()
# Pixel shape
observation.shape
The observation is an image of size 210*160 with 3 channels (rgb).
plt.imshow(observation)
Then we take a look at our action space.
env.unwrapped.get_action_meanings()
Although we have 6 discrete action values, there are only three real actions. 0&1 stands for doing nothing. 2&4 refers to going up. 3&5 refers to going down.
# lets take some steps to see middle game scene in pong game,
# 0 is numeric action to stay at same place
for i in range(20):
observation, reward, done, info = env.step(0)# 0 means stay the same place(or do nothing)
plt.imshow(observation)
# 2 is numeric action to move paddle up in game
for i in range(20):
observation2, reward, done, info = env.step(2)# 2 means go up
plt.imshow(observation2)
# 3 is numeric action to move paddle down in game
for i in range(20):
observation3, reward, done, info = env.step(3)# 3 means go down
plt.imshow(observation3)
In the game, we control the green paddle which is on the right side. When the ball passes our paddle and goes to end right, we get reward of -1 for losing. If the ball crosses the opponent and reaches the left, we get a reward of +1 for winning. The game finishes if one of the players reaches 21 scores.
So the definition of the system in reinforcement learning method is clear.
State is the screen of game.
Action is going go up, go down and stay still.
Goal: maximize total reward
We warp frames to 84x84 images as done in the later work and rescale the pixel values.
Next we define a network policy with an architecture that inputs a 84x84 image and chooses an action.
We use three convolutional layers and two fully connected layers to construct the DQN model.
class Pong(torch.nn.Module):
def __init__(self, action_num):
super(Pong, self).__init__()
# Three convolutional layers
self.conv = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=8, stride=4),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=4, stride=2),
nn.ReLU(),
nn.Conv2d(64, 64, kernel_size=3, stride=1),
nn.ReLU()
)
# Two fully connected layers
self.fc = nn.Sequential(
nn.Linear(self.feature_size(), 512),
nn.ReLU(),
nn.Linear(512, action_num)
)
def forward(self, x):
# evaluate the network and return the action values
x = self.conv(x)
x = x.view(x.size(0), -1)
x = self.fc(x)
return x
def feature_size(self):
# Return the size of last convolutional layer
with torch.no_grad():
feature_num = self.conv(torch.zeros(1, 1, 84, 84)).view(1, -1).size(1)
return feature_num
def action(self, state):
# Choose action based on max action values
state = to_tensor(state)
if USE_CUDA:
state = state.cuda()
q_values = self.forward(state)
return q_values.max(1)[1].data[0]
You can define any network model as you want. Based on architecture and complexity of your model, it may train slow or fast, or even not train at all (too hard to train).
Pong(6) # show the structure of the network
The loss is the squared loss: ${(Q\_ real-Q\_ expected)}^{2}$, as the code below. We can use the loss function to calculate gradient and optimize our DQN model.
def train():
# Train begin
# Sample batch data
state, action, reward, next_state, done = zip(*random.sample(replay_buffer, batch_size))
state = np.concatenate(state)
next_state = np.concatenate(next_state)
# Calculate loss
# Input the data as tensors
state = torch.Tensor(state)
next_state = torch.Tensor(next_state)
action = torch.LongTensor(action)
reward = torch.Tensor(reward)
done = torch.Tensor(done)
# Use CUDA
if USE_CUDA:
state = state.cuda()
next_state = next_state.cuda()
action = action.cuda()
reward = reward.cuda()
done = done.cuda()
# Calculate Q values based on the model
q_values = model(state)
next_q_values = model(next_state)
q_value = q_values.gather(1, action.unsqueeze(1)).squeeze(1)
next_q_value = next_q_values.max(1)[0]
# Set expected_q_value
expected_q_value = reward + gamma * next_q_value * (1 - done)
# Set loss
loss = (q_value - expected_q_value).pow(2).mean()
# Clear grad
optimizer.zero_grad()
# Backward
loss.backward()
# Grad step
optimizer.step()
# Train end
Below are some hyper parameters and basic parameters that will be used in the training process.
Batch_size: How many rounds we play before updating the weights of our network.
Gamma: The discount factor we use to discount the effect of old actions on the final result.
Epsilon_begin: Exploration rates at the beginning
Epsilon_end: Exploration rates at the end
Epsilon_decay: The speed of decreasing the exploration rates
Replay_buffer_size: The maximum number of data that our buffer can hold. If the the length of the data is larger than this size, we should drop oldest data point to make room for new data.
Replay_initial: The size of the first buffer before we begin to train from the buffer.
Learning_rate: The rate at which we learn from our results to compute the new weights. A higher rate means we react more to results and a lower rate means we don’t react as strongly to each result.
logger = logging.getLogger()
# game environment id
env_id = 'PongNoFrameskip-v4'
# env_id = 'Pong-v0'
round = 0
save_every_n_round = 100
render = False
one_round_max_frame = 10000
# Adam optimizer learning rate
learning_rate = 0.00001
# exploration rates at beginning and at the end
epsilon_begin = 1
epsilon_end = 0.02
# Set experience replay buffer size
replay_buffer_size = 100000
replay_initial = 10000
# Set batch size
batch_size = 32
# Set discount
gamma = 0.99
# Output the model
model_file = "batch_"+str(batch_size)+"_pong.model"
file_handler = logging.FileHandler(str(batch_size)+".log")
file_handler.setLevel(logging.INFO)
logger.setLevel(logging.INFO)
logger.addHandler(file_handler)
def to_tensor(I):
return torch.Tensor(I).unsqueeze(0)
The crux of our algorithm is going to live in a loop where we continually make a move and then learn based on the results of the move. We’ll put everything in a while block for now but in reality you might set up a break condition to stop the process.
The first step to our algorithm is preprocessing the image of the game that OpenAI Gym passed to us.
In the env.wrapper code, we convert the image format from RGB to GREY because the color is not important to us and GREY format is easy to train. In addition, we resize the image from 210x160x3 to 1x84x84.
The following cell may run very long time , so just be patient, or use the pre trained model to test. Uncomment the "break" to run.
# Accepts a tuple (s,a,r,s') and keeps a list, returns a random batch of tuples as needed
replay_buffer = deque(maxlen=replay_buffer_size)
# Set environment that skip over the 4 observations over all the time steps
env = make_atari(env_id)
# Configure environment for DeepMind-style Atari
env = wrap_deepmind(env)
# Set the environment that can be used in pytorch
env = wrap_pytorch(env)
model = Pong(env.action_space.n)
if USE_CUDA:
model = model.cuda()
# Train step
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# Set the initial exploration rate
epsilon = epsilon_begin
epsilon_decay = 30000
# Decay exploration from 1 to 0.02
import math
epsilon_by_frame = lambda frame_idx: epsilon_end + (epsilon_begin - epsilon_end) * math.exp(-1. * frame_idx / epsilon_decay)
frame_seq = 0
round_reward = 0
# while True:
for _ in range(2000):
# break
# reset environment
state = env.reset()
# state = pre_process(state)
# state = np.concatenate([state, state])
round_over = False
# Runing until the game round over
while not round_over:
# Set exploration rate
epsilon = epsilon_by_frame(frame_seq)
# behave according to an epsilon greedy policy
action = env.action_space.sample()
if random.random() > epsilon:
action = model.action(state)
# agent takes the action, and the environment responds
next_state, reward, done, _ = env.step(action)
round_over = done
# next_state = pre_process(next_state)
# next_state = np.concatenate([state[80:], next_state])
# push to replay buffer
s = np.expand_dims(state, 0)
ns = np.expand_dims(next_state, 0)
replay_buffer.append((s, action, reward, ns, done))
# update the state
state = next_state
frame_seq += 1
# print(frame_seq)
# Gather the rewards
round_reward += reward
# Start training when there are enough memory exists
if len(replay_buffer) > replay_initial:
train()
if frame_seq % 10000 == 0:
# print("saving model", frame_seq)
logger.info("current round: "+str(round)+"saving model")
# from datetime import datetime
# start = datetime.now()
torch.save(model, model_file)
# print(datetime.now() - start)
# Caculate round number
round += 1
# print("round over", round_reward)
logger.info("round reward: "+str(round_reward))
round_reward = 0
We write a script to save the model in the file and log rewards from time to time.
# Read the log trail
def read_log(filename):
results = []
with open(filename) as f:
for line in f:
if line.startswith("round reward"):
reward = line.replace("\n", "").replace("\r", "").replace("round reward: ", "")
results.append(float(reward))
return results
# Plot the result
plt.plot(read_log("32.log")) # reward of the train process
plt.xlabel('Round')
plt.ylabel('total reward per round')
plt.title('DQN Pong game q-learning (training)')
plt.show()
We can see that the agent has learned some tricks to defend and even win. From the above figures, we can figure out that at the beginning, the total rewards of the agent in one episode are around -20, which means the agent is losing the game. The reason is that with the relatively large epsilon, the agent is likely to explore more possibilities. However, with the increase of the number of game round, the rewards become larger. After around 1750 steps for batch size 32, the rewards turn positive and even reach 20, which means the agent begins to win. It can be explained that with the decay of the epsilon, the agent is prone to exploit rather than explore, leading to the higher rewards.
Let's check out the model we trained to see whether it can beat the hard coded AI or not.
# Load the model
pong = torch.load("batch_32_pong.model")
# Set test rounds
test_rounds = 100
win_num = 0
score = []
current = 0
for round in range(test_rounds):
# reset the environment
state = env.reset()
current = 0
round_over = False
# Running until the round over
while not round_over:
# env.render()
# Running greedy
action = pong.action(state)
# agent takes the action, and the environment responds
next_state, reward, round_over, _ = env.step(action)
# Caculate round reward
current += reward
# Update state
state = next_state
# Calculate number of victory
if reward > 0:
win_num+=1
score.append(current)
# Plot the result
plt.plot(score)
plt.xlabel('Round')
plt.ylabel('total reward per round')
plt.title('DQN Pong game q-learning (test)')
plt.ylim((-21,21))
plt.show()
# Print the wining rate
print("we win at rate ", win_num/test_rounds)
Now, we can see that the total reward per round are all above 0, which means we win the game every time. However, the mean reward is around 15, which means we cannot hit the ball back every time.
# Let us view the game
# Play the mp4 file
import io
import base64
from IPython.display import HTML
video = io.open('test.mp4', 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<video alt="test" controls>
<source src="data:video/mp4;base64,{0}" type="video/mp4" />
</video>'''.format(encoded.decode('ascii')))
1. The training process is too long. With GPU, it takes us a whole day to train the model.
2. The score result is not as high as we expected. The mean score of our model does not reach 21, which means the trained agent does not always hit the ball back every time.
Here are some ideas to improve the training process and agents:
1.Generate simulations of batches in parallel. I tried using multithreading to generate and play batches of games at the same time to increase the speed of training process. The problem was that the backend library for the game had issues creating environments in multiple threads. Maybe you can overcome this by implementing a multiprocess method.
2.Giving a reward to the agent for catching the ball could be good for not losing (for example a +0.5 reward). Now we are just giving positive reward to agent for scoring. If it catches the ball and then gets a -1 reward, it will think it is probably bad to catch the ball.
3. Use a different gamma to train the model. In our model, we use 0.99 as gamma. We may try other values for the gamma parameter, for example, 0.9. This may improve the training speed.