Deep Reinforcement Learning

Tuesday November 14, 2017

This is a presentation given for Data Science DC on Tuesday November 14, 2017.

Further resources up front:

The following goes through all the content of the talk:

Aaron Schumacher

Hi! I'm Aaron. All these slides, and a corresponding write-up (you're reading it) are on my blog (which you're on).

Deep Learning Analytics

I work at Deep Learning Analytics (DLA). We do deep learning work for government and commercial customers. DLA is a great place to work, and one of the ways it's great is that it sends us to conferences and such things.

Deep RL Bootcamp

DLA sent me to the first UC Berkeley Deep RL Bootcamp organized by Abbeel, Duan, Chen, and Karpathy. It was a great experience and it largely inspired this talk.

I have a separate summary write-up about my experience at the bootcamp, and they've since put up all the videos, slides, and labs, so you can see everything that was covered there.

Sutton and Barto's textbook

The other major source for this talk is Sutton and Barto's textbook, which I like a lot.

The picture shows the first edition, which is not what you want. The second edition is available free online, and was last updated about a week ago (November 5, 2017).

Sutton and Barto are major figures in reinforcement learning, and they do not follow any no original research rules, making their book really fairly exciting, if you're not put off by the length (over 400 pages).

(The diagrams on the cover are not neural nets, but backup diagrams.)


The plan for today is to first mention four successful applications of reinforcement learning. Then we'll go through a core of theory. This will let us then understand pretty completely how each of those applications is achieved. Finally, we'll wrap up, looking at a few other applications and thoughts about how things are going.

applications: what

The applications here are all going to be games, not because reinforcement learning is only applicable to games, but because games are fun, and these examples are well known and cover a good range of techniques.


First up, backgammon.

Atari Breakout

Next, Atari. A lot of Atari games are well played by RL now. The ones shown (Video Pinball, Boxing, Breakout) are some of the ones that RL does the best on.


I'm also including Tetris, mostly because it's a chance to talk about an interesting technique.


And in the last two years, Go has been pretty much conquered by RL, so we'll talk about that.


Let's start to build up the theory of reinforcement learning.

This is going to start very gradually, but I promise that by the end we'll be moving fast.

Yann LeCun's cake


Yann LeCun introduced this cake idea for relating three main varieties of machine learning. It's largely based on one view of how much information is used at each training step.

I'm going to use it to build up and relate these three kinds of learning, while introducing reinforcement learning notation.

unsupervised learning

In unsupervised learning, we have a collection of states, where each individual state can be referred to with \( s \).

I'm using "state" without distinguishing "state" from "observation". You could also call these "examples" or "covariates" or "data points" or whatever you like.

The symbol "x" is commonly used.

state \( s \)

States can be anything as long as it can be expressed numerically. So that includes text, images, and sound. Really anything.

unsupervised (?) learning

So say we have a set of a thousand images. Each image is an \( s \).

We want to learn something, and that tends to mean unsupervised learning starts to resemble supervised learning.

At two ends of a spectrum we have clustering and autoencoders, and all kinds of dimensionality reduction in between.

Unsupervised learning is sort of the dark matter of machine learning. Even Yann LeCun says "We are missing the principles for unsupervised learning."

deep unsupervised learning

Deep unsupervised learning is whenever we do unsupervised learning and somewhere there's a deep neural net.

supervised learning

Up next is supervised learning. We're introducing a new entity \( a \), which I'll call an "action". It's common to call it a "label" or a "target" and to use the symbol "y". Same thing.

action a

Whatever you call it, the action is again a numeric thing. It could be anything that \( s \) could be, but it tends to be lower-dimensional.

The cat/dog classifier is a popular example, and a left/right classifier is just the same, but those might feel more like actions.

supervised learning

In supervised learning you have a training set of state-action pairs, and you try to learn a function to produce the correct action based on the state alone.

Supervised learning can blur into imitation learning, which can be taken as a kind of reinforcement learning. For example, NVIDIA's end-to-end self-driving car is based on learning human driving behaviors. (Sergey Levine explains in some depth.) But I'm not going to talk more about imitation learning, and supervised learning will stand alone.

You can learn this function with linear regression or support vector machines or whatever you like.

deep supervised learning

Deep supervised learning is whenever we do supervised learning and somewhere there's a deep neural net.

There's also semi-supervised learning, when you have some labeled data and some unlabeled data, which can connect to active learning, which has some relation to reinforcement learning, but that's all I'll say about that.

reinforcement learning

Finally, we reach reinforcement learning.

We're adding a new thing \( r \), which is reward.

reward \( r \)

Reward is a scalar, and we like positive rewards.

optimal control / reinforcement learning

I'll mention that optimal control is closely related to reinforcement learning. It has its own parallel notation and conventions, and I'm going to ignore all that.

\( r, s \rightarrow a \)

So here's the reinforcement learning setting.

We get a reward and a state, and the agent chooses an action.


Then, time passes. We're using discrete time, so this is a "tick".

\( r', s' \rightarrow a' \)

Then we get a new reward and state, which depend on the previous state and action, and the agent chooses a new action.


And so on.

standard diagram

This is the standard reinforcement learning diagram, showing the agent and environment. My notation is similar.

reinforcement learning

So here's reinforcement learning.

"Given" is in quotes because the rewards and states that we see depend on the actions we choose.

"Good" is in quotes because we haven't defined "good" beyond that we like positive rewards.


Can you formulate supervised learning as reinforcement learning?

Here's a question to think about.

supervised as reinforcement?

You can re-cast supervised learning as reinforcement learning, but it isn't necessarily as efficient: it's better to be told what the right answer is than to just be told that you're wrong. This is one sense in which Yann LeCun means that reinforcement learning is using fewer bits of information per example.


Why is the Sarsa algorithm called Sarsa?

Here's a question that you can already figure out!

(Sarsa is an on-policy TD control algorithm due to Sutton, 1996.)

reinforcement learning

Sarsa: State, Action, Reward, State, Action.

We'll see the closely related Q-learning algorithm in some detail later on.

deep reinforcement learning

Deep reinforcement learning is whenever we do reinforcement learning and somewhere there's a deep neural net.

The one twist is that in reinforcement learning, the network(s) may not be in the obvious place. We'll need to develop a few more ideas to see more places to put a neural net.

Markov property

We're going to assume the Markov property, which means that the state we observe tells us as much as we can know about what will happen next.

This is frequently not true, but we'll assume it anyway.

                   Make choices?
                   no                        yes
Completely    no   Markov chain              MDP: Markov Decision Process
observable?   yes  HMM: Hidden Markov Model  POMDP: Partially Observable MDP

This chart relates a bunch of Markov things, to hopefully help you orient yourself.

We're working with Markov Decision Processes (MDPs). Even though things are frequently not really completely observable, we'll usually just ignore that. Often this is fine (think about blackjack, for example).

It's often easiest to think about finite spaces, but this isn't always necessary.

(The chart is from

usual RL problem elaboration

We'll introduce some more notation and a few ideas that help us work on reinforcement learning problems.

policy \( \pi \)

This thing that takes a state and produces an action we'll call a "policy" \( \pi \). This effectively is our reinforcement learning agent, and the name of the game is figuring out how to get a good policy.

But how do we even know whether a policy is good?

(If you don't like spelling in Greek, you can imagine it's a little walker!)


We want a policy that maximizes "return" \( \sum{r} \). Return is just the sum of all the rewards we'll get into the future. The symbol "G" is also used, but I'll keep writing the sum.

Reinforcement learning can consider finite-length episodes that always come to an end, or the ongoing setting in which the agent keeps on going. (You can see that there could be a problem with infinity in the ongoing setting.)


Here's a typical toy example for understanding reinforcement learning: a gridworld.

There are sixteen states, each uniquely identifiable to the agent.

There are four actions: go up, down, left, right.

We want the agent to learn to go from the start to the goal.

We'll say this is episodic, with each episode ending after the agent makes it to the goal.

So how do we set up the rewards?

gridworld reward

Here's a natural way to set up reward. You get a reward of one when you complete the task. What does the return at each state look like, then?

gridworld return

The return, unfortunately, is just one everywhere. And this means that there isn't any reason to go directly to the goal. We have all the time in the world, so we could just as well meander randomly until we happen to hit the goal. This isn't very good.


How do we keep our agent from dawdling?

How can we set things up so that going directly to the goal position is incentivized?

Consider changing rewards, or changing how return is calculated, or anything else.

There are at least three approaches that are commonly used.

negative reward at each time step

If we get a negative reward at each time step, the only way to maximize return is to minimize episode length, so we're motivated to end the episode as soon as possible. This certainly makes sense for the episodic setting.

(It's like existing is pain; the agent is like Mr. Meeseeks!)

discounted rewards

It's very common to use a "discount factor" \( \gamma \), which might be 0.9, for example. Then a reward soon is worth more than a reward later, and again the agent is motivated to do things more directly.

average rate of reward

Average rate of reward is also used in some settings, where it can make things more stable.

value functions

I already snuck in a value function when we saw green values for every state in the gridworld.

The state value function \( v \) tells us the return from any state, and the state-action value function \( q \) gives the return from any state-action combination. It might not yet be clear why the two are importantly different, so let's get into that.


Does a value function specify a policy (if you want to maximize return)?

Here's the relevant question to consider.

trick question?

Is it a trick question? Really, values depend on what you do, on your policy. But let's assume I just give you a value function and you trust it. Can you use it to induce a policy?

value functions

The state-action value function \( q \) certainly implies a policy. For whatever state we're in, we check the return for all available actions, and choose the action that's best.

For the state value function \( v \), we can get a policy in a similar way, but only if we know what state we'll be in after taking an action. That may or may not be true.

environment dynamics

The next state and reward depend on the previous state and action, and that's determined by the environment. We don't necessarily know what's going on inside the environment.


When you, a smart human with lots of ideas about physics, look at this picture of a gridworld, you assume that going "right" will do what you expect.

But in general, we don't know in advance how the environment behaves. Maybe going "right" always takes you to the upper left state, for example.

Sometimes you do know the environment dynamics, like with games with well-defined rules. But sometimes you don't. You could try to learn the environment dynamics with a model.


Reinforcement learning uses "model" to refer to a model that you learn for the environment dynamics. (Often, just the next state is predicted.)

This model is something that the agent has to learn, and depending on whether the agent is learning a model determines whether you're said to be doing model-based vs. model-free reinforcement learning.

(In adaptive control this "model learning" is called "system identification".)

learning and acting

Everything to this point has been elaborating the problem setup for reinforcement learning. Now we get into how we actually learn and go forth with RL.

learn model of dynamics

As mentioned, you could try to learn your environment's dynamics. If you can do this well, it's great, but it may not be easy. Model-based RL is an exciting area of research.

learn value function(s) with planning

To introduce value function learning, let's consider the situation where you know nothing yet but you have the environment dynamics, so you can make decisions by looking ahead. This is a familiar thought process, conceptually.

planning 0

You're in state \( s_1 \) and you have to choose between \( a_1 \) and \( a_2 \).

Because you have the environment dynamics, you can see what would happen next.

planning 1

Looking one time step ahead, you can already start to evaluate your action choices by the rewards you'd get immediately. This is the state-action value view, \( q \): you're evaluating the value of the actions directly.

planning 2

Looking two time steps ahead, you can similarly start to evaluate how good the states you would get to are. This is the state value view, \( v \).

This example is simple and assumes states and actions never recur. We also haven't introduced a couple more wrinkles. You can think about differences between thinking in terms of \( v \) and \( q \) later, for example in the case of backgammon.

This kind of planning can happen continuously in the background, or it can be done on demand at decision time. The search space can get quite large.


The graph structure on the previous slide might make you think of a range of algorithms that you could already be familiar with.

You can think of these as smart ways of exploring the possibly very large branching structures that can spring up. Monte Carlo tree search is particularly clever, and it will appear again later.

rollout diagram

One way of dealing with all the possible tree paths is just to sample from it and try to make estimates on the basis of these.

backgammon roll

I thought it was interesting to learn that the term "roll-out" comes from work on backgammon, where you literally roll dice to move a game along.

everything is stochastic

Backgammon has randomness in the environment, from the dice, but randomness can enter all over the place.

stochastic s'

The next state can be random.

stochastic r'

And the next reward can be random, as in the case of multi-armed bandits.

exploration vs. exploitation

This can always be an issue in unknown environments, but especially with randomness, we encounter the issue of exploration vs. exploitation. The \( \epsilon \)-greedy approach is one well-known way to ensure that an agent keeps exploring.


How does randomness affect \( \pi \), \( v \), and \( q \)?

Now acknowledging that there's randomness all over the place, I can show another way that my notation is shorthand.


How does randomness affect \( \pi \), \( v \), and \( q \)?

I haven't been hiding too much. A policy is a probability distribution over possible actions, and value functions give expectations. No problem.

Monte Carlo returns

Here's our first pass at a model-free learning algorithm. We interact with the environment, and use our experiences in the past just like model-based roll-outs.


The environment may change, and we want our methods to be able to deal with this.

moving average

This kind of moving average is used all over the place.

The parameter \( \alpha \) is a learning rate, and the whole thing can be seen to be a case of stochastic gradient descent, which is a nice early connection to neural nets which start turning up.

(Sutton and Barto use \( \alpha \) for learning rate, not the otherwise popular \( \eta \). This is a sensible choice for them especially since they sometimes have multiple learning rates in the same algorithm, and the second learning rate can then be \( \beta \).)

MDP diagram

We can think about this question, which is not a trick: the difference is \( r' \).

Bellman equation

That relation gives rise to a bunch of Bellman equations, which are quite useful.

Bellman came up with a couple neat things. Dynamic programming (despite being poorly named) is great for lots of problems, including those involving Bellman equations.

Bellman also introduced the phrase "curse of dimensionality".

temporal difference (TD)

Bellman equations give rise to temporal difference (TD) learning (Sutton, 1988).

It's sometimes described as "bootstrap" learning.


Q-learning is a temporal difference method that combines a lot of the things we've just seen.

You can see in the equation that Sarsa is closely related.

on-policy / off-policy

Reinforcement learning differentiates between on-policy and off-policy learning. On-policy learning is when you're learning about the policy that you're currently following, and it's generally easier than off-policy learning.

Importance sampling is one way to achieve off-policy learning.

Q-learning is off-policy, which is nice. You can look at the equation and see why.

estimate \( v \), \( q \)

We've sort of implicitly been doing what's called tabular learning, where each state or state-action has its own estimated return. But we can plug in any kind of supervised learning algorithm, including deep learning. We'll see this in applications.

back to the \( \pi \)

Stepping back from value functions, we can work with a parameterized policy directly.

REINFORCE is not necessarily a great acronym.

policy gradient

This is the gradient that you use.

You may see the connection to score functions and the so-called log-derivative trick, but we can interpret without all that.

If an action contributes to high return, you can encourage it, and if it contributes to low return, you can discourage it.


Actor-critic algorithms combine policy gradient and value function methods to reduce variance, and are pretty popular.

applications: how

We now know enough to understand a lot of the details of big reinforcement learning applications!

TD-gammon (1992)

Backgammon was dealt with pretty completely about 25 years ago already.

We know what state the board will be in after any move, and the state of the board is what matters, so it makes sense to use a state value function \( v \) here.

DQN (2015)

Deep Q-Networks (DQN) is a well-known algorithm that works well for many Atari games. See:

evolution (2006)

Surprise! Tetris is well-solved by evolutionary methods, which I didn't mention at all. These methods are sneaky that way. They can work better than you'd think. See:

AlphaGo (2016)

AlphaGo surprised a lot of people by reaching super-human levels of Go play. It was a bit complicated. See:

AlphaGo Zero (2017)

AlphaGo Zero improved on AlphaGo in performance and by requiring no human training data. It reaches incredibly good performance with only its self-play learning method. See:

(AlphaGo Zero can only play legal moves, but stupid moves (like filling eyes) aren't forbidden, so it is less constrained than the original AlphaGo.)


Let's look at some more applications to paint out more of the area that reinforcement learning can influence.

Neural Architecture Search (NAS)

Neural Architecture Search (NAS) uses policy gradient where the actions design a neural net. The reward is validation set performance. See:

Amazon Echo

A lot of people are interested in natural language processing. A lot of chatbots are still big nests of if statements, but there is interest in using RL for language. See:

(In particular in the multi-agent setting, there might be some connection to evolutionary game theory and the work of Peyton Young.)


Lots of people think using RL for robot control is pretty neat.

(left to right: Chelsea Finn, Pieter Abbeel, PR2, Trevor Darrell, Sergey Levine)

Pieter Abbeel, Peter Chen, Rocky Duan, and Tianhao Zhang founded Embodied Intelligence to work on robot stuff, it seems.

self-driving cars

People seem to want to use RL for self-driving cars.

OpenAI DotA 2

OpenAI has beaten top human players in one-on-one DotA 2. This is pretty neat. They haven't released details of their methods as they say they're working on the full five-on-five game. We'll see!


Wrapping up!

bootcamp diagram

There are a lot of RL algorithms with horrible acronyms.

reinforcement learning

But the core ideas are pretty concise, and lots of work in the field can be quickly understood to a first approximation by relating it back to these core ideas.

network architectures for deep RL

Lots of exciting developments in RL are coming from the use of deep neural nets, and I wanted to say a few things specific to doing these applications of deep learning.

the lure and limits of RL

There's progress in RL now. The limits of that progress are not yet known.

One hot take from Denny Britz: "Ironically, the major advances in RL over the past few years all boil down to making RL look less like RL and more like supervised learning."


It's for you to decide!

Thank you!

Thank you for coming!

Thanks again to DLA for supporting me in learning more about reinforcement learning, and in being the audience for the earliest version of this talk, providing lots of valuable feedback.

Thanks to the DC Deep Learning Working Group for working through the second iteration of this talk with me, providing even more good feedback.

Thanks to Eric Haengel for help with Go and in thinking about the AlphaGo Zero paper.

further resources

Here are my top resources for learning more about deep reinforcement learning (and having a little fun).

The content of A Brief Survey of Deep Reinforcement Learning is similar to this talk. It does a quick intro to a lot of deep reinforcement learning.

Karpathy's Pong from Pixels is a readable introduction as well, focused on policy gradient, with readable code.

If you can afford the time, I recommend all of Sutton and Barto's Reinforcement Learning: An Introduction. The authors are two major reinforcement learning pioneers, to the extent that they sometimes introduce new concepts in the pages of their textbook.

David Silver's course is quite good. It largely follows Sutton and Barto's text, and also references Csaba Szepesvári's Algorithms for Reinforcement Learning.

The Deep Reinforcement Learning Bootcamp that inspired this talk has lots of materials online, so you can get the full treatment if you like!

If you want to get into code, OpenAI's gym and baselines could be nice places to get started.

If you want to have fun, you can play Go with humans at the National Go Center, and I always recommend the Hack and Tell meetup for a good time nerding out with people over a range of fun projects.

Juergen Schmidhuber's Deep Learning in Neural Networks: An Overview didn't make my top list, but it has an interesting section specific to using LSTMs in RL.