>> The final session again for speakers,
first speaker is Shane Gu who will be talking about
Temporal Difference Models: Deep
Model-free RL for Model-based RL.
>> Okay. My name is Shane,
I am a key student of Rich but this work was done during
my internship at Google so it
does involve a corroborate from Cambridge.
But hopefully, it's also
new material for people from the lab.
Today, I'll talk about this Model-free RL
for Model-based RL and I'll explain what it means.
I think it's good to celebrate the success
of a kind of model for deep reinforcement learning.
Model-free reinforcement is a very general framework for
solving a sequential decision making problem,
and thankfully to EY who can introduce more notations,
I am not going to go through this again.
But it's a very general framework that we can try to
learn various kinds of tasks and then soon,
it's shown to be very
successful in solving radical tasks.
And the thing about those that's really
nice is that it's very general and
universal and for example model-based algorithms.
I can briefly explain the Model-based, Model-free.
Model-based is basically you learn the dynamics
of this kind of environment
and then you can use that model to kind of roll out,
and then the optimization which at that point basically
becomes something like back propagation
to have my support of policy.
But the Model-free methods basically don't assume,
don't basically learn this kind of
framework information and just
directly try to maximize the reward.
Assuming only kind of sample from dynamics.
So dismantled or success,
but also say there's a limitation
for those kind of success.
The first is that a lot of
those Model-free algorithm are
very highly sampling intensive.
So they can work to solve problems in simulation,
have this amazing parkour movements.
I really want to train this kind of
robot to do that, it's almost impossible.
And also functionally or some other kind of- for
continuous controls on a benchmarks we
kind of thought were kind of difficult,
in the end we actually can't solve reasonably with
delta DPO network like
linear policies or just some kind of random features.
And probably the worst part
is that maybe the arrows are unnecessary,
and that's a bunch of
recent work that show
evolutionary messages or some kind of
random search can also solve a lot of
some of the previously solved
difficult conventional problems.
So let's say that's this kind of
potential cause for this
and for this kind of story problems
and I think it's really that
the kind of classic way to Model-free are of this
defined lacks kind of
rich running signals' they can leverage.
So I think I'm going to start by kind of having
kind of showing kind of I guess
four kind of way of kind
of solving reinforcement in problem.
And this kind of graphic layout like this way.
You can now see.
Okay. Sorry there is some kind of boxes but, never mind.
So basically I'm going to wax,
this had this kinda of how much data
each of the message take and
the other axis show
how general kind of complex that each
of branch of the message can solve.
So first we have this kind of model-based message.
So when they saw that,
because they use supervised learning to run the model,
so if they can solve the task it's
very generally simple efficient,
but there's a limit to how difficult the task itself,
so for example you can like solve it
like here like in some difficulty.
While of course this stuff is more general.
So it can solve in here and
On-policy stuff say like can solve
if harder and the version stuff gets old like MIMORE.
But each time a message kind of goes right,
it takes like say 10 or 100 times more samples
to solve this kind of similar difficulty kind of problem.
So they can ideally a good direction
to explore as they try to reach its top left,
we want to solve real problems,
difficult problems they want to
subversion or how task but also be very simply efficient.
So kind of my prior work also focused on
calibration gaps between those kind
of branch of reinforcement learning.
And often we can do is we can prove
stability and sampling efficiency or those methods and
Facetalk is also kind of fall into this kind
of corner bridging gaps with model-based on-policy.
So to give more of motivation,
I am going to call this like Curse
of "cherries" because they've been haunting me for
the past two years since
the Yann LeCun introduced this slide of cake and cherry.
He basically said the Model-free arrow is
just the cherry and then the the model-based,
he called unsupervised learning is a cake.
Which means is that the more we
can leverage there and I can deep neural network well.
So to show this more concretely.
Here liking something when
considered kind of data you collect.
You have different policies.
Say I'll hopefully can improve me over time,
and then for each policy you
collect a number of trajectories spiraling into policy.
So Evolutionary methods and basically those
that used in the previous experience,
any collapse all those kind of
trajectory from kind of single policy into
just one number and I guess you can of find
some random like local coarsening virus as well.
But basically just say signals get out
of all those kind of rich experience.
And On-policy messages with better it ditches
older previous policy experiences but I can use kind
of the rewards coming
from each of the robot trajectories.
And Off-policy kind of Model-free is better in that,
it can use mostly rewards for
also from older previous of policies.
But I really like something that's really
nice is that the model-based which is,
by learning this kind of dynamics,
you basically leverage all the information from
any of the previous policy and
older kind of information in the kind of state
and action reward transitions.
And this is often kind of showing the resulting,
that's the more kind of doing something you
kind of feel the general when those methods work,
that sample efficiency is very good.
So in this talk,
I'll mostly can talk about,
how can we say- do something like in a Model-free?
So that you can solve harder task
by can leverage as much or even
more learrning signals and
model-based so summarize a problem statements in this.
How can we increase the amount of learning
signals in kind of model-free arrow.
There are few kind of prior work for this.
Say auxiliary task, accurate prediction
often it was going to ask them,
can prediction rewards to it.
There is also the inverse body function
the (HER). I'll explain.
Next. They'll connect to the work presents now,
and also this question that can
leverage as much or
more learning signals than model-based.
That still have say
ozone conversions of the Model-free algorithm.
So I'll briefly explain the Hindsight Experience Replay.
It's basically just say kind of Off-policy,
Q-learning algorithm where they
defined as primary sparse reward function.
So now the reward, isn't just going to state action next,
state also takes in kind of
parameter which is the goal is trying to reach.
And the reward is if there is reached a goal
within you can get to receive the reward.
If you learn the Q-function for
this then basically Q-function.
Than some kind of probability of
reaching some kind of goal stage from your current state.
So in kind of what they do is that,
so often in kind of
sparse problem if you just run some policy it
doesn't reach the kind of goal you wanted to
reach by basically kind of go on some trajectory.
And what HER does is that it
takes all the future state that arrived,
and treat those as if they are
the kind of goals that this,
you're trying to reach from the state as TNaT.
So basically in Hindsight you're
relabeling this kind of
parameters and Off-policy learning.
And this has shown to solve some of
the difficult sparse real problems. As in this.
They you kind of look closely at some of the X axis.
This dismisses actually very simple intensive.
I guess one is rotary kind of
focus on sparse real problems, so they,
by nature that even if say for the same state action,
next state you sample many goals each of
the goal basically only gives
like zero one kind of reward,
so it's still a very sparse kind of reward you get.
Another thing is that the way they can
relabel is also kind of limited by just
taking the future success or
positive goals that was able to achieve.
So, starting from here,
the question is: Can we take
this [inaudible] relabeling and then can
include a sufficient efficiency.
Like much more drastically?
So, this is like very simple example, but I like it.
So, this is basically tasked with trying to
teach the HalfCheetah to run different velocities,
sample from minus 10 to 10.
And so, typically when you do
kind of reinforcement in each episode,
say you sample different target velocity, and then,
you basically get curves like this,
and, this is like correspond to,
you increase number of updates due to
the parameters given that kind of sample.
And then, often you kind of saturate,
for example with 50 updates per sample.
It's basically about the same as 10 updates per sample.
But, if you use relabeling trick,
which means when you do off-policy queue earning,
you don't just use the goal stored and replay buffer.
By use the kind of goal
that you kinda regenerates from another distribution,
then you basically can get kind of further improvements.
So, basically this says if you have some kind of
reinforcement learning tasks that's
kind of can be frame as
if the dynamics is the same kind of
frame as doing kind of
multiple goal stuff if you re-sample this kind of goal
or task parameters off-policy learning.
Often you can actually kind of get just
kind of implemented some proficiency more,
and this was using was positive negative goals.
So, from this kind of thing I'd say is that for forcing
reinforcement learning it's kind of Truven
previously this he says this can
we use a state action reward next state,
from any policy but the true benefit
is you can use state action next state forming
policy and basically learn
the Q-function against any kind of reward functions.
And this now can start feel
more like model-based reinforcement learning,
in that it's not just trying to learn like
a Q-function for single, skate, or reward,
but it's trying to have infinite server reward functions,
and pseudo's inference reward function is try to
learn many kind of statistics about the state,
actions, next state, transitions also
was kind of planning extract out of it.
So, is there more kind of exact connection.
And that's where I kind of discuss some of the basics,
and kind of model-based reinforcement learning so,
this is one of the simplest way
to do model-based reinforcement learning,
in that you first learn this kind of transition dynamics,
so given a state and action what is
the next stage just use unsupervised learning,
and then the MPC what they do is at
the test time if you're given state ST then you
just roll out like it was a sequence of
actions using this model and
then you basically try to maximize the reward
expected actions and take the first action and repeat it.
So, you can simply like,
you can write this small incurring
impressive forming that now you now flies
over a sequence of actions most of
chromosomal sequence of feature states.
By ensured a sequence feature of states
satisfy the constraint imposed by the dynamics.
So, this is exactly same as liking this way,
but by doing this we can kind of write
connection to say the to Q
like this kind of parameterized functions.
So, for example if you have this parameter as a reward
as some kind of negative distance
between next state and then,
the sun goes states,
and let's assume like gamma will do
for now that discount factor,
then if you run a Q-function for that,
that Q-function is basically
exactly the same as it is going to see.
So, what it shows that Q-function capture something.
Basically, kind of information about
the model of the dynamics,
and you can use Q-function to do model-based planning.
And the nice thing about this is that when you don't
Q-function that's the algorithm it uses modal free.
So often you have,
you're not kind of limited
by some kind of
problem seeing contrary model-based train.
So, now there's kind of connection
to the classic model-based
but then this is still not useful
because there's like a single-step so,
and constrained optimization but,
so just like this it has no limited applicability.
So, the idea is can we learn new kind
like now when we talk about kind
of model the only model we
know is just single-step model.
And, the idea is can we use TD learning to
learn new kind of family of models that describe the MDP.
And that's why I introduced temporal difference models.
So, how can we generalize model MDP?
So, that's so, it's simple trick,
you basically add this kind of planning horizons so,
tau basically you only get the reward
after at the end of the planning horizon and,
basically you can write down going
standard of securing objectives
like this where each time step the towel decrements and,
you get the reward only at the end.
But, nice thing about this is that so,
this learns some kind of
multi-step model of dynamics
in that's what this Q-function
means after learning of policy is basically
after if your goal was to reach SG in say tau time steps,
what is the state u, what is the distance
between a state u end up and then they go SG?
If you going to learn this Q-function perfectly.
Then what you can do is you can use this to do
multi-step version of MPC.
So, in standard MPCs single-step you
plan every kind of single step but,
now with this you can basically just
substitute different kind of tau and then,
plants every kind of case step or so,
an extreme case you're just kind of plan
for kind of single-stepping in that,
you directly kind of plugging
the hole kind of planning horizon,
and all you do is basically yourself mice for
their sin for basically the kind
of a goal you try to reach,
in order to kind of maximize your reward at the end so.
I may have rushed this a bit,
but intuitively some of
the details are also on the paper,
but the idea basically now we just need
to kind of plan every kind of K time steps so,
effectively doing is
some instance hierarchical reinforcement learning,
basically when you're using this kind
of this Q-function that
can say reach any kind of step or in
a given time step
as a kind of sub policy in the harbor policies basically
finding what goals should you cannot plan
every kind of case step to
get the maximum reward of the task.
So, there's some kind of limitations
that intermediate rewards if you print every so often,
but then often for reinforcement you
actually just care about the reward at the end so,
for the task we tried actually this wasn't a problem,
bigger problems probably because
you're going you need solve can
construct mutation if you just have a Q-function and,
if the Q-function is a neural network
that's quite unstable and difficult.
So, this is kind of racism in detail.
The kind of nice thing about kind of falsely cued
learnings stuff is there's actually very
straight forward to vectorize it so,
if the reward function
say like a distance metric and
vectorizer what kind of dimensions,
then you can also kind of
vectorize the kind of Q-function well.
And actually know that okay so,
maybe C. So, one of the thing is that you
actually know the kind of a Q-function
based represents some kind of distance
between the some kind
of goal state and the future state you reach.
So, you can just kinda include that as kind
of kind of building block of
the Q-function and if
you do that basically you can learn
something that resemble like a model,
that's kind of more details to it that I can look
up for the time sake but basically was this,
what you can do is you can do express with MPC so,
you can directly use this kind of F which is
sort of like go condition
kind of multi-step predictive model of dynamics.
To predict kind of what state you end up directly,
instead of like framing in
this kind of constraint optimization form.
There's kind of more some differences
but for interest please ask me later.
So, for the results I will describe
the basic things we don't want kind of
benchmark domains or continuous control,
and then say that the orange one is the DPG so,
they all kind of converge very kind of nicely
to kind of best performance but often
is much slower than
say modal base which is the purple one or the TDN.
The model-based per converge very fast and,
also accurately for simple domain like richer,
but for something as difficult like say where
that locomotion dynamics is more tricky
this kind of gap basically
incurs the gap it converge very fast but then
there's kind of bias in
the asymptotic performance of a final policy.
So, overall basically TDM considering
infinite sample efficiency in
kind of all those kind of cases
tested and specifically converge to the same kind of
final performance as contrast
as kind of Model-free methods.
So, in discussion I
think hopefully I kind of mentioned guess my confidence
that's increasing this kind
of cherry for kind of Model-free Reinforcement can
sometimes be as good or even
better than Model-based Approach and,
I think the important question to ask in
reinforcement learning is how can we refrain
Model-free arrow such that
you can get much more signals because only then
probably the deep learning aspect
of it make a lot of sense.
And, also tower was just very exciting problems and but
also may problems and future potentials and,
thank you for listening to the talk.
>> We have time for questions.
>> [inaudible]
>> So, I avoid so,
in the end I avoid a constraint optimization, yeah.
By framing in this way yeah so,
here this is the final final form I use and,
this F is basically just a part of a Q-function.
They parametrize. Yeah. And, you can kind of see
this as this is basically predicting from the state and,
action if your goal was say two ST plus
T in if you cannot reach there
in T minus one what's the state you actually reach.
If you get an imperfectly yes.
>> So, the intuition with the model kind
of depends on your choice of
the reward is a distance metric right?
>> Yes exactly yeah so,
for the experiments we tried kind of L1 stone but,
because you can vectorize it sees me
and yeah but it definitely depends a lot on that so,
if for example the state is like images
then probably you need to learn some kind of metrics.
rather than just using the standard loss.
All right then let's thank Shane again for a great talk.
No comments:
Post a Comment