- It's my pleasure to introduce Mike Jordan as the
invited speaker for the 2018 Taskar Memorial Lecture.
This memorial lecture recognizes the late Ben Taskar,
who was an exceptional machine learning
researcher, colleague, and friend.
Ben made contributions in many aspects of the field,
including probabilistic modeling, computer vision,
and natural language processing.
Today, we also celebrate the Taskar Center for
Accessible Technology that's led by Anat Caspi.
One of the focuses of this center is on the translation
and deployment of open source technology to improve
the lives of people with various abilities, or disabilities.
On the academic side, the center focuses on
the integration of universal design principles
into the engineering curriculum.
And as an example of something that the center has done,
they've created something called access maps
that are really popular ways of automating
pedestrian routing, based on people's abilities.
I also wanna acknowledge Ben's family,
who's come here today from the Bay area.
So it's exciting to have Mike giving today's lecture.
Mike was Ben's post-doc advisor at Berkeley,
and Mike is one of the most distinguished researchers
in machine learning and statistics.
Mike's won tons of awards that I won't attempt to list,
but these include being inducted into the National Academy
of Sciences, National Academy of Engineering,
and the American Academy of Arts and Sciences.
And today, Mike is gonna be talking on aspects of
optimization, which have connections to some of the work
that Ben did with Mike on structure predictions.
(applause)
- Thank you very kindly, it's always a pleasure to be here.
I've been in this room several times and I love coming here,
it's such an amazing crowd and you're all so big.
Thanks to Emily, she was really a post-doc with me.
She came unofficially, but she'd come from MIT
to visit us at Berkeley, and did some really great work
that helped spur the field forward.
So I wanna thank, obviously, Anat and family for inviting me
but there's really a deeper thanks I wanna give to them
which is, thank you so much for doing all the activities
you've done in honor of Ben, and keeping him with us.
Because we miss him,
and the fact that you have the center,
that you have this event, it gives us a place
and a grounding that we can come back to and remember Ben,
and feel like he is still with us.
So I wanna thank you profoundly that you didn't do
just something for your family, but you did something
for really for all of us, so I really appreciate that.
So if you get to my age in life, you have lost many people,
and so in fact my parents are gone, and some great friends
like Ben are gone, and you get used to that feeling.
It's not a great feeling, but it is part of life.
But I particularly notice it when something happens
that I wanna tell somebody about.
So with my mother, it's like if something happens
with my children, I wanna tell somebody, and I think,
ah, my mom's not there, that's too bad.
Well, with Ben, he was probably the number one person
when there was an academic idea that either I'd thought of
or someone else had thought of, I wanted to tell Ben.
And so in fact, this talk is really for Ben,
this is material that I've been working on for the
last few years that I would wanna tell Ben about.
And so in my own mind I'll be giving the talk to him.
It's somewhat technical, and I normally don't give
quite as technical a talk, it's kinda math-y.
But that's the kind of things
that Ben and I would talk about.
We both loved math, both the
beauty of it and the fun of it.
So I certainly remember being around Ben and having this
little playfulness, you know, twinkle in the eyes, little
smile kind of always on the face when we'd talk about math.
He understood the playfulness of being an academic,
and the beauty of that, and the fun of it.
And the humor associated with it.
So anyway, I wanna thank all of you for bringing me here
to participate in this really, celebration of
all of us, but particularly Ben.
So in fact, Ben taught me a lot about optimization,
and so optimization is the theme of this talk.
Lot of people are talking about
AI these days in machine learning.
Frankly, I don't think we're there yet, I don't think those
fields have yet to do anything really that serious.
So it's amazing how much hype we have now about it.
Most of the ideas have been discovered in statistics,
optimization, operations and research,
areas of computer science, economics and so on.
And machine learning is basically an integrative discipline,
it brings these ideas together to solve some problems.
Ben really created a whole lot of new ones,
it per say, frankly.
And the fact that it's being rebranded now as AI,
as if we've solved the big problem of intelligence,
which we haven't even begun to solve.
Maybe in my lifetime we'll start to perceive it,
but we're not perceiving any yet.
And so the fact that we're talking in this kind of
exalted language is alarming to me.
And so I, as the kind of wave is pushed in that direction,
of all the hype and all the anxiety,
I've headed in the exactly opposite direction,
which I typically do in my career,
I go back to the nuts and bolts and try to do some
more theory, and try to figure out what won't work,
and get more guarantees, and so on and so forth.
So for the younger people in the audience, I wanna
encourage you, this is not the big era of AI.
It is just not, all right, it's not gonna work
in the way that people are talking about it.
We're just kinda learning some mappings from some data
and there are some problems associated,
there was a nice panel this morning on the issues
that are now rightfully coming up.
And it's not because we thought we were gonna solve
the big problems and suddenly we see they're problems.
It's rather, this is an interesting discipline,
it's gonna take decades to roll out, just like
chemical engineering, just like civil engineering.
And I wasn't around in those eras, but the 50s and 40s,
when those were rolling out, I bet they didn't have
the level of hype that we have now.
So I have a lot of respect for what they did,
they changed the world, so I think that this,
we're building in some sense the first
engineering discipline for humans.
Chemical engineering and civil engineering weren't
really about humans, I mean, we've built buildings
that people lived in, but this is really something
that's really all about humans, and their
interaction with science and technology.
And so that's an exciting but
extremely challenging thing to do.
So we need to be sober and take the decades
it needs to be taken, to do this.
So I think I've said enough about that, but just to give you
a little of the context to what I'm thinking about.
So optimization is a wonderful playground,
but it's a lingua Franca, it's a place where
many people come to and talk together.
So it came out of many many fields and many have
contributed, but I think not that many people
went into optimization as their first field.
They kinda end up there later in life when they realize
how useful it is, and how communicative
it becomes to work in that field.
Okay, so, let me start then with
some of the actual content here.
Okay, so it's certainly played an increasingly
central role in statistical machine learning.
It can't really be the whole story, because machine learning
is also about uncertainty, and so you need sampling ideas
and all that, just optimizing can't be the whole story.
But it has really been very, very central.
It supplies algorithms, that's great,
but it also supplies lower bounds
and therefore fundamental understanding.
So let me say something about that.
When I learned optimization, it was
in the 1980s, I was a student.
And I took some classes and read some books,
and it was presented to me as a field that was
dead and done, and a big success.
You had simplex method, you had conjugate gradients,
you had Newton's method and so on.
And they really worked, and there were rates of convergence
and there was good software, and there's just
a tool to be used by everybody.
And there was this famous book by Press et al,
that you just picked up and used it.
And that was in the 1980s.
So great.
That was false, it was not dead and done at all.
So there were a couple of revolutions happening around that
time, and mostly associated with the former Soviet Union.
And that's another connection to Ben.
I think that may be why Ben was so interested in
optimization, he'd been trying to pick that up.
And one of the revolutions was
called the interior point method.
It was developed also in the US, but the Russians
really, really took it much, much further.
And it was just an older, but a revitalized perspective
on how to do optimization, and it's had a huge impact.
But the other one was gradient based optimization.
And stochastic gradients and stochastic approximation
and these ideas that went back to the 50s.
But they were picked up in that era, and taken
to the, really, a beautiful level.
And in particular, lower bounds were discovered.
And that's a conceptual leap,
most fields don't have lower bounds.
Statistics has lower bounds, computer science has very few.
And I think it's a sign of the maturity of the field
that it has lower bounds, and optimization finally got
lower bounds in about, circa 1970.
And so this was all done in the Russian school,
and once you get lower bounds you start
to get fundamental understanding.
You see the best you can possibly do.
And so that strips away a lot of the details
of specific algorithms, it tells you about
what's the best you could possibly do?
All right?
Now that's an interesting concept,
because we're doing optimization, here,
so we're developing optimization algorithms.
So we're asking for the best optimization algorithm.
The optimal optimization algorithm.
So already, that's kinda conceptually neat,
what could that possibly mean?
So that kind of thought inspires a lot of my work here,
and the work of other in this field,
and that's the kind of thing that I,
again, say, Ben would have loved.
This kind of recursivity and so on.
So, let's get into that.
Now, once that happened in the 80s,
as is often the case, brand new issues came out
and it was realized that this field is not done at all,
there's a lot of open problems.
And now as people with statistical interest have
moved into this field and started to use it,
it's even more clear how many problems are open.
So I hope by the end of this talk you see that
I'm not actually trying to show you any real results,
I'm just trying to show you some
cool problems and some open areas.
Okay, so let's talk about saddle points a little bit.
So this is gonna be some work that's actually joint
with Sham Kakade in the audience, you have a lot of
good people in optimization here at UW,
and Sham is a particularly noteworthy person
who has kinda helped, from whom I've learned quite a bit
from his papers and with talking with him.
So we used to think, at least in the 1980s,
that bad local minima were the main problem,
kind of learning about these from the complexity
theorists that told us that that was
the source of all the complexities,
and physicists with spin glasses and so on.
And as time has gone on, it's sort of not been
the case that it's been the local minima.
That a lot of problems there are no local minima,
but there are a lot of saddle points.
Saddle points, in two or three dimension don't look
like a big deal, there's kind of like a little saddle,
and you just kind of move your way around it.
But in a hundred thousand dimensions,
saddle points can be a very, very big deal.
It might be that almost all the dimensions are bad ones.
You stay stuck, and it's hard to find
the dimension that gets you out.
And if you have lots of saddle points, you'll get stuck
for a long time, then you'll move to the next one,
you'll get stuck for a long time, I mean you'll get
kinda stuck for way too long, for months.
In fact, if you train a neural net, and I won't say the word
neural net ever again in this talk, but (laughter)
if you train one, it goes down in
the air, and then it flattens out.
And then after a while it goes
down again, and it flattens out.
Those are saddle points.
And it's probably a ring of saddle points and then
another ring, and it's kind of going past those rings.
Again, hundred thousand dimensions, or million dimensions.
So understand that the behavior on saddle points is actually
really, probably a good chunk of the problem here.
Okay, so what ingredients are we gonna use?
What's gonna be in our toolbox.
Well, again, we're gonna be mathematicians
in this talk, we're gonna simplify.
So we're gonna use a few ingredients.
Gradient descent, critically,
stochastics, critically, partly for algorithmic speed
reasons, but also 'cause we are statisticians, too,
and we wanna always have probability in mind.
And then we'll talk a bit about acceleration,
'cause acceleration, you wanna move as fast as possible.
We're trying to move as fast as possible so we have
to have some notion that you're accelerating.
That's one of the reason why we have those ingredients,
the other is just that experience has taught us that the
simpler the methods, the better they tend to work and scale.
And so as computer scientists, we're interested critically
in scaling and a robustness in working.
And the more complex methods with second order and Hessians
and so on, are a little bit deprecated right now because
they don't scale, and they don't tend to be very robust.
Now that will end, the pendulum will swing back,
but these three ingredients are already
getting us quite far in our thinking.
So here's a saddle point.
Look at the saddle shape.
We're gonna be talking about strict saddle points,
where there is a negative eigenvalue which is
strictly negative, and our question is gonna be
how to escape these things efficiently.
Now here's a computer science word.
We don't want it to take exponential time to escape
a saddle point, if exponential is exponential
in the number of dimensions, for example.
That would be a disaster in a hundred thousand dimensions.
So can we do some theory and develop some algorithms
that assure that this doesn't happen?
Okay, so here is this first little line of work,
I'm gonna put together a whole bunch of little lines of work
and this one'll probably be one of the longer ones,
and again, Sham is on the list there, but also
Chi Jin, student at Berkeley, Rong Ge, and
Praneeth Netrapalli, so it's a very geographically
distributed collection of people.
All right, so a few facts from some recent work that
several people have been doing, is gradient descent.
Well this an old algorithm, it goes back to at least Newton,
and some historian can tell me how much further it goes
back, but you're going down the steepest descent direction.
Now, in discrete time, it's kind of
hard to analyze some of these things.
So it's often useful to go into continuous time
and develop a differential equation.
When you go into continuous time with gradients,
you get something called gradient flow.
And that's been studied by the
mathematicians for a long, long time.
It's just a differential equation
that flows along the gradient.
And there's something called Morse theories,
and there's an area of mathematics that tells you that with
probability one, you will not end up at a saddle point,
if you're doing gradient flow.
But had it been an open problem for the discrete version,
this gradient descent, if I started gradient descent
in an arbitrary place, could I end up and sucked in
sucked in sucked and just kinda asymptotically
end up at a saddle point.
And we did some kind of Morse theory kind of arguments
to show that that's not the case.
But that is an asymptotic argument.
So no efficiency statement.
Can you say something about efficiency?
Well, there's a negative result.
We had a follow up paper which showed that gradient descent,
even though asymptotically it'll never end at a saddle point
it can take exponential time to escape saddle points.
So it does get sucked in for kind of a long time,
and then sucked in for a long time and sucked in
for a long time, and then you set that up in a
not too weird way, such that it'll take exponential time
to actually finally get past all the saddle points.
So that was a negative result.
But now there's an important paper, Ge, et al
that showed it turns out if you just add, you do
stochastic grading, which we all love for other reasons,
it can actually escape saddle points in polynomial time.
So we went from exponential to polynomial,
the thing that computer scientists love to do.
Now their polynomial was not necessarily good.
Polynomial is not good enough for us, we want fast.
We want linear or sub-linear.
Logarithmic is where we are typically trying to get.
But none-the-less, that, as usual, opened up
the, oh, maybe one could go faster.
Okay, so let's set up a little structure here,
let's do convex optimization at first.
So we have a minimum of a convex
function, a D dimensional space.
Here is gradient descent, which
Isaac Newton would have recognized.
And I should say the date on this talk was January 15th,
'cause I gave this talk at the Newton Institute in
Cambridge, and as I was giving the talk, there was this
huge portrait of Newton sitting right there, and I realized,
as you can see, there's several other ingredients
that Newton would recognize in this talk.
And I kept kinda turning to him and saying, see?
(laughter)
So I left that date on here just
'cause it's kinda stamped into me.
So anyway, there's Newton's algorithm.
So it converges, provably, to a global minimum,
and here's the critical part, is that the number of
iterations that it takes to get to global minimum,
it's independent dimensionality.
And I don't think that's widely enough appreciated.
Everyone thinks gradient descent is good,
because the steepest descent is cheap and all that,
but it's number of iterations is independent dimension,
if you do it in infinite dimensions, then it's just as fast
in terms of number of iterations, which is amazing.
Now, but what about the world that we mostly
really care about, these non-convex problems?
So now we can local minimus saddle points
and local maxima, and all that.
Now here's a list of problems that are actually non-convex.
But they don't have any local minima, provably.
So they have a bunch of saddle points,
but it turns out moreover, all those saddle points
are strict saddle points, so they don't have
these kind of flat, bad regions that could get you
really stuck, and you couldn't say much.
So these are really interesting problems.
There's not neural nets up here, but these are other areas
if you're a machine learning person, that you need to know
about, they're really working in real world applications.
So a lot of them, phase retrieval, matrix sensing,
these are, a lot of practical applications of these.
So notice these local minima, all these (mumbling).
So if you can avoid these things efficiently,
you're done, you solved the problem,
you get to the global optimum.
Okay, so let's set up some structure
to talk about convergence.
So we need a bit of smoothness
so we can see anything at all.
So here's the standard assumption, we assume that we have
a gradient which is Lipschitz, that if you move between
two points, x and y, that the gradient doesn't change much,
as relative to the distance.
And then we now talk about a first order stationary point.
If the gradient were zero, that would be the classical
notion of a stationary point from calculus.
So we put a little epsilon ball around that, and we say
how long does it take to hit that epsilon ball?
And we do that for various values of
epsilon, so now we get a rate.
All right, and here's the theorem.
Let's say it's due to Nesterov.
Which is that gradient descent, with a certain choice
of step size, the one over the Lipschitz constant,
finds one of these guys in a number of iterations
which is proportional one over epsilon squared.
So this is somewhat of a slow rate, but not too bad.
And that's true for any convex function.
Lipschitz, smooth, convex function.
We have a distance to the original distance that has to be,
if you start really far away, you're just gonna
take a long time to get there, fine.
And then you have a very simple
two times L sort of expression, here.
But then again, here is the key,
which is that this is independent of dimension,
number of iterations.
So that's the best one could kind of possibly hope for
for a gradient-style algorithm.
All right, so now what are we gonna try to do?
Well we're gonna try to now go to beyond that,
we're gonna try to avoid saddle points,
that algorithm will get stuck at saddle points, potentially.
So now we're gonna define a second order stationary point.
So first of all, we need a little bit more smoothness
assumption, we need a Hessian to exist,
and we need it to be Lipschitz.
We don't wanna move around, and Hessian changes a lot,
then we wouldn't know how to get
out of the saddle point, really.
So we're gonna make that assumption,
we'll have now a parameter row.
The second order stationary point is again that we have this
first order stationary, plus the minimum eigenvalue of the
Hessian has gotta be greater than or equal to, if that were
zero, that would be your classical notion of a not saddle
point, but we got ourselves a little slop again to have a
little ball around the solution so we can get a rate.
And that ball happens to be parametrized by
row times F from the square root,
it's just a standard parametrization, and that's it.
But it's proportional to the row of the Lipschitz bound.
All right, now we're gonna analyze it, extremely simple
algorithm, and it's a mathematician's algorithm,
one that you can analyze and say something about.
But it also is something you can implement and practice.
So it is just gradient descent.
But from time to time we're gonna add noise
from a ball of radius R, selected uniformly.
Now, time to time means that we're
gonna not let this happen too much.
And so this perturbation condition is that
if the gradient happens to be really small
where you're currently at, and you haven't
added perturbations too recently.
So a little bit of a historicist, if you will.
If that's true, now you add some noise.
And the noise is isotropic.
So that's a dumb little bit of stochasticity
that we've added to gradient descent.
We needed something like that 'cause gradient descent
itself will take exponential time, that would be bad.
So we've done this little bit of extra noise.
And here's the results, so here is the convergence rate now,
now we're looking for second order stationary points,
Nesterov did first order stationarity.
So we have again these Hessian Lipschitz assumptions.
We set the learning rate to a certain choice,
and now we're gonna find a second order
stationary point and a number of iterations,
well that looks exactly the same as before.
And it is, except for that little tilde sign right there.
But before we get to that, it is
one over epsilon squared, still.
So we're able to evaluate saddle points just as fast
as gradient descent in the convex case,
this is the non-convex case.
That's pretty striking fact about gradient descent, which
is it doesn't seem to be too troubled by saddle points.
Moreover, we have the Lipschitz constant there, and
everything is the same, and so what does that tilde mean,
that's where we're hiding the dimension dependents.
But the dimension dependents
is logarithmic, poly-logarithmic.
It's logarithm to the fourth power, D.
And D could be a hundred thousand, that's not
a bad penalty to pay to avoid saddle points.
So if you wanna pump this result up to the most you can,
you would say that perturbed form of gradient descent goes
through the saddle points as if they're almost not there.
And so this, to me, is one of the first kind of proofs that
establishes why gradient descent is so effective in things
like neural nets, oops, I said it twice.
In high dimensions.
All right, so here just to summarize what we got,
so Nesterov for convex problems got this rate,
we get exactly the same rate up to
this penalty of log to the fourth.
Now, is that four real?
Probably not, probably that's an artifact of the proof
technique, and I suspect log is a lower bound, but I don't
think we know a lower bound, currently, for this problem.
Now, here's a little bit of the history
just briefly, I already alluded to this.
But here are a bunch of different methods
that do find second order stationary points.
We've been talking about the ones based on gradients,
but you could also go to Hessians, and this is
a little more of a classical literature.
If you give me a Hessian, maybe you can get a Hessian
cheaply, but in a lot of problems you can't,
but if you did, then I can compute the eigen structure
of it, and I could find the negative eigenvalue
direction, and I could just hop out.
And so if you do the analysis there, you get a faster rate,
it's one over epsilon to the one point five.
But these algorithms are not being used,
at least in most of the problems I'm aware of.
There is an intermediate set of things where you allow
yourself a Hessian vector product, and this is interesting.
Because that is something you can do realistically
in lots of problems, and you get a faster rate here, 1.75,
and you get this nice log D dependence.
So this is, a bunch of people, mostly at Stanford,
and I think it's very interesting comparing.
This will kinda play out over the year to come
as to which of these approaches.
Anyway, we're up in gradient land here,
there was an original Ge et al work and it was
just polynomial, so not nice, but
not that strong of an assertion.
There was then a paper by Levy,
who was able to prove D to the third,
and still polynomial in whatever epsilon.
And then this work who goes all the way down to
one over epsilon squared, a good polynomial,
and then does logarithm to the fourth,
starting to get to the lowest it could probably possibly be.
Just a little word about proof technique here,
so really this is differential geometry.
This is just, we're trying to prove something about
the geometry of saddle points and how they interact with
dynamics, a particular kind of gradient dynamics.
So you'd like to analyze around a saddle point,
where is the kind of the bad region, where if you're
in that, it's gonna take you a long time to get out.
And this is a probabilistic statement because
I'm gonna be adding noise from time to time,
and the noise will maybe get me out of that region.
And if I'm out of the region then kind of just the
power method will get me out, I have a negative eigenvalue,
boom, it'll just shoot me out.
So I need to know how thick that region is.
And the argument that's used in this proof
is a coupling kind of argument.
You don't know what that region looks like,
you could probably do some differential
geometry to figure out more about it.
But what you do is you start a couple pair of processes,
the probability argument, he started a couple pair
of processes on both sides of the region,
with R apart, where you don't know what R is.
And now you adjust R until these two things
start to flow quickly apart from each other,
when a phase transition occurs.
Now you know you're at the right value of R,
and you choose that as your R.
And that R is capturing now a pancake-like
shaped object that's pretty thin.
And that gives you this rate of logarithm, the D.
The previous work had replaced this whole argument
with a flat slab, and the slab was too thick.
And so in high dimensions especially,
that's where you get the D cubed.
Okay, so I always like it, again, I think
Ben would love the probability coming together,
the differential geometry coming together, the algorithms.
All right.
Okay, so that was the first line of
work that I wanted to talk about.
Right now let's talk about, let's go further,
we're gonna talk about saddle points.
Can we get past the saddle points even faster?
So there's this notion of Nesterov acceleration
which if you know about optimization is one of
the big discoveries the last 40 years
about what's the fastest you can go.
It kinda gives you insight into that.
And so do those techniques allow us
to move past saddle points quickly?
What other kinds of stochastic models
can we use to escape saddle points?
There's just this brown ball of
noise, is there other, better ways?
And how do these two kind of ideas interact?
So this is kinda the last year of research.
And to get there, we're gonna need to have
a deeper understanding of acceleration
than has been available in literature today.
So we just started to say what it is.
The deeper understanding is based on going into continuous
time, and just talked about differential equations
and stochastic differential equations.
And this'll be a shock to the computer scientists
in the room, they thought they could develop their
whole career outside of differential equations.
And I'm gonna tell you that's wrong,
and it's because these optimizations,
these items that hop around in a space,
they don't have any real topology.
So it's hard to talk about notions like acceleration
and going faster, 'cause you're just hopping along
a sequence space, you need some embedding into a continuum.
And that embedding leads to insights,
there's gonna be phase transitions in continuous world,
which you don't see in the discrete world.
And so it's phase transitions which are
giving us the difference in speeds.
So I'll get there in a minute.
All right, so this work is one of my favorite lines of work
in the last few years, and it's principally been led by
two of my favorite grad students, Andre, well they're
all my favorite, that's not fair, but (laughter),
I'm pumping their work right now, so Andre Wibisono
and Ashia Wilson have really mightily contributed to this
in the last several years, now, and then Mike Betancourt
from Columbia's joined us in the last little phase
of this research, to help us along a bit.
Okay, so let me be now really provocative
and grandiose sounding, which I normally am not.
I'm provocative but not grandiose.
In mature fields, ones that mathematically really seem deep
and mature, at least in my experience, there's a really
deep interplay between integration and differentiation.
You can call that interplay the
fundamental theorem of calculus,
also Talagron's work, and...
Quantum field theory does a lot of this.
Really mature fields you sometimes work
with the integrals and sometimes work
with derivatives, and you go back and forth.
And these can be very generalized notions of these objects.
So physics certainly is the poster child of this,
there was Newton who wrote down F equals MA,
it's based on derivatives, it's a differential equation.
And wrote down the laws of nature using derivatives.
Then Lagrange came around about 100 years later,
and said, I can rewrite all of that using integrals.
So I'm gonna integrate an energy, and moreover
I'm gonna optimize other paths under those integrals,
so he introduced also optimization, not just integrals.
And then he gets out the same physics that Newton got out
using this other formulism, it's called variational.
And as you probably know, that led to then Hamilton's
perspectives, and that led to a lot of the 20th century
physics was based on that, but a good physicist
goes back and forth all the time.
A good probabilist goes back and forth all the time.
Statistics is full of these kind of arguments.
If you're doing Laplace expansions or saddle point
expansions, that's a derivative in the complex plane
for the purpose of doing integrals.
The numerical disciplines, finite elements is take a PDE
and turn it into a bunch of little integrals, patches,
that you put together in a variational way.
Monte Carlo, hybrid Monte Carlo, the best existing
Monte Carlo algorithms are based on taking a derivative
to get fast speed as you move along in some direction.
So I could go on, but I think this is true that fields that
kinda go on both sides of this are really somewhat deep.
I would argue that a lot of our computational fields
of this century are not so deep in that regard.
In particular, I'd argue that optimization is not.
Might be surprisingly.
It's all full of derivatives.
Everywhere you see, in every paper, it's tons of derivatives
and Hessians and border derivatives or whatever.
But you almost never see an integral sign.
And there's not a variational perspective
as fundamental in that field.
So I'm not gonna argue this, but I'm gonna put it out there,
so I think there's an opportunity to take
a variational perspective on optimization.
To ask, what is the best optimization algorithm.
That's, in essence, a variational question,
and it's gonna require the other side of the coin.
All right so let's go back now to gradient descent again.
So here's gradient descent, I'm now
using K as my variable and not N.
So again, known for hundreds of years.
What's maybe not so known is it has a convergence rate.
So now we're starting to talk
about competitional complexity.
So we're doing convex optimization in this part of the talk,
and what this convergence rate is, is that after you've
gone through K steps, no matter what the convex function is,
I can guarantee you'll be within a
ball of size one over K of the optimum.
So that is not an old idea,
that's maybe in the last few decades.
It's kind of two or three pages of proof
using Lipschitz, convexity, Jensen's inequalities and so on.
But anyway, it's a nice fact that
we can analyze the rate of this.
So already we're starting to do something
a little more computer science-y to have an actual rate
over a class of functions, not just for a specific function.
Okay, now at some point, various of the Russian school,
Nemirovski is the most well-known name having done this
in the 70s, started asking, can I get a complexity theory
of optimization?
And I think that was a really important moment.
He didn't just go to Turing and say, well,
the complexity theory already exists,
the computer science gave it to me.
'Cause that complexity model's just too strong.
You can do anything with a Turing computer.
So you can't say much.
And I think that's, with all due apologies to the complexity
theorists in the audience, you can't say much, right?
So Nemirovski said, no, I'm gonna have a different
computational model, it's gonna be an oracle model.
And so for example, if I'm looking at optimization of
convex functions, what I'm gonna do is, the computer
is allowed to know the function value, the gradient
at any point that it wants, and it's allowed to take its
steps within the linear span of all previous gradients.
So that's a computer, that's a computational model.
It's a pretty strong one, but it's not
everything, it's not Turing complete.
But you can prove things with it now.
And in fact, you see nowadays in complexity theory like,
fine grained complexity and kind of relative complexity
starting to emerge, it has this flavor.
So he said, well, given that computation model,
what's the fastest you can go?
And this framework is rich enough to actually give an
answer, and the answer turned out to be one over K squared.
So it's faster than gradient descent.
Which is maybe a little surprising, because gradient
descent is steepest descent, if you're only allowed
to look at gradients, you better go down the
steepest direction and go as far as you can,
and we're giving you the optimal step size.
So I think probably, I wasn't around in that era, they
were a little surprised that there could be a better rate.
So probably people said, well
that lower bound's just too low.
We need to get a better lower bound that's a little higher.
And it was maybe a shock to everybody when Nesterov,
a student of Nemirovski's, wrote down this pair of equations
that's called Nesterov acceleration.
There's now two variables, not just one.
One of them does gradient descent in the x variable,
but that's now the y variable, and then the x variable
is kind of a combination, an extrapolation,
a rotation, if you will, of the last two x variables.
So it has a little bit of a momentum flavor,
or a heavy ball flavor, it has a lot of intuitions.
But the amazing thing is that then Nesterov not just
invented this, but he proved that it has a rate of
one over K squared, beating gradient descent.
Which is pretty astonishing given that
it's only using gradients, and it's not
going down the steepest descent direction.
Moreover, this algorithm will occasionally go back uphill.
Which is amazing, if you're trying to get down this hill
as fast as possible, how can that possibly be?
So I think it's safe to say that a lot of people have worked
on this, that there's a lot of math, but there's not a huge
amount of understanding to this day about the phenomenon.
Where's it coming from, why is this the optimal algorithm,
why is it reaching the lower bound, and so on.
So I'm gonna argue that the best way, at least I think,
to understand it is to go into continuous time.
That just in discrete time you're
not gonna see the phenomenon.
So yes, in the last 40 years there's been all kinds of
accelerations of classical methods, they're all really
interesting, they're based on kind of the algebra
associated with Nesterov, but not really,
there's not an underlying generative principle,
if you will, for this class of methods.
Okay, so let's now go into continuous time.
So there is gradient flow at the top, it's just
a differential equation, it takes away and it
goes downhill, you don't need a step size.
If you discretize that equation,
you get out steepest descent.
Well it depends, there's two different ways,
several different ways to discretize it.
If you discretize it with explicit Euler math
then you get out gradient descent.
If you discretize it with implicit order,
you get out a prox method.
Now, that's cool, but also you can analyze the
convergence rate of this in continuous time.
Using the same kind of tools we did in
discrete time, actually a little easier.
And the convergence rate turns out to be one over T.
Maybe not a big surprise, 'cause gradient descent
was one over K, this is now one over T.
Sort of similar rate.
Okay, in a very nice paper, Su, Boyd, and Candes at Stanford
said, what if we take the Nesterov pair of equations
and do the same thing, take the step size to zero,
out will pop some differential equation,
and there is the differential equation that pops out,
so not surprisingly it's second order
because we had two equations.
It's non-homogeneous, and there's a mysterious three
sitting up there, and then you have the gradient.
So it looks kinda like a little bit of a
oscillatory thing driven by a gradient.
You can't solve it, but you can analyze it
using basal functions and understand
a little bit about how it moves around.
And so that was very nice, very interesting.
We looked at that and were inspired by this,
and said, is there a underlying,
where did that differential equation come from?
Why that differential equation, why is it the best one?
Is there an underlying generative principle?
So I think of this as much like Lagrange looking at
Newton's equations, F equals MA and says, okay,
that's a nice equation and it seems to work,
but where's it come from, why that equation?
Is there an underlying principle?
So here's our attempt to deliver such a principle.
So I've just written it down here, and I'll
take a couple minutes trying to unpack this.
So we call this a Bregman Lagrangian.
This is a function of the state space, x where you are,
we also have a derivative, x dot,
and it's also a function of time,
this will be all time varying.
It's got two main pieces to it,
is this part right here, and this part right here.
So let's focus first of all on this.
First of all, all these alphas, betas and gammas,
are just time scaling functions that are
degrees of freedom of the space of algorithms.
We don't just want one algorithm, we want a whole space,
and so we give ourselves this reasonable freedom.
They're not the important part.
The important part is these two pieces.
This piece right here is the function you're trying to go
downhill on, so the F, scaled in some way.
So that's kind of like a potential
energy, you wanna go down.
We also have a term here which is gonna be kinda like a
kinetic energy, which is now known as a Bregman divergence.
So this is a Bregman divergence
between a point x and an extrapolated point,
so what is a Bregman divergence?
It's this little equation here, if you want,
it's based on an auxiliary function, H,
and it measures distance between x and y using H.
So this picture might be better
than the equation for some of you.
If I wanna measure distance between x and y on this axis,
what I do is I put down a function H,
and I take the linear proximation to H at X,
and go out there and get a discrepancy between
the linear proximation and the actual function H,
and that red distance is the distance between
x and y using this auxiliary function H,
it's called the Bregman divergence between x and y.
If H is just quadratic, this is a fancy way of writing down
the Euclidean distance between x and y,
it's a variational way of doing that.
But if H is logarithmic, you'd get a Kullback-Leibler
divergence and H can be all kinds of other things,
and has been studied widely in convex optimization.
Bregman, by the way, I think is Russian-Israeli,
I don't remember, so anyone happen to know?
Again, I like to make connections, the Russians
were behind a lot of this work that came out
in the 80s, for obvious kind of reasons.
So that's the Bregman divergence,
now how are we using the Bregman divergence?
In an interesting way, I think.
We were measuring, the Bregman's measuring
point x and x plus a timescale derivative.
So it's a little bit of a, like elasticity, if you will.
How much energy are you paying if you move in the
direction that the momentum seems to be carrying you?
So we measure that, and let's call that a kinetic energy.
And in fact, if H is chosen to be quadratic,
this whole thing reduces to just one half x dot squared,
which is kinetic energy.
So this becomes literally a Lagrangian
that happens to be time varying.
So, now why do I think this is an
interesting object to be studying?
Well first of all, what do you do with Lagrangians?
You put them inside of an integral,
and then you optimize over the paths.
And that will give you the particular path
that we argue you should be following
if you're trying to optimize in the optimal way.
So let's go to that.
Here's what Lagrange told us how to do, we take our
Lagrangian and we put it inside of an integral,
and then we optimize over paths and maybe
we picked out this particular path, here.
And then the mathematics is called calculus of variations.
You just, in essence, set all partials as equal to zero,
and that's, in functional analysis,
called the Euler Lagrange equations.
And so it just takes a bunch of partial derivatives
of this thing in a total time derivative.
And once you do that, you'll get out a differential equation
you'll get some xs and some x dots and some x double dots,
and so let's do that.
It's a two line little derivation, and so here you get out
this master differential equation.
It looks awful, maybe, to you, but I love it,
and so it looks beautiful to me.
It's got second order, it's got a non-homogeneous
component, and then here's a whole bunch of geometry.
Just, that's geometry to me.
And then there's the gradient.
If H is quadratic, if we pick alpha and beta
in a particular way as P log T,
and if, what else do we do...
Yeah, if we just do all that, then that differential
equation reduces to the Su, Boyd and Candes thing.
But moreover, this differential equation,
on a particular settings of alpha, gamma, and H, and beta,
comes up with all the known, at least the acceleration
items that I know of, in the literature,
their continuous time versions.
So it seems to be a generator of accelerated algorithms.
So that's already kind of cool that we were able
to unify them in one differential equation.
But now, what could you get more out of this?
What mathematical other facts can you learn
from having written all this stuff down?
Well, first thing you might wanna do is get a rate.
You'd like to know, can I get a general rate
once and for all, not having to do with algorithm
specific rates, which the literature has been doing
for all the discrete time algorithms.
And so those rates are (mumbles).
It's not that painful, but it's two or three
pages, typically, of calculation.
The amazing thing here is you get a rate
with just one line proof, for this entire family of things,
and I'm gonna show you that proof on the next page.
So here is the rate.
The rate is a big O of E to the minus beta T.
So beta T is that degree of freedom
that you have as a designer.
So here it's already kind of interesting,
you get to pick your rate.
So we're gonna return to that in a minute.
You pick the rate.
So anyway, it's this, and here's the proof.
You just write down what's called a Lyapunov function,
here's the Lyapunov function, involving the
Bregman divergence, and involving the optimum.
You take its first time derivative,
you ask that that be negative, so you go downhill
in this particular Leapunov function, and this happens
if and only if this occurs, so it's a one line proof.
Ashia Wilson, who I mentioned earlier, has done another
couple of papers that are quite beautiful on other
Lyapunov functions that exploring the whole space of
Lyapunov functions, both in discrete and continuous time
and linking them, so that's a very nice little magnum opus.
Anyway, that's already kind of interesting,
looks like we've learned something about this.
But, can we learn anything more?
And then something really interesting happened,
and this where I think it finally starts to get
interesting, maybe it's been tedious til now, but anyway.
So let's suppose that I, so as I said,
I get to pick the rate, and that seems odd.
Because aren't they just kind of
given to us by the algorithm?
So let's pose if I pick a rate which is,
I set beta T to be a logarithm of T.
And so I take exponential of that, I get one over T,
which is a rate I know I can achieve with gradient and flow,
for example.
I know I can do that.
So anyway, I set beta T equal to that value, and then
there's a particular way to set alpha and gamma,
they're actually determined by that,
I'm not getting into that.
And then I set H to whatever.
And then I plug this in, I get actually gradient flow,
it just reduces to gradient flow.
Yes.
- [Male] Quick question.
Why do not just set it to infinity, and converge instantly?
- Good, you're asking the right question.
I'm gonna bring you into the conversation here in a moment.
So suppose I set it to logarithm
and I get this known algorithm, great.
Suppose that Sham, he's a little smarter than I, he says,
I'm gonna set it to two times the logarithm of T.
Now when I take the exponential,
he gets out a rate of one over T squared.
And we know there exists such an argument,
it's the accelerated version of Nesterov which does that.
Now here's the interesting part, in this space, this x,
x dot space, he and I will follow exactly the same path.
That's surprising.
What'll happen is that he'll move
faster along the path than I am.
And all that means is that he's using
a different clock than I'm using.
That's not interesting.
The path is what's interesting,
and we're following the same path.
So this Lagrangian has a mathematical
property, it's called covariant.
And so that means that for any of these choices
of beta, we're all following the same path.
So now let's suppose that Ed says, no, I'm gonna move
exponentially, guys, I'm gonna set beta T equal to T.
And so I'm moving exponentially fast.
And he will now design with his machinery,
his other Lagrangian equations, he'll get an algorithm,
and it'll move exponentially as fast.
He will, but it'll move along our same path.
So he's really not doing anything new,
he's just using a different clock.
And like Einstein said, the clock doesn't matter,
it's not the interesting part of this problem.
And our young man here in front, Pedro says,
I'm gonna use a super infinite fast clock,
and I'm gonna go as fast as...
And yes, you can, but you will move along
our same path, just using your weird clock, so.
So in continuous time, in some sense
the phenomena evaporates of acceleration.
But that's not what really happened.
What happened is that acceleration is not really
all about acceleration, it's about the path.
There is an optimal path to follow for optimization.
And you wouldn't have seen that in discrete time.
All right, now, what's then interesting, and is
the whole story at this point, is how you discretize.
How do you go back into computer science land?
Because we know in computer science land, we can't
discretize at, Ed can't discretize his differential
equation, something's gonna break
with his exponentially fast thing.
Similar to you, you won't be able to do it.
But we will be able to, 'cause we
picked these less ambitious rates.
Now why do we know that,
well, we just have these oracle bounds.
But, there must be some more mathematics
that actually tells us that.
So let me now move on to the next part of the story,
which to me is the most interesting.
So these are the mysteries about a year ago.
Why can't we just discretize when we are using
exponentially fast clocks, or super fast clocks?
What happens when we arrive at a clock
speeds where we can discretize?
What happens in some mathematical sense?
We have a signal, a flag that's been raised,
of something that's, and then once we
arrive at that point where we can
discretize, how do we discretize?
And so I'm gonna claim these are all solvable,
these problems, and the secret is a technique,
or not technique, a beautiful area of mathematics
called symplectic integration.
How many of you know what symplectic integration is?
Okay.
The MIT person over here knows.
Yeah, it's not widely appreciated, it's one of the major
areas of mathematics, I'm sure your mathematics department
has several symplectic geometers here.
When I put it in my abstracts nowadays,
I go give talks, a few of these people show up.
And they usually have gray beards, and their sitting there
in the back looking bored, but then they lighten up
when I get to this part of the talk (laughter).
All right, so what is symplectic integration,
I think I have a slide that says something.
Yes.
All right so, this was developed in the late 1800s
by some people that you will recognize, Hamilton,
Poincare, Jacobi, a bunch of these kind of people.
What were they trying to do?
Well, at that era, they had a whole bunch of
interesting differential equations.
They had Newton's equations, but they had Maxwell's
equations, and a whole bunch of others, fluid equations.
And they were saying, they were very prescient,
they were saying, how could we discretize
these differential equations?
They didn't have computers yet, but somehow
they're already thinking ahead.
And they asked the final question, well,
the true flow of these differential equations
gives us certain paths that have
certain mathematical properties.
They, for example, conserve energy.
They conserve momentum.
There's physical laws, there's certain
invariances along these paths.
If I discretize, I'm gonna hop along a discrete collection
of points, and I'll have an approximation or that's fine,
we have to tolerate that.
But if I hop in a certain way that I'm losing momentum
or losing energy, I'm not doing the real physics anymore,
and that would be just wrong.
And they were very prescient, because now, if you were to
do that nowadays, with, say, simulating the galaxies,
the cosmos, or simulating the climate, or protein folding,
and you did it in the wrong way, you bleed away energy
and momentum and you just get bad answers, just bad physics.
So people don't do that.
Why, because these guys figured out
a way to do it the right way.
And the right way to do it is, is that there is a class
of methods that will exactly preserve energy momentum
even as they're doing a discrete approximation
of the differential equation.
Those are called symplectic integrators.
Roughly speak, well...
What is the idea, so I knew about this back at MIT,
I sat in some differential equations courses
when I was a young assistant professor.
And I heard about this a little bit
and I just kind of remembered it.
I remember the following idea, which is that
if you think about a classic differential equation,
it's a vector field, and you're
flowing along the vector field.
So at any given point, there's an arrow leaving that point,
and you follow that arrow for a little bit over time.
Symplectic perspective is different, at every given point
there's a little triple of vectors.
And that defines the little volume element, and if you take
the determinant of all the elements then you get a volume.
And so now, as that volume element
flows forward in the flow,
it's gonna shear in some way, and
I want its volume to say constant.
And the volume is just a determinant, so
that's just a little piece of mathematics.
And so you work out an integrator, which as it moves along
discretely, preserves exactly these volumes.
And then you can generalize this to volumes in phase space,
and then to exterior forms in differential geometry.
In fact, this drove a lot of the early
development of differential geometry.
Okay, now the actual algorithms turn out to be
kind of interesting, they're not that surprising.
So if you'll remember, when you discretize differential
equations, you'll take like maybe the explicit method,
which looks at where you currently are, and the
gradient where you are, and then it moves forward.
Another, the implicit approach, takes where you currently
are, looks at the gradient of where you're going to move to,
and uses that gradient.
And mathematically, this is
just looking one way or the other.
The latter, you have to solve a little
implicit equation, but you can often do it.
Neither one of those are symplectic,
they will bleed away energy or momentum.
But if you interlace them in a particular way,
some terms will cancel, and you will exactly
conserve these things.
So I remember learning about this, and then I learned the
other interesting fact is, because these symmetries are
being preserved, these little triples,
this has much better error control properties,
much less error is accrued as you move along
'cause it's kind of a stability, this,
a lot of things are canceling.
And so you can move much farther, step size
is gonna be turned up by a large factor.
And that's critical for these large scale applications.
So no one who does serious chemistry or physics nowadays
uses anything other than symplectic integrators,
at least to my understanding.
So, why did I think about looking into this a little bit?
Well, because, if you can move as fast as possible with
these integrators, that's what acceleration somehow is.
So it must be relevant.
I'm gonna now kinda cut to the chase a little bit.
There is a standard recipe for doing this,
you start with a Lagrangian and the Lagrangian's time
varying, we need to move into the Hamiltonian framework,
there's something called a Legendre transform,
which is you just do it, and you
move into Hamiltonian variables,
that's a big of conjugate duality, if you will.
I guess this is Rockefeller's university,
so you all know about that.
And so we get a new set of variables that include time
as an explicit variable, you now have a Hamiltonian,
which is time invariant, and there's a standard approach
to applying symplectic integrators to that,
there's not just one symplectic integrator,
there's a whole family of them.
So we picked a particular one,
here's a slide of actually doing this.
So this is a particular high dimensional quadratic,
and so we write down the quadratic, that's the function F,
and we make these other choices that I mentioned, and
turn that into the other Lagrange equations, or sorry,
turn that into Hamiltonian, which turns into a symplectic
integrator, and we put that on the computer.
So we did no kind of algorithmic choices of any kind,
the mathematics just drove us along,
and here is the actual convergence.
And so that is the oracle rate for this particular function,
and so it is doing as advertised.
Moreover, it's oscillating.
It's not just going downhill, it's
going up and down and up and down.
All right.
So that's really cool.
Okay, so let me spend the rest of my talk sort of,
I could go on about that, but we just wrote a paper
last week called 'On Symplectic Optimization'
which is on my homepage, so if you're curious
you can see some more details.
I also wanted to put a slide in here, which I forgot to,
which is that if you compare to Nesterov,
Nesterov in this thing also achieves the oracle rate,
and then near the bottom it actually
turns down a little bit faster.
And that's an interesting phenomena
that's not part of the dynamics.
There's a couple of interesting mysteries there.
So in some sense, Nesterov's a little big faster
to us if you set the step sizes equal.
But, this is a symplectic integrator,
you can turn the step size that much more.
So when we do that, we really beat Nesterov.
So this extra algorithm actually beats Nesterov,
which is I think a real achievement.
Okay, so real briefly, a followup project, here.
Acceleration and saddle points.
So this has been an open problem.
As I'm moving near a saddle point,
this has put two ideas together now,
acceleration is this kind of momentum I start to build up
that allows me to get downhill as fast as possible.
If I've got a lot of momentum and there's a saddle point
sitting there, presumably that might help me.
I might be able to get past that saddle point,
blow by it, I don't see it as much.
But that's that intuition, as usual
you have to try to prove something, here.
It could be false, that intuition.
Anyway, Chi Jin and Praneeth Netrapalli
have worked, we've worked together on this.
And so just really briefly, there has been some
existing work that, for example, accelerated
gradient descent defines in this non-convex setting
epsilon saddle points at a rate of one over epsilon squared.
There are some nested loop algorithms where you do
something like acceleration inside of another loop.
So if you know about interior point methods,
they kind of have a Newton inside of a nested loop,
and there are algorithm is where that's reasonable.
In our field, these nested loops, I just don't
really believe in them, I think we really need
to have kind of just a simple algorithm.
So anyway, but faster rates have been achieved,
not only one over epsilon squared,
but one over epsilon 1.75, if
you allow yourself the nesting.
So we're asked, can we do this just with basically
accelerated gradient descent, not being more fancy.
Same assumption as before, a Hessian
Lipschitz and gradient Lipschitz.
And a goal is the same is before,
we wanna find a secondary stationary point.
Here is our algorithm and the wanted time,
I'm not gonna go through it, but basically
this is just accelerated gradient descent,
that's what steps four and five are right there.
This is now the same perturbation condition as before,
and this is something called non-convex exploitation,
which I will mention in a minute.
So anyway, here is the result.
Now hopefully you're getting used to being theoreticians
and just looking at and understanding these results.
This is the convergence rate, it's not epsilon squared,
it's epsilon seven fourths.
So, da-da, proof that acceleration can
accelerate you past saddle points.
Moreover, we just have Lipschitz constant and a row up in
the top and the usual thing, and then we have that tilde
again, and that tilde here is logarithmic in D, again.
It's logarithm to the sixth power, now, and not the fourth,
but again, I think that's just the proof technique.
Okay, I'll skip over that.
So how does this proof go?
Well, you have to use, it turns out, a Hamiltonian.
The Hamiltonian allows you to handle the fact that
in this non-convex setting,
you're not always doing a descent.
So the Hamiltonian, if you don't just look,
if you just look at the function value itself,
it's not descending, you're going up and down.
If you have the Hamiltonian, you're mostly going down.
It takes out a little bit of that, but there's
occasional times where you can still go back up.
And that's what that NCE step is designed to handle.
So it's a bit of a patch, if you will.
There may be a generic Hamiltonian here,
but this patch is what allows us to prove this theorem.
So I'm gonna skip over the details here.
All right, then two more minutes on
follow on, so, a lot of that work's several months ago,
what we're mostly doing nowadays is trying to ask, what if
we put all this sort of stuff together with stochastics?
Instead of just doing a perturbation condition,
what if we do real stochastic gradients, or real diffusions?
So if you work in discrete time,
those things are called stochastic gradient methods,
when you take them into continuous time, they're called
diffusions, or more general, Markovian processes.
And they have rates of convergence,
they probably have optimality principles,
we don't know what they are yet.
And so there's a huge number of open problems about,
what's the fastest way to diffuse?
What's the fastest way to sample and diffuse?
And how's the dimension dependence go?
What about the non-convex case?
So I've got a large group now
and everybody's working on this.
And I know that people here are working on it, too,
there's a lot of people working on it.
This project was with one of the students, Xiang Cheng,
and another student, Niladri Chatterji, and then
Peter Bartlett joined us for this project.
So, can we accelerate a diffusion?
What is a diffusion, again, it's a
continuous time version of stochastic gradient.
So think brownian motion with a gradient.
There have been some negative results.
But, they've focused on what's called overdamped diffusions.
So just briefly, the classical diffusion that people have
been analyzing here is called Langevin diffusion.
Langevin diffusion is just what you might expect,
it's the rate you descend, it takes a
gradient and it adds brownian motion.
So if you're a gradient loving person,
it's the first thing you would start with.
Okay, but, from all the work we've been talking about today,
we know that acceleration and getting momentum into these
things lets you move faster, so probably
we can diffuse faster if we allow momentum.
So instead of doing overdamped diffusion,
which makes you sluggish, allow yourself some underdamping,
so you can kinda hop around a little bit more.
And so you now get two equations,
and make an underdamped Langevin,
and you can do that analysis.
So can we accelerate that, and I think
this is now actually my last slide.
Here is the results for overdamped Langevin, and this is
a very recent results, these are mostly people in France
who have been working on this.
The convergence rate is one over epsilon squared,
and it has a factor of D in the numerator,
and our new result shows that with underdamped Langevin,
more of a momentum kind of method, you can go
all the way down to one over epsilon much faster,
and the dimension dependence improves to square root of D.
Okay, so I am done with everything I really wanted to say,
let me just now bump back up to the top level.
I hope you appreciate that there
are tons of open problems, here.
And if you have some math background, especially some
physics, but not just, there are just all kinds of things
you can do, and many many problems that are open.
And these are ones that I think are needed to be answered
if we're gonna have a serious engineering science
that gives some guarantees that says this will happen
under this situation, these error bars I'm getting are
calibrated, and blah blah blah blah blah.
That we really kind of got stuck
with computer science and statistics,
which I think of myself as both,
really were very, very separate.
And the computer scientists gave
really strong guarantees with no noise.
And the statisticians gave guarantees but with
no kind of serious, computational thinking.
No speed of convergence in any serious way.
And when you put the two together,
you can start to solve real world problems.
Thank you.
(applause)
- Thank you very much.
We have time for a couple of question from the audience.
- All the way back in the back.
- Your thoughts on continuous time approach,
what are your thoughts on continuous time approach
for non-linear constraints, like a
problem with non-linear constraints?
- So a very good question about, he asked about
the constraints, and specifically, non-linear constraints.
So I talked almost entirely about unconstrained
problems here, as you may have noticed.
And it's another, one of my list of many open problems
to do a lot of these things with constrained optimization.
There has been a growth of kind of conditional gradient
ideas in the last few years, noticing that there are
really, really simple ways to impose constraints
that scale really well in dimension,
and so on my list of things to do is kinda
redo some of this literature, but with
conditional gradient instead of gradient.
And do it in continuous time.
In fact, in continuous time, you can have the constraint
not be just a, you can't go past this,
you can have the constraint be a force
on your second order differential equation.
So you'll start to get there and the force pushes you back.
I think computationally that's gonna be an
actually better way to go for constraints
than the way we're used to in discrete time.
And non-linear, linear is not really the big issue here.
These tools really are a bit
agnostic to linear, non-linear.
- So ideally, you'd like to be neither overdamped
nor underdamped, right, because if you're underdamped,
you oscillate, and that slows it down.
Is there an optimum?
- No, you don't...
So, no.
You wanna oscillate.
The proof shows that the oscillation
is what gets you the fast speed.
It's not a perfectly...
- But does it give it the maximum speed?
(muffled speech)
- Critically damped is slower
than the oscillatory.
Yeah, it's not quite the same.
Your physics intuition helps, but it's not exactly right.
Anything else?
- Not to be basic, but have you compared this to
things like Adam, that optimize neural networks?
- Yeah, so he's asking about Adam,
a particular method applied to neural nets.
And maybe Sham could kinda help me, here,
but we're in the very early days of
optimizing really, really large problems
that are non-convex and stochastic, and all that.
And I think that Adam is just kinda one little step
along the way, it's a somewhat ad-hoc algorithm
for which there's not a lot of theory,
and there are theoretically well-grounded methods which are
roughly as good, or maybe even better,
and I think, at least in my experience,
the theoretically-motivated stuff wins out.
And either Adam will be shown to be,
the descent on some Hamiltonian...
- [Offscreen] I think you should just merge them together.
- You can...
Merge them together, Mike, okay.
There's a nice... (murmuring)
Merge them together, Republicans and Democrats
let's all come together. (laughter)
But no, it's a good, there's not gonna be
one answer to everything, there's gonna be
various ingredients that all come together.
This particular algorithm has one particular insight,
this has another insight.
And a good user of these, it won't be
the computer just delivering answers.
It'll always be the human sitting in there thinking,
how do I put that ingredient together with this,
as a good engineer should be.
A good chemical engineer will build a chemical factory,
which didn't develop all by itself.
In fact, I should say I've written an op-ed on AI
due to the last few days, and I'm gonna put it out there
at some point, and it's a little polemical.
I am unhappy with all the hype over AI,
and it says something like, you know,
when chemical engineering and civil engineering
were developed, people didn't say, the best way
to do this is to build an artificial chemist,
and then that chemist will know how to build the factory.
(laughter)
And for some reason that's how we're
thinking about what we're doing here.
Let's build an artificial and general intelligence,
and it'll solve all our problems,
or kill us, one of the two.
Instead of just thinking about, what are the ingredients,
how do we as engineers put them together,
how do we use them for useful purposes, and so on.
Is that a good place to stop?
Well, one more.
- Your point of saying you're presenting mathematical
formulas and these equations for a computer scientist.
How and what kind of work are you doing in the area
of knowledge representation, where you take a problem
and you represent it in a way that computer scientists
can sort of begin and understand it better.
- I love your question, it was about knowledge
representation, computer science and math.
Some part of it I love, and some part I don't.
Computer scientists should be just as aware
of uncertainty and dynamics and
optimization as everybody else.
And don't call yourself computer scientist
and wall yourself off from that, please.
But also, all the other people out there
should be aware of knowledge representation.
And ontologies, and relational things,
and first order things, and reasoning.
And just to get, I think so many people in the audience
know this, I love natural language processing, which I
think is a field that brings all these things together.
And actually someone asked me if I had a lot of money what
I would do with it, I'd spend it on solving NLP problems.
'Cause it does, you have to think about representation,
and don't get me started, but here they are,
they're loving me again, so. (laughter)
All right, get me a beer afterwards.
So just to say this even more provocatively,
even though it's being filmed,
I do not believe that neural nets
just with huge amounts of data
will solve serious problems like
machine language translation or dialog.
Period. Ever.
They will make useful artifacts that
we can use for some simple purposes.
But they won't, and it's gonna be humans, in the loop,
thinking about human sides of these things,
including knowledge representation.
But, you know, it's been 40, 60 years of thinking about
knowledge representation, and it
didn't lead that far that fast.
So it takes time, it's hard.
And it requires annotations, it
requires why you're annotating what.
But don't read the media too much about a lot of this stuff,
this could take decades to do this field well.
And for people thinking that oh, five years has gone by
and we don't have knowledge representation solving
all the NLP problems, or, it's just,
that is such a hopelessly bad way to think.
- Great.
And with that, let's please (laughter)
thank our speaker. - Thank you.
(applause)
No comments:
Post a Comment