>> Without further ado,
we are happy to start with our own Ilya Razenshteyn,
who will talk to us Beyond John's Ellipsoid.
>> Thanks for the introduction.
So I'll talk about joint work with Alexandr Andoni,
Assaf Naor, Aleksandar Nikolov and Erik Waingarten.
I don't plan this talk to be too formal,
so if you have like questions,
feel free to interrupt me in
any meaningful way. That would be great.
So, the starting point of the John's Ellipsoid,
many of you have probably seen it and basically,
the statement is very simple.
So if you have a centrally symmetric convex body,
you can sandwich it between two ellipsoids,
which are square root d apart.
So, for example something like this.
And this is extremely useful statement.
So it has, I don't even know,
maybe thousands of applications or something.
Obviously, it has lots of
applications in convex geometry,
so many proofs go,
let's take our convex body
and move it to the John's position,
which means that we perform
a linear transform so that
the ellipsoid is actually a union bowl.
But it also has some other fun applications.
So these are just few that I personally love a lot.
So one is a beautiful result how to solve
integer programs on constant
dimension and polynomial time.
And now, that is how to approximate
sub-modular functions
using John's Ellipsoid as a sub-routine.
And finally, it has been relatively recently used in
a very beautiful way to improve
balance on the low rank conjecture.
So this is a great result.
So the question is, can we improve it?
And the answer is no.
And basically, this bound square root d is tight
for many examples including hypercube,
cross-polytope from
like pretty much lots of other things.
And in this talk,
I will tell you how to get improved balance
if we are willing to relax the guarantee a little bit,
which is still useful for applications.
But before I'll do it, let me
give you two equivalent reformulations of
the John's theorem because the result I'm going to state,
it will be related to these reformulations rather
than the geometric statement I told you.
So the first one is the analytic form.
And basically, here we have
a pretty trivial observation that
if you have a centrally symmetric convex body,
then that gives you a norm.
And how do you find this norm?
So pictorially, if your body is like
this and your vector is like this,
then the normal weight is around
two because it's around twice as
long as the length to the intersection.
So the formula would be like this,
and this is called Minkowski functional
of the body K. So this is a norm.
And John's theorem says
that if you have some norm on R_d,
then there exists a linear map,
which is Lipschitz as a map from my norm to L2,
and inverse is also Lipschitz.
And Lipschitz constant from X to
L2 is square root d and Lipschitz
constant from O2 to axis one.
Okay? So this is completely equivalent reformulation,
but it will be lenient to,
think about this statement.
And the other reformulation
is just essentially packaging of
the previous slide is using Banach-Mazur distance.
So Banach-Mazur distance between two norms or between
two convex bodies is the following expression.
So it's minimum overall linear transformations
of the operator normal of t
as an operator from X to
Y times operator norm of
the inverse as an operator from Y to X.
So it's basically exactly the
same as on the previous slide,
but rewritten using more succinct notation.
So basically, John's Theorem using this notation
says that distance from any norm
to the dimensional L2 is at most square root D. So,
that's it for the John's theorem.
So the question is, if we can do better in some way.
And the answer is yes if we are
willing to sacrifice enough.
And this is actually not a new result in this work.
It's more like we uncovered
it from previous work and use it algorithmically.
So basically, this is a relatively obscure paper
of Daher from 1993.
It has like maybe five citations
total or something like this. So it's really obscure.
And let me introduce
the main result in the quantitative form from it.
So this slide will be a little bit dense,
but yes, bear with me.
So basically, the statement goes like this.
So we start with a norm on R_d as before
and we have some parameter alpha
between zero and one-tenth,
then I claim that there exist two spaces,
X prime and the H prime.
So X prime is sufficiently close to X.
It's within distance d to alpha from it.
H prime is sufficiently close to L2.
It's within constant distance from L2.
And for these spaces,
there exists a nice map between them,
namely between their unit spheres.
And this map has the following properties.
First of all, as a map from X prime to H prime,
basically, it's a Holder map
with constant log D and exponent alpha,
and the inverse is also Holder with exponent alpha and
constant log D. So schematically,
we have two spaces, X and Euclidean space.
Then we perturb both of them.
So we perturb X by d to
alpha and we perturb L2 by a constant facto.
And then between these spaces,
there exists a very nice map with very good parameters.
So Holder map with exponent alpha and constant log D.
So that's pretty much the statement.
So let me compare it with the John's theorem.
So the biggest advantage
is that parameters are very good,
namely the constant radiating here is log d instead of
square root d. And
we also need to sustain some perturbation,
but perturbation is also pretty small.
It's like d to some small power. So it's very good.
It's much better than square root d. However,
it comes at a cost and there are like many disadvantages.
First of all, it's only defined for unit vectors.
That's probably the biggest annoying thing about it.
And second, it's not a Lipschitz map.
It's alpha Holder, where alpha is pretty small.
So the smaller the alpha is,
the worse the quality of the map is.
And another pretty unfortunate property
of it is that it's actually non-linear.
It can't be a linear map.
So in particular, computing
it becomes a non-trivial issue.
Any questions about the statement?
>> All of those things are
necessary or you can put some of them later?
>> Some of them are definitely necessary.
So for example, non-linearity
is necessary because if it would be linear,
it would build for the whole space.
Coincidentally speaking, I think that,
so let's put it this way,
I think that you can improve the upper bound.
So you can make it for alpha being constant and here,
maybe have like polylog d.
But both upper and lower bound, you can't improve.
So there are lower bound, even for L infinity actually.
>> [inaudible]
>> So basically, there is a tradeoff
between how much you
perturb and what Holder exponent you are getting.
And I think for L infinity,
it's sort of necessary or close to being necessary.
And maybe because I actually have quite a bit of time,
maybe let me show you how this map
works for L infinity actually,
well one, actually L infinity may be.
>> What is alpha Holder?
>> Alpha Holder, basically, it's just this.
So that the norm is at most the norm to power alpha.
So Lipschitz is one alpha equals one.
So it's a weaker property when alpha is small.
>> I guess a way to think
about it is that smaller distance.
>> Small distances get blown up quite, yes.
So for example, because we talk about unit spheres,
it really calls for alpha equals zero.
So, anyway, like anything is zero Holder, but ideally,
we would like alpha to be as large as possible,
but if we are willing to perturb X from little,
then we can't quite achieve that.
So let me tell you how this works for L infinity.
Let me use the white board. So for L infinity,
the construction is actually very simple.
So if my starting spaces are infinity to
power d and my target is L2 to d,
then my perturbation is just to say that okay,
instead of looking at L infinity,
let's look at L maybe square root log d
to power d. So
these distances are within two to
the square root log d of each other.
And between these spaces,
there is a very nice map and that works as follows.
So we take a vector x and
we'll pretty much raise every coordinate to some power.
So it's basically x_i to power, if we call this p,
this would be p over two, and that's the map.
So that's the whole thing. I mean,
why it has this property,
so it's a separate question.
It's not entirely trivial, but
the construction at least is very simple and times signs.
Okay. So, let's see. Any more questions?
So, why this could potentially be useful?
It's not obvious.
It feels like it almost gives nothing.
The fact that it is useful for something,
it's a separate story.
But at least in terms of properties,
you can meaningfully compare it with the John's Theorem.
>> So far, the [inaudible]
>> Precisely. Yes, definitely.
In fact, even the way I formulated
John's Theorem so you can
give even stronger formulation that
gives you more nuanced information about this ellipsoids,
how they touch your body,
and so on, so on.
Actually, it's useful in some applications. But, yes.
>> Like the bandwidths.
It's useful in bandwidths.
>> It's useful anywhere.
>> Yes. That's what I'm saying.
>> But in bandwidths, fine. Yes, sure.
>> You mean specifically information
about contact points that you get decompositional.
>> Yes [inaudible] .
>> So, now let me say a little
bit about making it algorithmic.
So, it's nontrivial, and the main reason why it is so,
is because it builds on
the certain construction which takes
two normal spaces and computes
what's called Complete Interpolation between them, so,
this was introduced by Coldiron in 64,
and I'm not going to define it
but basically the intuitive idea
you have to norms and you show how to
evolve one into another in the smooth way.
Definition of this interpolation,
is basically minimum of a certain functional on
a certain infinite-dimensional space
of holomorphic functions.
So, you can compute it in
like finite time left alone polynomial,
but we'll show how to properly discretize
this optimization problem and then
solve it using Ellipsoid method, so,
it's not very difficult conceptually,
but it requires quite a bit of technical work actually.
So, basically we can compute
efficiently norms in the interpolated spaces,
and then use this as a subroutine to compute
this embedding from the previous slide approximately.
So, now let me tell you about applications.
So, we use this embedding to actually
give better algorithm for finding
nearest neighbors and then briefly introduce the problem.
Again, I'm sure many of you have seen it in some way,
and it's not the specific something too important here.
So, basically, the starting point is
Symmetric space and we
would like to execute
approximate nearest neighbors queries efficiently.
So basically, the query is a point,
and we would like to find approximately closest point.
So, for example, we know that the
closest point is a distance R,
but then we are happy with any point within this legible.
So, parameters that we care about is
how much space our the data structure occupies,
and how much time it takes to answer a query and so
there is a national trade-off between
three parameters spaced,
query time and approximation.
So, important case and to which we apply
the embedding from two slides ago,
is when your metric is actually some norm on our D,
and data structures that they would like
are when space is
polynomial in number of points and the dimension,
and query time is sub-linear in the number
of points in polynomial and the dimension.
Most of the work for this problem was done
for the case when distances L1 and L2,
and this is for a good reason,
first of all L1 and L2 are probably
most important distances in the applications,
and second that actually very
efficient algorithms for these case.
So, basically, again there are some
formulas how well you can do,
but let me just say that,
as soon as your approximation is slightly more than one.
So basically, as soon as you have
any known data approximation,
you can get sub-linear query time
and very good bond on the space.
So basically, the problem for L1 and
L2 is kind of solved,
again modally some of the remaining open problems.
So, now let's see what is known about distances beyond
L1 and L2. So, not much.
So, for norms, if your distance is a norm,
then until this work the best result was to just say
that let's take our norm
disregarded completely approximate
it using John's Ellipsoid,
and just do running a data structure for L2,
and get approximation square root [inaudible].
So, this is unsatisfactory because we don't really
use potentially beautiful structure of our problem,
and we kind of forget about the norm
and just work with L2.
So, in this work we show how to get
improved approximation for any norm.
So, if your metric is given by a norm on RD,
then the new approximation we can
get is pretty much to
the square root log D. So, it's subpolynomial LD.
>> I mean just to give more perspective,
can you say whether fastest time
normaI know [inaudible] not very useful for application.
>> So, for less abstinance fortunately,
you can always do much better than this you can get like
log D. So, basically,
the only norm, there are several questions here,
what are the norms beyond L1 and L2 which are useful,
and for which norm of this bound is good.
So, the norms that are useful at least potentially,
I would say that I know like two examples,
first is basically, Optimal transport,
so WP norms, and second class of norms is Matrix norms.
So, basically, shorten P and
like related things, actually,
now thinking about it for
WP norms log G bound is not known,
So, for them maybe this is the best picture.
That's a good point. So, for Matrix norms,
we can actually know how to do much better than these
we can get like log the approximation for those.
So, this two to square root log D,
I think the main strength of this
result is that it's universal,
so it's like for all norms at the same time.
Yes. I mean if
we talk about how useful this is for practice,
so this is how it looks is not quite
practical, for example,
because it runs Ellipsoid method in the heart.
Okay. So, I'm not going to tell you how this is proved,
I'll just give you like a very short overview, so,
short overview is that,
we use this embedding from
this talk to show certain structural results,
that I will tell you in a second.
Then after we have the structural result,
then through somewhat standard pipeline,
we get certain random spaced partitions,
and then in the completely standard way we use them to
get approximate nearest neighbor data structures, so,
let me tell you this structural result,
how to find sparse cuts in the graphs that are
embedded in your space.I think it's really nice.
>> [inaudible] is the entire statement
that you gave is the D to the Alpha.
D is the subpolynomial.
>> So, well, you need to set out to be sub consult,
actually, you need to set out to be
one over square root log D.
>> [inaudible].
>> Yes it's alpha order
with the exponent one square root log D,
so, it balances out exactly.
So, yes. It's colder with sub-constant alpha,
but it's fine for us.
So, let me tell you this result
about Sparse cuts in embedded graphs.
It's actually is a purely geometric result,
which has nothing to do with data
structures, and it's self-contained.
So, if you felt asleep, now you can wake up.
So, basically, the game is like this.
So, I have some norm on RG and now,
I want the statement of the following solved.
So, I want a statement that sounds like this.
So, for any graph embedded in
my norm such that edges of length at most one,
I can guarantee one of the two things,
either there exist a ball
of relatively smaller radius that
contains lots of points
or the graph has an epsilon-sparse cut.
So, let me draw some pictures.
So, here is a graph in
my geometric space with
the property total edges like of length of most one,
and by length I mean just distance between end points.
So, either I should be able to find the ball of
radius R of epsilon with linear number of points.
And by ball, I mean ball in the space X.
So, it's ball in the normal space.
Or if there is no such a ball,
I must be able to guarantee that my graph has a cut,
which has conductance at most epsilon.
So, basically, I want this ball or cut property,
and of course the smaller R of
epsilon is the stronger these properties.
So, I want to ask, what's
the smallest R of epsilon I can always guarantee?
>> [inaudible] To prevent edges from going back to?
>> All edges [inaudible] right?
>> Okay.
>> So you're claiming
this is trivially true if the radius is very large?
>> So, it's not trivially true,
but we'll see that's why it's true.
So even the finite bound,
it's not entirely obvious, but we will see.
So first of all, for Euclidean space,
the boundaries of most one over Epsilon.
And again, this is not obvious,
but it's a corollary of Cheeger inequality.
So pretty much, if you, for example,
if you embed your graph on the line with short edges,
then if it's expounder,
then typical distance is small so there is a dense ball.
And you can generalize it to Euclidean space by
just since Euclidean distance
is located to hold coordinates, square of.
And in 2017, Assaf Naor proved that for general norms,
these parameters at most log d over epsilon square.
So it's like very strong bound actually,
and this bound is tight up to dependence on Epsilon.
So log d bound is tight.
So we get very useful primitive actually.
So basically if I have any graph in my norm of space,
either I can find that dense cluster in it,
or I can find a very sparse cut.
And so, it shouldn't be surprising
to you that you can actually utilize it to
design good random space partitions
and get good data structures out of it.
So I think this is really the core of
the interesting core of the algorithm.
However, by itself,
this bound is not very helpful, and let's see why.
So the problem is, we don't have
any control how this sparse cut looks like.
We just know existentially that it
exists but it doesn't have to
be induced by any nice subset of my ambien space.
So priori it can be pretty crazy,
and actually we don't know otherwise.
So as a result, this very nice log d over
Epsilon square bound is actually not
so useful for the algorithms.
But for example, the bound for
L2 it comes with a benefit.
Sparse cut can be assumed to be just a coordinate cut.
So basically, we know that either there exists a ball,
or there exists a coordinate on the threshold
such that if you cut there, you'll get a sparse cut.
So the question is, can you do
something similar for general norms?
And the answer is yes with both parameters.
But if you just use John's theorem
and disregard again the norm completely,
you can get the bound square root D over
Epsilon with coordinate cuts.
So that kind of answers your question.
So that's the simplest way
to see that it's fine for like every norm.
But again, this is bad because it
again gives us a proclamation square of D,
and that was precisely what we were trying to avoid.
And the main technical result of
our work is to show that actually you can show
sub-liminial bound on D of
Epsilon with coordinate cuts
but not in the original space.
But after you perform as
a transformation the embedding
that I've told you basically.
So the statement becomes like these.
Either in the cut, there exists a good, sorry.
Either there exists a ball with lots of points,
or after transforming your graph using this embedding,
you can always find a coordinate cut which is sparse.
So basically now, this is a dichotomy.
And then we use it again to get the data structures,
but this is really the core statement.
>> I'm just confused by the statement of
the coordinate because I can always rotate my embedding.
I mean, what's the meaning of the coordinate?
>> Oh, it's still true. For in your auto normal basis,
it would be true. Yeah.
>> Weak.
>> Think about half like just
if you want coordinate free definition half spaces,
but the coordinate is a bit stronger, right,
because it's like in any bases there
is a good coordinate cuts.
Yeah. So that's it for the technical stuff and let me...
>> So you don't have to rotate the Daher map.
So you have one Daher map,
and then from that one Daher map,
you could choose any form of norm bases.
You don't need to change the map according to these?
>> No we don't. Actually, we need to shift things.
So the precise statement is there
exists a shift vector after which
you apply a radial extension of Daher's map,
and then you'll guarantee a coordinate cut.
We don't need to rotate anything.
We need to shift and
the existence of a shift is a separate question,
but I'm not, I don't have time for that.
Yes. So basically for conclusions,
what I've shown you is that we can
approximate a general norm with
the Euclidean norm beyond the John's bound that
the cost of weaker guarantees arguably like much weaker.
However, it's still useful.
And the main open problem I think,
at least for the way I
presented this is to find more applications.
So because John's ellipsoid it
has so many applications that
I would be shocked that any of them you really,
really need all the information
that John's theorem gives you.
So I definitely am willing to
bet anything that like there
is at least one more interesting application.
So it's really a question of crowd sourcing.
So if people will try,
they will find something, I'm sure.
And of course, there are lots
of interesting questions for example what you ask,
are these parameters tight,
so they kind of are.
But for example, we don't know that this bound is tight.
So potentially, we can get maybe like
poly log d here with
coordinate cuts after some transformation and that
would have implications for nearest neighbor search.
Because finally enough when
you reduce embedding to this statement,
you really only care about upper modules.
You somehow don't really care about lower bound.
So potentially, these can be improved.
And yeah, I think that's it. If you
have any questions, ask me.
>> In terms of finding this Daher map?
>> So what exactly you got? The partition?
>> Right. Finding the partition,
you need to find this Daher map?
>> Yes, you need to be able to conclude that.
>> And then, is that part of it?
>> Yes. So basically
if you can make complex interpellation, [inaudible].
>> So that's the complex interpellation?
>> Yeah.
>> Okay.
>> Any other questions?
>> I have a bit of a philosophical question.
Do you know why this work
of Daher wasn't picked at by the...
>> Well, first of all, it wasn't ranged.
>> I thought it's ranged.
>> So another interesting,
actually, because it's a funny story.
So basically, independently apparently,
it was discovered by Culloton.
And actually, most of
citations is by human like which is very of
service and in particular in like Math Sainath review,
whoever reviewed it, there is a name,
I just don't remember who it was.
It was written the same result was independently
discovered by Culloton and unpublished.
So I think it's like a big discouragement overall if you,
so it felt like community really disregarded this result.
I don't know why. I think it's like,
to me, it's obvious that it's very important.
Maybe another thing is
that it wasn't formally quantitatively.
It was like soft statement
about continued dimensional spaces.
And the way it's formulated,
it doesn't sound too
strong in the infinite dimensional world,
but in the final dimension is when you work out
all the dependencies it starts being...
>> So you had to work out the dependencies?
>> Yeah, it's not hard. You just go over the proof.
>> Sure, but still okay.
>> Yeah, it wasn't,
it was infinite dimensional state.
And it was also maybe formulated for like not for
all spaces but for uniformly convex space,
uniformly convex spaces.
But again, in the final dimensional world,
everything is close to uniformly convex space,
so yeah. Okay?
>> So, we're going to resume in eight minutes.
>> Okay.
>> Thanks again.
No comments:
Post a Comment