An Giang info: Youtube daily report Jul 31 2018

This video includes lyric on the screen

For more infomation >> Nightcore - Be Your Friend - (Lyrics) - Duration: 2:42.

-------------------------------------------

Google Pixel 3 may finally get this major missing feature on the Pixel 2 ● Tech News ● #TECH - Duration: 2:18.

GOOGLE'S Pixel 3 continues to generate excitement and new leak could reveal that wireless charging

is finally coming to search giant's flagship phone.

The Google Pixel 3 looks set to bring a swathe of new features to this hugely popular smartphone

brand.

Rumours are rife that Google is likely to include a faster Qualcomm Snapdragon 845 processor,

improved camera and better battery life.

It's also thought that the larger Pixel 3 XL will get a full edge-to-edge screen which

will cover almost the entire front of the device.

In fact, the only part of the Pixel that won't be filled with this display is a small notch

at the top and slight chin at the bottom of the phone.

Now another new leak may reveal that the latest Pixel is getting something many were hoping

would appear on the Pixel 2 - wireless charging.

9to5Google is reporting that the alongside the updated phone will be a new accessory

called the Pixel Stand.

This dock is thought to be able to add power to the Pixel 3 without the need for wires

and that's not all.

Once placed on the Stand users may also set seamless Google Assistant integration even

when their phone is locked.

The Pixel 2 doesn't include wireless charging and that puts it behind some key rivals including

Samsung, LG and Apple.

Adding this new way of refilling the device would clearly be a popular move and would

certainly make a sense as it will future-proof the phone.

This latest leak comes as a new picture was recently released on the web which claims

to show how the larger Pixel 3 XL in all of its glory.

Gizmochina says it has managed to see images of a set of CAD drawings detailing how the

phone will look inside a case.

Although Google seems to be going for a single camera on the back, there could be the inclusion

of a double front-facing system which may add some DSLR-style depth of field to selfies.

Two final features revealed in the image is the space for a stereo speaker at the base

of the phone and a rear-mounted fingerprint scanner in the middle of the device.

For more infomation >> Google Pixel 3 may finally get this major missing feature on the Pixel 2 ● Tech News ● #TECH - Duration: 2:18.

-------------------------------------------

Pocztówka z 1905 r. Pomnik Kopernika z pałacem Karasia w tle - Duration: 0:58.

In the varsaviana series, we present a postcard addressing the subject

seemingly captured in tens of thousands of identical ones

shots and well-integrated,

that is, a postcard with the Copernicus monument.

But... postcard postcard uneven.

Of course, the presented card is nicely framed, but that's not the point.

In the foreground, a nice eye moves with the figures from the early twentieth century,

on the second, he captures our outstanding scholar,

but what's interesting,

is only in the background,

(we count on) - on the third plan.

Well, in the background behind the monument is located the Karaś Palace, which does not exist today,

which after many protests, discuses and adventures, was finally demolished in 1912.

Thus, the postcard is, however, extraordinary -

Quod erat demonstrandum.

Please visit www.atticus.pl

to the varsaviana department and old postcards.

For more infomation >> Pocztówka z 1905 r. Pomnik Kopernika z pałacem Karasia w tle - Duration: 0:58.

-------------------------------------------

Live in the D: The stray python in Ferndale finds a home - Duration: 4:49.

For more infomation >> Live in the D: The stray python in Ferndale finds a home - Duration: 4:49.

-------------------------------------------

Audi A5 Sportback 2.0 TDI quattro Pro Line S Schuifdak, Leer , Navi Xenon Led, DVD, full - Duration: 1:14.

For more infomation >> Audi A5 Sportback 2.0 TDI quattro Pro Line S Schuifdak, Leer , Navi Xenon Led, DVD, full - Duration: 1:14.

-------------------------------------------

FIRST concert Ever! TAYLOR SWIFT Reputation Tour! - Duration: 4:49.

so tonight mommy is going to a concert I'm really excited I'm going to Taylor Swift

I'm gonna get ready while Kenzie plays with her new princess castle daddy's

gonna watch the kids while Mommy goes

here's what I'm wearing tonight

Kenzie likes getting ready with me and what's this Kenzie it's a spray and

spray and it's sparkly and it has like it okay you're ready yeah thank you

shake it up it's like fairy dust huh yeah really good it's called a thousand

wishes you ready for the fairy spray close your eyes

spin around uh now you're all sparkly Kenzie's helping you pick out some shoes

which ones do you think I should wear she grabbed these ones and said these

ones are the best they're like sparkly and glittery spiky

okay so you can are they hard to walk in

so Kenzie what are you wearing today

you have a headband you have these bracelets tell me about these bracelets

birthday you know who you remind me of

all three burgers for Heather yes I got three burgers

I've got a big appetite okay so we're walking into the concert it was kind of

a crazy journey here because I'm really bad at driving in a direction but we're

here let me show you the view

For more infomation >> FIRST concert Ever! TAYLOR SWIFT Reputation Tour! - Duration: 4:49.

-------------------------------------------

My SECRET to Shutting Down Haters - Duration: 6:04.

For more infomation >> My SECRET to Shutting Down Haters - Duration: 6:04.

-------------------------------------------

Steve Rogers Meets Natasha Romanoff And Bruce Banner | Marvel's The Avengers (2012) - Duration: 4:21.

Stow the captain's gear.

Yes, sir.

Agent Romanoff, Captain Rogers.

- Ma'am. - Hi.

They need you on the bridge.

They're starting the face-trace.

See you there.

It was quite the buzz around here, finding you in the ice.

I thought Coulson was gonna swoon.

Did he ask you to sign his Captain America trading cards yet?

Trading cards?

They're vintage. He's very proud.

Dr Banner.

Yeah, hi.

They told me you would be coming.

Word is, you can find the Cube.

Is that the only word on me?

Only word I care about.

It must be strange for you, all of this.

Well, this is actually kind of familiar.

Gentlemen, you might want to step inside in a minute.

It's going to get a little hard to breathe.

Flight crew, secure the deck.

Is this a submarine?

Really?

They want me in a submerged, pressurised, metal container?

No, no, this is much worse.

Hover power check complete. Position cyclic.

Increase collective to 8.0%.

Preparing for maximum performance takeoff.

Increase output to capacity.

Power plant performing at capacity.

We are clear.

All engines operating.

S.H.I.E.L.D. Emergency Protocol 193.6 in effect.

- We are at level, sir. - Good.

Let's vanish.

Engage retro-reflection panels.

Reflection panels engaged.

Gentlemen.

Doctor, thank you for coming.

Thanks for asking nicely.

So, how long am I staying?

Once we get our hands on the Tesseract,

you're in the wind.

Where are you with that?

We're sweeping every wirelessly accessible

camera on the planet.

Cell phones, laptops...

If it's connected to a satellite, it's eyes and ears for us.

That's still not gonna find them in time.

You have to narrow your field.

How many spectrometers do you have access to?

- How many are there? - Call every lab you know.

Tell them to put the spectrometers on the roof

and calibrate them for gamma rays.

I'll rough out a tracking algorithm, basic cluster recognition.

At least we could rule out a few places.

Do you have somewhere for me to work?

Agent Romanoff,

could you show Dr Banner to his laboratory, please?

You're gonna love it, Doc. We got all the toys.

For more infomation >> Steve Rogers Meets Natasha Romanoff And Bruce Banner | Marvel's The Avengers (2012) - Duration: 4:21.

-------------------------------------------

Do You know? #July31st - Duration: 1:42.

Do You know that July 31st 2012 - Michael Phelps beats Larisa Latynina's record number

of Olympic medals.

Michael Fred Phelps was born on June 30, 1985 in Baltimore, Maryland.

He is an American swimmer and he holds world records in several events.

Phelps won eight medals at the 2004 Summer Olympics in Athens.

Six of those were gold.

These medals made him tie the record for medals at a single Olympics.

Alexander Dityatin has held this record since 1980.

In 2008, Phelps won eight swimming gold medals at the Summer Olympics in Beijing.

This broke Mark Spitz'srecord for most gold medals in a single Olympics.

Spitz had won seven gold medals at the 1972 Summer Olympics.

At the 2012 Summer Olympics in London, Phelps won another four gold and two silver medals.

At the 2016 Summer Olympics in Rio de Janeiro, Phelps won five gold medals and one silver.

In total he has won 28 Olympic medals, a record.

23 of these are gold medals, over two times as many as the former record.

For more infomation >> Do You know? #July31st - Duration: 1:42.

-------------------------------------------

Learning Maths-Graphs - Duration: 0:46.

For more infomation >> Learning Maths-Graphs - Duration: 0:46.

-------------------------------------------

'Segundo Sol': Valentim confronta Karola ao descobrir desvio milionário de Beto - Duration: 7:44.

For more infomation >> 'Segundo Sol': Valentim confronta Karola ao descobrir desvio milionário de Beto - Duration: 7:44.

-------------------------------------------

Awesome First Time Offroad Adventure - Barnwell Mountain - Duration: 3:29.

Yee Haw!!

this is kinda scary agh

(screaming)

oh my gosh

she wants you to do it

NO! NO NO NO!

Oh that wasnt too bad

Alright. We made it to Barnwell Mountain

Its olivias first ride really doing offroading in a RZR

what did ya think?

FUN

it was fun?

what was your favorite part?

(driving sounds) going up and down

and muddy! and you got muddy

YEAH!!!

it splashed us everywhere

i know

i want to do it again you want to do it again?

well i think were gonna eat lunch

you mean dinner?

dinner

were gonna eat dinner then we will go back out

then we will do that again

WooHoo!

can i take this off?

WooHoo.

Yeah take it off

dont look down

(laughter and screams)

GiGi is gonna get even muddy

i wonder what theyre gonna think about that?

Was that a long ride?

Did you fall asleep?

Yeah Alot.

It was rough yeah

Nobody's gonna let me drive!?

Can I go take a bath?

For more infomation >> Awesome First Time Offroad Adventure - Barnwell Mountain - Duration: 3:29.

-------------------------------------------

Tom Hardy's Transformation in the Venom Trailer Will Terrify You - Duration: 1:18.

For more infomation >> Tom Hardy's Transformation in the Venom Trailer Will Terrify You - Duration: 1:18.

-------------------------------------------

Audi A3 1.6 TDI 110 PK S-Tronic Sportback Attraction (BNS) - Duration: 1:05.

For more infomation >> Audi A3 1.6 TDI 110 PK S-Tronic Sportback Attraction (BNS) - Duration: 1:05.

-------------------------------------------

INCREDIBLE HIDDEN TREASURES IN THE MONTE DEL PIRENÉE 0RIENTALE | Babylon Metal Detector - Duration: 12:17.

here it is

ole !!

a medal !!

to see what is this that sounds so good?

everything is stones

what joy ...

here it is

a tin

ancient

Is it copper?

it's like copper ... it's green, you see?

to know what is

okay...

trash

it can be currency

here ... currency

yes

it's a "ARDITE"

here we have you

well ... let's go well

beautiful!!

very high signal and it sounds erratic

or an iron horseshoe ...

or a can

I wish it were something good ... but

I have my doubts, here is

it's an iron

thin iron like ...

a knife folded or ...

yes, it has a knife shape

knife blade, but it's undone

so...

we bury it again that it ends up rotting

and it is already

I think it's another currency

lets go see it

seal...

it's a lead seal

Seal

It is similar to a coin

let's see it ... to believe it

here it is...

Ole!!

a medal and this ...

another Roman medal

this is the 7 ...

I found another one like this

look how pretty

the 7 mysteries ... or something like that

has the hitch still ...

beautiful, no?

a trash

foil

very big this will be a can

next to these skewers

here is an ox horseshoe

horseshoe of ox ... today will be good day

One day is not good if I do not get a horseshoe

my recipe ... one a day at least

It's a 68

theoretically it is copper

A zipper

modern

I ran out of battery ...

there is a signal ...

similar to copper

it's a lead

it's a lead from the wheels

to balance the wheels

pity!!

can

a horseshoe in the sun

to the bag goes

wow !!

It has a decoration

has here like a flower

I was above

It's a piece ...

ah, it looks like a bullet

a bullet fired

look ... currency!

Ole...Ole!!...a "SEISENO"

It's missing a corner, but ...

You see ... pretty

coin ... pretty

here is ... a bullet

a musket bullet

here ... currency

Ole! "ARDITE"

ardite in sight

all right!!

he resisted

I already said that it was a good sign

after this ARDITE I found another SEISENO

and this bronze PREMONEDA

THANK YOU FOR SEEING OUR VIDEOS ... WE APPRECIATE A LIKE AND SUBSCRIPTION

For more infomation >> INCREDIBLE HIDDEN TREASURES IN THE MONTE DEL PIRENÉE 0RIENTALE | Babylon Metal Detector - Duration: 12:17.

-------------------------------------------

Herzogin Meghan: Jetzt eskaliert die Familien-Situation komplett! - Duration: 3:52.

For more infomation >> Herzogin Meghan: Jetzt eskaliert die Familien-Situation komplett! - Duration: 3:52.

-------------------------------------------

DS DS 4 PURETECH 130PK S&S CHIC DAB+/18''/NAVI - Duration: 1:07.

For more infomation >> DS DS 4 PURETECH 130PK S&S CHIC DAB+/18''/NAVI - Duration: 1:07.

-------------------------------------------

"je récupère mon ex" quelle est la méthode? - Duration: 12:27.

For more infomation >> "je récupère mon ex" quelle est la méthode? - Duration: 12:27.

-------------------------------------------

Live in the D: Uniquely Detroit - Drew & Mike Podcast - Duration: 3:01.

For more infomation >> Live in the D: Uniquely Detroit - Drew & Mike Podcast - Duration: 3:01.

-------------------------------------------

Live in the D: The stray python in Ferndale finds a home - Duration: 4:49.

For more infomation >> Live in the D: The stray python in Ferndale finds a home - Duration: 4:49.

-------------------------------------------

Live in the D: Shop at People' Records in the Eastern Market District - Duration: 4:33.

For more infomation >> Live in the D: Shop at People' Records in the Eastern Market District - Duration: 4:33.

-------------------------------------------

Live in the D: JLF Paddle Boards are almost like walking on water - Duration: 4:56.

For more infomation >> Live in the D: JLF Paddle Boards are almost like walking on water - Duration: 4:56.

-------------------------------------------

Trump Voices Skepticism on Sale of 3-D Printed Guns - Duration: 4:36.

For more infomation >> Trump Voices Skepticism on Sale of 3-D Printed Guns - Duration: 4:36.

-------------------------------------------

Une maman a tué ses enfants à cause d'une erreur bête en leur préparant le petit déjeuner - Duration: 6:23.

For more infomation >> Une maman a tué ses enfants à cause d'une erreur bête en leur préparant le petit déjeuner - Duration: 6:23.

-------------------------------------------

Honda CR-V 2.2D ELEGANCE Leer interieur, Panoramadak - Duration: 1:11.

For more infomation >> Honda CR-V 2.2D ELEGANCE Leer interieur, Panoramadak - Duration: 1:11.

-------------------------------------------

Toyota Avensis Wagon 2.0 D-4D Executive Business Leder+Navigatie - Duration: 1:08.

For more infomation >> Toyota Avensis Wagon 2.0 D-4D Executive Business Leder+Navigatie - Duration: 1:08.

-------------------------------------------

Medina: Ethnic Studies Key to Students Success - Duration: 2:59.

What do we want... Ethnic studies. When do we want it... Now

For people like Jose Lara, ethnic studies represents more than just a class. He's a social studies

teacher in the Los Angeles unified school district. For Jose ethnic studies are the

keys to the hearts and souls of the young people he has worked with.

Ethnic studies helps students in every study that have been shown

has shown increased graduation rates increased academic rates

increase attendance rates and a decrease of drop out rate for students to take ethnic studies.

As a school board member of the Los Angeles area Lara got the LAUSD to make

ethnic studies a requirement for graduation. I was doing really poorly in school I didn't

really care about school.

Karla Gomez Pelayo a UC Berkeley graduate in ethnic studies credits it with giving her direction in life.

Ethnic studies saved me and I mean that literally and figuratively

it reengage me in my education it helped me understand my cultural roots and identity.

It is a very empowering experience for students. It is something that students crave for and

it's almost like a light goes on.

It is testimony like this and his personal experience of many

years in the classroom as a teacher that motivated Assemblymember Jose Medina to push

forward legislation to make ethnic studies classes mandatory in California high schools.

We're not talking about something that's just a benefit to certain ethnic groups to Latinos or

African Americans but ethnic studies has demonstrated that it can raise

academic achievement across all groups.

Assembly bill 2772 would make ethnic studies at high school graduation requirement starting in 2023.

What you do can make a difference and I was a recurring theme in ethnic studies.

Danielle is a student at Hiram Johnson high school in Sacramento enrolled in ethnic studies.

I was born in Sacramento I also am native American, Puerto Rican, Portuguese,

White, Mexican.

And..

Yeah and I have family from all over the world basically. So like people hearing my stories

I would hope that they understand what I've been through and what I'm going through and

then kind of make that connection with me.

As people ask about the cost I would say what is the cost of not knowing,

you know, what is the cost of the having a society that is divided.

Ethnic studies is good for all students even students whose groups are not being studied.

White students do well when they take ethnic studies classes just as African American students

do well when they take ethnic studies classes. There's no study out there

thats shown so far that ethnic studies actually does bad for students.

Ethnic studies builds empathy for the students, it builds a better community a better society

and improves academic grades. It's a winner for everybody all around.

For more infomation >> Medina: Ethnic Studies Key to Students Success - Duration: 2:59.

-------------------------------------------

Eduardo Giannetti – Os melhores livros para aprender psicologia - Duration: 1:03.

I'm working to overcome this dichotomy between fiction and non-fiction.

I find much more true knowledge

about human deep psychology

in good novels written by Dostoyevsky than in Psychology handbooks.

I believe that it contains more knowledge,

especially when it comes to subjectivity.

I believe that creators working with Literature

have been much more careful with their works,

with their attempt to make evident what happens

inside our minds, with all its drives, emotions and expectations

and also, to analyze the dreams of each person.

And I find this knowledge to be extremely valuable.

I think that it has to be mobilized. And literary genres doesn't matter that much.

The quality of workmanship, the final result, that's what matters.

For more infomation >> Eduardo Giannetti – Os melhores livros para aprender psicologia - Duration: 1:03.

-------------------------------------------

Speaker Of The House - Ceasefire (Lyrics) feat. HICARI - Duration: 3:35.

Ceasefire

Ceasefire, Ceasefire

Isn't it enough

Ceasefire

Can we just call it a ceasefire, Yeah yeah, ceasefire

Isn't it enough that we hate each other, Other, So

Ceasefire

Can we just call it a ceasefire, Yeah yeah, ceasefire

Isn't it enough that we hate each other, Other, So

Can we put what we've done behind us, in silence

Can we just call it a ceasefire, Ceasefire

Now, even mama's asking if it's true or false

You're the one behind it all

When you were the one who had got my back, but

It's easy for you say shit like this

Why you so afraid of talking face to face, When

You had to have the final say

Me into tiny pieces when you walked away

Was is not enough to break

Ceasefire

Can we just call it a ceasefire, Yeah yeah, ceasefire

Isn't it enough that we hate each other, Other, So

Can we put what we've done behind us, in silence

Can we just call it a ceasefire, Ceasefire

Since I lost you I got nothing left to lose

Fighting's all we seem to do

When I'm not around to reject, dismiss it

It's easy for you say shit like this

half the truth, well

But somethings tells me you've been spreading

They're telling me they heard the news

These people like to stare at me across the room

I'm not gonna lie to you

For more infomation >> Speaker Of The House - Ceasefire (Lyrics) feat. HICARI - Duration: 3:35.

-------------------------------------------

My SECRET to Shutting Down Haters - Duration: 6:04.

For more infomation >> My SECRET to Shutting Down Haters - Duration: 6:04.

-------------------------------------------

On momentum methods and acceleration in stochastic optimization - Duration: 51:33.

>> Yeah. So, I'll be talking about momentum methods

and acceleration and stochastic optimization.

This is joint work with

Prateek Jain back in Microsoft Research in India,

Sham Kakade who's going to be at

the University of Washington and

Aaron Sidford from the Stanford University.

Okay. Let me start by

giving a brief overview and context of our work.

As many of us might already be aware of,

optimization is a really crucial component

in large-scale machine learning these days.

Stochastic methods, such as Stochastic Gradient Descent,

are some of the workhorses

that power this optimization routines.

There are three key aspects with which

Stochastic Gradient Descent is used in practice

in this optimization routines

and which are crucial for its good performance.

So, the first one is minibatching where we compute

the Stochastic Gradient not over

a single example but rather over a batch of examples.

The second one is what is known as model averaging,

where you run independent copies

of Stochastic Gradient Descent on

multiple machines and try to somehow combine

the results to get a much

better model than each of them along.

Finally, there is a technique from

deterministic optimization called

acceleration or momentum,

which is also used on top of Stochastic Gradient methods.

All of these three things seem to be important

for the good performance of

Stochastic Gradient Descent in practice.

However, from a understanding point

of view or a theory point of view,

we still don't have a very good understanding

of what are the effects of these methods

on Stochastic Gradient Descent and do

they really help and if they help,

how are they helping and

all these various aspects with which these are used.

Our work here is

a thorough investigation of all of these three aspects,

for the specific case of stochastic linear regression.

The reason that we consider

stochastic linear regression is because

the very special structure of the problem lets

us gain a very fine understanding

of each of these three aspects.

At the same time,

the results and intuitions

that we gained from this seem to have

some relevance to even more complicated problems,

such as training deep neural networks,

and I'll touch upon that topic

towards the end of my talk.

In this talk in particular,

I'll be talking only about the last aspect,

which is acceleration and

momentum algorithms on top

of Stochastic Gradient Descent.

So, this is basically the outline

or the high level subject of my talk.

Let me now start by giving

a brief introduction to deterministic optimization.

A lot of you might already know this but I

thought if I just spend a few minutes on this,

it might be helpful to

follow the rest of the talk even more easy.

So, bear with me if you already know a lot of this stuff.

So, gradient descent is of course one of

the most fundamental optimization algorithms

and given a function f,

we start with a point w0 and at every iteration,

we move in the negative gradient direction.

As I mentioned, the problem that we'll be

considering in the stop is that of linear regression.

Here, we are given a matrix X and

a vector Y and we wish to find the vector w,

which minimizes the square of X transpose w minus Y.

This is the linear regression problem

and as you can imagine,

this is a very basic problem and

arises in several applications

and people have done a lot of

work in understanding how to solve this problem well.

In particular, the first question that comes to our mind

is what is the performance of

gradient descent when you

apply it to the linear regression problem?

In order to answer this question quantitatively,

we need to introduce this notion of condition number,

which is just the ratio of the largest to

smallest eigenvalues of the matrix X, X transpose.

With this notation, it turns out

that gradient descent finds an epsilon suboptimal point.

What I mean by that is that f of

w minus the optimum is at most epsilon.

So, gradient descent finds

an epsilon suboptimal point in about

condition number log of

initial sub-optimality over targets

of optimality number of iteration.

So, this is the number of

iterations that gradient descent

takes to find an epsilon suboptimal point.

The next question that comes to mind

is whether it's possible to do any better

than this gradient descent rate

and there is hope because gradient descent,

even though it has seen gradients

of all the path iterates,

it only uses the current

gradient to make the next update.

So, the hope would be if there is

a more intelligent way to

utilize all the path gradients that we

have seen to make a better step then

maybe we can get a better rate and this intuition

turns out to be true and there are

several famous algorithms which

actually achieve this kind of improvement.

Some of the famous examples include

conjugate gradient and heavy ball method,

as well as Nesterov's celebrated

accelerated gradient method.

The last two methods in particular are

also known as momentum methods,

and we'll see why they're called momentum methods.

So, as a representative example,

let's look at what Nesterov's

accelerated gradient descent looks like.

While gradient descent had just one iterate,

Nesterov's accelerated gradient can

be thought of as having two different iterates.

We denote them with Wt and Vt and there are

two kinds of steps: gradient steps and momentum steps.

So, from Vt, we take a gradient step to

get Wt plus 1 and from Wt plus 1,

we take a moment of step to get Vt plus 1.

Then we again take a gradient step.

So, it alternates between

these gradient and momentum steps.

This is a Nestrov gradient algorithm

with appropriate parameters of course.

This is how it looks in the equation form.

The exact form is not important

but one thing that I want to point out here is

that the amount of time taken for a single iteration of

Nesterov's accelerated gradient is pretty much the

same as that taken by one iteration of gradient descent.

So, in terms of time per iteration,

it's exactly the same up two constants.

However, the convergence rate

of Nesterov's accelerated gradient turns out to

be square root of condition number log

of initial or target for optimality.

So, if you compare it to the rate of gradient descent,

we see that we get an improvement of

square root of condition number.

Condition number is defined as a max

over min so it's always greater than or equal to 1.

So, this is always

an improvement to our gradient descent.

This is actually not just a theoretical improvement.

In fact, even on moderately conditioned problems,

here we're looking at

a linear regression problem where

the condition number is about 100.

You see that Nesterov's gradient, which is in the red,

is an order of magnitude

faster than gradient descent which is in the blue.

So, this actually really has

practical advantages in getting

this better convergence rate.

So, this is all I wanted to

say about deterministic optimization.

If we come to the kind of

optimization problems that we

encounter in machine learning,

we usually have a bunch of training data X1,

Y1 up to Xn, Yn.

Let's say we get all of this data

from some underlying distribution on Rd cross R. So,

this underlying dimension and

we get the strength in data.

We can use whatever we saw so far to solve

the training loss which is just 1 over

n summation Xi transpose w minus yi whole square.

So, the f hat of W that we have

here is the training loss and we

could use either gradient descent or

Nesterov's accelerated gradient

to solve the training loss.

But in machine learning,

we're not really interested in optimizing

the training loss per se,

what we are really interested in is

optimizing the test loss or test error,

which means that if you are given

a new data point from the underlying distribution,

we want to minimize X transpose w minus y whole square,

where X and Y is sampled

uniformly at random from

the same underlying distribution.

In order to optimize this function however,

we cannot directly use the gradient or

accelerated gradient method from before because

the gradients here can be written as

expectations and we don't really

have access to exact expectations.

All we have access to is

these samples that we have seen from the distribution.

This setting has also been well-studied

and goes by the name of stochastic approximation.

In a similar paper, Robbins and

Monro introduced what is known

as Stochastic Gradient algorithm to

solve these kinds of problems.

The main idea is extremely simple,

which is that in any algorithm that you take,

gradient descent for instance,

wherever you are using gradient for the update,

you replace it with a Stochastic Gradient.

Then the Stochastic Gradient is

calculated on just using a single point.

In expectation, because the Stochastic Gradient

is equal to the gradient,

we're in expectation doing gradient descent and

we would hope that this also

has good convergence properties.

Moreover, if you look at the para-iteration cost

of Stochastic Gradient Descent, it's extremely small.

So, the Stochastic Gradient completion just

requires us to compute that thing in the red

for the linear regression problem,

which requires us to just take

one look at the data point and that's about it.

So, making each of these Stochastic Gradient updates is

extremely fast and because of this efficiency,

it's widely used in practice.

While I give you this example on top of

gradient descent and how to

convert it into Stochastic Gradient Descent,

you could apply the same framework

to any deterministic optimization algorithm.

If you take heavy ball, you

will get a stochastic heavy ball.

If you take Nesterov's accelerated gradient,

you will get stochastic Nesterov accelerated gradient.

The recipe is just simple.

You will have an updated equation,

just replace gradients to Stochastic Gradients

and you get these stochastic methods.

Okay. So, now let's again come back to

Stochastic Gradient Descent and

try to understand the convergence rate of

Stochastic Gradient Descent in

the stochastic approximation setting.

For illustration, if you just

consider the lifeless setting

where y is exactly equal to x transpose w star.

So, there is some underlying truth w star and

our observations y are

exactly equal to x transpose w star.

Then the convergence rate of

Stochastic Gradient Descent again turns out

to be very similar to that of

gradient descent which is condition number,

log of initial level targets of optimality.

But the definition of condition number here is different

compared to what we had in the gradient descent is.

But with this new definition,

the convergence rate looks essentially the same.

If we consider the noisy case where y is equal to

x transpose w star plus a 0 mean iterative noise,

then there will be an extra term in

the convergence rate which depends

on the variance of the milestone.

So, sigma square, which we do not hear with sigma square.

So, let me try to summarize

all of the discussion so far in a table.

>> Theta is in the data?

>> What is theta?

>> Yes. It has some additional log factors

in whatever is inside the brackets.

So, summarizing all our discussion so far,

in the deterministic setting,

we saw that gradient descent has

the convergence rate of condition number log

of initial targets of formality,

so it depends linearly on the condition number.

We saw that there are

these acceleration techniques which can

improve by factor of square root condition number, right?

And the stochastic case,

Stochastic Gradient Descent has

again this kind of convergence rate,

which is the sum of the lifeless part and then

one thing that comes from the noise

and the broad question that we ask here is,

whether accelerating Stochastic Gradient Descent

in the stochastic setting possible,

just like what we were

able to do in a deterministic setting.

I need to clarify the question a little bit

more because it's well known that

the second term that we have here which is sigma square D

over epsilon is actually statistically optimal.

So there is no way no algorithm

can improve upon the statistically optimal rate.

There is information theoretic lower bonds

that's the best you can do.

So when we talk about accelerating

Stochastic Gradient Descent what we mean is,

can improve this first term where,

for instance, we have a linear dependence on copper,

can we perhaps improve it to

a square root copper for instance?

Okay, yeah.

>>But here I could just use the direct methods

to jump directly to the solutions,

you mean optimal in the class of gradient first order?

>>Yeah, yeah optimal in the class of first-order methods.

>> So, what about second-order methods?

>> Second-order methods- even within

first-order methods if you're not

interested in a streaming algorithm,

you could just take the entire co-variance matrix

and try to invert it again

using maybe a first-order method.

But we are looking at streaming

algorithms which are actually

what we used in practice. Yeah.

>> You could also use conjugate for gradient [inaudible].

>> Again, you can use that on

the empirical loss for

the streaming problem conjugate

gradient methods I don't know,

they have not been studied for

the streaming kind of thing.

For the empirical loss we could do

either second-order methods or conjugate gradient,

any of the things, yeah, okay?

So, before I try to answer this question,

let me try to convince you why this question is worth

studying and the reason is

that in the deterministic case firstly,

we saw that acceleration can actually

give orders of magnitude improvement

over an accelerated methods and there is

reason to hope that the same thing

might be true even in the stochastic case.

In fact, even though we don't really know of

a accelerated method in the stochastic setting,

in practice people do use the stochastic heavy ball and

stochastic nesterov that I just

mentioned in training deep neural networks.

So much so that in fact if you look at any of

the standard D planning packages

like Pytorch,or TensorFlow,

or CNTK, we'll see that

SGD actually means SGD with momentum.

So, there is this default parameter of momentum and

as it is usually run with

this additional momentum parameter.

Even though we don't really understand what

exactly it's doing and how it's helping,

if at all it's helping and so on, okay?

So, in the practical context

people are using these things but at

the same time we don't really

have a full understanding of

whether this even makes sense in the stochastic setting.

Finally, in terms of related work on

understanding acceleration in the stochastic setting,

there are some works by Ghadimi and Lan in

particular which try to

understand whether acceleration is

possible in the stochastic setting.

But they consider a different model compared to

what I introduce here as the machine learning model.

At a high level what it means is that,

they assume additive bounded noise,

rather than sampling one element at a time or mini

bytes at such a time and these

are completely different models.

I'll be happy to go over it in

more detail if you're interested offline about this,

about the differences in the settings here.

So, let me now jump to an outline of our results

and I'll present it as the questions that we

asked and what we get as an answer.

So, the first question that we asked

is whether it's possible to improve

this linear dependence on the condition number

to a square root condition number,

for all problems in the stochastic setting.

It turns out that the answer to this is no.

There are explicit cases where

it's not possible information theoretically,

to do any better than the condition number.

So, the second question would then be

whether improvement is ever possible.

Let's say for some easier problems,

maybe improvement is possible and the answer to

this is actually subtle

and it turns out that it's perhaps possible,

but it has to depend on other problem parameters,

it cannot just depend on the condition number.

Then the third question that we asked is whether

existing algorithms like stochastic heavy ball

and stochastic nesterov has accelerated gradient.

These are actually being used in practice,

do they achieve this improvement whenever it is

possible on problems where it is possible?

Surprising answer to this is that they do not

and in fact there are

problem instances where they are no better than a SGD,

even though improvement might be possible.

Finally, the question we ask

is whether we can design an algorithm which

improves over Stochastic Gradient Descent

whenever it is possible?

And we do design such an algorithm which we call

accelerated Stochastic Gradient Descent or SGD for short.

So, let me try to go

a bit more in detail into each of these things,

are there any questions at this point?

So, let's first try to

answer the first question whether acceleration is always

possible in the stochastic setting

and we'll try to do this with examples and

the example that we consider will be

the noiseless case where Y is exactly equal to

X transpose W star for some vector W star.

The question we are asking is

whether we can improve the convergence rate of

Stochastic Gradient Descent from a linear dependence on

condition number to maybe

a square root on the condition number?

For this, let's consider a discrete distribution.

So, here we said Y is equal to X

transpose W star and W star is some fixed vector,

so all I need to specify is

the distribution of X and here I'm just

specifying the distribution on X so let say

X is a two-dimensional vector.

One zero with very high probability

0.9999 and zero one with very low probability 0.0001.

In this case, an easy computation tells

us that the condition number is one

over the minimum probability which is ten to the four.

The question we are asking is if you can get

a convergence rate that looks

like square root condition number,

our equal entry if you can have

the initial error using

about square root condition number samples,

are about 100 samples, right?

In this case, it turns out that

it's actually not possible and this is

not too difficult to see because if we

sample a bunch of samples from this distribution,

unless we see about the condition number of samples,

will not see the second direction,

which is the probability ten to the minus four.

With fewer than ten to the four samples,

then we don't see anything in this zero one direction,

any X in the zero one direction.

So, there is no way we can

estimate the second coordinate of

W star without looking at the zero one vector.

So, with fewer than condition number of samples,

we cannot actually estimate anything about W one star.

So, acceleration is not possible in this setting.

So, this answers our first question in the negative,

that there are some distributions where

acceleration may not just be possible.

Information territory, yeah.

>> So, your second quarter to W

one star is one of a kappa anyway.

>> Yeah. So, it depends on-

>> [inaudible] less than one of a kappa?

>> Yeah.

>> You can always estimate it with one sample?

>> Yeah, yeah. So, it

depends on what is your noise level.

So, if you are above noise level,

then you don't care about that coordinate.

If you're below that noise level when you do want to get

that right, you care about it.

>> It's usually [inaudible] epsilon level on your desired accuracy.

>> Yeah. So, it does depend on epsilon.

However, when I'm talking about the rates here,

this rate- if I give you a certain rate,

either you have to write it in terms of,

till this error I get this rate,

till this error I get a different rate.

Or if you're trying to write a universal rate,

you do want to understand what is

the real behavior of

the algorithm, yeah that's all I'm saying.

>> Dependence on epsilon is not one

[inaudible] which is not of dependency here.

>> Can you, can I-

>> You do have dependency on epsilon in your formula?

>> Yeah.

>> [inaudible] epsilon.

>> Yeah, yeah.

If you're allowed to depend on epsilon as well,

then the dependency will be something

good till you want to get one over kappa accuracy

and something different after

you- if you want to go below

on all kappa. Yeah.

>> Did you say the second dimension

matters a lot less than the first dimension?

>> Yeah.

>> Because it's still rare to ever see that direction.

>> Yeah.

>> But why isn't observing the first dimension enough to

get very high or a very low [inaudible].

>> Yeah. So, before I was saying so,

if you only care about error

which does not carry you- okay,

so one thing you can think of is you

could be so far away from

your optimal in the second direction

that even with the low probability when you

take an expectation, it could actually matter.

>> Yes. If the loss is super large.

>> If the loss is super large.

But the point I'm trying to make here is that,

if you only care about a certain level of accuracy,

then you don't care about the second coordinate.

If you really care about the entire spectrum of rate,

then you do care about the second one as well

because eventually, you want to get everything right.

This example seems cooked up,

but the point of this example is to illustrate the issues

that might arise when you think of acceleration

when you go from

a deterministic setting to a stochastic setting.

So when I go to maybe empirical evaluations,

things may be much clearer.

But at this point, this example is more

to illustrate what are the things that

would just be completely different between

deterministic and stochastic settings. Yeah.

>> If for example the loss is

found in the classification-

>>Yeah.

>> -then this will not be an issue.

>> So, this may not be an issue. I mean

you could also think of for

instance where one dimension is actually super large.

But there are a lot of other dimensions

which together matters a lot.

But then you're still restricted in

your step size because of the single larger direction,

where this might still be an issue, right.

Okay, so this basically

demonstrates that you cannot

expect this acceleration phenomenon

to always magically work out in

the stochastic case and

the second question then we would ask is

whether it's ever possible for

this acceleration phenomenon to still

help you in the stochastic setting right.

And for this we again take a two-dimensional example

where the vectors come from a Gaussian distribution

and the covariance matrix is diagonal

with 0.999990 and 0.0001

same values we had earlier and even in this case

the condition number turns out to be

about 10 to the fourth, Right.

However because this is a Gaussian distribution,

if you just observe two samples,

with probability one they are

going to be linearly independent.

Right? And once you have

linearly independent samples you can just

invert them to exactly find W star.

So no matter how large the condition number is,

just after two samples,

you are able to exactly find W star.

And in this setting the suggest that

acceleration might be possible,

at least from an information theoretic point point view

or statistical point point view.

So, for the second question we see

that it might be possible but it

has to depend on something else

and not just the condition number.

Okay. So, what is

this quantity that actually

distinguishes the previous two cases

of the discrete and Gaussian distributions?

And the answer is basically the number of

samples that required for the empirical covariance matrix

which is this one over n summation x i

transpose minus the actual covariance matrix H.

So as long as

our empirical covariance matrix is close to

the true covariance matrix we are okay and

the important quantity here

is what is the number of samples that

you need for this to be the case.

For scalars, this is very well

understood and it's just the

variance of the random variables

that you are considering and for

matrices there is a similar notion called matrix

variace which in this context

we denote by kappa tilde and

call it Statistical condition number,

this is the quantity which determines how

quickly the empirical covariance matrix

converges to the true covariance matrix right.

And this quantity is what really

distinguishes the previous two cases.

So how does this actually relate

to the actual condition number?

So if we recall

the competition condition number or

just the condition number which is here on the right,

is just the, let's say

the maximum norm of x square or the support.

We can relax this but for simplicity let's

say the maximum norm squared over the support of

the distribution divided by

the smallest Eigen value of

the covariance matrix so

that's the definition of condition number.

Whereas the statistical condition number is here on

the left for an appropriate orthonormal basis

e this is basically

a weighted average of the mass on the direction e divided

by the expected value of e transpose x i whole square.

So the expressions may look a little

complicated but once you have these

it's actually not too difficult to see

that the statistical condition

number is always less than or

equal to condition number because on the left you have

some weighted average whereas on

the right you have a minimum in the denominator.

That's basically the intuition

and this can be formalized and acceleration

might be possible whenever

the statistical condition number

is much smaller than condition number.

Okay? Because statistical condition number

we're thinking of it as an

information theoretic lower bound.

So, if kappa tilde is pretty much the same as kappa.

Then there is no acceleration to be had.

Whereas if kappa tilde is actually

much smaller than kappa there may be

scope for accelerating the rate

of Stochastic Gradient Descent.

>> What's the sum of E.

>>Yeah so, this was

a little complicated so I tried to simplify it.

So E is basically

the eigen vectors of the covariance matrix,

of the covariance matrix H, population

covariance matrix which.

>> So there is at most D.

>>Yeah this is this is like

D Gaussian random variables

here whereas there could be arbitrarily large.

Okay so, If you compare what

happens for the discrete and Gaussian cases.

So for the discrete case it turns out that

both statistical and compositional condition number

are about 10 to the four which

are essentially the same and so that's why

we see that acceleration is not possible in that setting.

Whereas in a Gaussian setting, with some constants,

so the statistical condition number is of the order of

about 100 whereas the actual condition number

is about 10 to the four.

So there is a significant difference between

the statistical and computational condition numbers

and this was the reason

why discrete case acceleration

is not possible and where in

the Gaussian case there is still

hope that acceleration might be possible.

Okay and if we plot this,

plot the performance of gradient descent for both of

these distributions so the left

is discrete and the right one is Gaussian.

The green curve corresponds to the error of

Stochastic Gradient Descent or

several iteration and we see

that the point where the errors starts

decays pretty much the same for both of

them and it's about the condition number.

So after the condition number of samples it

starts to really decays geometrically.

Whereas the statistical condition number in

both these cases is very different

for the discrete distribution there is

not much difference between kappa tilde and kappa,

whereas for the Gaussian distribution there is a lot of

difference between the statistical condition number

which is the information theoretic lower bound to

the place where Stochastic gradient descent

actually starts doing very well.

Right so there is a large gap between the pink line and

wear green line starts could

do very well for the Gaussian distribution.

Okay. So this next question would then be whether

existing stochastic algorithms such as

stochastic heavy ball or

stochastic Nesterov accelerated gradient,

do they achieve any improvement

in settings like this Gaussian setting

that we saw before where there seems to

be scope for acceleration,

and the surprising answer to this is

that actually we construct

explicit distributions and fairly natural distributions.

Not cooked up distributions where there is a large gap

between statistical and actual condition number

but the rate of heavy ball,

the stochastic heavy ball method is no better

than that of Stochastic Gradient Descent.

So in the stochastic setting,

even though there is scope for

improvement on certain problems,

stochastic heavy ball does not achieve this improvement.

We can show this rigorously only for

the heavy ball method but

the same intuition seems to be true,

same results seems to be true empirically even

for stochastic Nesterov accelerated gradient as well.

So what I mean by that is that if you look at,

so this is the lower bound,

I would hope that green

one is Stochastic Gradient Descent,

I would hope that an accelerated method can do much

better than Stochastic Gradient Descent but it turns out

that the stochastic heavy ball

which is in blue and stochastic master which is in black,

they basically are on top of each other

and they don't do much better

than Stochastic Gradient Descent.

So, these momentum techniques do not really helping

the stochastic setting here. Yeah?

>>So if you did this termnistically

we would expected to have

dip along this purple line or?

>>Yeah. So you can use any of the offline methods right,

then using these number of samples is

sufficient to start decaying the errors. Yeah.

>> Is this with or without?

>>This is this with

nice but we have

similar plots for with nice but then there

is an error floor which is like 1 over n.

Yeah so this is without nice.

>>So the dip should happen at two samples?

>>Okay. Yeah, so I

haven't plotted the direct methods here,.

>>Yeah.

>>Direct methods, these two things are decoupled, right?

So there is how many samples do you take to construct

your empirical loss function and how

many iterations of your direct method that you run.

So, there are like both of these are

decoupled for direct methods.

>>You can just check answers for the last question is,

if you did batch method,

direct method or whatever on two samples will it work?

>>Yeah. On two samples for the Gaussian distribution

it should just work. Yeah, exactly.

But number of iterations it will

take might still be larger.

Because the empirical covariance matrix

might still be very ill conditioned.

>> [inaudible]

>>Then if you completely

if you use a matrix inversion then it will be this quick.

If you again use a first-order method to solve

empirical loss function that could

take some while because

the empirical function is not very well conditioned.

Yeah, that's all I meant. Okay? So the point

of this plot is that

the stochastic momentum methods

in the stochastic setting do not really

provide the improvement that we expect them to

provide based on

our intuitions from the deterministic setting.

Okay? So finally, so the question

is whether we can actually design

an algorithm which actually gives these kinds of

improvements in the stochastic setting and

whenever such improvements are possible.

Right? And the answer to

this actually turns out to be yes and we design

an accelerated Stochastic Gradient Descent method.

We get the following convergence rate.

So, the second term which is sigma squared

d epsilon is the part that comes due to nice,

and the first part was what

we were looking to improve for

compared to Stochastic Gradient Descent

which has a linear dependence on kappa,

our method has a dependence on

square root of kappa, kappa tilde.

And we already saw that

this kappa tilde is always less than equal to kappa.

So, this result is

always better than the condition number.

Okay? And that could be much better

when there is a large gap between kappa tilde

to the line kappa.

So, what does this mean in terms of plot?

Note that the X-axis is in the log scale.

So, the green one is Stochastic Gradient Descent,

the blue and black which overlap are

the Stochastic Momentum Methods,

and the red curve is our algorithm.

At least in this case, we see that there's

about- in order of magnitude,

improvement in the performance

of our method compared to that of

Stochastic Gradient Descent or even

the Stochastic Momentum Methods.

We should also note

that we are still far away from

the lower bound which is about 100 here,

and we believe that while

information theoretically 100 samples might be sufficient

for computational methods or streaming

computational methods which are

based on these first-order information,

we believe that it may not be possible to

do better than whatever our result is,

but we don't have any proof of the statement.

This is just a conjecture at this point.

>> Is this still the 2D Gaussian example?

>> It's actually a 100-dimensional Gaussian example.

>> So coming back to your question

two samples not enough, right?

>> Yeah, in this case. So that's why

I put it at 100. Yeah.

>> [inaudible] is 100, right?

>> 100, yeah.

>> That's it.

>> This is the same graph as before so previously-

>> Yeah, all the graphs are the same thing. Yeah, yeah.

>> [inaudible] so in your methods,

you showed that the convergence rate is actually

a square root Kappa Tidle.

>> Yeah.

>> So that doesn't correspond to

the lower bounds that you would achieve.

>> Yeah. So there's still a gap

between the Kappa Tilde here.

Both theoretically and empirically,

we do see that there is this gap.

We don't know if it is possible

to achieve the lower bound

with a streaming algorithm like the one that we're using.

You can definitely achieve that by

using a batch kind of method,

but with the streaming method, it's not clear.

>> Sorry, my [inaudible].

What is the value of square root Kappa Kappa Tilde here?

Is it [inaudible]?

>> Thank you. About 1,000.

>> It's 1,000?

>> Yeah. So, let

me now introduce briefly the algorithm that we use.

Unfortunately, we ourselves don't have

too much intuition about what exactly it is,

but at a high-level, what I want to point out

is that in the Deterministic Setting,

you could write Nesterov's Accelerated Gradient

in several different ways.

The most popular one is

a Two-Iterate Version which has momentum interpretation,

and there are also

various other interpretations of

understanding what Nesterov's acceleration is doing.

There is another interpretation

which use a safe 4 Iterates and has it's

interpretational simultaneously optimizing upper bounds

and lower bounds for the function

that you are trying to minimize.

While all of these are

equal and in the Deterministic Setting,

meaning that when you use exact gradients,

they actually turn out to be quite different if you

use Stochastic Gradients

instead of Deterministic Gradients.

The algorithm that we analyze here is

basically Nesterov's Accelerated Gradient

in the four iterate Version,

and we again basically replace

the gradient with Stochastic Gradients.

In our paper, we explain this more

as how this relates

to a weighted average

of all the past gradients and so on,

and I'll be happy to talk more about it

offline if you're more interested in

knowing what exactly the algorithm is doing.

So for the next couple of slides,

let me try to give a high-level proof overview of

the result that we have here. Yeah?

>> [inaudible] computationally, is it comparable to-

>> Yeah, up to constant, it's the same.

Instead of two rows it'll be four rows.

Yeah. Okay. So, if

you recall our result- this is our result, right?

So there are two parts to the result,

one is the rate because of the noiseless part,

the other one is due to the noise term.

The proof also naturally decomposes into two parts,

one to bond the first term,

and the other one to bond the second term.

For the first term, we can just assume that we have

a noiseless setting and just analyze

what is the convergence rate in the noiseless setting.

For the second term, you assume that you

start at the right point and just understand

how this noise drives the algorithmic process.

So, these are the two different components of the proof.

The first part actually is fairly

straightforward once we figure

out the right potential function to use.

The main innovation that we had to do here was

that for the standard Nesterov's Accelerated Gradient,

there is one potential function that's actually used,

and then we realize that there is actually

a family of potential functions that you could use,

all of them work for the

Deterministic setting but only one

of them seems more suitable for the Stochastic setting.

So, if we shift the potential function

by the H inverse norm,

it turns out that this is

the right potential function and

analysis is actually fairly standard,

in that it follows the kind of analysis that

Deterministic Nesterov's Accelerated Gradient follows.

And then you get

this Geometric Decay Rate of one

minus one over square root Kappa Kappa Tilde,

and this is where the Geometric Decay of

square root Kappa Kappa Tilde comes from.

For the second term,

it turns out that such a simple analysis

actually does not work out,

in the sense that if

we try to use any potential function,

it seems to blow up our results by a factor

of dimension or a factor of condition number.

So, we had to do an extremely tight analysis of how

the Stochastic process evolves

when it's driven by the noise.

So, you can think of the algorithm as a process,

something that describes a process,

and noise as something which

comes in as an input at every time-step,

and you have to understand how

the algorithm behaves in this noise input.

Basically, if you think of two iterates of the algorithm,

which is say, WT and UT,

then we write Theta T to be

the co-variance matrix of these iterates,

and then we can write a linear system

that describes the evolution

of these co-variance matrices,

Theta T. If you do all of this,

you get this inverse formula where B is an operator

which takes positive semi-definite matrices

to positive semi-definite matrices.

And you want to understand what happens to

a minus B inverse applied to

this noise noise transpose matrix,

and this is still quite challenging

because B has singular values which are larger than one.

So, if you use any crude bounds,

we cannot even get a good bound on this one,

so we had to really

understand the eigenvalues of this operator B,

and then solve this inversion problem in one dimension.

And then combine different dimensions

together using bounds on

statistical condition number and condition number.

So, this is all I wanted to

say about the proof of the result.

So if we recap whatever we have seen so far,

we saw that for the special case

of Stochastic Linear Regression,

we can completely, critically understand

the behavior of Stochastic Gradient Descent,

Stochastic Momentum Methods,

and this new accelerated

Stochastic Gradient Descent Method.

We saw that while

conventional methods don't really

provide any improvement in the setting,

this new method seems to provide

significant improvement even in the Stochastic Setting.

So, the next question that we were trying

to tackle is whether this

has any relation to problems beyond

linear regression or this

applies only to linear regression?

So, we had a bunch

of conjectures or results

here and we tried to evaluate all of

them in the context of training neural network.

So, the first question that comes to mind when we go into

this new setting is that

people are actually using Stochastic Heavy Ball,

Stochastic Nesterov in practice,

and there also have been very influential papers

which argue that they actually

give improvements over Stochastic Gradient Descent.

And this is why people started

using these methods in the first place.

Whereas what we're saying here is that

these Momentum Methods don't really give

any benefit in the Stochastic setting.

So, both of these things are contradicting to each

other and what is

the reason for this apparent contradiction?

The reason here turns out to be Minibatching.

So, if you think of Minibatching,

it's actually a continuum

between completely Stochastic setting

and a completely Deterministic setting based on

the size of Minibatch that you actually choose.

What we are saying here is that in the extreme,

where the Minibatching is

one or a completely Stochastic case,

these momentum methods are not helping.

Whereas in the Deterministic case,

we know for a fact that

these momentum methods actually helped.

When you are using Minibatching somewhere in the middle,

it's conceivable that you are going to get

benefits purely because you are moving

somewhat closer to Deterministic Gradient Descent.

The point of our algorithm here however is that ASGD

improves over SGD irrespective

of what Minibatch size it is,

and for whatever Minibatch size if there

is anything acceleration that's possible,

it gives that kind of acceleration.

So in order to test this hypothesis,

we first trained a deep auto-encoder for

mnist with the smallest batch size of one,

so Stochastic Gradient on single examples.

It turns out that the performance of Stochastic Gradient,

Heavy Ball, and Nesterov are essentially similar.

There is nothing really to distinguish

the difference between these methods.

Whereas if we run

our Accelerated Stochastic Gradient Method,

it runs reasonably faster,

at least in the initial part,

compared to that of

these other algorithms even

with a small Minibatch size of one.

Going next to classification task

on CIFAR-10 using Resnet,

again, a small Minibatch size of

eight which is much smaller than what's used in practice.

If you use this Minibatch size of eight,

we again see that there is not

much to distinguish the performance

of Stochastic Gradient as

well as the Stochastic Momentum Methods.

So all of them perform reasonably similar.

Whereas if you compare our method with for instance,

Nesterov method here for example,

at least in the beginning phases it converges much

faster compared to the Nesterov method.

We only really think about

the initial phases because there

is this additional term of

Sigma Squared Theorem which I'll not talk about.

So eventually, you will reach in life 4

and then there'll be no acceleration in that part.

So, the acceleration that you can hope for is only in

the initial phases where you can

hope to get faster convergence.

>> Are you using the Gradient Rate Decay?

>> Yeah. So, that's why the drops are correlated,

but the learning rate's themselves have been

searched using hyperparameter search. Yeah.

>> How about the other processes?

>> So, in the previous thing, it was

just a constant learning rate, so there was no decay.

Here, the things that we search for our learning rate,

when to decay, and how much to decay.

All of this was based on

a validation set and then we use this,

and this are zero-one test error.

Yeah?

>> You just answered my question.

>> Okay. Yeah?

>> I see error when it convergence, after convergence.

>> Yeah, so here you can see it's about 90,

say eight percent, about

eight percent is the error that you get here.

It's not maybe fully state of the art,

but then we are also not using

state of the art network to do this.

>> There is no main difference between both assets?

>> Yeah, so the final

error there's not that much of a difference.

The final error seems similar.

>> So, you're saying, you only have time for 10 classes,

you should use ASGD?

>> Yeah. So, I mean that's actually my next thing.

So, this was for like small Minibatch size of it right?

So, if you use something like 128,

which is more reasonable to use in practice,

we again see that ASGD converges faster,

but pretty much with the same accuracy,

but it converges faster compared to Nesterov method.

Of course, I mean you're again seeing

these similar places to draw because that's

exactly where we decay the learning rate,

and, for instance,

if you only care about getting

say to 90 percent accuracy,

you can get there using ASGD about

1.5 times faster as compared to Nesterov's method.

This could potentially be useful for if

you are doing hyperparameter search and you don't

care about getting the optimal error

for every hyperparameter,

but only want to figure out what is

the rough ballpark the bearer is going to be. Okay?

>> There's a jump there, right?

It can't really predict how [inaudible] it's going to be after

hundred impulse and process initially [inaudible]?

>> Yeah, I mean if you if you only want to

see whether something is

extremely bad or it's reasonable,

this could help you, that's all I'm saying.

Yeah. So, okay this

brings me to the conclusion of my talk,

so to recap what we have seen in the deterministic case,

we saw that acceleration improves

the convergence speed of

gradient descent by factor

of square root condition number,

and we asked the same question

in the stochastic approximation setting,

where we saw that

stochastic momentum methods which are like

the conventional and classical methods

that are being used.

They do not achieve any improvement

or Stochastic Gradient Descent,

whereas, the algorithm that

we proposed which is accelerated

Stochastic Gradient Descent.

can achieve acceleration

the stochastic setting whenever it's possible,

and while our theoretical results have been

established only for

the stochastic linear regression problem,

we also empirically verify

these observations for in the context

of training deep neural networks,

and we also released the code for this algorithm in

Pi touch and if any of you play with neural networks,

we would encourage you to try out our algorithm,

and let us know what you observe.

So, just as a high-level point

about optimization in neural networks and so on,

I would like to point out that while

optimization methods are heavily

used when training neural networks,

a lot of the methods that are actually

used are not well understood

even in very benign setting such as convex optimization.

I spoke about momentum methods,

stochastic momentum methods even for

the special case of linear regression

was not really well understood,

and algorithms such as Adam,

RMSProp, there's been

some recent work on Adam, for instance.

So, there are a bunch of

these methods which are widely used in practice,

but even in very benign settings,

the performance is not really well understood.

it's important to do this because as

we saw just because people are using it,

it may not be the best thing to do.

As practitioners, we may not

have all the time in the world to try out

all possible combinations to figure out

what is the best thing to do, right?

In this context, stochastic

approximation from a TD point of

view provides a good framework

to understand these algorithms,

to see what makes sense,

and what doesn't make sense.

While classical stochastic approximation result

focus on asymptotic convergence rates

and so on would be really interested

here in the painting strong non-asymptotic results,

which give us like finite

time convergence guarantees and so on.

So, that's all I had to say. Thank you for being here.

>> Any questions?

>> Yeah?

>> [inaudible] graphs, what was the y-axis again?

>> Error.

>> What error?

>> Tester and the bell distribution.

>> So, [inaudible] how many zero minus f?

>> F star.

>> P minus F star.

>> [inaudible] P minus F star. On some

of the draws from the same distribution.

>> In this case, we can exactly conclude

the function value because you

know that it's a Gaussian distribution.

So, you know the covariance matrix, right?

So, it only depends on the covariance matrix. Yeah?

>> So, your new algorithm the accelerated [inaudible],

if you do that from the deterministic,

doesn't it have a correspondence

to some other known algorithms.

How do you come up with that?

>> Yeah, so as I said,

so Nesterov's accelerated gradient can be written in

multiple ways, the deterministic world.

The most popular one is

what has this momentum adaptation

which is what people use.

But, there are various other interpretations and

various other ways you could write the same algorithm in

the deterministic world and

our algorithm is stochastic find one of them.

Yeah. So, in stochastic world,

they're completely different even though

in the deterministic world, they are exactly the same.

>> So, your balun [inaudible] square root of [inaudible] was upper bound?

was that upper bound?

>> That's upper bound, yeah.

>> In the plot, the error wasn't going down around [inaudible]

>> So, that may be some other constraints.

>> So, you would expect it

to go down [inaudible].

>> [inaudible].

>> Yeah, so it hasn't hit the bottom,

so if you consider like 10 to the four as

the baseline for these methods.

So, right after 10 to the four, they curve down.

>> Is there a more quantitative way to [inaudible].

>> Yeah, so there was a more quantitative way

which is that ideally

you want to run this for different values of

condition number and see,

and then plot the line,

and see the slope of that line,

which also we have, it's

a little more difficult to explain in a talk.

>> I see.

>> Yeah.

>> Well, there are no more questions,

let's thank Amit again.

For more infomation >> On momentum methods and acceleration in stochastic optimization - Duration: 51:33.

-------------------------------------------

Here Is Why Trump Is To Blame For Terrorists Potentially Getting 3-D Plastic Guns - Duration: 1:30.

>> Soon, no conditions for guns,

aparpt lay.

This morning, the plan for ghost

guns are up online.

These are firearm components for

rifles and guns who can be

created by anyone at home who

has a 3-d printer.

There.

>> The president weighed in

saying I'm looking into 3-d guns

being sold to the public.

Already spoke to the NRA.

It uz doesn't seem to make much

sense.

It was the trump state

department that dropped the

lawsuit to stop the blueprints.

The Obama administration started

them saying the plans could be

downloaded by terrorists, and

beyond dropping the lawsuit, the

trump administration is looking

at changing the very rule the

designer was originally sued

under.

And to the second part of the

president's tweet, there's no

word why he would check in with

the gun lobby or why the

conversation would have taken

place.

Let's slow it down.

The NRA is saspecial interest

group.

He didn't say he spoke to the

department of justice or the

state department.

He's not speaking to any other

gun control groups, but the

president took to Twitter to say

I have reached out to the NRA.

So those of you who have

chanlted over and over, drain

the swamp, walk us through --

>> Why you would go to the

industry group.

>> Why would you contact the

largest lobbying group ever, the

NRA, on what to do about the

ghost gun.

For more infomation >> Here Is Why Trump Is To Blame For Terrorists Potentially Getting 3-D Plastic Guns - Duration: 1:30.

-------------------------------------------

Kershaw Natrix Knife Review- New for 2018 Carbon Fiber and G10. Based on ZT 0770 - Duration: 10:21.

Last week you may remember a video where I strapped on the old proverbial water skis

and prepared for a well supervised stunt that involves a large ocean dwelling creature made

entirely of teeth.

And if you don't get that joke, then congratulations you never watched Happy Days.

And I'm not even being sarcastic about that.

But we're back to knives and not the more expensive brand name melamine foam which I

should have bought way cheaper off Amazon.

And this one specifically is the updated Kershaw Natrix which you have been seeing show up

on all the internet knife reviews table tops for the past two months- including Austria's

finest gear reviewing Youtuber A Little Older, whose video about this Natrix I'll link

at the end.

But before we get into the real knife demonstration video art let's look at the dimensions of

this update to a knife I think they released last year.

A statement I did not research well enough.

Like the overall length and weight.

I don't research anything.

I mean I researched new computers today.

I can't afford any of those.

Blade size and cutting edge.

What do you research or do that is a waste of your time.

Before you answer remember if you don't have anything nice to say.

Handle size and grip area.

Then you might be a student loan servicer!

Spine thickness and handle thickness.

You know I've been out of college for about 16 years now.

Tallness and flipper tabs.

I think in 2024 I'll have those paid off.

I heard Youtube was lucrative so that's why I'm here.

The good news is though that the Kershaw Natrix with carbon fiber overlays is a good knife

for under $50- so if you also wasted money on an industry specialized for profit college

you too can own this knife.

Of course if you made the smart choice and went into say computer science at a traditional

college you might be able to afford the knife this is based off of- the Zero Tolerance 0770,

which is based off the 0777- which won an award of some kind.

According to the website.

PBR won an award.

Let's look at the blade shall we.

It features a drop point, that resembles a sheeps foot in a way depending on how you've

been drinking.

"that sheep has real nice feet" ok not that much.

It's covered in a titanium carbo nitride covering a blade made from 8Cr13Mov sometimes known

as D2 if you buy a cheap knife directly from China.

It has a decent sized cutting edge, and some fine spine jimping on the top- not quite like

a spyderco but close.

Deployment of the blade is handled by a flipper tab that is not assisted, but has a strong

detent like the kershaw Fraxion I reviewed last year.

These knives are kind of similar in spirit- the Fraxion being smaller but having a light

mostly G10 frame.

This one can be deployed consistently with minimal effort.

Most is the initial press, and pop it rockets out.

It's actually almost harder to keep it from deploying fully.

Lockup is handled by a sub fame lock... which is a patented Kershaw feature, but looking

at it in idiot terms it's a metal piece attached to the G10 with screws... so the knife doesn't

have to have a full liner- it reduces carry weight and keeps it a nice size blade and

handle.

One thing I noticed after a few hundred deployments is that the blade centering was off... the

pivot had loosened a bit, and after tightening it reentered itself.

The pivot is easy enough to tighten- and apparently has no locktite which is a bonus.

Detent is strong so it's not possible to fling it open downward gravity style.

Handle is as mentioned made primarily from G10 with no liners.

It has a back spacer so it's only partially open backed.

It has thin carbon fiber overlays to match the hood of your 4 door accord with the car

seat in the back.

For a light weight every day knife I don't see any reason to make a handle out of steel

or even titanium.

Pocket clip.

Deep carry short, tip up in right or left pocket because it's swappable.

how ever no tip down deployment is possible- which relatively few people should complain

about.

Note the word relatively.

And I just got this question recently from a new subscriber... my preferred carry when

I'm wearing pants tip up blade backward in my right pocket I am right handed.

Pocket clip has right amount of tension, I like the tip to be a little less aggressive

of an angle but that's a minor nitpick.

Comparisons.

We'll keep this short this video doesn't need to be 10 minutes.

I dunno maybe it will be.

First the Natrix.

Blue and black, I think the knife is nice looking, the handle is kinda comfortable

I like a grip area about 3.5 inches so I can move my hand around a bit.

A little smaller isn't bad I guess, not everyone needs large knives.

I own a few Kershaws and this one would be most likely to show up in a pocket rotation.

Now the Fraxion- this one is also very nice.

A coworker asked me once what a good small knife was, and I recommended this.

It's a fast deployer and you barely know it's in your pocket.

Not everyone needs handles and blades over 3 inches.

How about the Cryo 2, Hinderer designed.

A little too heavy for me, and not quite big enough I don't need an assisted knife, when

a non assisted like the Natrix is nearly as good.

My rule of thumb is, if the blade and handle are not the ideal size for me, and it's over

4 ounces I won't carry it.

Let's look at the Para Military 2... little bit larger handle and blade the ideal every

day carry size for me but a lot more expensive probably by $100.

Sometimes I have a hard time recommending a good light modern snappy deploying knife

that's affordable that's not a spyderco.

The Natrix fits that bill... it's sort of comparable to the Spyderco Tenacious in price

point and similar steel, but a little lighter and slightly smaller.

And I think nicer looking.

The tenacious is kind of plain in black G10 personally- although nicer after drinking.

Wrap it up.

The Natrix is a fine every day carry blade if you're on a budget.

Sure knife snobs don't like 8Cr13Mov steel, but to the person who doesn't mind sharpening

a bit more often- then it's a good budget choice.

I own many knives in a lot of different steels, but that's never been the main reason I choose

to carry a specific knife.

My last two larger knife purchases were both Spydercos and the looks, and handle materials

were my first considerations when spending money I didn't have.

Many of the knife designs I like, from the looks, to the ergonomics tend to have good

steel already.

And as a note of caution on the batoning of light weight knife designs.

They are not designed for this.

And it voids your warranty.

Plus you might injure yourself or a nearby drunk loved one.

The blade stop pin popped out because of the flex of the G10 and the repeated hard whacking.

Luckily my knife was easily fixed by partially disassembling the knife, and praying to Lynn

Thompson that I could find the stop pin to finish my review.

The Lord himself smiled down- he was wearing that black tie and dress shirt, surrounded

by 1000 flaming dismembered pig heads.. and I found it.

This of course is not a reflection on the construction of the knife- but a reflection

of my poor character and terrible sense of humor.

The testament is, I was able to fix the knife and return it's operation to normal boring

every day carry things.

Like cutting.

So if you like this sort of review, subscribe to my channel, give the video a thumbs up,

An Giang info

Tuesday, July 31, 2018

Youtube daily report Jul 31 2018

No comments:

Post a Comment