>> Does anyone want to have a try?
>> [inaudible]
>> Yeah.
>> Is there any tracking in this or is it just?
>> No. Just the discriminitive parts, it's frame based.
>> It sounds good.
>> I use a balt light, 25-33 per second.
Yeah. I think we can.
>> I'd like to introduce you.
>> Yeah.
>> Yes. So, hello. We're very pleased to have Chie here
who's studying for PhD at Imperial with TK Kim,
and was previously at Tingwa and Beijing,
and as you can see she is an expert on
hand tracking and has also been very involved in
the hand bench-marking projects
at recent CVPRs and ECCVs and so on.
So, sorry ICCV,
and she's going to tell us about hands. Welcome.
>> Thank you.
Yeah. Okay. I'm very
glad to have this opportunity to present my PhD work
about 3D Hand Pose Estimation
using Convolutional Neural Network.
So, recently 3D Hand Pose Estimation
has gained a lot of
interests with the commercial depths a camera.
We can see many companies has
incorporate or are planning to
incorporate hand interaction into their products.
3D Hand Pose Estimation can be formulated as configuring
the pose space given the depth image.
The pose space can be represented by
3D joint locations all angles between the pose.
So, assume we have a data set that
consists of pairs of images,
input images and its label for this image.
What we can do if we want
to make a prediction for a new unseen image,
like we want to estimate
the joint locations for new images.
Yes. We can simply learn
a mapping function f from x to y.
The full list function we can
choose CNN or random forest,
or other supervised learning method
but we will have this question.
Will this method function learn a good mapping?
If not what are the challenges?
So, these are the challenges.
First, little hands can appear in
any viewpoints many complex articulations,
also in some viewpoints it exists self-occlusions,
and many self similar parts
is also like we have different shape.
Finally, acquiring labeled dataset itself is a problem.
So, they are typically
three approaches to solve this problem.
The first one is the learning-based method.
It is what we had discussed,
learn a mapping from x to y.
It is efficient and the layers no need to do tracking,
and there's no need to do initialization.
Is like what little demo I showed just now,
but you need to still learn the function.
The second method is the generative method.
We assume we have a hand model,
and it will search it pose parameter space to
find a configuration best
aligned hand model Chula input image.
So, this method if we use
a very good hand mesh models for example,
it can achieve very accurate results,
but as the hand is in a very high dimensional space.
It is very hard to converge without good initialization.
So, the third method
is a hybrid method combine about two.
The discrimitive method can provide an initialization for
the generative method like Sharps and the Taylor's et al,
and also the prediction of
the learning-based method can act two eyes a
patch of the energy for the generative method.
So, my PhD work tackles
these challenges learning-based method.
The first we're going to deal
with the viewpoints articulation,
and self-similar parts and final work focus on occlusion.
At the same time we collect a lot of dataset by
a design and automatic labeling system.
So, my first work is motivated by two observations.
The first one is, we have many hand pose images.
Actually they are different just
because the viewpoint changes.
The second observation is
although the images demonstrate compressed articulations.
Many images show self similar parts like the square part.
So, we can use part based method to help
our linear model to generalize unseen images.
However, due to many self-similar parts,
the model trained with the parts-based method can
hardly discriminate different parts.
So, I can challenge you to
identify which parts it belong, is a quite hard.
But if we move our attention like estimate hand pose
by following the hand structure.
If we know, let's say the middle finger,
and if we follow the hand structure,
we can identify which parts they belong.
So, for each part,
we can learn offsets from
the center to the following joined locations.
As well we can find some similar parts like,
similar parts in our datasets,
we can estimate the parts following new unseen images.
Then where we can concatenate only the patches,
we can get it a whole pose.
Also as we estimate
the hand in a sequential way by following this hierarchy,
we are able to reduce the viewpoints at the same time.
We assume the palm is rigid and we can
calculate the global rotation
by estimating the palm joints.
So, by lists estimated joint locations,
we can align all viewpoints to one direction.
So, this reduce viewpoint variation.
So, we are not the first one to
use hierarchy or hand structure.
These two work decompose,
pose the joint locations by following the hierarchy.
The difference of our work
from these work is that we borrow in
a spatial attention image recognition to transform
the input and outputs at
the same time to reduce
the input and output space variation.
3D model with local part viewpoints,
the variation of the input space is reduced.
You can see the whole images,
it demonstrates many variations but if all hand pose,
the variation is smoother.
So, if we, or at the same time we
organize the parts by following
the hand structure we avoid
relative [inaudible] to discriminating the different hand part,
and also make use of
these self-similar parts because the self-similar parts.
We just want to know the offsets
as long as we find some similar parts in dataset,
we can make estimation for the new purchase.
So, also for output space it is largely reduced.
For example, if the back finger
pose a space for middle finger tip.
If we use a product-based method and the learner offsets,
less finger becomes the brown one and if we
further reduce or align the viewpoint into one direction,
the space for the fingertip is confine by the red arch.
So, to attend on
the least local patches along the hierarchy,
we apply a spatial transformation to
the input image and other output joint locations.
The perimeter of the transformation
is acquired by estimation results
from the previous layers.
For example, if we want to estimate the yellow joint,
we focus on part
centered on a estimated red joint locations.
The t adds ty in
the transformation is the location for other joints,
and the theta is the rotation for the whole poles,
and b specifies the crop to size of the patch.
So, the spatial attention module is
actually between two successive layer in our hierarchy.
It is trying to form the other puts
and the rejoined locations
to patch and align it.
So, if we put all the hierarchy layers together,
we can get whole pose.
However, if one like in this slide,
if one of the estimator is wrong,
the following estimation will be wrong.
So, listed hierarchy layer
exist the error accumulation in our method.
We propose two solutions,
the first one is, we use
a cascaded stage to define the estimation.
The second one is, we use some criterion to
evaluate the estimation and remove some bad hypothesis.
So, for our second method,
we extend the latency of energy of
a town's low work which we
mentioned by introducing low palm structure
and the bone length.
I will generate multiple samples
around the lower estimation of let's say layer 1.
And they use the PSO to search
the minimal of low energy we use.
So, the whole pipeline consists of four layers.
In each layer it consists of cascaded stages and
partial PSO to estimate
the joint locations in this layer.
A spatial attention module is added between two
successive CNN to focus on
a local patch and align input and output.
So, we can see after refinement,
the error accumulation is largely reduced.
So, to show the efficacy of
each components we use in our pipeline,
we provide multiple baselines.
Like for the first one is
a holistic CNN regressing 21 joints together.
Then the second one, we have separate
CNN to estimate low global rotation.
The length rotates all images to one direction,
and they estimates to the 21 joints.
For the baseline, we first estimate all CNNs together
and they apply less spatial attention
on some local patches to refine estimation.
The fourth and the five is
our proposed methods without the refinements and
the error accumulation and the one way of refinement.
>> So, between [inaudible] sizes and the CNNs and the baselines?
>> In our model, we have multiple CNNs,
so we settle aside our baseline CNN, the memory size.
They store the memory size same with all proposed method.
So in size or
the capacity of receiver is roughly the same.
So, we can show each component we add
improve the performance on the data set.
And here, we show the view point of distribution by using
the estimated rotations from the palm joint,
so it's in different the cascaded stages.
So, the blue line is original distribution,
and after the refinements,
the rotation variation become largely reduced.
So, then we compared our method to
many [inaudible] work and HSO is the town's work,
it's quite similar to our method.
This work can be treated as exstention objects work,
but we can see, a liste set and then our data sets,
we improve it by a large margin.
So, the next work is about data sets.
So, the [inaudible] method can
help a generalization when there is no enough data.
There are still in many cases we can not
even find the similar parts in a data set.
So, we want to collect a larger data set.
Sure, we can try to using
size and images to get larger data sets,
like MSRC's synthetic data sets.
But let's insert the images up
here quite different from the real images,
and it is also not easy to design very natural poses.
So, before BigHand,
the largest area data set is HaandNet,
but only the label six joints.
So, these two factors motivate us to design
automatic labeling system with
forehand pose as a notation.
So here, I will not go into the details,
but show the space coverage of the data set.
The BigHand data set, this one is
the viewpoint, and [inaudible] articulation.
We can see the BigHand demonstrate
a broader and more even coverage.
Then compare the white wave as
a wearer and align your data set.
So, using this data set,
we hold a lot of 2017 hand challenges
and the top three achieved error about 10 millimeters.
And the demo I showed just now is about similar,
our top three is about a 13 for unseen object.
So, we have already
got an error as low as 10 millimeters,
so if you have
the least functions out all based
on the discriminitive method.
So, we have this question,
does learning a mapping function
solve all the challenges we mentioned at the start?
When we collected a data set,
we concede on the design method.
We concede on the viewpoints articulations.
But we're having to consider the occlusions.
So, can the mapping
function work well under severe occlusions?
So, in this work,
I want to model hand pose and occlusion.
So, before we start, let's
figure out a little problem first.
So, when we collect the datasets for the big hands,
we have leaves of observations.
For many images with occlusion,
we have multiple one to choose for occluded one.
You can see when the camera is over there,
and then we can get some many at articulations.
So, the blue skeleton are
ground truths for visible joints and the red,
and the yellow is a full occluded one.
So, our problem becomes
even when using a learning-based method,
our problem is, so we're making one image
to multiple joints allocations.
So what will happen if we use
a CNN trained with a mean squared error?
So, we first look at an example image with occlusion.
The red skeleton is
a prediction of a CNN trained with a mean-squared error.
For visible joints, the estimation is more or less good,
but for occluded joints,
the prediction is not close to any ground truth.
So, to further clarify our problem,
we assume our datasets to train
the CNN contains two exactly the same images of a,
and the ground truth is yellow in d. After convergence,
let's say we are given estimation of the red part,
is the average of these two labels.
So apparently, the average is
our loss estimation for articulated object.
So, what should we do?
For the visible joints,
we have only one,
for occluded one, we have multiple locations.
So, we can think of a model,
these two cases by different model or different loss.
So, we propose to solve
this problem by a two-level hierarchy.
The top hierarchy is we have plenary variable
to represent a joint either visible or not.
If it is visible, we have
a unimodal distribution to model this one,
and for occluded one,
we have a unimodal distribution.
Then, by depending on the visibility in the top level,
we can switch between these two cases.
So, the binary variable
representing whether a joint is visible or not,
follows a Bernoulli distribution.
So, when a joint is visible,
it is actually deterministic,
but if we're considering in a level noise,
we can seal up the joint if it's
drawn from a single Gaussian distribution,
so we model it by a single Gaussian distribution.
When the V is zero,
the joint is occluded.
So, we use Multi Gaussian with j component
to model this mathematical model labels.
So, with all the components defined,
the distribution of the joint locations Y depends on
the visibility is this.
Under the joint distribution of Y,
and the low visibility is like this.
So, the two-level hierarchy is a
shown in the conditional distribution.
First a sample V is drawn
from the Bernoulli distribution,
and then depending on V,
the joint location Y is
joint from the union Gaussian distribution
or Multi Gaussian distribution.
Last, so our proposed method can switch these two cases,
and they provide for
these description of a hand opposes under an occlusion.
So, now we have defined our model,
and the next problem is,
how do we set the parameters?
Note the hierarchy mixture density is,
and also the joint distribution
are conditioned on input image X.
So, we can see these parameters
are actually dependent on,
or they can be represented by a function.
As the joint distribution is differentiable,
so we choose a CNN to learn these parameters.
So, given a dataset of image and
this possible joint locations and visibility,
the likelihood of the whole data set is
the product of all the individual joints
and for all images.
Here, we assume all the joints
are independent of each other.
So, our goal is to learn
a neural network that
your parameters maximize the likelihood.
We use a negative logarithm likelihood
at a loss function.
So, this loss function consists of three terms.
The first term is about visibility,
and the second term is for visible joins,
and the last one is for a multi,
for joints and the occlusion.
So, during testing where image X is fed into the network,
the prediction for one joint
is diverted to different branches depending
on the prediction of the visibility
w. If w is larger than 0.5,
the prediction on sampling for the visible joints
is first branch the Single Gaussian.
Otherwise, is in second branch.
But the problem is that when
a prediction for the visibility is wrong,
the sample or a prediction for the joints
will be wrong because we go into the long branches.
So, to help all these bounds problem,
instead of using the binary visibility label v
to compute a likelihood
or a loss of function during training.
We use samples drawn from the estimated distribution.
So, when a number of samples for
the visibility is larger enough,
and the average of these samples become w,
so we replace v by w during training.
The least, we call it a softer
version of our proposed method.
So, to demonstrate superiority of our work,
we construct two variations.
The first one is a is a method like a model,
all the joint locations by a single Gaussian.
The second way is model
all the joints by Multi-Gaussian distribution.
So, in this figure,
we draw 100 samples for
each finger tip from distribution of all these methods.
So, for the visible joints, this line,
the SGN and the HMDN actually are produced compared
to samples around the ground of truths.
Well, for an MDN,
we can see it's actually have
a broader range compared to these ones.
For the occluded joints,
where a sample is produced by SGN,
the samples produced by SGN is quite in a broader range,
and these overlaps with other joints.
The samples produced by HMD,
our proposed method, this scatters
in movement range of the joints.
So, we can see the HMD and combine
the advantage of SGN, an MDN,
and is able to produce
compared to samples for the visible joints
and an interpretable samples for joints and occlusion.
So, to further demonstrate
the distribution predicted by our method,
we represent the distribution by the spheres.
The spheres lays the components of Multi-Gaussian,
and the center is the mean,
and the radius is the standard deviation,
and the transparency is in proportion to
the mixture of different components.
So, we can see HMDN actually
combine the superiority of SGN and MDN.
So, to compare these variations,
we draw one sample and the
Joe compile the sample to the post label.
The average error for
the visible joints and
the occluding joints are listed here.
We can see, for different number of Gaussian components,
each MDN consistently improved from the SGN or MDN.
All motivation is actually modeling
the distribution and occlusion.
We list these two distributions
by getting samples from the predicted distribution.
We compare a set of ground truth labels and then the sets
of samples with Joe and then
calculated layer minimum distance
and see how well they align each other.
If the layer are quite aligned,
the distance should have been closer.
So, we can see the methods we propose
improved a lot from SGN, fully occluded one.
So, then compared to MDN,
it should achieve a similar accuracy.
Yeah. So, previously, we propose
to mitigate to let you post the bells during testing.
So, by sampling from
the visibility distribution at the training.
So, in this table, we show
the soft version considering
to better results, layer harder version.
The way we compare it with prior work,
we choose our datasets that contains
a considerable number of occluded joints,
both in testing and training.
We compared three methods,
the first figure shows our comparison with these method.
Using the comma, used measurement may
take low proportion of joints under certain threshold.
The first one is ICVL's work
and then the second one is Tom's work.
We compare them in the last two figures by sampling,
because these two methods can
provide multiple other posts.
The the ICVL's work is because we jittery
[inaudible] estimation can be treated as a Unit Gaussian model.
In Tom's work, they have a GMM fit into the leaf nodes,
so it can be treated as Multi Gaussian distribution.
So when we compare these two methods,
we draw multiple samples and choose
the closest distance to
the ground truth among the samples.
So we can our methods actually can
achieve lower accuracy when we
draw more samples and the variance of
our method is smoother than these two methods.
So here is all my work. Thank you.
>> Thank you very much. Time for questions.
>> Yeah. So in one of the tables you show that
your hierarchical HMDN method actually so
that was also in visible joints not
only on the occluded joints.
Any intuition like that might be the case. So, yeah.
>> So, you mean-.
>> You also have improvement
in visible joints not only in occluded.
>> Yeah, yeah, yeah, yeah, yeah.
For the visible joints, actually
you mean why it should improve on the visible joints.
Yeah, it actually should have because also in reference,
in Bishop's like on 1995 paper,
they also showed when we model
the multimodal problems using like the distribution.
It helped the learning for the single model occasions.
So I think this is because during learning,
where we have error from these two parts,
from the multi-model part,
is learned easily, is converted easily.
So it helped the network to
allocate its capacity in visible or single Gaussian.
Oh, no, no, no, single model part.
So if we treat only these cases
by using a single model or using mean
squared error is actually adds
burden or maybe misleads
the network to converge to a bad minima.
>> That is something that you trying to
extending that is because of the occlusion,
the suggestion with two hands and
[inaudible] first and yeah, you haven't tried that.
Now, you have to work on that suggestion?
>> Yeah, I think it is also
for- I haven't tried to lead two hands,
but like one part is the difficulties,
the detection part; how to discriminate these two parts.
But once if we have
got the information like the crop one,
I think it will work.
But I'm not sure because it is quite complicated and is
just a single part especially where
you have some occlusions.
I think a tailor is like,
the listen to work,
they have to deal with these slightly interactions
but also two-hand.
But for this work itself
is two-hand if there is no interaction.
>> How the training data was labeled?
I mean for an acute joint,
did you supply multiple labels
or use like one per example?
>>Yeah, actually because in a training,
we didn't do any like a menu,
like a supply because if we look at the post part,
start form the label parts.
If we look at the list label,
it actually has a corresponding input image,
maybe connecting it, each label
has corresponding input images.
Therefore, we have different labels.
So during training, we didn't group any ones,
we are just like fit them into the training.
We don't like to provide,
because it actually exists in
the dataset but for testing.
>> Yeah, that type of dataset was collected from that.
>> Yeah, yeah, so in the collection,
when actually I perform these as many as possible.
>> Right. Yeah.
>> Yeah. So for testing,
we need to group some post together,
similar like in a sequence,
to make them like if they are very
close to each other for the visible joints,
which means these are multiple ground truth.
Yeah, yeah thank you.
>> How was it's annotated?
So, 2.2 million depth images are recorded,
I'm just not familiar how was the ground truth attached?
>> The ground truth is like we have
a checker stat system,
like each joint is attached with
a sensor and we have six sensors.
>> Okay.
>> Yeah, once they attached to
the list and other one is fingertip,
because the sensor provided
both locations and the rotations.
Local word rotations. So if we assume the palm is rigid,
we actually can calculate other joins,
better location, and the rotations.
>> Right. So in terms of data recorded that it's actually
the same as having that that you compare before.
>> Yeah, yeah, yeah.
>> It's just that then you fit
a hand model to the six recording sensors.
>> Yeah, yeah. Then with
some inverse kinematic like
if we get this joint location,
this joint location, we have rotation.
We actually can calculate these parts.
>> So how worried you about
your ground truth not modeling
individual fractions base and
this might look very similar in your sensor data?
>> For the sensor data,
because when you do this,
rotation actually change because
it also record the rotation.
The sensor is a very sensible,
it can give you different rotations
and then you can calculate the other joints.
>> I'm not familiar how does the center interfere with
the depth data at all?
>> We have the wires but the problem is that,
yeah the thing is that the wire is
quite thin and the sensor itself is thin.
In a depth camera,
in the depth image,
you can hardly see these wires,
but in RGB, you will not use this data.
>> In RGB, you would easily see them.
>> Yeah, yeah, yeah. So my current work
is about transforming the depths to
the RGB will generate something
like some game or something. Yeah.
>> Thank you very much.
>> Thank you very much.
No comments:
Post a Comment