- Good afternoon, welcome to the first talk
in the Paul G. Allen School of Computer Science
and Engineering Colloquiam Series for 2018.
I am delighted to introduce to you
Hannah Hajishirzi who is an assistant
research professor in the electrical engineering
department here at UW so she's no stranger
to most of the people in this room.
Hannah has been doing exciting work
at the junction of natural language processing
computer vision, machine learning,
and artificial intelligence.
She does really exciting and ground-breaking work
in building on these foundations
to enable artificial intelligence
to start to do more than what any
of these single areas can do by itself.
So I think we're in for a real treat
with her talk in which she'll be telling us
about learning to reason and answer
questions in multiple modalities.
She's worked with a wide range of data.
She brings together the amazing capabilities
of end-to-end deep learning systems
with symbolic methods that are designed
to support reasoning and interpretability.
So without further ado, welcome, Hannah
and thanks everyone for coming.
- [Hanna Hajishirzi] Thanks Noah.
So thanks a lot for the introduction.
In this talk I'm going to present my work
on question answering and reasoning
about multi-modal data.
This is a joint work with my amazing students
and my colleagues at University of Washington
and AI, too.
Recently we have witnessed great progress
in the field of artificial intelligence,
especially in natural language processing
and question answering.
For example, we have seen IBM's Watson
beating humans in Jeopardy.
We see Google search engine being able
to answer a lot of interesting questions
about entities and events and it's mainly built
on Google Knowledge Graph.
Also question answering and interactive system
capabilities have been deployed
into nowadays cell phones and home automation,
like in Amazon Echo and Google Home.
These systems are great mainly because
they are doing a really good job
in pattern matching but we really need
to answer two important challenges in order
for these systems to be fully applicable.
The first challenge is to have rich
understanding of the input.
The second challenge is the ability
to do complex reasoning.
Let's look at this example:
What percentage of the Washington state budget
has been spent on education in the past 20 years?
If you ask this question from Google you probably
see a list of webpages that are relevant
to the Washington state budget
and it's the user's job to go
over these web pages, connect all of them,
finally find the answers to the question.
But what we want is a question answering system
to be able to understand that we are actually
looking to find and solve that equation.
And then it's the AI system's job to go
over different web pages to understand
exactly what is going on inside those web pages,
looking at different sources of data,
like graphs, like diagrams, tables, and so on,
and then connect all of them together,
do complex reasoning that practically requires
multiple steps to finally answer this question.
Or let's look at this problem.
What will happen to the population of rabbits
if the population of foxes increase?
So this is a type of question that probably
a ten-year-old would be able to answer
by looking at this diagram, knowing that foxes
and rabbits are connected to each other.
This is the food web and foxes are consuming
and eating rabbits.
But for current AI systems this is actually
very difficult to answer.
In order to answer those questions
the system not only requires to understand
what is going on inside this diagram
but also needs to know what it means
that rabbits are consumed by foxes.
Basically, it requires some sort
of complex reasoning to go over
a large collection of textbooks
and also maybe look at some other types
of structured data like encyclopedia,
or, for example, Wikipedia pages
to finally answer this question.
In my research I have been focused
on designing AI systems that can
address these two challenges.
One is understanding the input
and also being able to do reasoning.
I have started my research career
with designing logical formalisms
on how to represent data such that we can
do more efficient reasoning.
Then I extended those formalisms
to NLP and cognitive vision applications
by learning those formalisms from data.
In particular I have introduced new challenges
in NLP and computer vision.
Some challenges like automatically solving
algebra word problems or automatically solving
geometry word problems.
Basically these challenges are the types of tests
that 10-year-olds or 12-year-olds
would be able to handle but current AI systems
can't solve those problems.
In order for an AI system to address
these questions I have made contributions
to the NLP area to basically have better
and richer understanding of the textual input
and also to computer vision literature
to have a better understanding of visual input
and also multi-modal data.
For the purpose of this talk I'm going
to focus on a task that is mainly
question answering.
The idea is we want to have a good and rich
understanding of the input that can be
of the form of multi-modal data,
usually a question and a context
and an algorithm for being able
to do reasoning to find the answers
to the question.
Here is the outline of my talk.
In the first part I will show how
we can represent data mainly using
symbolic representation or neural representation.
Then I'm going to show my work
on designing end-to-end and deep neural models
for question answering about multi-modal data.
And then in the next part I will show how
to use symbolic representation
to solve some AI challenges
and then finally I show my future directions.
When we want to design AI systems
an important challenge to address is
how we represent data such that
we can learn data and data representation
from data, but at the same time
we want these representations
to facilitate reasoning for us.
These type of representations can range
from symbolic representations,
like logical formulas, to neural representations.
Let's look at this problem
We want to design a system that can automatically
solve geometry word problems.
And I want to show that if we can understand
the input to some sort of logical formula
like what you see in the screen
then if we can leverage axioms and theorems
from the geometry domain and do reasoning
we would be able to solve these problems.
This representation is great because it allows us
to do complex reasoning and then solve
these geometry problems.
But at the same time directly learning it
from data is very hard.
Basically this representation
is too rigid to learn.
What we want is to make
these logical representations a little softer.
For example, we can use different formalisms
that are available, like markup logic network,
probabilistic relational models,
or some of my PhD work
on representing sequential data.
But for the purpose of this work
we have focused on using probabilistic relations
and then assign some probabilistic score
to each of the relations that we are extracting
from the geometry problem.
So here I just told you about how we represent
the problems and then later in the talk
I will show you how we can use
these probabilistic relations
to finally solve the problem.
On the other end of the spectrum
lies neural representations.
One very popular way and technique
that has been used in the deep neural model
literature is to use word embedding.
And the idea is to see if the word
that occur very frequently with each other
be appearing in the same high dimensional space
or they appear very close to each other
in a high dimensional space.
And this is called embedding, to map words
into some high dimensional space.
This has been a very popular these days
and people are achieving a lot of great results
using deep neural models.
But in order to achieve meaningful representations
we usually require lots of training data.
And also because they are not that intuitive
it is very hard to interpret and explain them,
it is much harder to do complex reasoning
using these types of representations.
But what if we can add some structure
to these neural representations?
For that let's look at the domain
of visual illustrations or in particular
let's focus on diagrams.
A lot of different types of visual data
can be found in a diagram form.
So for example we have diagrams in textbooks,
we have diagrams showing us how
to assemble furniture,
we have work flow diagrams, and so on.
These type of images are inherently different
from natural images and they usually try
to depict some complex phenomenon
that it's very hard to show
just with a single image or multiple sentences.
But if we use an out-of-the-box
neural embedding model and then try
to represent these diagrams into
some neural representations
that probably wouldn't work.
Why, because we're going
to lose a lot of information
that are hidden in the structure of these diagrams.
Also, there are a lot of ambiguities
involving these diagrams.
Like arrows might mean different things
in different types of diagrams.
In one diagram it might mean consuming,
in some other diagram it might mean
water transformation, for example
in a water-cycle diagram.
So how do we tackle this is to build and design
a representation that works for a wide range
of diagrams and then just try to respect
the structure hidden inside the diagram.
In particular we introduce diagram parse graphs
where every node in the graph actually shows
a constituent in the diagram.
Like for the diagram at the bottom
you can see there are notes
for blobs, for texts, or for arrows.
And then added, show how these constituents
are related with each other.
For example there are intra-object relationships,
a text describing a blob,
or inter-object relationships,
two objects are related with each other.
And then later we can take different components
from the diagram and then basically encode those
into some neural representation.
Later in the talk I will show how
we can leverage these representations
and do a good job
in question answering about diagrams.
So I showed you two different ends
of the spectrum for representing input.
One are symbolic representations
which are great because they allow us
to do more complex reasoning but they usually work
for a specific domain.
On the other end of the spectrum
are neural representations which are great
because they can cover a wide range of inputs
and they are easier to learn
if you have enough training data.
But on the other hand,
they are much harder for reasoning.
So most of my work has been focused on
making the logical formalisms a little bit softer
or neural representations more structured
and in future I'd really like to be able
to combine these directions.
Let's move on to the second part of my talk
which is on designing neural models
for question answering about multi-modal data.
Lets first look at this task
in the textual domain.
This has been also called reading comprehension
in natural language processing
and it's a very studied problem.
The idea is we have a question
like which NFL team represented the NFC
at Superbowl 48 and there is a context paragraph
given to us as an input.
Then the goal is to find the answer
to the question.
Which is going to be Seattle Seahawks
and it usually can be found inside the paragraph.
A conventional approach to solve these problems
is a pipeline approach.
It involves a feature engineering to set
to map the question and the context
to some features like the words
that are appearing in the question in context
like the frequency, their similarities, and so on
and then train a classifier that tells us
if a phrase is the correct answer
to the question, or not.
When we apply this method to a very popular
recent data set of question answering
this method achieves about 52-3% accuracy
on how it can solve these problems.
And as you see there is a large gap
between this pipeline approach
and human performance.
And we believe that the reason is
there is a disconnect between the feature engineering
or the representation and also
how we design the classifier.
Also, we think that these features do not do a good job
of representing the text or the interaction
between the question and the text.
So what we have done is to introduce
a neural approach to address this problem.
What we want is to map the question and context
into some neural representation
and then at the same time learn a function
that assigns a really high score
to the correct answer of the question.
Basically this function
has a domain and a range.
The domain would be the neural representation
from the question and context
and the output is the distribution
over the words appearing in context,
such that the correct phrase like the Seattle Seahawks
gets the highest score.
But how do we learn such a function and representation?
Let's look at the similarity in formation
between the question and the context paragraph.
What we want is to find what word or phrases
in the context are really important
in answering the question.
Let's look at this phrase:
National Football Conference
This is probably an important phrase.
It is relevant to NFL to NFC and Super Bowl 48
in the question.
But then let's look at another word,
defeated, in the context
This is probably a less important question
or a less important word to understand.
And because it is only relevant to something
like represented in the question.
This is called an attention mechanism
and it has been very popular these days
in both NLP and computer vision.
I can very loosely define an attention mechanism
by using human visual attention.
For example, if I want to see, and I want to focus
on the stop sign on this image
we basically look at certain parts of this image
with high resolution where the stop sign is located
and the other parts of the image with lower resolution.
So basically I'm going to look at the most important
parts of the image.
So this has been using NLP and other tasks
like machine translation and other tasks as well.
Most of the time the attention is usually being looked
at this direction, like the attention
from the context paragraph to the question.
But what we observe in this word is that the attention
it is important to look at it
from the other direction as well.
Let me give you some insight and then
I will dig into the details.
For this direction we want to see
for every word in the question
or every phrase in the question
what are the most important parts of the context
or what are the critical informations from the context.
For example, NFL teams would be related
to Seattle Seahawks and Denver Broncos
because both of them are teams in the NFL.
But then let's look at another phrase like NFC.
It is more likely being relevant
to National Football Conference but the thing is
it is more relevant to Seattle Seahawks
compared to Denver Broncos.
Why, because Seattle Seahawks is part of NFC
while Denver Broncos is not.
So this actually will help us to have higher score
for the correct answer to the question.
So here are the details of how
we implement the attention.
We first compute a similarity matrix using all the words
that are appearing in my context paragraph
and all the words in the question
then for every word in the context
I compute the distribution of how it is similar
to the words in the question.
And then I have a rated combination of all the words
in the question for the context words.
Now I have a new distribution over the context words.
For the second direction what we do is again
to build on this similarity information
between the question and context words.
But this time we're going to look
at every word in the question
and compute its distribution to how much
it is similar to the words in the context.
So now we have one distribution with respect
to the question words.
We have another distribution with respect
to the orange words and an aggregate, all of those.
Now we have a new distribution that represents
how important the context words are
for us to answer the question.
So this is great, we have been able
to bring in information from the question side
and have a better representation of the context paragraph.
But what is missing is how we can incorporate information
from the structure and the sequential nature
of the sentences.
In particular the added new function
that basically tries to encode
the structure and sequential information
from the context and see how these
are interacted with each other.
And finally this will be the output
of our system in scoring the phrases.
More particularly, we have introduced
a deep neural model called bi-directional
attention flow.
This is a hierarchical architecture
that has different layers.
These layers are designed such that
they add richer understanding of the input.
And basically we have the representations of the input
at different levels of granularity
according to these different layers.
And here is the detailed architecture
of our system.
Don't get scared of this diagram
but what it mainly shows is that
each of these nodes tries to represent
a word into some neural representation.
We have different layers, that each layer
is responsible to capture some information
about the context and the question.
For example we have character embedding layer
that tries to deal with unknown words
in the vocabulary.
We have attention flow layer that tries
to bring in information from the question,
And we have modeling layer that tries to capture
the structure of the sentence
into building the representation.
Now that we have a representation
we pass all of these to an ouput layer
that can change according to different applications.
But for the purpose of this particular test
we wanted to compute the distribution
of a very sparse index of the phrases
where the n index of the phrase is located.
Basically we predict p of start and p of n distribution.
Then at the training stage we bring in training data
and we optimize this objective function
which is maximizing the log probabilities
that these predicted distributions
p of start and p of n are actually assigning
higher scores to the ground truth start index
and the ground truth end index or in particular
y start and y index.
And then once we do training we basically
learn the parameters and then use the model in action.
At test time the input to the system
are, again, the question and the context paragraph.
We outline our neural model against all the layers.
with all the learned parameters
and now we find out what is the most likely phrase
that is the answer to the question.
Let's see how our model works in practice.
We evaluated this model in a very popular
question answering data set
that includes about 100 k questions and paragraphs.
And they are all most popular articles from Wikipedia.
And we evaluated on how well it could answer the questions.
As of January, one last year we were state of the art
and we were the first on this leaderboard,
this question answering leaderboard.
And our system was able to achieve about 81% accuracy.
And the reason that we are higher with respect
to other teams, we believe that we leverage this
by direction on attention.
Also this hierarchical nature, this modular nature
of our representation and our model
is helping a lot with capturing
more insight about the input.
So since then a lot of teams are competing
in this domain.
And so now, these days there are about 60-70 teams
on the leaderboard and some of it
are built based on our models,
some are completely different systems
but it is interesting that best model now
at least as of January one 2018
is actually building on our BiDAF by adding
new word-embedding ELMO and it gets
about 85% accuracy.
We have also evaluated BiDAF on other data sets.
Some of it has been done by my group
and some of it by other researchers in other places.
Basically we have achieved the state of the art
on a set of articles from CNN where
the question answering is in the form
of cloze style tests,
state of the art on some other Wikipedia
question answering data sets, Zero Shot
relation extraction data sets and a new data set
that requires multi-hop reasoning.
We also tried to incorporate such similar ideas
into another modality.
In particular we showed that if we add
a little bit more structure to these neural representations
we are able to leverage similar ideas
into a diagram question answering test.
In particular, we introduced this challenge
of answering questions about diagrams
that are taken from textbooks.
We have collected about 15 k questions and diagrams
and we have questions like this in this food web
we want to see how many consumers consume kelp
or some questions like this, what happens to the water
in the sea in a sunny day.
So there are different varieties of questions.
There are a lot of ambiguities
so this is obviously more challenging
than just question answering about only a language modality.
So here is the architecture of our system.
We basically applied a similar setup
to understand questions and map them
to some neural representation.
Then we build our dependency graph
or basically the diagram, a parse graph
adds some structure to diagram representation,
take different components of the diagram
and code them into some neural representation.
And then computer attention how they are similar
to each other and then answer questions.
Our results are promising.
Our method with respect to another method
that uses only deep neural models
without these structured representations
we achieved almost 15% better results
and we got this significant gain.
And our system is able to answer these types of questions
Like the diagram depicts the life cycle of what?
Or how many stages of growth does the diagram depict?
The second one is more difficult.
It requires to have a better understanding
of the diagram.
Let me show you a demo of my system
on how we answer questions about textual data.
So I hope you all see the demo.
So the input to the system are a paragraph
and a question and then we submit
and we want to see how we answer the question.
Let me show you some examples.
The first paragraph is about Nikola Tesla.
If I ask this question, in what year Nikola Tesla was born,
if I submit it will give me 1856.
And it's pretty interesting because there is no
explicit mention anywhere but you see
that the first number in the parentheses is 1856
and our system is able to learn
that usually the first number is associated
with the birth year.
Let's look at another question that requires
richer understanding of the question.
The document, the article is
about Inter-governmental Panel on Climate Change.
I can ask this question.
What organization is the IPCC part of?
If I submit the question it will give me United Nations,
which is right, and as you can see here is a hint.
"IPCC is a scientific inter-governmental body
"under the auspices of the United Nations."
So it shows that it can do kind of complex paraphrasing.
Or let's look at another question that requires
a little bit reasoning.
This article is about the Rhine River in Europe.
And the question is asking what is the longest river
in Central and Western Europe.
If I submit this question it gives me Danube
but there is no explicit hints, explicit mention
that Danube is the largest river.
You can find it here: "It is the 2nd largest river
in Central and Western Europe."
It refers to Rhine after the Danube.
So basically the system is able to do
some single-step reasoning to understand
that Danube is actually the largest.
Let me show you some mistakes that our system is making
because probably that's more interesting
to show how we can make improvements.
Let's look at this article, Oxygen.
I want to write my own question.
What does the element with atomic number eight belong to?
So we expect the system to give me it's a member
of the chalcogen group.
Understanding that an element with the atomic number eight
is actually oxygen, but, Okay the system makes a mistake
because it does understand that this is oxygen
but we require another step to understand
it's a member of chalcogen group.
Let me push on this reasoning side a little further.
Let me write down my own story.
Liz has nine black kittens.
She gave some of her kittens, or let me give a number.
She gave three kittens to John.
John had five kittens.
Then I'm going to ask my question
which is how many kittens...
does Liz have.
So the system is not able to answer this question.
It finds that 'okay Liz initially had nine black kittens.'
But it's not able to do reasoning to understand
that some number of kittens are decreased
from the initial number of kittens.
So this is basically the focus
of the next part of my talk.
So just to summarize, so far,
I have talked about designing a deep, modular
neural model that can do question answering
on wide coverage input that includes text
and also diagrams.
The remaining challenges are what can we do
when the questions require more complex reasoning,
especially when the training data is limited.
And that's the focus of next part of my talk.
I am interested in introducing new challenges
that humans can solve but current AI systems
cannot address those.
In particular I have looked at the domain
of geometry and algebra word problems,
trying to design algorithms
that can automatically solve them.
Addressing those problems require rich understanding
of the input and also the ability
to do complex reasoning while training data is limited.
An interesting test bed to all of these problems
that I'm introducing is algebra word problems.
Now I have the story that I just entered as my demo
like Liz had nine black kittens and something happened
to the number of kittens, now how many kittens
does she have, or did John get.
Designing algorithms that can automatically
solve these problems has been an AI challenge
for a long time, even since 1963,
but the approaches that earlier AI researchers
were taking was basically using some rules
to map questions into equations.
But that does not generalize to new domains.
Especially because these algebra word problems
are designed on a child's world knowledge
and they can vary a lot.
We can have questions on daily life.
They can have questions on shopping
or science experiments.
There are no prior constraints
on the syntax or semantics that have been used
in these domains.
And we sometimes require knowledge
in order to solve some of these problems.
For example, in order to compute
the number of people who began living
in a country we need to basically know that we need
to add the number of people who were born
in that country and the people who were immigrated
to that country.
There are some words in those stories
that don't matter much, like the word kitten
can be replaced with many different things
like book, toy, balloons.
As long as we do it consistently we should be fine.
But some words like this verb give,
in this story it plays an important role.
If we replace it with receive the whole story
and the final equation would be completely different.
There are some irrelevant or missing information
in these stories.
For example, the story tells me
that Mary cuts some more roses
from her flower garden.
It never explicitly tells us that these roses
are actually being put in the vase.
But it is very easy for us humans
to understand that these roses
are being put in the vase but this
is not so obvious for machines.
There are ambiguities involved
like, for example for the first story
we need to add the number of games that are lost
and the number of games that are won
in order to find the total number of games.
But in the second story we need
to subtract the number of balloons that are lost
from the total number of balloons
to find the remaining number of balloons.
So to really understand the stories
we need to combine all these sentences
and understand these sequences
of sentences all together.
We have started this challenge
in 2014 with a few colleagues and since then
it has attracted a lot of interest
in the AI and NLP communities.
And a lot of people are looking at this problem.
So one idea for learning to solve algebra word problems
would be something like this.
What if we directly learn equations from text,
and map text to the equation.
But when we apply this approach on the data set
that are from 5th grade math questions it fails.
It basically gets about 62% accuracy.
Our solution is what if we get closer
to how humans try to solve these problems.
And basically look at all the quantities
that are appearing in these problems
as sets of entities and look at the stories
of how those entities are changed in different states,
in different world states.
So let's look at this example.
Liz gave some of her kittens to John.
We started with some number of quantities
about kittens that was our initial set
and it had one container which was Liz.
And according to this sentence
these quantities are transferred
between two different sets or two different containers.
Now Liz has less number of kittens.
John has more number of kittens.
But not all the numbers that are appearing in equations
strictly follow the order that those numbers
appear in those stories.
Let's look at this example.
On Monday, 375 students went on a trip to the zoo.
All seven buses were filled and four students
had to travel in cars.
How many students were in each bus?
In order to solve this problem
we probably need to first compute this part,
which is multiplying the number of buses
which is seven, to the unknown number
of students that are inside each bus
to find out what is the total number
of students in buses.
And basically using this idea we're going
to represent math word problems
using semantically tied equation trees.
And the idea is every leaf in those trees
are actually showing us the quantities and the sets
that are appearing in the problem
and then the intermediate node show math operators,
show how these sets are being combined
with each other.
But all those intermediate nodes
are also typed, meaning that they are going to be
of type, for example, students, or money, or something.
Then our problem now is reduced
to find the best tree that represent this word problem.
The space of equation trees
for a given problem is huge.
In particular for a problem that includes
about six quantities the assertion space
is 1.7 million trees.
But the good news is we can compute the score
of these trees in a bottom-up approach.
For example we can use and learn
some local scoring functions
that scores all these sub-trees
multiply all of these scores, like the scores
of all of these sub-trees and then see how
they should be combined with each other
according to the global information
that we are getting from the problem.
Basically, we reduce the problem
of scoring these equation trees
into learning some local function
that tries to score sub-trees
with repect to some parameters
and also some global functions
that tries to score all these global trees
to see how these sub-trees should be combined
with each other.
Then we learn those functions
to learning a local function
we try and classify, it's a multi-class classifier
that as input it takes a pair of entities
and as output it returns one of the four
math operators: addition, subtraction,
multiplication and division.
And the features that we are using
to train this classifier are the intertextual relationship
between the two entities that we have extracted
from text and we also have incorporated semantics
we have extracted for all of those entities.
Then in order to compute the global scoring function
we have a discriminative classifier
that tries to score a good tree versus bad trees.
And again the features we're going
to take advantage of, the global features
extracted from text.
For inference, we leverage integer linear programming
to generate candidate trees for us
that are consistent according to the types
we are getting from the problem.
For example, we have some type constraints
like this that the type of the left hand side
of the equality should be similar
to the type of the right hand side of the equality.
Our results are promising compared to an approach
that does not use this representation.
We get about 72% accuracy, about 10% gain.
And even more recently some other researchers
have built on our approach
used deep reinforcement learning
on how to do a better job on combining
these sub-trees and they achieved
about 78% accuracy on this test.
These are some problems
that our system is able to solve.
Our system can combine set difference
number of packs like four, eight, and four
and then multiply them by the number
of bouncy balls in each pack.
Or we can form a long range of additions
and subtractions mainly informed
by the verbs that are appearing in the question.
We are still not able to solve problems like this,
that Sara, Keith, Benny, and Alyssa
it's hard to infer that it's talking
about four people.
So in this part I talked about algebra word problems
which is a new challenge in the NLP and AI literature.
I showed how can you reduce learning to solve
algebra word problems to learning
to map text to math operators.
And if I can solve this problem
it's actually a step toward how we can have
an understanding about multiple sentences together.
And basically, try to be able to have
a precise understanding of this type of text
and do a better job in question answering.
Let's push a little further on the reasoning side.
And also let's bring in another modality.
For that we have focused on automatically solving
geometry word problems.
This is much more challenging than an algebra domain
because not only we need to understand the text
of the question like most of the challenges
I described also holds here
we also need to understand the diagram part
of the question and also be able to align those.
Understand that for example secant AB
is actually referring to that A B line
in the diagram.
So I'm not going to go into the details
how we exactly solve the vision part
or the language part, but just give you an intuition.
I would like to go from the text and the diagram
into some logical representation
that allows me to do complex reasoning.
And obviously learning these representations
directly from data would be very difficult
so what we do is make the representations
a little softer and then extract
all geometry concepts that exist in the problem,
like ABC, line DE, line AC and so on,
and then try to form how they can be related
to each other or basically find what are the possible
geometry relations that exist
between the geometry concepts.
So we have something like ABC the triangle
or line AC and DE are parallel with each other.
You might even have a wrong relation
like AC and AD are parallel with each other.
Then what we like to do is to be able
to score these relations according
to the text that we are observing from the question
and also according to the diagram
that you are observing from the question.
For scoring them according to the text
we follow an idea very similar
to what I just described for the algebra domain.
We would like to form a classifier
or different classifiers, such that
they learn what is the best relation
between two geometry concepts.
And to compute the diagram scores
we would like to use standard vision techniques,
have some rough estimate
of how this diagram would look like
and find the accurate representation
and then to score these relations
according to the diagram.
And then once I have all of these scores I would like
to align my knowledge between the text and diagram scores
do an optimization task, find what is the best
of those relations.
But this is also an important challenge,
how we align textual and visual data here.
Let me give you some intuition.
You want to find a set of relations
that actually have high score according to the text
and according to the diagram.
Also, we want to cover most of the important facts
that are mentioned in the text and the diagram.
Also, we want it to be coherent
meaning that the relationships
shouldn't conflict or shouldn't contradict each other.
The search space is huge,
like we have a combinatorial search space
but the good news is we could form
a sub-modular optimization function
that allows us to gradually select important relations
so our optimization, of course, is efficient
but at the same time we get to something
that is close to optimal.
Then we have mapped the question
in the form of text and diagram
to some logical representation.
Now I will bring in my knowledge
about geometry which are some theorems and axioms
that appear in the geometry domain.
I will do reasoning and then finally
answer the question.
Our results are promising.
Basically, we show that we can achieve
about 52% accuracy on automatically solving
geometry word problems and we have achieved
significant gains compared to just a rule-based system
or when we only look at the text or the diagram.
And again it's very interesting
to see that our system is able to beat
a student average in automatically solving
these SAT word problems.
That was kind of exciting
and there was a New York Times article
featuring this work.
So in this part of the talk
I mainly focus on symbolic representation
for complex reasoning and I introduce
two new challenges,
one on automatically solving algebra word problems,
the other on automatically solving geometry word problems.
I showed that this intermediate representation matter
but the main idea was how can you relate concepts
in the math domain or in the geometry domain.
But in order to form these relationships
and classify those we actually require knowledge
about those basic operators,
either in geometry domain or in the math domain.
An important question to ask is how can we generalize it
to more complex domains.
And that's actually the focus of my future directions.
I would like to design AI system
that can have rich understanding
and can do complex reasoning on a wide range
of multi-modal inputs including textual or visual data.
There are a few components that I need to build
in order to make these AI systems achievable.
One is trying to collect and acquire knowledge.
Some parts of knowledge are given to us explicitly
but a lot of pieces of knowledge are hidden.
How can we acquire those knowledge information
for us to do a better job in reasoning.
Another important direction I would like to pursue
is how can we leverage the benefits
from symbolic representation and neural representation
and cover a wide range of input
why we can do complex reasoning.
And also in order to be able to make these systems
really applicable I want to design a scalable algorithm.
And finally I would like to take these
into new applications for example
in tutoring applications.
Some ideas of how we can collect knowledge.
A lot of important information are hidden.
For example about entity attributes.
We want to collect information, for example,
about object sizes but we might be able
to capture those just by looking
at how they co-occur with each other.
Usually dogs are bigger than cats and so on,
by looking at multi-modal data.
Or how can we collect knowledge
about events and their structure
by looking at their temporal information
that we get from, again, a large collection
of multi-modal data.
Then if we have these type of knowledge extracted,
how can we incorporate those into our system.
For example, we can add that to our modular representations
have some algorithms for aggregation and reasoning
while we have a new representation
and new knowledge resources coming in.
It is also very important to have a scalable algorithm
to understand different types of inputs
because the input to the problem can be a paragraph
and it can go all the way to world wide web, right?
How can we design algorithms that can understand
a wide range of inputs and be scalable.
One important direction I want to proceed
borrow ideas from information retrieval
and try to hash candidate answers
that might be the answer to the question
and then use those, for example, search engines and so on.
But at the same time we want to have
a deeper and richer understanding.
Another important direction is how can we read text faster.
We have some preliminary results already,
which is on incorporating ideas
from human speed reading and design
neural speed reading approaches.
And so far with our preliminary results
we show that we can get the almost similar accuracy
but with three times faster.
I would like to incorporate all of these
in, for example, tutoring and education applications
This tutoring system is required
to have two important components
for automatically solving problems
or generating problems and then interacting
with the students.
For one part the system requires
to know the mistakes that the student is making
and explain them how to solve those.
But on the other direction the system can work
as a study buddy with the student
and try to actually acquire knowledge
from the student and then help them
to understand the problems better.
We have already done some preliminary work
on generating word problems and collecting knowledge
from students.
And then I'm going to get to this part
of my future work which I think is very important
and the idea is how can we leverage
both of these representations and make them
closer together such that we have
more complex reasoning but at the same time
we can cover a wide range of inputs.
One potential idea that I had is
to design a network that can learn
different reasoning operators, in particular
we can have something like this
that can take the world state,
the current fact that we are observing,
and a question, update the world state
and then reduce the query.
We basically want to borrow ideas
from logical reasoning literature
reduce the query to something
that is simpler to answer.
Let me show you an example.
My current state is Daniel is holding the apple.
Then the observation is that Daniel journeyed
to the garden.
The question is asking 'where is the apple?'
I want to update the world state.
Now I know Daniel is holding the apple.
Daniel is in the garden but the important part
of the world state that I'd like to focus on
is actually where is Daniel.
This is the important part that okay,
I want to simplify the query.
And given for example, a story and a question
like where is apple and if you need information
about where apple goes from different people
I would be able to apply my network backward
and each time try to answer the simpler query.
The first one would be where is Daniel.
The second one is still where is Daniel, and so on.
And we can even stack these different layers together
and have more complex reasoning.
To summarize, I introduced two different methods
using neural and symbolic approaches.
That both of them try to go beyond pattern matching
try to achieve rich understanding of the input
while they are able to do complex reasoning.
I showed that the neural model worked well
with different types of input.
The symbolic representations can do
great complex reasoning.
And in future I would like to integrate
both directions to leverage the benefits
of both systems.
Thank you. (applause)
- [Noah Smith] We have time for questions.
- [Dan] This is an outsider's curiosity
question, but on the SATs sort of the algorithm
versus the human, do you have a sense
of which style of question answering
more benefits from there being multiple choice,
or if they even leverage multiple choice
in the same way or a totally different way?
- So actually we didn't leverage multiple choices
in our setup.
And I think if we did we even could get better numbers.
Basically that's how we handled that.
Sorry, I forgot the rest of the question.
So I think if we did leverage the multiple choices
we could even get better numbers
because when we did the reasoning
we translated all the old logical formulas
into some numerical equations and then solved it
through some of the equations.
If we couldn't solve then we didn't answer
because we wanted to avoid the negative number.
But if we were kind of using the different
multiple choices we were able to make sure
that some of them definitely are not working
and therefore remove some choices
then do fifty fifty answers but we didn't do that.
We didn't use any human trick for answering...
- For example, probably less common
in the geometry domain but if you happen
to make a natural language mistake
that a human is very unlikely to make
you may get an answer that isn't
one of the choices and you should just try again.
- Sure, so no, we didn't leverage
the multiple choice, that's a good idea.
And actually like about 30% of our errors
are natural language errors, very good observation.
And some of it is not like
we don't understand the sentence.
The hard part is how can you make co-references
among different sentences.
Like, for example, if we're talking
about different lines and it said each other
we didn't know which of those two lines...
So these are one category of questions.
And there was 30% of questions
that they were really complex and it required
external knowledge, like it was saying
a polygon is hidden under a piece of paper.
It is an obvious thing for a child
to know what is happening there
but our system didn't have any idea.
- [Magda] Can you say more
about the scalability problem because one thing
that if I have this SAT problem
there's so many practice problems
and they're all going to be so similar
and they're all going to apply the same rules
and the same patterns but kind of
in the knowledge there's also things
that are less common less frequency
and can be applied on the one hand scalability
will stop being sound but on the other hand
can capture kind of some of the sense
of popular information.
- Sure, absolutely, so Magda is asking my view
about the scalability and the trade off.
So that's absolutely right.
One first thing that I showed you
was how can we do faster reading of the input.
I'm saying to get similar accuracy
in question answering.
We got good accuracy but right now our number
is about 85%, we got something around 83%.
We thought that is fine, to make 2% mistakes
but at the same time be faster
be able to kind of read the text faster.
So I completely agree there are definitely trade offs.
Like the same kind of trade off exists
between the complex reasoning and being
more specific to have wide coverage
at the same time be very generic and general.
So I think it really depends
on the application domain.
But again, one direction that I really like
to pursue is the following.
So right now when you have in Google search
and when you search something it will give you
a really quick response because they have probably hashed
a lot of indexes, they can easily find
the relevant document.
But if they really want to have
a good understanding of the meaning
it will be hard, just by looking
at the hash from the document level,
I mean, you know like document level hashing
probably is very high level, right?
What if we can go a little bit more
inside the document and hash different words
in the context.
So for example, I have something like
Barack Obama was 44th president in 2009,
I want to hash information about Barack Obama
one with respect to the 44th and what
with respect to 2009.
Now if the questions asks me
who was the president in 2009 I can easily do
some dot product between the vectors
that I hashed and also what I've got
in the question.
So I can be able to be very accurate
but at the same time make it much faster.
Especially go from linear time
by reading the text, the whole question
and the text, to kind of log linear time,
if I can be smart in hashing this stuff.
- [Sonja] How much interpretability is
important in this domain?
I mean when you give an answer would you
want the users to understand the reasoning?
Are you working on that part?
- So Sonja is asking how interpretability
is important, especially these neural
representations.
Absolutely it is very important.
I think interpretability and explainability
are two topics very relevant to each other
but not exactly the same thing.
I might even be able to explain
some of my rationals about how I decided
to choose this answer but still
my models are not that interpretable.
I highly agree that if they are interpretable
it's much easier for me to explain
but I might be able to get around it.
Without interpretability I can still
explain this stuff.
But I agree this is a very important direction.
I am mainly focused on explainability
than interpretability but I agree
those are both very important.
So for example, we have done,
like one common practice in language
is to visualize where the attentions
are going or for example,
go to lower dimension
and see how the words that are similar
with each other are close together
and if it makes sense.
- [Noah] One quick question if she'll let me.
So you may have already answered this
when you answered Dan's question but imagine
that you could get the best NLP group
in the world to work on one problem
and really move the needle on it
would it be co-reference resolution,
would it be something else?
What would help you move the numbers
on one of these tasks?
- I would say understanding multiple sentences together.
Co-reference resolution is part of it,
but this sequential understanding
I think it is important, like - Some version of discourse.
- Some version of discourse, right.
So for example nowadays we are doing
a really good job on sentence-level understanding
but understanding the whole story together
or how things are connected with each other
that's actually gonna be the first thing
towards being able to do multi-hop reasoning
and complex grid, how to connect different things together.
- [Noah Smith] Okay, I think we are out of time
let's thank the speaker again.
(applause) - Thank you.
No comments:
Post a Comment