Hello!
In this video, we'll be covering the k-nearest neighbors'
algorithm..
The K-Nearest Neighbors algorithm (K-NN or KNN) is a supervised learning method used
for classification and regression.
* For classification, the output of the K-NN algorithm is the classification of an unknown
data point based on the k 'nearest' neighbors in the training data.
* For regression, the output is an average of the values of a target variable based on
the k 'nearest' neighbors in the training data.
In a classification problem, the K-NN algorithm works as follows:
1.
Pick a value for K 2.
Search for the K observations in the training data that are 'nearest' to the measurements
of the unknown data 3.
Predict the response of the unknown data point using the most popular response value from
the K nearest neighbor.
What happens if we choose a very low value of K, let's say, k=1?
If we're trying to classify our data points into two categories red & blue for instance
then we may get a diagram that looks like this.
The black dotted line represents the curve that separates the Red and Blue classification
regions.
We want to choose a K that will create a separation of the regions that best matches this line.
As you can see, there are very jagged edges following the data points Now if we have an
out-of-sample data point, let's say, here.
The first nearest point (since K=1) would be Blue, the point that it is on.
This would be a bad prediction, since more of the points around it are Red, therefore
the point should be considered Red.
But since it is blue, we can say that we captured the noise in the data, or we chose one of
the points that was an anomaly in the data.
Another way to visualize this would be something like the diagram on the right.
It looks very similar diagram to the one on the left, but I want you to think of how
the division line between these two categories would look.
Pause the video now and think about it for a bit!
Alright, keeping that in mind, it could look something like this
A low value of K causes a highly complex model, which causes an overfitting of the dataset.
Overfitting is bad, as we want a general model that works for any data, not just the trained
data.
Now on the opposite side of the spectrum, if we choose a very high value of K, such
as K=100, then the model becomes overly generalized.
This makes it difficult to distinguish between the two regions if a point appears between
the two regions.
The white region is a spot where neither Blue nor Red can be decided.
So let's say we have an out-of-sample data point here, can you tell which region that
belongs on?
Probably not.
And if we look at the black dotted line that we want to follow, we can see that the model
doesn't follow this at all.
And finally, this is what a decent value of K might look like.
Here we've used K=16 and we can see a divide in the data that mimics the black dotted line,
producing a fairly accurate model for this dataset.
Also, looking at the regions there are no areas of noise, like we saw in the diagram
with a Low K.
The lighter shades in each region simply mean that the model is less certain about the predictions
of what to classify a data point in that area.
Bringing back that model we had from before, how do you think the line would look like
in this case?
I think that it might look something like this.
This line is a lot more general then the one we saw from before.
Another key point to consider is that there are some difficulties when dealing with out-of-sample
data.
Out-of-sample data means data that is outside of the dataset used to train the model.
Since we are dealing with out of sample data in the real world, it is hard to determine
whether the predictions using the out-of-sample data will produce an accurate result.
This is because we don't know what the result of the out-of-sample data should be, thus
making it hard to determine the accuracy.
Later on we'll discuss how to evaluate a model?s ability to classify out-of-sample data.
A good example of where Machine Learning can be applied is email.
Think about all the emails that you receive every day.
Do you get 10, 20, 100, maybe even a thousand a day?
How much of that is ham or spam?
Your email likely uses machine learning to help categorize this based on specific words
that may indicate that the email is spam and helps uses that information to categorize
and relate emails that are similar.
* K-Nearest Neighbors regression problems are concerned with predicting the outcome
of a dependent variable given a set of independent variables.
To begin, let's consider the diagram shown above, where a set of points are drawn from
the relationship between the dependent variable y and the independent variable x.
Given the set of training data (the unshaded circles which lie on the curve) we use the
k-nearest neighbors algorithm to predict the outcome Y, of X=5.6 (also known as query point)
given the training data.
* First, let's consider the 1-nearest neighbor method as an example.
In this case we search the training set and locate the one closest to the query point
X.
For this particular case, this happens to be x6.
The outcome of x6 (i.e., y6) is then taken to be the answer for the outcome of X (i.e.,
Y).
Thus for 1-nearest neighbor we can write: Y = y6
* This point (X,Y) is represented by the red circle on the diagram.
* Next, let's consider the 2-nearest neighbor method.
In this case, we locate the first two closest points to X, which happen to be x5 and x6.
Taking the average of their outcomes (y5 and y6), the solution for Y is then given by:
o Y= (y5 + y6) / 2 * This point (X,Y) is represented by the green
circle on the diagram.
Alternatively, we can also approach regression problems using the weighted k-nearest neighbors
algorithm.
This algorithm works by assigning weights to the k-nearest neighbors depending on the
distance of the points to the query point.
The closer the neighboring point is to the query point, is the larger the assigned weight
will be.
The sum of the weights must be equal to 1.
Thanks for watching!
No comments:
Post a Comment