K-Nearest Neighbors algorithm (or KNN) is one of the most used learning algorithms due to its simplicity. So what is it?
KNN is a lazy learning, non-parametric algorithm. It uses data with several classes to predict the classification of the new sample point. KNN is non-parametric since it doesn’t make any assumptions on the data being studied, i.e., the model is distributed from the data.
What does it mean to say KNN is a lazy algorithm? It means it doesn’t use the training data points to make any generalisation. Which implies:
- You expect little to no explicit training phase,
- The training phase is pretty fast,
- KNN keeps all the training data since they are needed during the testing phase.
Most data does not obey the typical theoretical assumptions, like when we consider a model like linear regression, which makes KNN crucial when studying data with little or no prior knowledge.
For basic machine learning algorithm, watch the following video.
Where KNN was born?
KNN was born out of research done for the armed forces. Fix and Hodge – two officers of USAF School of Aviation Medicine – wrote a technical report in 1951 introducing the KNN algorithm.
KNN is a Supervised Learning Algorithm
A supervised machine learning algorithm is one that relies on labelled input data to learn a function that produces an appropriate output when given unlabeled data.
In machine learning, there are two categories
In supervised learning, you train your data on a labelled set of data and ask it to predict the label for an unlabeled point. For example, a tumour prediction model is trained on many clinical test results which are classified either positive or negative. The trained model can then predict whether an unlabeled test is positive or negative.
It works just like we’d do it – a teacher or a parent would teach a child new things. If a teacher wants the child to learn how an elephant looks like, he will show the child pictures of elephants, and then pictures of animals which are not elephants like zebras and monkeys.
When we see an elephant, we shout, “elephant!” when it’s not an elephant; we shout, “no, not an elephant!” After the teacher does this for a while with the kid, and he shows a child a picture and asks “elephant?” and the child will (mostly) correctly say “elephant!” or “no, not elephant!” depending on the picture. That is supervised learning. When we substitute the child with a computer, it becomes supervised machine learning.
We train it using the labelled data already available to us. In a dataset consisting of observation (x, y), we want to learn a function g: X → Y so that with X, we can use g(x) to predict corresponding output Y.
Where to use KNN
KNN can be used in both regression and classification predictive problems. However, when it comes to industrial problems, it’s mostly used in classification since it fairs across all parameters evaluated when determining the usability of a technique
- Prediction Power
- Calculation Time
- Ease to Interpret the Output
KNN algorithm fairs across all parameters of considerations. But mostly, it is used due to its ease of interpretation and low calculation time.
The primary step in Machine Learning
KNN is very simple and is often used as a benchmark for more complex classifiers like the Support Vector Machines (SVM) and the Artificial Neural Networks (ANN).
How is it employed in daily problems?
Despite its simplicity, KNN does better than more powerful classifiers and is used in places such as genetics, data compression, and economic forecasting.
- In political science – classing a political voter to “vote Republican” or “vote Democrat”, or to a “will vote” or “will not vote”.
- Banking system – KNN can be used to predict if a person is fit for loan approval. Or if he or she has similar traits to a defaulter.
- Calculating credit ratings – KNN can help when calculating an individual’s credit score by comparing it with persons with similar traits.
Other areas that use the KNN algorithm include Video Recognition, Image Recognition, Handwriting Detection, and Speech Recognition.
Companies Using KNN
Companies like Amazon or Netflix use KNN when recommending books to buy or movies to watch. There was even a $1 million award on Netflix to the team that could come up with the most accurate recommendation algorithm!
How do these companies make recommendations? Well, these companies gather data on the books you have read or movies you have watched on their website and apply KNN. The companies will input your available customer data and compare that to other customers who have purchased similar books or have watched similar movies.
The books and movies recommended depending on how the algorithm classifies that data point.
How does KNN works?
Contributed by: Augustine Joseph
The k-nearest neighbor algorithm stores all the available data and classifies a new data point based on the similarity measure (e.g., distance functions). This means when new data appears. Then it can be easily classified into a well-suited category by using K- NN algorithm.
Suppose there are two classes, i.e., Class A and Class B, and we have a new unknown data point “?”, so this data point will lie in which of these classes. To solve this problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the class of a particular dataset. The data point is classified by a majority vote of its neighbors, with the data point being assigned to the class most common amongst its K nearest neighbors measured by a distance function.
Consider the below diagram:
Here, we can see that if k = 3, then based on the distance function used, the nearest three neighbors of the data point is found and based on the majority votes of its neighbors, the data point is classified into a class. In the case of k = 3, for the above diagram, it’s Class B. Similarly, when k = 7, for the above diagram, based on the majority votes of its neighbors, the data point is classified to Class A.
KNN algorithm applies the birds of a feather. It assumes that similar things are near to each other; that is, they are nearby.
KNN captures some mathematics you learned as a child as you were trying to grasp the calculation of the distance between points on a graph. The idea of similarity (sometimes called closeness, proximity, or distance).
Euclidean distance or straight-line distance is a popular and familiar choice of calculating distance.
Choosing the right value for K
To get the right K, you should run the KNN algorithm several times with different values of K and select the one that has the least number of errors.
The right K must be able to predict data that it hasn’t seen before accurately.
Things to guide you as you choose the value of K
- As K approaches 1, your prediction becomes less stable.
- As your value of K increases, your prediction becomes more stable due to the majority of voters.
- When you start receiving an increasing number of errors, you should know you are pushing your K too far.
- Taking a majority vote among labels needs K to be an odd number to have a tiebreaker.
Check out how A* algorithm works.
Working of KNN Algorithm in Machine
To understand better the working KNN algorithm applies the following steps when using it:
Step 1 – When implementing an algorithm, you will always need a data set. So, you start by loading the training and the test data.
Step 2 – Choose the nearest data points (the value of K). K can be any integer.
Step 3 – Do the following, for each test data –
3.1 – Use Euclidean distance, Hamming, or Manhattan to calculate the distance between test data and each row of training. The Euclidean method is the most used when calculating distance.
3.2 – Sort data set in ascending order based on the distance value.
3.3 – From the sorted array, choose the top K rows.
3.4 – Based on the most appearing class of these rows, it will assign a class to the test point.
Step 4 – End
Some KNN Advantages and Disadvantages
Some Advantages of KNN
- Quick calculation time
- Simple algorithm – to interpret
- Versatile – useful for regression and classification
- High accuracy – you do not need to compare with better-supervised learning models
- No assumptions about data – no need to make additional assumptions, tune several parameters, or build a model. This makes it crucial in nonlinear data case.
Some Disadvantages of KNN
- Accuracy depends on the quality of the data
- With large data, the prediction stage might be slow
- Sensitive to the scale of the data and irrelevant features
- Require high memory – need to store all of the training data
- Given that it stores all of the training, it can be computationally expensive
A Quick Summary of KNN Algorithm
- K is a positive integer
- With a new sample, you have to specify K
- K is selected from database closest to the new sample
- KNN doesn’t learn any model
- KNN makes predictions using the similarity between an input sample and each training instance.
This blog has given you the fundamentals of one of the most basic machine learning algorithms.
KNN is a great place to start when first learning to build models based on different data sets.
Data set with a lot of different points and accurate information is your best place, to begin with KNN.
You should Keep these 3 points in mind:
- A data set with lots of different points and labelled data is the ideal to use.
- The best languages to use with KNN are R and python.
- To find the most accurate results from your data set, you need to learn the correct practices for using this algorithm.