top of page

Getting to Grips with kNN - Nearest Neighbour Algorithm

In this blog entry I am going to explain a little about kNN Nearest Neighbour algorithm and walk you through the example given in the book: ‘Machine Learning with R’ by Brett Lantz, (2013).

kNN Nearest Neighbour is often referred to as a lazy learner, in machine learning terms, as it is not really learning anything, instead it merely stores the training data verbatim. This allows the training phase to occur very rapidly, with a potential downside being that the process of making predictions tends to be relatively slow. Technically speaking, no abstraction occurs.

According to Lantz, (2013), "As instance-based learners do not build a model, the method is said to be in a class of non-parametric learning methods—no parameters are learned about the data. Without generating theories about the underlying data, non-parametric methods limit our ability to understand how the classifier is using the data. On the other hand, this allows the learner to find natural patterns rather than trying to fit the data into a preconceived form".


The kNN algorithm begins with a training dataset made up of examples that are classified into several categories, as labeled by a nominal variable. Assume that we have a test dataset containing unlabeled examples that otherwise have the same features as the training data. For each record in the test dataset, kNN identifies k records in the training data that are the "nearest" in similarity, where k is an integer specified in advance. The unlabeled test instance is assigned the class of the majority of the k nearest neighbours (Lantz, 2013).

In general, nearest neighbour classifiers are well-suited for classification tasks where relationships among the features and the target classes are numerous, complicated, or otherwise extremely difficult to understand, yet the items of similar class type tend to be fairly homogeneous (Lantz, 2013).


The following example will teach you a step by step approach. Some detail is omitted to preserve some copyright integrity. If you wish to read about this subject in more detail the book is available through Packt Publishing, [Birmingham and Mumbai].


Step 1 – Collect the Data

The data set used was the, “Breast Cancer Wisconsin Diagnostic” dataset from the UCI Machine Learning Repository, which is available at http://archive.ics.uci.edu/ml. This data was donated by researchers of the University of Wisconsin and includes measurements from digitized images of fine-needle aspirate of a breast mass. The values represent characteristics of the cell nuclei present in the digital image.


The breast cancer data includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis, and 30 are numeric-valued laboratory measurements. The diagnosis is coded as M to indicate malignant or B to indicate benign.


The 30 numeric measurements comprise the mean, standard error, and worst (that is, largest) value for 10 different characteristics of the digitized cell nuclei.

These include:

  • Radius

  • Texture

  • Perimeter

  • Area

  • Smoothness

  • Compactness

  • Concavity

  • Concave points

  • Symmetry

  • Fractal dimension

Based on their names, all of the features seem to relate to the shape and size of the cell nuclei. Unless you are an oncologist, you are unlikely to know how each relates to benign or malignant masses. These patterns will be revealed as we continue in the machine learning process.


Step 2 – Explore and Prepare the Data Set from your working directory

  1. Read in the csv file (> wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

  2. First visualise the data structure

  3. Then deduct the first column which is a patients’ unique identifier which is of no use to us here.

The screen shots at each stage will show my progress using RStudio.

First Visualize the Data

  1. After looking at the structure of the data and removing the first column let us see what the diagnosis variable which is our target variable has so far.

  2. Then change the diagnosis variable to a factor and re-label.

  3. Output a summary of just three of the independent variables

The above shows the output of the above steps. We see from the data that there are problems with value measurements with smoothness ranging from 0.05 and 0.16 while area ranges from 143.5 to 2501.0.

Change Diagnosis Variable to Factor and re-Label. Produce Summary

<