top of page

kNN Nearest Neighbour Algorithm for Classification- Explained with example.

In this blog entry I am going to explain a little about kNN Nearest Neighbour algorithm and walk you through the example given in the book: ‘Machine Learning with R’ by Brett Lantz, (2013).


kNN Nearest Neighbour is often referred to as lazy learning in machine learning terms as it doesn’t store anything in memory.


The kNN algorithm begins with a training dataset made up of examples that are classified into several categories, as labeled by a nominal variable. Assume that we have a test dataset containing unlabeled examples that otherwise have the same features as the training data. For each record in the test dataset, kNN identifies k records in the training data that are the "nearest" in similarity, where k is an integer specified in advance. The unlabeled test instance is assigned the class of the majority of the k nearest neighbors (Lantz, 2013).


In general, nearest neighbor classifiers are well-suited for classification tasks where relationships among the features and the target classes are numerous, complicated, or otherwise extremely difficult to understand, yet the items of similar class type tend to be fairly homogeneous (Lantz, 2013).


The following step by step approach does not go into minute data, partly as I wish to avoid a complete copy of the book which is available through Packt Publishing House, Birmingham and Mumbai.


Step 1 – Collect the Data

The data set used was the, “Breast Cancer Wisconsin Diagnostic” dataset from the UCI Machine Learning Repository, which is available at http://archive.ics.uci.edu/ml. This data was donated by researchers of the University of Wisconsin and includes measurements from digitized images of fine-needle aspirate of a breast mass. The values represent characteristics of the cell nuclei present in the digital image.


The breast cancer data includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis, and 30 are numeric-valued laboratory measurements. The diagnosis is coded as M to indicate malignant or B to indicate benign.


The 30 numeric measurements comprise the mean, standard error, and worst (that is, largest) value for 10 different characteristics of the digitized cell nuclei.

These include:

• Radius

• Texture

• Perimeter

• Area

• Smoothness

• Compactness

• Concavity

• Concave points

• Symmetry

• Fractal dimension

Based on their names, all of the features seem to relate to the shape and size of the

cell nuclei. Unless you are an oncologist, you are unlikely to know how each relates

to benign or malignant masses. These patterns will be revealed as we continue in the

machine learning process.


Step 2 – Explore and Prepare the Data

1. Set your working directory

2. Read in the csv file (> wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

3. First visualise the data structure

4. Then deduct the first column which is a patients’ unique identifier which is of no use to us here.

The screen shots at each stage will show my progress using RStudio.


1. After looking at the structure of the data and removing the first column let us see what the diagnosis variable which is our target variable has so far.

2. Then change the diagnosis variable to a factor and re-label.

3. Output a summary of just three of the independent variables