kNN Nearest Neighbour Algorithm for Classification- Explained with example.

In this blog entry I am going to explain a little about kNN Nearest Neighbour algorithm and walk you through the example given in the book: ‘Machine Learning with R’ by Brett Lantz, (2013).


kNN Nearest Neighbour is often referred to as lazy learning in machine learning terms as it doesn’t store anything in memory.


The kNN algorithm begins with a training dataset made up of examples that are classified into several categories, as labeled by a nominal variable. Assume that we have a test dataset containing unlabeled examples that otherwise have the same features as the training data. For each record in the test dataset, kNN identifies k records in the training data that are the "nearest" in similarity, where k is an integer specified in advance. The unlabeled test instance is assigned the class of the majority of the k nearest neighbors (Lantz, 2013).


In general, nearest neighbor classifiers are well-suited for classification tasks where relationships among the features and the target classes are numerous, complicated, or otherwise extremely difficult to understand, yet the items of similar class type tend to be fairly homogeneous (Lantz, 2013).


The following step by step approach does not go into minute data, partly as I wish to avoid a complete copy of the book which is available through Packt Publishing House, Birmingham and Mumbai.


Step 1 – Collect the Data

The data set used was the, “Breast Cancer Wisconsin Diagnostic” dataset from the UCI Machine Learning Repository, which is available at http://archive.ics.uci.edu/ml. This data was donated by researchers of the University of Wisconsin and includes measurements from digitized images of fine-needle aspirate of a breast mass. The values represent characteristics of the cell nuclei present in the digital image.


The breast cancer data includes 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis, and 30 are numeric-valued laboratory measurements. The diagnosis is coded as M to indicate malignant or B to indicate benign.


The 30 numeric measurements comprise the mean, standard error, and worst (that is, largest) value for 10 different characteristics of the digitized cell nuclei.

These include:

• Radius

• Texture

• Perimeter

• Area

• Smoothness

• Compactness

• Concavity

• Concave points

• Symmetry

• Fractal dimension

Based on their names, all of the features seem to relate to the shape and size of the

cell nuclei. Unless you are an oncologist, you are unlikely to know how each relates

to benign or malignant masses. These patterns will be revealed as we continue in the

machine learning process.


Step 2 – Explore and Prepare the Data

1. Set your working directory

2. Read in the csv file (> wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)

3. First visualise the data structure

4. Then deduct the first column which is a patients’ unique identifier which is of no use to us here.

The screen shots at each stage will show my progress using RStudio.


1. After looking at the structure of the data and removing the first column let us see what the diagnosis variable which is our target variable has so far.

2. Then change the diagnosis variable to a factor and re-label.

3. Output a summary of just three of the independent variables


The above shows the output of the above steps. We see from the data that there are problems with value measurements with smoothness ranging from 0.05 and 0.16 while area ranges from 143.5 to 2501.0. This often happens but it is important that we compare like with like, or more uniform data so we ‘Normalize’ the data or value ranges, in this case. We do this by creating a normalize () function in R.

Next normalizing the Data and assigning it to a new data frame.

Output as follows:

Test the Normalized Data and add Diagnosis labels for Training Vs Test Data Set.

Output as follows:

summary(wbcd_n$smoothness_mean)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0000 0.3046 0.3904 0.3948 0.4755 1.0000

> summary(wbcd_n$area_mean)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0 0.1174 0.1729 0.2169 0.2711 1.0000


From the above we can see that values range from zero(0) to one (1) for all features.

Next in order to make a model we need to create a training dataframe and a test dataframe and add appropriate labels. The numbers in brackets mean that we wish to apply this dataframe (train) to rows one through to row 469 for all columns and for dataframe ‘test’, row 470 to 569 for all columns.

> wbcd_train <- wbcd_n[1:469, ]

> wbcd_test <- wbcd_n[470:569, ]

> wbcd_train_labels <- wbcd[1:469, 1]

> wbcd_test_labels <- wbcd[470:569, 1]


Step 3 – Training a model on the data

We can use a kNN function in a class package so in case we haven’t done so already, install.package (“class”) and run it with: library(class).

Now we can build the classifier and make predictions with the following syntax:

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl=wbcd_train_labels, k=21)


Step 4 – Evaluating Model Performance

If you haven’t already done so down install.packages(“gmodels”) and run with: library(gmodels).


Now create a crosstabulation of the two vectors using the CrossTable ( ) function in the gmodels package. We get chi-square output by default which we don’t need here so we can take this out.


So input: CrossTable (x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)


TN means True Negative, while TP means True Positive. The corollary of this is that FN means False Negative and FP means a False Positive. In this output 61 persons were correctly predicted as having benign tumours and 37 were correctly predicted as having malignant tumours. 2 people were diagnosed as not having tumours when in fact they had tumours (FN or False Negative) while no-one was predicted as having malignant tumours when in fact they were benign (FP or False Positive). With a FN result it could cost the patient their life and a hefty litigation fee for the hospital whereas a False Positive could create worry for the patient, illness and costs to the hospital in wasted treatments.

The following input was used:

wbcd_train <- wbcd_z[1:469, ]

> wbcd_test <- wbcd_z[470:569, ]

> wbcd_train_labels <- wbcd[1:469, 1]

> wbcd_test_labels <- wbcd[470:569, 1]

> wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,

cl = wbcd_train_labels, k=21)

> CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,

prop.chisq=FALSE)


The output was as follows:


Unfortunately we have more FN (False Negatives) here we means we have a 95% confidence level which is less than our previous 98%.

To improve the result you could try variations of k numbers but ultimately the more times you test 100 patients’ results the better.

0 views0 comments