top of page

Getting to Grips With Datasets - Working with Data for K-Means Clusters & K-NN Nearest Neighbours

Introduction / Abstract

The point of this exercise was to look at two algorithms namely, k-Means Cluster and k-NN Nearest Neighbor. The ideal situation was to find a random dataset that would enable one to develop a model that would work for both these algorithms. The normal practice would be to find an appropriate algorithm to fit a dataset rather than the other way around.

Nevertheless given this challenge, one set out to try and fit an algorithm to a dataset that appeared to be a reasonable candidate. In the end, although useful lessons were learned along the way, this approach proved frustrating. Attempts to fit the dataset from Flowercrowd to k-Means Cluster hit problems. Then it was decided to attempt an exercise in k-NN Nearest Neighbor as laid down in Usuelli, (2014). Even this tried and tested example produced errors. Finally it was decided to take a dataset from Lantz, (2013) designed for a lesson on k-Means Cluster and try to use it for kNN Nearest Neighbor but this also produced errors where solutions could not be found in the remaining time. Although some datasets and exercises were not original it still showed how difficult working with datasets and algorithms can be suggesting constant practice is what is needed.

Although originally this paper was intended to explain machine learning clustering models, it demonstrates, instead, the difficulties that early learners can encounter when choosing and working with datasets. Nevertheless it was still deemed a useful learning experience given the limited resources at one’s disposal.

First Test – Predicting Gender Using K-means Clusters


This study is about the analysis of gender. Due to the exercise remit only the first 250 rows out of an original 20,000.00 were used.

Some comprehensive analysis on the dataset was done initially. Decisions were made then on how best to clean the data, removing user responses such as, ‘NA’ and blank or ‘Null’ entries. The initial examination of the data in Excel and R Studio revealed that some columns had no values entered into row cells or very few. As a result some columns were removed altogether from the original data set. One column had only 5 values and so this was removed altogether too. It was felt that there could be no useful contribution to any model with so few data for a variable. Other missing data were classified as NA and then omitted in each variable case where there were NA’s as subsets. There were two columns, namely; ‘fav_numbers’ and ‘tweet_counts’ which seemed ideal candidates to group for gender clusters, however their value ranges were very high. Various attempts were made to normalize these two numeric variables and create a separate dataframe consisting of a few variables. It was considered vital to normalize these two variables. Since this took up a great deal of time, it was decided that with so many character variables the lesson to be learnt from this exercise is that k-means Cluster was not the ideal candidate to use especially as what was really being asked here was a prediction of social network (Twitter) users’ gender based on the profile content and tweets. Therefore it was deemed futile to try and generate a model as the data would have been highly skewed and unreliable. Given the time one would have liked to have tried using Naïve Bayes Classifier as a predictor of gender.

The Data Set

Description of the Dataset.

The dataset researched was provided by The following is a brief description of the original dataset.

“Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.”

Although the exact algorithm used is not mentioned in the CrowdFlower website, in the original task a group of people were asked to look at profiles of Twitter users and guess what gender the user was based on certain criteria and words used in a Tweet. The algorithm searched each row of the data to match words, phrases and clusters of data.

The original data set consisted of 20,000 rows and 26 variables but for the purposes of this exercise the rows were reduced to 250 following the recommendation set out in the exercise. The dataset was downloaded and saved as a .csv file for use in R-Studio and Excel file for the purposes of examining the columns and rows more easily in a spreadsheet. The variables were as follows:

Data Gender Variables

However after examining the dataframe structure and each categorical variable table it was found that two columns had missing values in all 250 rows and one column had 245 missing values. It is possible to omit certain designated columns in R and certain rows that contain missing values but if we decide to eliminate all rows where a missing value occurs we would have been left with nothing to examine. It was felt to be much quicker to eliminate these three columns altogether in an Excel worksheet and reload the data set to avoid any syntax errors that might occur. The variable columns highlighted in Red and Orange were deleted in Excel and the data set was reloaded into RStudio for further cleaning.


If we are trying to make sense of how people relate to one another Clustering is a good algorithmic way of figuring out what is going on. The goal is to look for details that reveal an insight into how people, as a general rule, relate to each other and their environment. Used in aggregate labels may reflect some underlying pattern of similarity among individuals falling within a group.

Clustering involves finding natural groupings of data. Clustering works in a process very similar to the observational research described just now. We have attempted here to show how clustering tasks differ from classification tasks and how clustering defines groups.

Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groupings of similar items. It does this without having been told what the groups should look like ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction.

The initial look at the Crowdflower AI gender data set especially with its classification of gender into male, female and brand seemed to be an ideal candidate for the k-means Clustering algorithm as k could be set to 3 (in order to group by gender).

Clustering is guided by the principle that records inside a cluster should be very similar to each other, but very different from those outside.

“The result is a model that relates features to an outcome or features to other features; the model identifies patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label and inferred entirely from the relationships within the data. For this reason, you will sometimes see the clustering task referred to as unsupervised classification because, in a sense, this is classifying unlabeled examples” (Lantz, 2013).

As mentioned earlier the algorithm primarily is a means of discovery more so than prediction.

Lantz, (2013) further goes on to explain; “The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return groups A, B, and C—but it's up to you to apply an actionable and meaningful label.”


Description of Predicting Task

The prediction task is to predict what gender t