top of page

Getting to Grips With Datasets - Working with Data for K-Means Clusters & K-NN Nearest Neighbours

Introduction / Abstract

The point of this exercise was to look at two algorithms namely, k-Means Cluster and k-NN Nearest Neighbor. The ideal situation was to find a random dataset that would enable one to develop a model that would work for both these algorithms. The normal practice would be to find an appropriate algorithm to fit a dataset rather than the other way around.


Nevertheless given this challenge, one set out to try and fit an algorithm to a dataset that appeared to be a reasonable candidate. In the end, although useful lessons were learned along the way, this approach proved frustrating. Attempts to fit the dataset from Flowercrowd to k-Means Cluster hit problems. Then it was decided to attempt an exercise in k-NN Nearest Neighbor as laid down in Usuelli, (2014). Even this tried and tested example produced errors. Finally it was decided to take a dataset from Lantz, (2013) designed for a lesson on k-Means Cluster and try to use it for kNN Nearest Neighbor but this also produced errors where solutions could not be found in the remaining time. Although some datasets and exercises were not original it still showed how difficult working with datasets and algorithms can be suggesting constant practice is what is needed.


Although originally this paper was intended to explain machine learning clustering models, it demonstrates, instead, the difficulties that early learners can encounter when choosing and working with datasets. Nevertheless it was still deemed a useful learning experience given the limited resources at one’s disposal.


First Test – Predicting Gender Using K-means Clusters


Abstract

This study is about the analysis of gender. Due to the exercise remit only the first 250 rows out of an original 20,000.00 were used.


Some comprehensive analysis on the dataset was done initially. Decisions were made then on how best to clean the data, removing user responses such as, ‘NA’ and blank or ‘Null’ entries. The initial examination of the data in Excel and R Studio revealed that some columns had no values entered into row cells or very few. As a result some columns were removed altogether from the original data set. One column had only 5 values and so this was removed altogether too. It was felt that there could be no useful contribution to any model with so few data for a variable. Other missing data were classified as NA and then omitted in each variable case where there were NA’s as subsets. There were two columns, namely; ‘fav_numbers’ and ‘tweet_counts’ which seemed ideal candidates to group for gender clusters, however their value ranges were very high. Various attempts were made to normalize these two numeric variables and create a separate dataframe consisting of a few variables. It was considered vital to normalize these two variables. Since this took up a great deal of time, it was decided that with so many character variables the lesson to be learnt from this exercise is that k-means Cluster was not the ideal candidate to use especially as what was really being asked here was a prediction of social network (Twitter) users’ gender based on the profile content and tweets. Therefore it was deemed futile to try and generate a model as the data would have been highly skewed and unreliable. Given the time one would have liked to have tried using Naïve Bayes Classifier as a predictor of gender.


The Data Set

Description of the Dataset.

The dataset researched was provided by Crowdflower.com. The following is a brief description of the original dataset.

“Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.”


Although the exact algorithm used is not mentioned in the CrowdFlower website, in the original task a group of people were asked to look at profiles of Twitter users and guess what gender the user was based on certain criteria and words used in a Tweet. The algorithm searched each row of the data to match words, phrases and clusters of data.


The original data set consisted of 20,000 rows and 26 variables but for the purposes of this exercise the rows were reduced to 250 following the recommendation set out in the exercise. The dataset was downloaded and saved as a .csv file for use in R-Studio and Excel file for the purposes of examining the columns and rows more easily in a spreadsheet. The variables were as follows:


Data Gender Variables


However after examining the dataframe structure and each categorical variable table it was found that two columns had missing values in all 250 rows and one column had 245 missing values. It is possible to omit certain designated columns in R and certain rows that contain missing values but if we decide to eliminate all rows where a missing value occurs we would have been left with nothing to examine. It was felt to be much quicker to eliminate these three columns altogether in an Excel worksheet and reload the data set to avoid any syntax errors that might occur. The variable columns highlighted in Red and Orange were deleted in Excel and the data set was reloaded into RStudio for further cleaning.

Methodology

If we are trying to make sense of how people relate to one another Clustering is a good algorithmic way of figuring out what is going on. The goal is to look for details that reveal an insight into how people, as a general rule, relate to each other and their environment. Used in aggregate labels may reflect some underlying pattern of similarity among individuals falling within a group.

Clustering involves finding natural groupings of data. Clustering works in a process very similar to the observational research described just now. We have attempted here to show how clustering tasks differ from classification tasks and how clustering defines groups.

Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groupings of similar items. It does this without having been told what the groups should look like ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction.

The initial look at the Crowdflower AI gender data set especially with its classification of gender into male, female and brand seemed to be an ideal candidate for the k-means Clustering algorithm as k could be set to 3 (in order to group by gender).

Clustering is guided by the principle that records inside a cluster should be very similar to each other, but very different from those outside.

“The result is a model that relates features to an outcome or features to other features; the model identifies patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label and inferred entirely from the relationships within the data. For this reason, you will sometimes see the clustering task referred to as unsupervised classification because, in a sense, this is classifying unlabeled examples” (Lantz, 2013).

As mentioned earlier the algorithm primarily is a means of discovery more so than prediction.

Lantz, (2013) further goes on to explain; “The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return groups A, B, and C—but it's up to you to apply an actionable and meaningful label.”

PREDICTING TASK

Description of Predicting Task

The prediction task is to predict what gender the twitter profile is, given details like, sidebar colour, number of tweets, name and gender confidence by people who were asked to view the profiles and guess the gender of the Twitter profile maker.

Exploration of the Data Set

  • We downloaded the ggvis Package (install.packages(“ggvis”) for graphics and opened it up with the syntax ‘library(ggvis)’.

Since we are aiming to predict gender we chose to perform a scatterplot with sidebar on the y axix and names on the x axis genderCl %>% ggvis(~name, ~sidebar_color, fill = ~gender) %>% layer_points()


(sidebarcolor abd name)


The user names, naturally are unreadable but they are a unique identifier. The users’ sidebar color was scraped also from the web profiles and is in Hexadecimal form. We used a hex to colour website to glean what the colours were and the unique points were coloured blue, for male, orange, for female and green is in fact brand not NA but since distance is important for clustering gender was converted to numerical values of 1 and 0 explained later in this paper. There does appear to be a line of best fit but unless there is some order to names this is unlikely to be significant. The initial suggestion is that there are no meaningful clusters as each gender is well scattered.

  • We also looked at gender.confidence with sidebar_color on gender


Gender Confidence and sidebar colour

We can see from this that there was a high level of confidence (1.0 suggests high level of confidence) in predicting that Pink, Beige, Gray and Yellow are Females, while Blue, Black, Green and bright Blue are males. Brands were thought to be Black, Gray, Brown. In contrast there was a very low level of confidence in predicting Pink and Blue for men. FFFFFF is pink, C0DEED is blue. All Hex colour codes were checked using: http://www.color-hex.com/.

  • Finally we looked at tweet_count and color on gender.

genderCl %>% ggvis(~tweet_count, ~sidebar_color, fill = ~gender) %>% layer_points()


Sidebar Colour and Tweet Count with Gender Points


The above chart looks promising but unless the hex colours are ordinal in some way like from pink through to supposedly more masculine colours like green are dark blue it is difficult to determine if there could be real meaning in the clustering. There are quite a few Blue points clustered around the bottom left of the graph, with on the x-axis, Pink, Light Blue and Black where few ‘Tweet Counts’ are. Perhaps these are men who like feminine colours but do not wish to tweet much to bring attention to them?


4) Initial Data Structure.


Gender database Comparison


With the three columns removed the data was re-entered into RStudio, the structure of which can be seen in the picture above.

Cleaning of the Data Set

In order to view missing data for character variables and some integer variables tables were generated as follows:


Gender Missing Value Explorations


There were 22 missing values for the variable ‘description’ and 48 colour codes for sidebar_color listed as ‘0’ but this is nearly 20% of the entries. For the moment we will leave this in.

It was worth having a quick look at some of the numeric and integer variables just to see if there was anything worth noting. The difference between the minimum count and maximum count was very large when a summary of the variable ‘tweet_count’ was looked at so this may require normalisation? Other inputs represented confidence measures between 0 and 1 so this is fine.


Missing Values on gender Variables

It was decided at this point to convert all missing values in the dataframe, ‘genderCl’ to NA values:

genderCl [genderCl == “”] <- NA


Remove NA Values from Gender Database


And then to update the dataframe where NA values need to be omitted.

genderCl <- na.omit(genderCl)


Remove NA Values Checked


Unfortunately this has reduced our numbers to 119 which may cause problems later but for now we will continue. Not wanting to reduce the data set anymore we can dummy code a third variable for gender. Normally this is the unknown category but we have ‘brand’ here as gender category. Lantz, (2013) has the following to say:


“An alternative solution for categorical data like gender is to treat a missing value as a separate category. For instance, rather than limiting to female and male, we can add an additional level for "unknown." At the same time, we should also utilize dummy coding to transform the nominal gender variable into a numeric form that can be used for distance calculations. Dummy coding involves creating a separate binary 1 or 0 valued dummy variable for each level of a nominal feature except one, which is held out to serve as the reference group. The reason one category can be excluded is because it can be inferred from the other categories. For instance, if someone is not female and not unknown gender, they must be male. Therefore, we need to only create dummy variables for female and unknown gender”.


The syntax for this is as follows:


Create a Dummy Variable for Gender


We had a lot of trouble trying to get this to work. In the end it was primarily because the function did not recognize the shortened “F” which was replaced with “female”. But the is.na () function would not take the word “brand” so first we had to convert all brand entries for gender to NA. The output eventually came right as seen below:

Gender Dummy Variable was Tested, Errors Fixed And Tested Again


The first statement assigns genderCl$female the value 1 if gender is equal to ‘female’ and the gender is not equal to NA, otherwise it assigns the value 0. The is.na()function tests whether gender is equal to NA. If is.na() returns TRUE, then the genderCl$no_gender variable is assigned 1, otherwise it is assigned the value 0. To confirm that we did the work correctly, we compared our constructed dummy variables to the original gender variable as can be seen above and below. The word ‘no_gender’ was replaced with the word ‘brand’ to avoid confusion.


We thought we should remove the ‘unit_id’ as it didn’t seem to serve any purpose as an identifier. We wanted to normalize two columns namely; ‘tweet_count’ and ‘fav_number’ as the range between minimum and maximum number for each of these was vast. However the normalization process did not work for these variables which instead appear to have changed into character variables as seen below?


Normalization Failed on our First Attempt


The columns were counted in on Excel and RStudio as being columns ‘11’ and ‘18’ however it converted it into a variable with Length, Class and Mode?


The Sweep function might cure this problem? According to Carlo Fenara, (2015) on the website: https://www.datacamp.com/community/tutorials/r-tutorial-apply-family;


Sweep is probably the closest to the apply family. Its use is suggested whenever we wish to replicate different actions on the MARGIN elements we have chosen (limiting here to the matrix case). A typical scenario occurs in clustering, where you may need to repetitively produce normalized and centered data (“standardised” data).” Fenara, (2015)


Second Attempt at Data Normalization Using the Sweep Function


Unfortunately this attempt at normalizing the data for clustering purposes also failed but not for want of trying.

The input and output are listed below showing errors and solutions that were we were not able to interpret.


Sweep in Action


And the output with errors:


Sweep Normalization Attempt with Errors


Pressing the ‘Return with Debug’ gave the following advice in a new RStudio window:


Normalization Debug Recommendation - Not Understood


The failure to normalise important variables within a wider dataframe using gender as the cluster identifier seriously put the attempt at a model into question. Nevertheless the following section explains the intentions and methodology used for k-means Cluster given a dataset more suitable for this algorithm.

Building The Model

Since one cannot normalise some key variables it was decided not to proceed with the exercise.

k-Means Cluster Conclusions

For this academic exercise the task was to find a data set and apply the k-Means Cluster Algorithm and / or the kNN Nearest Neighbour to it to build a model that would discover clusters or groupings of data based on what was similar and not similar between designated k n groups. A quick glance at a dataset of Twitter users to predict gender would be a good enough candidate for this task, it was thought. Even after initial graph / pictorial examination of the data it still seemed a good candidate but unable to normalize the data meant abandoning any meaningful chance of model building. For the second half of this paper I have taken a known dataset and previously tried and tested method to solve a problem. I do not claim it as a wholly original work except in so far as it is my version of the lesson as laid out by Usuelli, (2014).

K-NN Nearest Neighbours – Predicting Language Based on National Flags

Abstract

The Data Set and Task

In this exercise we are required to work out what language is spoken given the make up of a nations national flag. Using colour, shape, emblems, etc. The question posed is can we produce a model which will allow us to work out the language spoken in the nation of the flag in question?

Exploration of the Data Set

The data set consists of ten levels of languages for the variable language and a yes or no for all other variables which represent different aspects of a national flag.

For example:

  • The colors feature (for example, red) has a yes level if the flag contains the colour

  • The patterns feature (for example, circle) has a yes level if the flag contains the pattern

  • The nBars/nStrp/nCol features followed by a number (for example, nBars3) have a yes level if the flag has 3 bars

  • The topleft/botright/mainhue features followed by a color (for example, topleftblue) have a yes level if the top-left part is blue

Methodology

The k-means target is to identify k (for example, eight) homogeneous clusters of flags. Imagine dividing all the flags in eight clusters. One of them includes 10 flags out of which seven contain the color red. Let's suppose that we have a red attribute that is 1 if the flag contains red and 0 otherwise. We can say that the average flag of this cluster contains red with a probability of 70 percent, so its red attribute is

0.7. Doing the same with every other attribute, we can define average flag, whose attributes are the average within the group. Each cluster has an average flag that we can determine using the same approach.

The k-means algorithm is based on an average object that is called the cluster centre. At the beginning, the algorithm divides the flags into 8 random groups and determines their 8 centres. Then, k-means reassigns each flag to the group whose centre is the most similar. In this way, the clusters are more homogeneous and the algorithm can recompute their centres. After a few iterations, we have 8 groups containing homogeneous flags.

Preparing the data

We need two inputs:

  • x: A numeric data matrix

  • centers: The number of clusters (or the cluster centers to start with)

Starting from dtFeatures, we need to build a numeric feature matrix dtFeaturesKm. First, we can put the feature names into arrayFeatures and generate the dtFeaturesKm data table containing all the features. Perform the following steps:

  1. Define arrayFeatures that is a vector containing the feature name. The dtFeatures method contains the attribute in the first column and the features in the others, so we extract all the column names apart from the first:

  2. Define dtFeaturesKm containing the features:

  3. Convert a generic column (for example, red) into the numeric format. We can use numeric to convert the column format from factor into numeric: dtFeaturesKm[, as.numeric(red)]

  4. The new vector contains 1 if the value is no and 2 if the value is yes. In order to use the same standards as our k-means descriptions, we prefer to have if the attribute is no and 1 if the attribute is yes. In this way, when we are computing the average attribute within a group, it will be a number between 0 and 1 that can be seen as a portion of flags whose attribute is yes. Then, in order to have 0 and 1, we can use numeric(red) – 1:

dtFeaturesKm[, as.numeric(red) - 1]

Alternatively, we could have done the same using the ifelse function.

  1. We need to convert each column format into 0-1. The arrayFeatures data table contains names of all the features and we can process each of them using a for If we want to transform a column whose name is contained in nameCol, we need to use the eval-get notation. With eval(nameCol) := we redefine the column, and with get(nameCol) we use the current value of the column, as shown: for(nameCol in arrayFeatures) dtFeaturesKm[ , eval(nameCol) := as.numeric(get(nameCol)) - 1]

  2. Now convert all the features in the 0-1 format. Let's visualize it: View(dtFeaturesKm)

  3. The kmeans function requires the data to be in the matrix form. In order to convert dtFeaturesKm into a matrix, we can use matrix: matrixFeatures <- as.matrix(dtFeaturesKm)

The figure below shows what the matrix looks like so far.


Matrix Flag Features


The matrixFeatures data table contains data to build the k-means algorithm and the other kmeans inputs are the parameters. The k-means algorithm doesn't automatically detect the number of clusters, so we need to specify it through the centres input. Given the set of objects, we can identify any number of clusters out of them. We can just define a reasonable number of centres, for example, 8:

nCenters <- 8

modelKm <- kmeans(

x = matrixFeatures,

centers = nCenters

)

At this point we obtained an error suggesting that there were “more cluster centres than distinct data points” see below:

nCenters Assigned Showing Errors in k-Figure[/caption]

A solution on Stackoverflow was as follows:


Stack Overflow Solution


However this did not fix the problem?

The following is what should happen:

The modelKm function is a list containing different model components. The help of kmeans provides us with a detailed description of the output and we can use names to get the element names. Let's see the components:

names(modelKm)

[1] "cluster" "centers" "totss" "withinss"

[5] "tot.withinss" "betweenss" "size" "iter"

[9] "ifault"

We can visualize the cluster centers that are contained in centers, as shown: View(modelKm$centers), each row defines a center and each column shows an attribute. All the attributes are between 0 and 1, and they represent the percentage of flags in the cluster with an attribute equal to 1. For instance, if red is 0.5, it means that half of the flags contain the color red.

The element that we will use is cluster and it contains a label specifying the cluster of each flag. For instance, if the first element of a cluster is 3, this means that the first flag in matrixFeatures (and also in dtFeatures) belongs to the third cluster.

All inputs after this did not work so it was deemed futile to continue.

Caveat

Once again we ran into problems we didn’t know how to fix.

We even tried using a ‘teen’ database used to explain k-Means Clusters in Chapter 9 of Lantz, (2013) except with the intention of using it for k-NN Nearest Neighbour but this started producing very strange and totally unexpected problems which stopped us from exploring this avenue too.

Cleaning the data

Could not proceed.

Building the model

Could not proceed.

Conclusions

Nothing to conclude.

Overall Conclusions

What these exercises show is that one cannot take an algorithm and try and fit a dataset to it. Rather one takes a dataset and after close examination and plenty of time, especially as a beginner, one looks for the most appropriate algorithm that would solve the problem. Even this is fraught with difficulty since one need to ‘get-to-know’ or get a feel for each algorithm first with a few dummy runs. I would suggest that fast tracked learning does not make it easy to quickly choose correct algorithms, given a newly acquired data set, only time and practice will allow for this to happen. But even going through a point by point exercise from a text book can go wrong as was demonstrated by first following an exercise in R Machine Learning Essentials by Usuelli, (2014) where this supposedly tested formula didn’t work either. We then took the teen data set from chapter nine of Lantz, (2013) and tried to use it with the Chapter 3 instructions on k-NN Nearest Neighbour and this didn’t work as planned either.

This suggests that only with much practice and mutual help from other learners and teachers can one hope to master these skills. Nevertheless this has been a useful exercise in taking a step in this direction as one has had to confront and overcome many coding errors and other obstacles. Onwards and upwards.

References

Websites

Jonge, E., and van der Loo, M., (2013), ‘An Introduction to Data Cleaning in R’, - A discussion Paper, https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf, , cited on 19th March, 2016 @ 10.35 am

http://www.color-hex.com/, cited on 20th March, 2016 @ 10.00 am (For converting Hexadecimals to colour.

http://www.crowdflower.com/blog/using-machine-learning-to-predict-gender, cited on Friday 18th March, 2016 @ 22.30 pm (For dataset on gender prediction)

https://www.datacamp.com/community/tutorials/r-tutorial-apply-family, cited on Saturday 19th March, 2016 @ 18.30 pm (Sweep function in R).

http://www.statmethods.net, cited on 19th March 2016, @ 17.30 pm

Redmond, D., (2016), Lecture Notes, DBS, Higher Diploma in Science in Data Analytics, Dublin 2.

Bibliography

Lantz, B., (2013), Machine Learning in R, Puplished by Packt Publishing, Birmingham, Mumbai.

Usuelli, M., (2014), R Machine Learning Essentialls, Published by Packt Publishing, Birmingham, Mumbai.

Post: Blog2_Post
bottom of page