Getting to Grips With Datasets - Working with Data for K-Means Clusters & K-NN Nearest Neighbours

Introduction / Abstract

The point of this exercise was to look at two algorithms namely, k-Means Cluster and k-NN Nearest Neighbor. The ideal situation was to find a random dataset that would enable one to develop a model that would work for both these algorithms. The normal practice would be to find an appropriate algorithm to fit a dataset rather than the other way around.


Nevertheless given this challenge, one set out to try and fit an algorithm to a dataset that appeared to be a reasonable candidate. In the end, although useful lessons were learned along the way, this approach proved frustrating. Attempts to fit the dataset from Flowercrowd to k-Means Cluster hit problems. Then it was decided to attempt an exercise in k-NN Nearest Neighbor as laid down in Usuelli, (2014). Even this tried and tested example produced errors. Finally it was decided to take a dataset from Lantz, (2013) designed for a lesson on k-Means Cluster and try to use it for kNN Nearest Neighbor but this also produced errors where solutions could not be found in the remaining time. Although some datasets and exercises were not original it still showed how difficult working with datasets and algorithms can be suggesting constant practice is what is needed.


Although originally this paper was intended to explain machine learning clustering models, it demonstrates, instead, the difficulties that early learners can encounter when choosing and working with datasets. Nevertheless it was still deemed a useful learning experience given the limited resources at one’s disposal.


First Test – Predicting Gender Using K-means Clusters


Abstract

This study is about the analysis of gender. Due to the exercise remit only the first 250 rows out of an original 20,000.00 were used.


Some comprehensive analysis on the dataset was done initially. Decisions were made then on how best to clean the data, removing user responses such as, ‘NA’ and blank or ‘Null’ entries. The initial examination of the data in Excel and R Studio revealed that some columns had no values entered into row cells or very few. As a result some columns were removed altogether from the original data set. One column had only 5 values and so this was removed altogether too. It was felt that there could be no useful contribution to any model with so few data for a variable. Other missing data were classified as NA and then omitted in each variable case where there were NA’s as subsets. There were two columns, namely; ‘fav_numbers’ and ‘tweet_counts’ which seemed ideal candidates to group for gender clusters, however their value ranges were very high. Various attempts were made to normalize these two numeric variables and create a separate dataframe consisting of a few variables. It was considered vital to normalize these two variables. Since this took up a great deal of time, it was decided that with so many character variables the lesson to be learnt from this exercise is that k-means Cluster was not the ideal candidate to use especially as what was really being asked here was a prediction of social network (Twitter) users’ gender based on the profile content and tweets. Therefore it was deemed futile to try and generate a model as the data would have been highly skewed and unreliable. Given the time one would have liked to have tried using Naïve Bayes Classifier as a predictor of gender.


The Data Set

Description of the Dataset.

The dataset researched was provided by Crowdflower.com. The following is a brief description of the original dataset.

“Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.”


Although the exact algorithm used is not mentioned in the CrowdFlower website, in the original task a group of people were asked to look at profiles of Twitter users and guess what gender the user was based on certain criteria and words used in a Tweet. The algorithm searched each row of the data to match words, phrases and clusters of data.


The original data set consisted of 20,000 rows and 26 variables but for the purposes of this exercise the rows were reduced to 250 following the recommendation set out in the exercise. The dataset was downloaded and saved as a .csv file for use in R-Studio and Excel file for the purposes of examining the columns and rows more easily in a spreadsheet. The variables were as follows:


Data Gender Variables


However after examining the dataframe structure and each categorical variable table it was found that two columns had missing values in all 250 rows and one column had 245 missing values. It is possible to omit certain designated columns in R and certain rows that contain missing values but if we decide to eliminate all rows where a missing value occurs we would have been left with nothing to examine. It was felt to be much quicker to eliminate these three columns altogether in an Excel worksheet and reload the data set to avoid any syntax errors that might occur. The variable columns highlighted in Red and Orange were deleted in Excel and the data set was reloaded into RStudio for further cleaning.

Methodology

If we are trying to make sense of how people relate to one another Clustering is a good algorithmic way of figuring out what is going on. The goal is to look for details that reveal an insight into how people, as a general rule, relate to each other and their environment. Used in aggregate labels may reflect some underlying pattern of similarity among individuals falling within a group.

Clustering involves finding natural groupings of data. Clustering works in a process very similar to the observational research described just now. We have attempted here to show how clustering tasks differ from classification tasks and how clustering defines groups.

Clustering is an unsupervised machine learning task that automatically divides the data into clusters, or groupings of similar items. It does this without having been told what the groups should look like ahead of time. As we may not even know what we're looking for, clustering is used for knowledge discovery rather than prediction.

The initial look at the Crowdflower AI gender data set especially with its classification of gender into male, female and brand seemed to be an ideal candidate for the k-means Clustering algorithm as k could be set to 3 (in order to group by gender).

Clustering is guided by the principle that records inside a cluster should be very similar to each other, but very different from those outside.

“The result is a model that relates features to an outcome or features to other features; the model identifies patterns within data. In contrast, clustering creates new data. Unlabeled examples are given a cluster label and inferred entirely from the relationships within the data. For this reason, you will sometimes see the clustering task referred to as unsupervised classification because, in a sense, this is classifying unlabeled examples” (Lantz, 2013).

As mentioned earlier the algorithm primarily is a means of discovery more so than prediction.

Lantz, (2013) further goes on to explain; “The catch is that the class labels obtained from an unsupervised classifier are without intrinsic meaning. Clustering will tell you which groups of examples are closely related—for instance, it might return groups A, B, and C—but it's up to you to apply an actionable and meaningful label.”

PREDICTING TASK

Description of Predicting Task

The prediction task is to predict what gender the twitter profile is, given details like, sidebar colour, number of tweets, name and gender confidence by people who were asked to view the profiles and guess the gender of the Twitter profile maker.

Exploration of the Data Set

  • We downloaded the ggvis Package (install.packages(“ggvis”) for graphics and opened it up with the syntax ‘library(ggvis)’.

Since we are aiming to predict gender we chose to perform a scatterplot with sidebar on the y axix and names on the x axis genderCl %>% ggvis(~name, ~sidebar_color, fill = ~gender) %>% layer_points()


(sidebarcolor abd name)


The user names, naturally are unreadable but they are a unique identifier. The users’ sidebar color was scraped also from the web profiles and is in Hexadecimal form. We used a hex to colour website to glean what the colours were and the unique points were coloured blue, for male, orange, for female and green is in fact brand not NA but since distance is important for clustering gender was converted to numerical values of 1 and 0 explained later in this paper. There does appear to be a line of best fit but unless there is some order to names this is unlikely to be significant. The initial suggestion is that there are no meaningful clusters as each gender is well scattered.

  • We also looked at gender.confidence with sidebar_color on gender


Gender Confidence and sidebar colour

We can see from this that there was a high level of confidence (1.0 suggests high level of confidence) in predicting that Pink, Beige, Gray and Yellow are Females, while Blue, Black, Green and bright Blue are males. Brands were thought to be Black, Gray, Brown. In contrast there was a very low level of confidence in predicting Pink and Blue for men. FFFFFF is pink, C0DEED is blue. All Hex colour codes were checked using: http://www.color-hex.com/.

  • Finally we looked at tweet_count and color on gender.

genderCl %>% ggvis(~tweet_count, ~sidebar_color, fill = ~gender) %>% layer_points()


Sidebar Colour and Tweet Count with Gender Points


The above chart looks promising but unless the hex colours are ordinal in some way like from pink through to supposedly more masculine colours like green are dark blue it is difficult to determine if there could be real meaning in the clustering. There are quite a few Blue points clustered around the bottom left of the graph, with on the x-axis, Pink, Light Blue and Black where few ‘Tweet Counts’ are. Perhaps these are men who like feminine colours but do not wish to tweet much to bring attention to them?


4) Initial Data Structure.


Gender database Comparison


With the three columns removed the data was re-entered into RStudio, the structure of which can be seen in the picture above.

Cleaning of the Data Set

In order to view missing data for character variables and some integer variables tables were generated as follows:


Gender Missing Value Explorations


There were 22 missing values for the variable ‘description’ and 48 colour codes for sidebar_color listed as ‘0’ but this is nearly 20% of the entries. For the moment we will leave this in.

It was worth having a quick look at some of the numeric and integer variables just to see if there was anything worth noting. The difference between the minimum count and maximum count was very large when a summary of the variable ‘tweet_count’ was looked at so this may require normalisation? Other inputs represented confidence measures between 0 and 1 so this is fine.


Missing Values on gender Variables

It was decided at this point to convert all missing values in the dataframe, ‘genderCl’ to NA values:

genderCl [genderCl == “”] <- NA


Remove NA Values from Gender Database


And then to update the dataframe where NA values need to be omitted.

genderCl <- na.omit(genderCl)


Remove NA Values Checked


Unfortunately this has reduced our numbers to 119 which may cause problems later but for now we will continue. Not wanting to reduce the data set anymore we can dummy code a third variable for gender. Normally this is the unknown category but we have ‘brand’ here as gender category. Lantz, (2013) has the following to say:


“An alternative solution for categorical data like gender is to treat a missing value as a separate category. For instance, rather than limiting to female and male, we can add an additional level for "unknown." At the same time, we should also utilize dummy coding to transform the nominal gender variable into a numeric form that can be used for distance calculations. Dummy coding involves creating a separate binary 1 or 0 valued dummy variable for each level of a nominal feature except one, which is held out to serve as the reference group. The reason one category can be excluded is because it can be inferred from the other categories. For instance, if someone is not female and not unknown gender, they must be male. Therefore, we need to only create dummy variables for female and unknown gender”.


The syntax for this is as follows:


Create a Dummy Variable for Gender


We had a lot of trouble trying to get this to work. In the end it was primarily because the function did not recognize the shortened “F” which was replaced with “female”. But the is.na () function would not take the word “brand” so first we had to convert all brand entries for gender to NA. The output eventually came right as seen below:

Gender Dummy Variable was Tested, Errors Fixed And Tested Again


The first statement assigns genderCl$female the value 1 if gender is equal to ‘female’ and the gender is not equal to NA, otherwise it assigns the value 0. The is.na()function tests whether gender is equal to NA. If is.na() returns TRUE, then the genderCl$no_gender variable is assigned 1, otherwise it is assigned the value 0. To confirm that we did the work correctly, we compared our constructed dummy variables to the original gender variable as can be seen above and below. The word ‘no_gender’ was replaced with the word ‘brand’ to avoid confusion.


We thought we should remove the ‘unit_id’ as it didn’t seem to serve any purpose as an identifier. We wanted to normalize two columns namely; ‘tweet_count’ and ‘fav_number’ as the range between minimum and maximum number for each of these was vast. However the normalization process did not work for these variables which instead appear to have changed into character variables as seen below?


Normalization Failed on our First Attempt


The columns were counted in on Excel and RStudio as being columns ‘11’ and ‘18’ however it converted it into a variable with Length, Class and Mode?


The Sweep function might cure this problem? According to Carlo Fenara, (2015) on the website: https://www.datacamp.com/community/tutorials/r-tutorial-apply-family;


Sweep is probably the closest to the apply family. Its use is suggested whenever we wish to replicate different actions on the MARGIN elements we have chosen (limiting here to the matrix case). A typical scenario occurs in clustering, where you may need to repetitively produce normalized and centered data (“standardised” data).” Fenara, (2015)


Second Attempt at Data Normalization Using the Sweep Function


Unfortunately this attempt at normalizing the data for clustering purposes also failed but not for want of trying.

The input and output are listed below showing errors and solutions that were we were not able to interpret.