Getting to Grips with Analysing Datasets with Categorical Variables and ANOVA

Abstract

Analysis of Variance

Analysis of variance is a collection of statistical models used to find whether there is a significant difference between group means. It allows us to see the effects of different independent variables upon the dependent variable.

This study is about the analysis of the Airbnb dataset and with the intention of building a model to do a prediction task on the dataset. The dataset comes from an ongoing Kaggle competition supported by Airbnb. The task, as originally set out by the Airbnb competition, was to predict which country new Airbnb users from the United States were most likely to visit first.


Some comprehensive analysis on the dataset was done initially. Decisions were made then on how best to clean the data, removing Airbnb user responses such as ‘unknown’, ‘other’, ‘NA’ and blank or ‘Null’ entries. Some residuals were removed where it was felt that the data would not make any useful contribution to any model.


Tests were then conducted on the cleaned data frame for variance about the Mean using One-Way and Two Way Anova. The dependent variable was country_destination, while it was decided to examine age and gender as the Independent variables to see if they had any bearing on country destination. It was found that there was a significant correlation between country destination and age and somewhat less significance for country destination and gender. However when age and gender were combined there was a high correlation with these two variables in relation to the choice of country destination and after further data cleaning, omitting ‘other’ country choice and ‘Not defined’ (Ndf) from ‘Country Destination’ the significance level increased further suggesting that age different destination decisions may be made among males or females depending on how old they are.

To clarify this further and after converting character variables to factors, it was decided that since the response variable is discrete and the error terms do not follow a normal distribution, that logistic regression would be the best test. Since NDF and US represented 90% of the country destination retaining NDF and working with some of the other data might have provided a more detailed predictor of country destination. Instead we were interested in the probability that a man will make the choice of country destination given their age?


A logistical regression model was attempted but we were not able to get beyond the error given the restrictions encountered in producing this report. However the intended working template for this exercise can be seen in Appendix 1. Therefore given more time and resources we would endeavor to complete this exercise.


The Data Set

Description of the Dataset.

The dataset researched was provided by Airbnb which contains a list of users along with their demographics, web session records, and some summary statistics. The whole dataset originally contained five csv files, namely:

train-users, test-users, sessions, countries and age/gender-bkts (buckets).

  • The train-user and test-user files.

The train-user files contain 171239 training examples with 16 properties:

  • id

  • date-account-created

  • date-first-booking

  • gender

  • age

  • signup-method

  • signup-flow

  • language

  • affiliate-channel

  • affiliate-provider

  • first-affiliate-tracked

  • signup-app

  • first-device-type

  • first-browser

  • country-destination

  • time-stamp-first-active

The test-users have 43673 items and 15 properties. The values of country-destination were missing and that is the value that the competition originally asked participants to predict. The training and test sets were split by dates. In the test set, one was expected to predict country destination of all the new users with first activities after 4/1/2014.

Sessions file: The sessions file is the web sessions log records for users. The sessions file contained 5600850 examples and 6 properties namely: user-id, action, action-type, action-detail, device-type, secs-elapsed.


There was actually 74610 different users in the file.


Countries: The countries file contained statistics of destination countries in this dataset and their geometric information. It had information on ten countries and their seven different properties , such as longitude and latitude.


Age-Gender-bkts: This file contained statistics of users’ age group, gender, country of destination. It consisted of four hundred and twenty examples and five properties.


Exploratory analysis of the data set

1) Users’ language: Not surprising most users speak English since Airbnb is a company located in the United States (US) and its customers are mostly Americans.

2) Users’ age: The age distribution shows that users’ age is mostly between 24 and 36.

3) Users’ gender: The original file showed that quite a large number of users did not record their gender information and for convenience gender missing values were cleaned from the data. Almost half of the users did not input their gender information.

4) Users’ country destination: Before the data was cleaned of missing values, the distribution showed that most people ended up booking nothing which was indicated as NDF. Among the users who did book with Airbnb, the United States was the most popular choice.


PREDICTING TASK

Description of Predicting Task

The prediction task is to predict in which country a new user will make his or her first booking. However it was decided after initial trials to deal with the interesting significant correlation between country_destination and age and gender.

Data Pre-Processing

1) date-first-booking: In order to aid prediction of the country destination, this was cleaned of missing values and converted to a factor.


Methodology

According to Newsom (2013), “Choice of the statistical analyses in the social sciences typically rests on a more general or cruder classification of measures into what I will call “continuous” and “discrete.” Continuous refers to a variable with many possible values. By "discrete" I mean few categories. I, as well as others, often use the terms “dichotomous”, “binary,” “categorical,” or “qualitative” synonymously with “discrete.”

Normal theory plays an important role in statistical tests with continuous dependent variables, such as t-tests, ANOVA, correlation, and regression, and binomial theory plays an important role in statistical tests with discrete dependent variables, such as chi-square and logistic regression. (Newsom, 2013). Ordinal scales with many categories (5 or more), interval, and ratio, are usually analyzed with the normal theory class of statistical tests. For this task it was decided that a generalized linear model was a better choice than a normal linear model given the mix a categorical and numeric data. We considered our dependent variable, ‘country_destination’ and after conducting initial ANOVA tests, converted ‘country_destination’ and gender to a factor and attempted logistical regression on these three variables with a different hypotheses question.


Exploration of the Data Set

The train_data2 file was explored as this file was the only one which listed countries. The file was read into RStudio and assigned to the variable named ‘airnbnT’ (for Train) but in order to maintain the integrity of the original file this name was immediately re-assigned to a new variable named, ‘airbnbC’ (for Clean) as the intention was to explore and clean the dataset to remove unknown, missing, Null and some residual values that might distort the data.

Exploring the feature or column ‘age’ provided the following output. /plot(airbnbC$age)

[caption id="attachment_183" align="aligncenter" width="915"]



Airbnb Plot Exploration[/caption]

Boxplot(airbnbC$age)

[caption id="attachment_184" align="aligncenter" width="914"] Airbnb Explored



Boxplot[/caption]

hist(airbnbC$age)

[caption id="attachment_185" align="aligncenter" width="911"]



Airbnb Histogram on Age[/caption]

None of the above is helpful to the reader.


When we perform a summary of airbnbC$age we begin to get a useful idea of the data

[caption id="attachment_186" align="aligncenter" width="630"]

However there are a lot of Null values and some excessive age ranges. We began by removing these excessive outliers from the data by limiting the upper and lower age limits to reflect what we expect to happen most of the time in real life which is that people will book holidays between the ages of 18 and 70 in Airbnb destinations. All ages above or below the designated range were listed as NA and removed.

So the summary of the dataframe after age was cleaned but before removal of NA’s was as follows:

[caption id="attachment_189" align="aligncenter" width="770"]

And after Na’s were removed:

Summary on Age after NA and Nulls Removed above

And let’s try looking at the histogram on age again.

The histogram on age shows that the distribution is skewed to the right.

Next Gender ‘unknowns’ and ‘Other’ responses were removed. An attempt to remove both in one line caused an error but reproducing the summary shows that the ‘unknown’ and then ‘other’ categories were removed.

A summary of ‘first_booking’ column was examined for missing and residual data:

After the 50383 empty (Null) values were removed we see the results:

We should perhaps remove ‘Other’ but we left it in for the present as it is a large figure reflective of the fact that many site visitors did not book accommodation. Ideally one would break this feature into two subsets, those who booked and those who did not book.

Opening up our original data file in Excel we could see a lot of missing (blank) or Null values for age, so we went back to remove these, (airbnbC <-subset(airbnbC, age != "").

We noticed the same problem with ‘first_affiliate_tracked’ and so removed Null values from this category too.

Finally we removed –unknown-’s from first_browser and ensured any further NA’s were removed.

ANOVA Test for significance

We first did One Way Anova’s on ‘age’ and then on ‘gender’:

Which produced the following output:

Our analysis shows that dependent variable ‘country_destination’ is highly correlated to ‘age’ of those booking Airbnb accommodation. It is significantly less than the 0.05 confidence level at minus 1.59e-08 (three stars).

If we look at gender on its own we get the following result:


The analysis suggests that gender does not play a big part in country destination choice. It’s less than the 0.05 significance level, 0.017, but possibly not enough to draw any definite conclusions on.


Finally we undertook analysis of ‘gender’ plus ‘age’ not thinking there would be a significant change on our results with ‘country_destination’ choice however the results showed an increase in significance level for gender when both independent variables were examined side by side, as seen below.

We coded country destinations as numeric in the calculations as the mistake below shows:

Gender and age examined independently side by side has nevertheless increased our P Value and our significance code has increased from 1 star to two stars.


So our next ANOVA test