top of page

Getting to Grips with Analysing Datasets with Categorical Variables and ANOVA

Abstract

Analysis of Variance

Analysis of variance is a collection of statistical models used to find whether there is a significant difference between group means. It allows us to see the effects of different independent variables upon the dependent variable.

This study is about the analysis of the Airbnb dataset and with the intention of building a model to do a prediction task on the dataset. The dataset comes from an ongoing Kaggle competition supported by Airbnb. The task, as originally set out by the Airbnb competition, was to predict which country new Airbnb users from the United States were most likely to visit first.


Some comprehensive analysis on the dataset was done initially. Decisions were made then on how best to clean the data, removing Airbnb user responses such as ‘unknown’, ‘other’, ‘NA’ and blank or ‘Null’ entries. Some residuals were removed where it was felt that the data would not make any useful contribution to any model.


Tests were then conducted on the cleaned data frame for variance about the Mean using One-Way and Two Way Anova. The dependent variable was country_destination, while it was decided to examine age and gender as the Independent variables to see if they had any bearing on country destination. It was found that there was a significant correlation between country destination and age and somewhat less significance for country destination and gender. However when age and gender were combined there was a high correlation with these two variables in relation to the choice of country destination and after further data cleaning, omitting ‘other’ country choice and ‘Not defined’ (Ndf) from ‘Country Destination’ the significance level increased further suggesting that age different destination decisions may be made among males or females depending on how old they are.

To clarify this further and after converting character variables to factors, it was decided that since the response variable is discrete and the error terms do not follow a normal distribution, that logistic regression would be the best test. Since NDF and US represented 90% of the country destination retaining NDF and working with some of the other data might have provided a more detailed predictor of country destination. Instead we were interested in the probability that a man will make the choice of country destination given their age?


A logistical regression model was attempted but we were not able to get beyond the error given the restrictions encountered in producing this report. However the intended working template for this exercise can be seen in Appendix 1. Therefore given more time and resources we would endeavor to complete this exercise.


The Data Set

Description of the Dataset.

The dataset researched was provided by Airbnb which contains a list of users along with their demographics, web session records, and some summary statistics. The whole dataset originally contained five csv files, namely:

train-users, test-users, sessions, countries and age/gender-bkts (buckets).

  • The train-user and test-user files.

The train-user files contain 171239 training examples with 16 properties:

  • id

  • date-account-created

  • date-first-booking

  • gender

  • age

  • signup-method

  • signup-flow

  • language

  • affiliate-channel

  • affiliate-provider

  • first-affiliate-tracked

  • signup-app

  • first-device-type

  • first-browser

  • country-destination

  • time-stamp-first-active

The test-users have 43673 items and 15 properties. The values of country-destination were missing and that is the value that the competition originally asked participants to predict. The training and test sets were split by dates. In the test set, one was expected to predict country destination of all the new users with first activities after 4/1/2014.

Sessions file: The sessions file is the web sessions log records for users. The sessions file contained 5600850 examples and 6 properties namely: user-id, action, action-type, action-detail, device-type, secs-elapsed.


There was actually 74610 different users in the file.


Countries: The countries file contained statistics of destination countries in this dataset and their geometric information. It had information on ten countries and their seven different properties , such as longitude and latitude.


Age-Gender-bkts: This file contained statistics of users’ age group, gender, country of destination. It consisted of four hundred and twenty examples and five properties.


Exploratory analysis of the data set

1) Users’ language: Not surprising most users speak English since Airbnb is a company located in the United States (US) and its customers are mostly Americans.

2) Users’ age: The age distribution shows that users’ age is mostly between 24 and 36.

3) Users’ gender: The original file showed that quite a large number of users did not record their gender information and for convenience gender missing values were cleaned from the data. Almost half of the users did not input their gender information.

4) Users’ country destination: Before the data was cleaned of missing values, the distribution showed that most people ended up booking nothing which was indicated as NDF. Among the users who did book with Airbnb, the United States was the most popular choice.


PREDICTING TASK

Description of Predicting Task

The prediction task is to predict in which country a new user will make his or her first booking. However it was decided after initial trials to deal with the interesting significant correlation between country_destination and age and gender.

Data Pre-Processing

1) date-first-booking: In order to aid prediction of the country destination, this was cleaned of missing values and converted to a factor.