Abstract

Analysis of Variance

Analysis of variance is a collection of statistical models used to find whether there is a significant difference between group means. It allows us to see the effects of different independent variables upon the dependent variable.

This study is about the analysis of the Airbnb dataset and with the intention of building a model to do a prediction task on the dataset. The dataset comes from an ongoing Kaggle competition supported by Airbnb. The task, as originally set out by the Airbnb competition, was to predict which country new Airbnb users from the United States were most likely to visit first.

Some comprehensive analysis on the dataset was done initially. Decisions were made then on how best to clean the data, removing Airbnb user responses such as ‘unknown’, ‘other’, ‘NA’ and blank or ‘Null’ entries. Some residuals were removed where it was felt that the data would not make any useful contribution to any model.

Tests were then conducted on the cleaned data frame for variance about the Mean using One-Way and Two Way Anova. The dependent variable was country_destination, while it was decided to examine age and gender as the Independent variables to see if they had any bearing on country destination. It was found that there was a significant correlation between country destination and age and somewhat less significance for country destination and gender. However when age and gender were combined there was a high correlation with these two variables in relation to the choice of country destination and after further data cleaning, omitting ‘other’ country choice and ‘Not defined’ (Ndf) from ‘Country Destination’ the significance level increased further suggesting that age different destination decisions may be made among males or females depending on how old they are.

To clarify this further and after converting character variables to factors, it was decided that since the response variable is discrete and the error terms do not follow a normal distribution, that logistic regression would be the best test. Since NDF and US represented 90% of the country destination retaining NDF and working with some of the other data might have provided a more detailed predictor of country destination. Instead we were interested in the probability that a man will make the choice of country destination given their age?

A logistical regression model was attempted but we were not able to get beyond the error given the restrictions encountered in producing this report. However the intended working template for this exercise can be seen in Appendix 1. Therefore given more time and resources we would endeavor to complete this exercise.

The Data Set

Description of the Dataset.

The dataset researched was provided by Airbnb which contains a list of users along with their demographics, web session records, and some summary statistics. The whole dataset originally contained five csv files, namely:

train-users, test-users, sessions, countries and age/gender-bkts (buckets).

The train-user and test-user files.

The train-user files contain 171239 training examples with 16 properties:

id

date-account-created

date-first-booking

gender

age

signup-method

signup-flow

language

affiliate-channel

affiliate-provider

first-affiliate-tracked

signup-app

first-device-type

first-browser

country-destination

time-stamp-first-active

The test-users have 43673 items and 15 properties. The values of country-destination were missing and that is the value that the competition originally asked participants to predict. The training and test sets were split by dates. In the test set, one was expected to predict country destination of all the new users with first activities after 4/1/2014.

Sessions file: The sessions file is the web sessions log records for users. The sessions file contained 5600850 examples and 6 properties namely: user-id, action, action-type, action-detail, device-type, secs-elapsed.

There was actually 74610 different users in the file.

Countries: The countries file contained statistics of destination countries in this dataset and their geometric information. It had information on ten countries and their seven different properties , such as longitude and latitude.

Age-Gender-bkts: This file contained statistics of users’ age group, gender, country of destination. It consisted of four hundred and twenty examples and five properties.

Exploratory analysis of the data set

1) Users’ language: Not surprising most users speak English since Airbnb is a company located in the United States (US) and its customers are mostly Americans.

2) Users’ age: The age distribution shows that users’ age is mostly between 24 and 36.

3) Users’ gender: The original file showed that quite a large number of users did not record their gender information and for convenience gender missing values were cleaned from the data. Almost half of the users did not input their gender information.

4) Users’ country destination: Before the data was cleaned of missing values, the distribution showed that most people ended up booking nothing which was indicated as NDF. Among the users who did book with Airbnb, the United States was the most popular choice.

PREDICTING TASK

Description of Predicting Task

The prediction task is to predict in which country a new user will make his or her first booking. However it was decided after initial trials to deal with the interesting significant correlation between country_destination and age and gender.

Data Pre-Processing

1) date-first-booking: In order to aid prediction of the country destination, this was cleaned of missing values and converted to a factor.

Methodology

According to Newsom (2013), “Choice of the statistical analyses in the social sciences typically rests on a more general or cruder classification of measures into what I will call “continuous” and “discrete.” Continuous refers to a variable with many possible values. By "discrete" I mean few categories. I, as well as others, often use the terms “dichotomous”, “binary,” “categorical,” or “qualitative” synonymously with “discrete.”

Normal theory plays an important role in statistical tests with continuous dependent variables, such as t-tests, ANOVA, correlation, and regression, and binomial theory plays an important role in statistical tests with discrete dependent variables, such as chi-square and logistic regression. (Newsom, 2013). Ordinal scales with many categories (5 or more), interval, and ratio, are usually analyzed with the normal theory class of statistical tests. For this task it was decided that a generalized linear model was a better choice than a normal linear model given the mix a categorical and numeric data. We considered our dependent variable, ‘country_destination’ and after conducting initial ANOVA tests, converted ‘country_destination’ and gender to a factor and attempted logistical regression on these three variables with a different hypotheses question.

Exploration of the Data Set

The train_data2 file was explored as this file was the only one which listed countries. The file was read into RStudio and assigned to the variable named ‘airnbnT’ (for Train) but in order to maintain the integrity of the original file this name was immediately re-assigned to a new variable named, ‘airbnbC’ (for Clean) as the intention was to explore and clean the dataset to remove unknown, missing, Null and some residual values that might distort the data.

Exploring the feature or column ‘age’ provided the following output. /plot(airbnbC$age)

[caption id="attachment_183" align="aligncenter" width="915"]

Airbnb Plot Exploration[/caption]

Boxplot(airbnbC$age)

[caption id="attachment_184" align="aligncenter" width="914"] Airbnb Explored

Boxplot[/caption]

hist(airbnbC$age)

[caption id="attachment_185" align="aligncenter" width="911"]

Airbnb Histogram on Age[/caption]

None of the above is helpful to the reader.

When we perform a summary of airbnbC$age we begin to get a useful idea of the data

[caption id="attachment_186" align="aligncenter" width="630"]

However there are a lot of Null values and some excessive age ranges. We began by removing these excessive outliers from the data by limiting the upper and lower age limits to reflect what we expect to happen most of the time in real life which is that people will book holidays between the ages of 18 and 70 in Airbnb destinations. All ages above or below the designated range were listed as NA and removed.

So the summary of the dataframe after age was cleaned but before removal of NA’s was as follows:

[caption id="attachment_189" align="aligncenter" width="770"]

And after Na’s were removed:

Summary on Age after NA and Nulls Removed above

And let’s try looking at the histogram on age again.

The histogram on age shows that the distribution is skewed to the right.

Next Gender ‘unknowns’ and ‘Other’ responses were removed. An attempt to remove both in one line caused an error but reproducing the summary shows that the ‘unknown’ and then ‘other’ categories were removed.

A summary of ‘first_booking’ column was examined for missing and residual data:

After the 50383 empty (Null) values were removed we see the results:

We should perhaps remove ‘Other’ but we left it in for the present as it is a large figure reflective of the fact that many site visitors did not book accommodation. Ideally one would break this feature into two subsets, those who booked and those who did not book.

Opening up our original data file in Excel we could see a lot of missing (blank) or Null values for age, so we went back to remove these, (airbnbC <-subset(airbnbC, age != "").

We noticed the same problem with ‘first_affiliate_tracked’ and so removed Null values from this category too.

Finally we removed –unknown-’s from first_browser and ensured any further NA’s were removed.

ANOVA Test for significance

We first did One Way Anova’s on ‘age’ and then on ‘gender’:

Which produced the following output:

Our analysis shows that dependent variable ‘country_destination’ is highly correlated to ‘age’ of those booking Airbnb accommodation. It is significantly less than the 0.05 confidence level at minus 1.59e-08 (three stars).

If we look at gender on its own we get the following result:

The analysis suggests that gender does not play a big part in country destination choice. It’s less than the 0.05 significance level, 0.017, but possibly not enough to draw any definite conclusions on.

Finally we undertook analysis of ‘gender’ plus ‘age’ not thinking there would be a significant change on our results with ‘country_destination’ choice however the results showed an increase in significance level for gender when both independent variables were examined side by side, as seen below.

We coded country destinations as numeric in the calculations as the mistake below shows:

Gender and age examined independently side by side has nevertheless increased our P Value and our significance code has increased from 1 star to two stars.

So our next ANOVA test

This suggests that there may be gender differences in country destination preference among different ages of accommodation bookers and so it was decided that these two variables should be examined together in a further test.

Prepare the data for performing multiple regression

Before examining the variables more closely character variables needed to be converted to factors and numeric variables checked in order to make the examination of the data more efficient.

According to Lantz (2013);

“An advantage of using factors is that they are generally more efficient than character vectors because the category labels are stored only once. Rather than storing MALE, MALE, FEMALE, the computer may store 1, 1, 2. This can save memory. Additionally, certain machine learning algorithms use special routines to handle categorical variables. Coding categorical variables as factors ensures that the model will treat this data appropriately”.

The syntax below shows the checks and conversions where appropriate.

The output for these conversions are as follows:

We mistakenly converted ‘country_destination’ to numeric before changing it to a factored variable.

A quick glance at the data in graph form so far.

A boxplot on ‘country_destination’ and age:

This showed that those choosing Spain as a destination tends to be carried out by younger users while those going to Great Britain tend to be older.

A plot on country destination measured against gender, both as factors, revealed the following:

The plot suggests that slightly more females than males make the booking when the US is the destination. More females than males make the booking for France and Italy. However the graph is not quite so easy to read for all the destinations.

Early examination of the data showed that US Airbnb users predominantly choose the United States as their first destination of choice. Rather than eliminate the US data to find the next destination of choice we have decided, given the strong significant values gleaned from our ANOVA tests earlier to ask the following question:

We are interested in the probability that a man will make the choice of country destination and purchase it given their age?

We will use a ‘General Linear Model’ for this which suits the combination of factor and numeric variables.

First we created the model ‘airbnbm’ and called it:

And the output was as follows:

[caption id="attachment_208" align="aligncenter" width="714"]

According to the output, the model is logit (i) = 5.18 + -0.01*income + -0.03*age. After fitting the model, we can test the overall model fit and hypothesis regarding a subset of regression parameters using a likelihood ratio test (LRT). “Likelihood ratio tests are similar to partial F-tests in the sense that they compare the full model with a restricted model where the explanatory variables of interest are omitted. The p-values of the tests are calculated using the 2 distribution. To test the hypothesis H0: 1= 2=0 we can compare our model with a reduced model that only contains an intercept term. A likelihood ratio test comparing the full and reduced models can be performed using the anova() function with the additional option test=’Chisq’.” (www.stat.columbia.edu).

> airbnbm.reduced =glm(country_destination ~ 1, family=binomial)

> anova(airbnbm.reduced,results, test="Chisq")

This produced the following error:

glm Anova Output Error[/caption]

At this point we were not able to proceed with the test as a solution could not be found for the error. However it was intended to use the template listed in Appendix 1 to come to conclusions about the Null Hypotheses and level of significance.

Conclusions

The Airbnb Kaggle data set presents tremendous challenges to provide an accurate prediction of first choice of country destination by new Airbnb users from America. Given the resource constraints it was not possible in the end to do any major justice to the original task as proposed by Airbnb. However following cleaning of the whole data set which mainly consisted of omitting NA, and Null entries and not including entries like; ‘unknown’ and ‘other’ unless they made up significant numbers, it was decided to focus on the interesting significance levels following ANOVA testing on country destination, age and gender, since Anova is suitable for testing categorical data. Since logistical regression lends itself better where the response variable is discreet and error terms do not follow a normal distribution (due to their limited size), a logistical regression general linear model (gml) was attempted, on the now cleaned and renamed airbnbC1 dataframe. The results were inconclusive purely as a result of a technical error which we were unable to overcome in the current timeframe. However our final coefficient printouts for the independent variables; age and gender were well below the 0.05 level suggesting significance.

References

Lindquist, Martin A, Generalised Linear Models, (R11.pdf), cited on 17/03/16, 16.05 hrs at http://www.stat.columbia.edu/~martin/W2024/R11.pdf

Ray, Mike CT ... Newsom 1 USP 634 Data Analysis Spring 2013 Course Syllabus USP 634 Data Analysis Spring 2013

Zhang et al, The Prediction of Booking Destination On Airbnb Dataset, cited on 16/03/16, 09.30 hrs at cseweb.ucsd.edu/~jmcauley/cse255/reports/fa15/038.pdf

DBS Lecture notes.

## Comments