An Exercise in Regression Modelling Using R and R Studio

Research Question / Problem to be addressed:

Let us imagine that the United States Department of Employment and Department of Education would likely to know if employers are bringing in foreign workers for jobs where skills are in short supply solely because employers can’t fill these positions with US persons or because they can pay less in wages to these workers. US employers too would like to know if they are seeking skilled workers from the right geographic areas of the world. What carries most weighting in terms of final paid yearly wage to Data Analysts, is it ‘visa class’, country of origin, the prevailing local wage for that particular job or what?


The stakeholders are the United States Departments of Immigration and Homeland Security and Departments of Education at national and local level. Other stakeholders include employers, in particular those seeking to fill positions as data scientists and data analysts, employment agencies and others involved in processing through immigrants.

Analytics Solution

It is assumed, anecdotally, that there is a shortage of data scientists and data analysts in America and those employers have a legitimate demand in seeking persons to fill these positions from outside the United States. However it is often an added advantage to a US employer to employ persons from abroad as their wage demands may be less. This may bring down the expected yearly wage for American workers which may have positive or negative repercussions. Some explorative statistical analysis and predictive modelling of immigration data for this subgroup of workers would be a useful start in building a picture of what is actually happening on the ground.

A Refined Statement of the Problem with Business Benefits

Building predictive models around the subject of immigration by data analysts and data scientists can help both employers and immigration official’s budget better for managing this task. This may benefit US employers in identifying what geographic part of the world they should concentrate their efforts most on to get ideal candidates, or what kind of visa prompts a higher or lower wage demand or gets the better candidate irrespective of the wage or how much the local prevailing wage per year impacts on attracting candidates from abroad. On the other hand it might inform the US administration that there is a need to increase spending on education that leads to more indigenous candidates in the future.

Analytics Problem Framing

Various Hypotheses for this might be on the lines of; ‘The more successful Greencard applications granted for data analysts, the higher the overall yearly wage paid to data analysts in the US’ or ‘The more successful Greencard applications there are granted for data analysts the more impact it has on lowering the overall yearly wage paid to data analysts in the US.’ We wish to investigate the hypothesis that there is a difference in paid wage, depending on type of visa card that the applicant has.

Problem Assumptions

It is assumed the population data follows a normal distribution and that the relationship between the independent variables and dependent variables will be somewhat linear. We assume a confidence level of 0.05 to denote a statistically significant variable.

Key Metrics of Success

We are looking for three star significant levels in the regression analysis to indicate the predictive power of each feature in the model. We are looking for a large negative or positive P value as a good predictor of a difference between our dependent variable and independent variables.

The Data

Data needs

In order to conduct this study it was necessary to obtain a reasonably up-to-date large subset of US Immigration data that would include the major visa categories and a subgroup of job types that are in short supply in the United States, such as data scientists and data analysts.

Data acquisition

We used a subset of the 2014 / 15 US Immigration Database consisting of 167,000 values and 26 variables downloaded from the US Government Department of Labour’s open data.

Visual exploration of the data

Excel and R Exploratory Analysis:

For both the exploration of the data in Excel and the model preparation in R it was first useful to view the columns in an Excel spreadsheet to get a clear idea of the columns / variables in the dataset.

Deep Analysis with R

We read the file called salary.csv into R and named the first dataframe, ‘salary’ with Factors as StringsAsFactors.

We looked at the structure which consisted of 167,278 objects (or Rows) and 27 variables (or Columns).

We then looked at a summary of the dependent variable ‘PAID WAGE PER YEAR’ on salary.

The minimum salary is ten and a half thousand while the maximum is two and a half million, (possibly a couple of outliers or mistakes in the data here).

Median is 78,600 and mean, 85,530. Because the mean value is greater than the median value we can predict that the distribution is right skewed. We can confirm this using a histogram.

It is indeed highly skewed to the right, suggesting that there are quite a few outliers in the data.

The large majority of yearly salaries are in the $10,500 to, circa, $100,000. Because linear regression assumes a normal distribution for the dependent variable, this distribution is not ideal. We will likely need to correct this later on.

Another problem at hand is that regression models require that every feature is numeric, yet we have twenty one factor types in our data frame. We are not interested in looking at all of these however. We can take a closer look at some of these factorized variables though using the table ( ) function.

Before fitting a regression model to data, it can be useful to determine how the independent variables are related to the dependent variable and each other. A correlation matrix provides a quick overview of these relationships. Given a set of variables, it provides a correlation for each pair-wise relationship. The syntax on the numeric variables that we are particularly interested in were highlighted below in blue with the output following:

The results of this matrix shows that ‘Prevailing Wage Submitted’ is correlated to Paid Wage per year at 0.44 and to Paid Wage Submitted at 0.58. Paid Wage Submitted is correlated very strongly to Paid Wage Per Year at 0.88.

We downloaded the ‘Psych’ Package and used the ‘Panels Scatterplot Matrix’ which gives a side by side visual view of correlations.

However this still wasn’t really very clear so we generated a more robust visual readout of the panels.

There is definitely a strong correlation between Paid Wage Submitted and Paid Wage Per Year and fairly strong correlations between the other pairings.

Harmonizing, rescaling and cleaning of the data

The initial visuals only looked at numeric variables but since we needed to examine some of the categorical variables alongside these strongly correlated numeric variables we converted our key categorical variables to factors before re-taking a look at the data using a Panels Scatterplot Matrix. Job title subgroup, country of citizenship and visa class were key categorical variables that needed to be converted to factors. The following figures show the input and readouts.

It was difficult to make out what one was seeing so we generated the Scatterplot matrix again as a visual as well as written reference.

Above the diagonal, the scatterplots have been replaced with a correlation matrix. On the diagonal, a histogram depicting the distribution of values for each feature is shown. Finally, the scatterplots below the diagonal now are presented with additional visual information.

The oval-shaped object on each scatterplot is a correlation ellipse. It provides a visualization of how strongly correlated the variables are. The dot at the center of the ellipse indicates the point of the mean value for the x axis variable and y axis variable. The correlation between the two variables is indicated by the shape of the ellipse; the more it is stretched, the stronger the correlation.

  • · Job Title Sub Group and Paid Wage Submitted showed no linear relationship at all (0.00) with perfectly round ellipse.

  • · Paid Wage per Year and Paid Wage Submitted showed the strongest correlation but the ellipse appeared round, but it’s linear suggesting a strong correlation? (0.88)

  • · Paid Wage Submitted and Prevailing Wage Submitted were fairly strongly positively correlated but with a round ellipse (0.58)

  • · Paid Wage Per Year and Prevailing Wage Submitted are the next highest positively correlated but with round ellipse (0.44)

  • · Job Title Sub Group and Prevailing Wage Submitted is next / positively correlated but with a round ellipse (0.17)

  • · Paid Wage Per Year is negatively correlated with Country of Citizenship, with a very oval ellipse (-0.15)

  • · Country of Citizenship and Paid Wage Submitted are negatively correlated but ellipse looks round (-0.15)

There was an unexplained discrepancy between R and RapidMiner in the first instance and things did not seem to tally very much with the other two either? Further matching should probably be conducted before any conclusions can be made.

It was obvious that the huge ranges in much of the variable data were not helping matters and there were a great deal of NA or noise, to the data set. So we removed NA values from one or two of the variables that had few NA’s and proceeded to normalise the other categorical variables of interest to us. And then conducted another; pairs.panels Scatterplot Matrix.

See below:

This has produced better, if not quite perfect, results.


Just a synopsis of the methodology.

Identification of potential problem solving approaches

We set out to conduct exploratory research using Microsoft Excel in the first instance to examine differences in paid wage per year among different visa class applications and country of citizenship of applicants. We separated the different visa classes from the rest of the data in Excel and later in R and performed an exploratory analysis.

Following on from this we cleaned and eventually constructed regression models to try to predict the impact certain criteria would have on wages paid to data analysts work visa applicants coming to America from abroad.

Selection of software tools

Excel Methodology

The research methods used with Excel were basic analysis with descriptive statistics followed by use of the ‘Corel’ function to establish correlations between variables.

R and R-Studio Methodology

We conducted a preliminary examination of the US Immigration dataset using R and R-Studio. After cleaning the data we examined quite a number of variables pertaining to our research question, to see if any were correlated with each other. Picking out those with strong correlations we conducted multiple regression analysis. We used the results from this to build [visa class] models for measuring the impact of local (or prevailing) wage and visa class on the paid wage per year for a ‘data analyst’ job in America.

Test approaches

We produced a generalized linear regression with Prevailing_Wage_Per_Year and Paid_Wage_Per_Year after normalising the categorical data and later compared this to our subset group for ‘Data Analysts’ with Australian and then Greencard (in R) for applicants respectively.

The plot below shows the output for this.

The points are all crowding around the bottom left corner with points above and below the line of best fit with many outliers and noise but despite this there is a definite visual suggestion of correlation here nonetheless.

Select approaches

We considered different algorithms for prediction in R but we felt regression was the best starting point.

Further cleaning of the data was attempted to try and improve the model in R but it could not be completed in the available time. However further recommendations were put forward to try and improve upon the model if further time allowed, see below.

Model Building

Before building any models we created a new dataframe called ‘salarynn’ to preserve the original dataframe. The purpose of building these models is to aid prediction of wage fluctuations for specific job types and to identify from which geographic area of the world specific job types are most likely to originate from and why.

Model structures

First we felt it prudent to look at all the original variables in one model to see what influence the original subgroup of job types and visa classes had on the dependent variable before pulling out ‘data analyst’ and specific visa class applicants to build specific models for these criteria.

Running and evaluating the model

The output of the first general model looked at Job_Title_Subgroup, Prevailing_Wage_Submitted and Prevailing_Wage_Per_Year on the dependent variable:

As a group, 'data analysts’ (job type) have a weighting of 75% on changes in Paid Wage Per Year for this subgroup when taken together.

We next looked at ‘data analyst’ in relation to VISA_CLASS applicants.

The following table sets out the interpreted results of this model:

However the goal of the analysis was to build a model that could predict for ‘data analysts’ and the different visa classes as this would aid in the identification of the origin of visa applicants. To this end a subset of the dataframe was created to isolate “data analysts” using the following syntax:

salary_an3 <- subset(salarynn, salarynn$JOB_TITLE_SUBGROUP == "data analyst")

We then created a subset of the dataframe (salary_an3) to isolate each visa class individually to be assessed in different models with “data analyst”.

However for H-1B1 Singapore Visas and H-1B1 Chile Visas there were missing values so the models would not work. A possible solution might be to isolate Singapore immigrants from Country_of_Citizenship and attempt a model on this and ‘Greencard’ to see if any ‘data analysts’ from Singapore come in on a ‘Greencard’.

For the E-3 Australian and Greencard Visa Type there were no NA (missing value) issues so models did work for these. The syntax below shows how prevailing wage per year impacts on paid wage per year on the isolated ‘data analyst’ / Visa Class criteria.