David Nunan
- Jan 27, 2022
- 12 min read

An Exercise in Regression Modelling Using R and R Studio

Research Question / Problem to be addressed:

Let us imagine that the United States Department of Employment and Department of Education would likely to know if employers are bringing in foreign workers for jobs where skills are in short supply solely because employers can’t fill these positions with US persons or because they can pay less in wages to these workers. US employers too would like to know if they are seeking skilled workers from the right geographic areas of the world. What carries most weighting in terms of final paid yearly wage to Data Analysts, is it ‘visa class’, country of origin, the prevailing local wage for that particular job or what?

Stakeholders

The stakeholders are the United States Departments of Immigration and Homeland Security and Departments of Education at national and local level. Other stakeholders include employers, in particular those seeking to fill positions as data scientists and data analysts, employment agencies and others involved in processing through immigrants.

Analytics Solution

It is assumed, anecdotally, that there is a shortage of data scientists and data analysts in America and those employers have a legitimate demand in seeking persons to fill these positions from outside the United States. However it is often an added advantage to a US employer to employ persons from abroad as their wage demands may be less. This may bring down the expected yearly wage for American workers which may have positive or negative repercussions. Some explorative statistical analysis and predictive modelling of immigration data for this subgroup of workers would be a useful start in building a picture of what is actually happening on the ground.

A Refined Statement of the Problem with Business Benefits

Building predictive models around the subject of immigration by data analysts and data scientists can help both employers and immigration official’s budget better for managing this task. This may benefit US employers in identifying what geographic part of the world they should concentrate their efforts most on to get ideal candidates, or what kind of visa prompts a higher or lower wage demand or gets the better candidate irrespective of the wage or how much the local prevailing wage per year impacts on attracting candidates from abroad. On the other hand it might inform the US administration that there is a need to increase spending on education that leads to more indigenous candidates in the future.

Analytics Problem Framing

Various Hypotheses for this might be on the lines of; ‘The more successful Greencard applications granted for data analysts, the higher the overall yearly wage paid to data analysts in the US’ or ‘The more successful Greencard applications there are granted for data analysts the more impact it has on lowering the overall yearly wage paid to data analysts in the US.’ We wish to investigate the hypothesis that there is a difference in paid wage, depending on type of visa card that the applicant has.

Problem Assumptions

It is assumed the population data follows a normal distribution and that the relationship between the independent variables and dependent variables will be somewhat linear. We assume a confidence level of 0.05 to denote a statistically significant variable.

Key Metrics of Success

We are looking for three star significant levels in the regression analysis to indicate the predictive power of each feature in the model. We are looking for a large negative or positive P value as a good predictor of a difference between our dependent variable and independent variables.

The Data

Data needs

In order to conduct this study it was necessary to obtain a reasonably up-to-date large subset of US Immigration data that would include the major visa categories and a subgroup of job types that are in short supply in the United States, such as data scientists and data analysts.

Data acquisition

We used a subset of the 2014 / 15 US Immigration Database consisting of 167,000 values and 26 variables downloaded from the US Government Department of Labour’s open data.

Visual exploration of the data

Excel and R Exploratory Analysis:

For both the exploration of the data in Excel and the model preparation in R it was first useful to view the columns in an Excel spreadsheet to get a clear idea of the columns / variables in the dataset.

Deep Analysis with R

We read the file called salary.csv into R and named the first dataframe, ‘salary’ with Factors as StringsAsFactors.

We looked at the structure which consisted of 167,278 objects (or Rows) and 27 variables (or Columns).

We then looked at a summary of the dependent variable ‘PAID WAGE PER YEAR’ on salary.

The minimum salary is ten and a half thousand while the maximum is two and a half million, (possibly a couple of outliers or mistakes in the data here).

Median is 78,600 and mean, 85,530. Because the mean value is greater than the median value we can predict that the distribution is right skewed. We can confirm this using a histogram.

It is indeed highly skewed to the right, suggesting that there are quite a few outliers in the data.

The large majority of yearly salaries are in the $10,500 to, circa, $100,000. Because linear regression assumes a normal distribution for the dependent variable, this distribution is not ideal. We will likely need to correct this later on.

Another problem at hand is that regression models require that every feature is numeric, yet we have twenty one factor types in our data frame. We are not interested in looking at all of these however. We can take a closer look at some of these factorized variables though using the table ( ) function.

Before fitting a regression model to data, it can be useful to determine how the independent variables are related to the dependent variable and each other. A correlation matrix provides a quick overview of these relationships. Given a set of variables, it provides a correlation for each pair-wise relationship. The syntax on the numeric variables that we are particularly interested in were highlighted below in blue with the output following:

The results of this matrix shows that ‘Prevailing Wage Submitted’ is correlated to Paid Wage per year at 0.44 and to Paid Wage Submitted at 0.58. Paid Wage Submitted is correlated very strongly to Paid Wage Per Year at 0.88.

We downloaded the ‘Psych’ Package and used the ‘Panels Scatterplot Matrix’ which gives a side by side visual view of correlations.

However this still wasn’t really very clear so we generated a more robust visual readout of the panels.

There is definitely a strong correlation between Paid Wage Submitted and Paid Wage Per Year and fairly strong correlations between the other pairings.

Harmonizing, rescaling and cleaning of the data

The initial visuals only looked at numeric variables but since we needed to examine some of the categorical variables alongside these strongly correlated numeric variables we converted our key categorical variables to factors before re-taking a look at the data using a Panels Scatterplot Matrix. Job title subgroup, country of citizenship and visa class were key categorical variables that needed to be converted to factors. The following figures show the input and readouts.

It was difficult to make out what one was seeing so we generated the Scatterplot matrix again as a visual as well as written reference.

Above the diagonal, the scatterplots have been replaced with a correlation matrix. On the diagonal, a histogram depicting the distribution of values for each feature is shown. Finally, the scatterplots below the diagonal now are presented with additional visual information.

The oval-shaped object on each scatterplot is a correlation ellipse. It provides a visualization of how strongly correlated the variables are. The dot at the center of the ellipse indicates the point of the mean value for the x axis variable and y axis variable. The correlation between the two variables is indicated by the shape of the ellipse; the more it is stretched, the stronger the correlation.

Interpretation:

· Job Title Sub Group and Paid Wage Submitted showed no linear relationship at all (0.00) with perfectly round ellipse.
· Paid Wage per Year and Paid Wage Submitted showed the strongest correlation but the ellipse appeared round, but it’s linear suggesting a strong correlation? (0.88)
· Paid Wage Submitted and Prevailing Wage Submitted were fairly strongly positively correlated but with a round ellipse (0.58)
· Paid Wage Per Year and Prevailing Wage Submitted are the next highest positively correlated but with round ellipse (0.44)
· Job Title Sub Group and Prevailing Wage Submitted is next / positively correlated but with a round ellipse (0.17)
· Paid Wage Per Year is negatively correlated with Country of Citizenship, with a very oval ellipse (-0.15)
· Country of Citizenship and Paid Wage Submitted are negatively correlated but ellipse looks round (-0.15)

There was an unexplained discrepancy between R and RapidMiner in the first instance and things did not seem to tally very much with the other two either? Further matching should probably be conducted before any conclusions can be made.

It was obvious that the huge ranges in much of the variable data were not helping matters and there were a great deal of NA or noise, to the data set. So we removed NA values from one or two of the variables that had few NA’s and proceeded to normalise the other categorical variables of interest to us. And then conducted another; pairs.panels Scatterplot Matrix.

See below:

This has produced better, if not quite perfect, results.

Methodology

Just a synopsis of the methodology.

Identification of potential problem solving approaches

We set out to conduct exploratory research using Microsoft Excel in the first instance to examine differences in paid wage per year among different visa class applications and country of citizenship of applicants. We separated the different visa classes from the rest of the data in Excel and later in R and performed an exploratory analysis.

Following on from this we cleaned and eventually constructed regression models to try to predict the impact certain criteria would have on wages paid to data analysts work visa applicants coming to America from abroad.

Selection of software tools

Excel Methodology

The research methods used with Excel were basic analysis with descriptive statistics followed by use of the ‘Corel’ function to establish correlations between variables.

R and R-Studio Methodology

We conducted a preliminary examination of the US Immigration dataset using R and R-Studio. After cleaning the data we examined quite a number of variables pertaining to our research question, to see if any were correlated with each other. Picking out those with strong correlations we conducted multiple regression analysis. We used the results from this to build [visa class] models for measuring the impact of local (or prevailing) wage and visa class on the paid wage per year for a ‘data analyst’ job in America.

Test approaches

We produced a generalized linear regression with Prevailing_Wage_Per_Year and Paid_Wage_Per_Year after normalising the categorical data and later compared this to our subset group for ‘Data Analysts’ with Australian and then Greencard (in R) for applicants respectively.

The plot below shows the output for this.

The points are all crowding around the bottom left corner with points above and below the line of best fit with many outliers and noise but despite this there is a definite visual suggestion of correlation here nonetheless.

Select approaches

We considered different algorithms for prediction in R but we felt regression was the best starting point.

Further cleaning of the data was attempted to try and improve the model in R but it could not be completed in the available time. However further recommendations were put forward to try and improve upon the model if further time allowed, see below.

Model Building

Before building any models we created a new dataframe called ‘salarynn’ to preserve the original dataframe. The purpose of building these models is to aid prediction of wage fluctuations for specific job types and to identify from which geographic area of the world specific job types are most likely to originate from and why.

Model structures

First we felt it prudent to look at all the original variables in one model to see what influence the original subgroup of job types and visa classes had on the dependent variable before pulling out ‘data analyst’ and specific visa class applicants to build specific models for these criteria.

Running and evaluating the model

The output of the first general model looked at Job_Title_Subgroup, Prevailing_Wage_Submitted and Prevailing_Wage_Per_Year on the dependent variable:

As a group, 'data analysts’ (job type) have a weighting of 75% on changes in Paid Wage Per Year for this subgroup when taken together.

We next looked at ‘data analyst’ in relation to VISA_CLASS applicants.

The following table sets out the interpreted results of this model:

However the goal of the analysis was to build a model that could predict for ‘data analysts’ and the different visa classes as this would aid in the identification of the origin of visa applicants. To this end a subset of the dataframe was created to isolate “data analysts” using the following syntax:

salary_an3 <- subset(salarynn, salarynn$JOB_TITLE_SUBGROUP == "data analyst")

We then created a subset of the dataframe (salary_an3) to isolate each visa class individually to be assessed in different models with “data analyst”.

However for H-1B1 Singapore Visas and H-1B1 Chile Visas there were missing values so the models would not work. A possible solution might be to isolate Singapore immigrants from Country_of_Citizenship and attempt a model on this and ‘Greencard’ to see if any ‘data analysts’ from Singapore come in on a ‘Greencard’.

For the E-3 Australian and Greencard Visa Type there were no NA (missing value) issues so models did work for these. The syntax below shows how prevailing wage per year impacts on paid wage per year on the isolated ‘data analyst’ / Visa Class criteria.

And the output for each of those was as follows. For Australian ‘Data Analyst’ applicants the results were as follows:

There was a local wage per year weighting of 66% for Australian visa class, data analyst applicants on the actual Paid Wage Per Year.

For Greencard holder data analyst applicants, the results were as follows:

Sixty one percent (61%) of prevailing (local) wage per year contributed to the ‘Paid Wage Per Year.

Calibrating models and data

The parameter estimates we obtained by typing ‘salarynn_an_oz’ tell us about how the independent variables are related to the dependent variable, but they tell us nothing about how well the model fits our data. To evaluate the model performance, we used the summary() command on the stored model:

The summary() output may seem confusing at first, but the basics are easy to pick up. As indicated by the numbered labels in the preceding output, the output provides three key ways to evaluate the performance (that is, fit) of our model:

1. The Residuals section provides summary statistics for the errors in our predictions, of which they are apparently quite small. Since a residual is equal to the true value minus the predicted value, the maximum error of 0.91158 for Australian Visa suggests that the model under-predicted paid wage per year by merely $1.00 for at least one observation. On the other hand, errors falling within the 1Q and 3Q values (the first and third quartile), were between $0.15 over the true value and $0.08 under the true value for Australian. For Greencard the maximum error of $1.67 for Greencard Visa for at least one observation is still quite small. Errors falling within the 1Q and 3Q values (the first and third quartile), were between $0.16 over the true value and $0.09 under the true value for Greencard.

2. The stars (for example, ***) indicate the predictive power of each feature in the model. The significance level (as listed by the Signif. codes in the footer) provides a measure of how likely the true coefficient is zero given the value of the estimate. The presence of three stars indicates a significance level of 0, which means that the feature is extremely unlikely to be unrelated to the dependent variable. A common practice is to use a significance level of 0.05 to denote a statistically significant variable. Here, our model has a significant intercept and significant variable, and they seem to be related to the outcome in a logical way.

3. The Multiple R-squared value (also called the coefficient of determination) provides a measure of how well our model as a whole explains the values of the dependent variable. It is similar to the correlation coefficient in that the closer the value is to 1.0, the better the model perfectly explains the data. Since the R-squared value is 0.8577, for Australian and 0.6176 for Greencard we can say that nearly 86 percent of the variation in the dependent variable for Australian and 62 percent of Greencard is explained by our model. Because models with more features always explain more variation, the Adjusted R-squared value corrects R-squared by penalizing models with a large number of independent variables. It is useful for comparing the performance of models with different numbers of explanatory variables.

Finally we plotted the same variable x and y data that we had done earlier as part of our test on the variables; PAID_WAGE_PER_YEAR as dependent upon PREVAILING_WAGE_PER_YEAR and obtained the following output from Australian Visa applicants and then Greencard applicants for Data Analysts.

The figure shows Australian Data Analysts Applicants first:

The second figure below shows these two variables plotted for Greencard Data Analysts Applicants:

The line of best fit has improved for both but it does highlight how the outliers are skewing the data at present. These would have to be dealt with any attempt at further improvement.

Documented findings with assumptions, limitations and constraints

Recommendations for improving the model.

It should be noted that there were over 156,000 missing values from many of the factorized categorical variables that we wished to build a model such as Country_of Citizenship, however when these were taken out after isolating for ‘data analyst’ and different Visa_Classes there were no ‘row values’ left to analyse. A possible solution might be to create a new column / variable for Singapore immigrants and combine Greencard and Singapore Visa types together and analyse ‘data analysts’ in this way or failing that a variation of this. The same might be done for Chile Visa Type with or without Greencard and Country of Citizenship with or without other South American countries until we have a reasonable subgroup to infer meaningful results.

Deployment

We would need to reconfigure the data so that R can process the data correctly. Unfortunately due to time constraints it is not possible at the moment.

Business validation of the model

The two to three models would suggest that they are a good predictor but much more needs to be done before final conclusions can be made. Other algorithms with the use of tools like Rapidminer might give a clearer picture.

Report with findings and recommendations for deployment

Given the preceding three performance indicators, our model is performing well. It is not uncommon for regression models of real-world data to have fairly low R-squared values; a value of 0.86 is actually quite good. However, we acknowledge that we have not paid enough attention to outliers, it has been suggested, for example that some of the Paid Wage Per Year values are so large that these are mistakes and could reasonably be removed from the data. With more time the models could and should be examined with careful concern for minimizing outliers. A visual graph of the model should be looked at before removing outliers

A key difference between regression modeling and other machine learning approaches is that regression typically leaves feature selection and model specification to the user. Consequently, if we have subject matter knowledge about how a feature is related to the outcome, we can use this information to inform the model specification and potentially improve the model's performance.

The model is being impaired by noise in the data. The data would need to be thoroughly cleaned however this would introduce imbalances within the data and skewed results as was tried with the variable Country_of_Citizenship so compromises need to be found.