An Exercise in Regression Modelling Using R and R Studio
Research Question / Problem to be addressed:
Let us imagine that the United States Department of Employment and Department of Education would likely to know if employers are bringing in foreign workers for jobs where skills are in short supply solely because employers can’t fill these positions with US persons or because they can pay less in wages to these workers. US employers too would like to know if they are seeking skilled workers from the right geographic areas of the world. What carries most weighting in terms of final paid yearly wage to Data Analysts, is it ‘visa class’, country of origin, the prevailing local wage for that particular job or what?
The stakeholders are the United States Departments of Immigration and Homeland Security and Departments of Education at national and local level. Other stakeholders include employers, in particular those seeking to fill positions as data scientists and data analysts, employment agencies and others involved in processing through immigrants.
It is assumed, anecdotally, that there is a shortage of data scientists and data analysts in America and those employers have a legitimate demand in seeking persons to fill these positions from outside the United States. However it is often an added advantage to a US employer to employ persons from abroad as their wage demands may be less. This may bring down the expected yearly wage for American workers which may have positive or negative repercussions. Some explorative statistical analysis and predictive modelling of immigration data for this subgroup of workers would be a useful start in building a picture of what is actually happening on the ground.
A Refined Statement of the Problem with Business Benefits
Building predictive models around the subject of immigration by data analysts and data scientists can help both employers and immigration official’s budget better for managing this task. This may benefit US employers in identifying what geographic part of the world they should concentrate their efforts most on to get ideal candidates, or what kind of visa prompts a higher or lower wage demand or gets the better candidate irrespective of the wage or how much the local prevailing wage per year impacts on attracting candidates from abroad. On the other hand it might inform the US administration that there is a need to increase spending on education that leads to more indigenous candidates in the future.
Analytics Problem Framing
Various Hypotheses for this might be on the lines of; ‘The more successful Greencard applications granted for data analysts, the higher the overall yearly wage paid to data analysts in the US’ or ‘The more successful Greencard applications there are granted for data analysts the more impact it has on lowering the overall yearly wage paid to data analysts in the US.’ We wish to investigate the hypothesis that there is a difference in paid wage, depending on type of visa card that the applicant has.
It is assumed the population data follows a normal distribution and that the relationship between the independent variables and dependent variables will be somewhat linear. We assume a confidence level of 0.05 to denote a statistically significant variable.
Key Metrics of Success
We are looking for three star significant levels in the regression analysis to indicate the predictive power of each feature in the model. We are looking for a large negative or positive P value as a good predictor of a difference between our dependent variable and independent variables.
In order to conduct this study it was necessary to obtain a reasonably up-to-date large subset of US Immigration data that would include the major visa categories and a subgroup of job types that are in short supply in the United States, such as data scientists and data analysts.
We used a subset of the 2014 / 15 US Immigration Database consisting of 167,000 values and 26 variables downloaded from the US Government Department of Labour’s open data.
Visual exploration of the data
Excel and R Exploratory Analysis:
For both the exploration of the data in Excel and the model preparation in R it was first useful to view the columns in an Excel spreadsheet to get a clear idea of the columns / variables in the dataset.
Deep Analysis with R
We read the file called salary.csv into R and named the first dataframe, ‘salary’ with Factors as StringsAsFactors.
We looked at the structure which consisted of 167,278 objects (or Rows) and 27 variables (or Columns).
We then looked at a summary of the dependent variable ‘PAID WAGE PER YEAR’ on salary.
The minimum salary is ten and a half thousand while the maximum is two and a half million, (possibly a couple of outliers or mistakes in the data here).
Median is 78,600 and mean, 85,530. Because the mean value is greater than the median value we can predict that the distribution is right skewed. We can confirm this using a histogram.