Introduction

The employee attrition rate is an important factor for companies to monitor and manage, as high attrition can lead to significant financial losses and organizational disruption. To better understand the factors that contribute to employee attrition, IBM data scientists have created a fictional dataset that includes various employee attributes such as age, education, job role, monthly income, job satisfaction, and others. This dataset provides an opportunity for machine learning practitioners to develop models that can predict whether an employee is likely to leave the company (binary classification) and estimate their monthly income (regression).

In this project, I plan to develop two separate models to predict attrition and monthly income. I will explore various techniques such as logistic regression, decision trees, random forests, and gradient boosting and evaluate the models’ performance using appropriate metrics. My goal is to provide insights into the factors that are most strongly associated with employee attrition and to estimate the typical salary levels of employees based on their individual characteristics and see if they are over or under payed.

Dataset

The dataset used in this project is a fictional dataset created by IBM data scientists. It contains 35 features that provide information about the employees, such as age, gender, education, job role, monthly income, job satisfaction, and performance rating, among others. This dataset is useful for exploring the factors that contribute to employee behavior. The dataset also allows us to explore important questions, such as attrition rates and of monthly incomes based on different categorie such as education, working years, etc.

Find dataset source here:

IBM HR Analytics Employee Attrition & Performance

Loading Libraries

library(tidyverse)
library(patchwork)
library(ggplot2)
library(caret)
library(ggcorrplot)
library(ggeffects)
library(MASS)
library(pROC)
library(rpart)
library(rpart.plot)
library(pdp)

set.seed(9438)

Reading the data

setwd("C:/Users/usuario/Desktop/Master/Advanced Modelling/Final_project/")
HR = read_csv("Database/WA_Fn-UseC_-HR-Employee-Attrition.csv")

Variables:

  1. AGE: Age of Employee
  2. ATTRITION: Employee leaving the company (0=no, 1=yes)
  3. BUSINESS TRAVEL: (1=No Travel, 2=Travel Frequently, 3=Travel Rarely)
  4. DAILY RATE: Salary Level
  5. DEPARTMENT: (1=HR, 2=R&D, 3=Sales)
  6. DISTANCEFROMHOME: Distance from home to work in km
  7. EDUCATION: (1=Below College, 2=College, 3=Bachelor, 4=Master, 5=Doctor)
  8. EDUCATION FIELD: (1=HR, 2=Life Sciences, 3=Marketing, 4=Medical Sciences, 5=Others, 6= Technical)
  9. EMPLOYEE COUNT: Employee count
  10. EMPLOYEE NUMBER: Employee ID
  11. ENVIROMENT SATISFACTION: Satisfaction with the environment (1=lowest, 4=highest)
  12. GENDER: (1=Female, 2=Male)
  13. HOURLY RATE: Hourly salary
  14. JOB INVOLVEMENT: Job Involvement (1=lowest, 4=highest)
  15. JOB LEVEL: Level of job (1=lowest, 4=highest)
  16. JOB ROLE: (1=HC REP, 2=HR, 3=Lab Technician, 4=Manager, 5= Managing Director, 6= Research Director, 7= Research Scientist, 8=Sales Executive, 9= Sales Representative)
  17. JOB SATISFACTION: Satisfaction with the job (1=lowest, 4=highest)
  18. MARITAL STATUS: (1=Divorced, 2=Married, 3=Single)
  19. MONTHLY INCOME: Monthly Salary
  20. MONTHY RATE Monthly Salary
  21. NUMCOMPANIESWORKED: No. of companies worked at
  22. OVER 18: (1=yes, 2=no)
  23. OVERTIME: (1=no, 2=yes)
  24. PERCENT SALARY HIKE % increase in salary between last 2 years.
  25. PERFORMANCE RATING: Performance rating (1=lowest, 4=highest)
  26. RELATIONS SATISFACTION: Relations satisfaction (1=lowest, 4=highest)
  27. STANDARD HOURS: Standard hours worked
  28. STOCK OPTIONS LEVEL: How much company stocks you own from this company
  29. TOTAL WORKING YEARS: Total years worked
  30. TRAINING TIMES LAST YEAR: Hours spent training last year
  31. WORK LIFE BALANCE: Time spent between work and outside
  32. YEARS AT COMPANY: Total number of years at the company
  33. YEARS IN CURRENT ROLE: Years in current role
  34. YEARS SINCE LAST PROMOTION: Years since last promotion
  35. YEARS WITH CURRENT MANAGER: Years spent with current manager

Feature Engineer & Descriptive

To prepare the dataset for the analysis, I conducted feature engineering by dropping non-relevant variables that are unlikely to influence the employee attrition rate and monthly income, I also removed those who had only 1 value and those who I didn’t know the meaning of them and were not correlated with the dependent variables. There were non missing values (NAs). I also converted character variables to factors so the model can understand that it is a category. This process ensured that the data set was ready for analysis and could be used to develop accurate and robust models.

HR = HR %>%
  mutate(
    HourlyRate = NULL,
    DailyRate = NULL,
    MonthlyRate = NULL,
    Over18 = NULL,
    StandardHours =NULL,
    EmployeeCount = NULL,
    EmployeeNumber = NULL,
    Education = as.factor(Education),
    EnvironmentSatisfaction = as.factor(EnvironmentSatisfaction),
    JobInvolvement = as.factor(JobInvolvement),
    JobSatisfaction = as.factor(JobSatisfaction),
    PerformanceRating = as.factor(PerformanceRating),
    RelationshipSatisfaction = as.factor(RelationshipSatisfaction),
    WorkLifeBalance = as.factor(WorkLifeBalance),
    Attrition = as.factor(Attrition),
    BusinessTravel = as.factor(BusinessTravel),
    Department = as.factor(Department),
    EducationField = as.factor(EducationField),
    Gender = as.factor(Gender),
    JobRole = as.factor(JobRole),
    MaritalStatus = as.factor(MaritalStatus),
    OverTime = as.factor(OverTime),
    JobLevel = as.factor(JobLevel)
  )

This next part of the code conducts a basci descriptive analysis of the HR dataset by taking the average of key variables. Specifically, the mean values of PercentSalaryHike, Age, MonthlyIncome, and YearsAtCompany. This information can be used to gain insights into the central tendencies of these variables.

Descriptive = HR %>%
  summarize(PercentSalaryHike = mean(PercentSalaryHike),
            Age = mean(Age),
            MonthlyIncome = mean(MonthlyIncome),
            YearsAtCompany = mean(YearsAtCompany))

Descriptive
## # A tibble: 1 × 4
##   PercentSalaryHike   Age MonthlyIncome YearsAtCompany
##               <dbl> <dbl>         <dbl>          <dbl>
## 1              15.2  36.9         6503.           7.01

Plots

This next section generates various density plots and histograms to visualize the distribution of different variables in the HR dataset.

The first four plots are density plots for PercentSalaryHike, MonthlyIncome, YearsAtCompany, and Age respectively. Each plot shows the distribution of the corresponding variable and the fill color represents the density.

The fifth plot is a histogram of Attrition, which shows the count of employees who left the company (Attrition=Yes) and those who stayed (Attrition=No).

The sixth plot is a histogram of JobSatisfaction, which shows the count of employees at different levels of job satisfaction.

t2 = ggplot(HR, aes(x=PercentSalaryHike)) + geom_density(fill = "darkred")
t3 = ggplot(HR, aes(x=MonthlyIncome)) + geom_density(fill = "green")
t4 = ggplot(HR, aes(x=YearsAtCompany)) + geom_density(fill = "purple")
t5 = ggplot(HR, aes(x=Age)) + geom_density(fill = "orange")
t6 = ggplot(HR, aes(x = Attrition)) +
  geom_histogram(binwidth = 5, color = "black", fill = "light blue", stat = "count")
t7 = ggplot(HR, aes(x = JobSatisfaction)) +
  geom_histogram(binwidth = 5, color = "black", fill = "brown", stat = "count")
  
t2 + t3 + t4 + t5 + t6 +t7

The distribution of most continuous variables appears to be right-skewed, while Age has a roughly normal distribution between 20 and 60 years. It seems that the majority of employees have not left the company, which could be a positive indicator for the organization. There are numerous instances of lower Job Satisfaction ratings (1 or 2), which may raise concerns about employee retention and job satisfaction within the company.

Correlations:

In this section, we are exploring the correlation between numeric variables in our HR dataset focusing more in interpreting Monthly Income.

HR_numeric = HR %>% 
  dplyr::select(where(is.numeric))



corrdata <- cor(HR_numeric)
ggcorrplot(corrdata, title = "Correlation matrix fo data", lab=TRUE)

The correlation analysis indicates that Monthly Income is significantly correlated with most of the numeric variables, which is a valuable finding for our analysis. Notably, Monthly Income shows a strong positive correlation with Age, Working Years, and Years at Company, which is to be expected given that these variables are likely to impact an employee’s income.

Data splitting

In this part of the code, the dataset is being split into two parts, one for training the model and the other for testing it. The split is done in a 70-30 ratio, where 70% of the data is used for training the model, and 30% is used for testing its performance.

in_train <- createDataPartition(HR$Attrition, p = 0.70, list = FALSE)  
training <- HR[ in_train,]
testing <- HR[-in_train,]
nrow(training)
## [1] 1030
nrow(testing)
## [1] 440

Classification

The classification section aims to predict the variable of Attrition, which represents whether an employee left the company or not. It is important to note that the data is imbalanced, with a majority of “no” cases (i.e., employees who did not leave the company) and only a few “yes” cases (i.e., employees who left the company).

Imbalanced plot:

HR %>% 
ggplot(aes(x = Attrition)) +
  geom_bar(aes(fill= Attrition)) +
  labs(x="Attrition", y = "", title = "Distribution of Attrition in the Sample") +
  scale_x_discrete(labels = c("Stayed", "Left")) +
  guides(fill = FALSE) +
  theme_light()

Logistic regression

Because we are dealing with binary classification, we can use the standard glm function in R for the logit model and use the AUC to evaluate model.

First, model with all variables:

logit.model = glm(Attrition ~ ., family = binomial(link = "logit"), data = training)
summary(logit.model)
## 
## Call:
## glm(formula = Attrition ~ ., family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7983  -0.4423  -0.1788  -0.0456   3.6397  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -1.221e+01  7.397e+02  -0.017 0.986830    
## Age                              -2.419e-02  1.669e-02  -1.449 0.147201    
## BusinessTravelTravel_Frequently   2.773e+00  5.874e-01   4.720 2.36e-06 ***
## BusinessTravelTravel_Rarely       1.542e+00  5.366e-01   2.874 0.004048 ** 
## DepartmentResearch & Development  1.544e+01  7.397e+02   0.021 0.983344    
## DepartmentSales                   1.471e+01  7.397e+02   0.020 0.984133    
## DistanceFromHome                  6.897e-02  1.400e-02   4.926 8.41e-07 ***
## Education2                        3.422e-01  4.282e-01   0.799 0.424132    
## Education3                        4.457e-01  3.788e-01   1.177 0.239307    
## Education4                        1.883e-01  4.174e-01   0.451 0.651843    
## Education5                        7.256e-01  6.791e-01   1.068 0.285354    
## EducationFieldLife Sciences      -1.337e+00  9.553e-01  -1.399 0.161733    
## EducationFieldMarketing          -7.831e-01  1.018e+00  -0.769 0.441717    
## EducationFieldMedical            -8.572e-01  9.507e-01  -0.902 0.367278    
## EducationFieldOther              -9.097e-01  1.048e+00  -0.868 0.385285    
## EducationFieldTechnical Degree    4.296e-01  9.621e-01   0.447 0.655184    
## EnvironmentSatisfaction2         -1.533e+00  3.570e-01  -4.294 1.76e-05 ***
## EnvironmentSatisfaction3         -1.511e+00  3.282e-01  -4.603 4.16e-06 ***
## EnvironmentSatisfaction4         -1.973e+00  3.448e-01  -5.721 1.06e-08 ***
## GenderMale                        3.254e-01  2.374e-01   1.371 0.170449    
## JobInvolvement2                  -1.235e+00  4.346e-01  -2.842 0.004482 ** 
## JobInvolvement3                  -1.705e+00  4.164e-01  -4.096 4.20e-05 ***
## JobInvolvement4                  -2.290e+00  5.867e-01  -3.903 9.49e-05 ***
## JobLevel2                        -1.822e+00  6.050e-01  -3.012 0.002593 ** 
## JobLevel3                         5.742e-01  9.454e-01   0.607 0.543628    
## JobLevel4                        -6.035e-01  1.567e+00  -0.385 0.700206    
## JobLevel5                         2.954e+00  2.036e+00   1.451 0.146837    
## JobRoleHuman Resources            1.594e+01  7.397e+02   0.022 0.982804    
## JobRoleLaboratory Technician      2.302e-01  7.454e-01   0.309 0.757421    
## JobRoleManager                    5.449e-01  1.171e+00   0.465 0.641579    
## JobRoleManufacturing Director     6.886e-01  6.474e-01   1.064 0.287474    
## JobRoleResearch Director         -2.715e+00  1.432e+00  -1.896 0.057920 .  
## JobRoleResearch Scientist        -7.151e-01  7.710e-01  -0.927 0.353712    
## JobRoleSales Executive            2.203e+00  1.442e+00   1.527 0.126653    
## JobRoleSales Representative       1.588e+00  1.542e+00   1.030 0.302931    
## JobSatisfaction2                 -1.051e+00  3.646e-01  -2.883 0.003934 ** 
## JobSatisfaction3                 -8.813e-01  3.147e-01  -2.801 0.005097 ** 
## JobSatisfaction4                 -1.501e+00  3.347e-01  -4.483 7.36e-06 ***
## MaritalStatusMarried              5.387e-01  3.420e-01   1.575 0.115180    
## MaritalStatusSingle               1.297e+00  4.437e-01   2.923 0.003462 ** 
## MonthlyIncome                    -1.862e-04  1.254e-04  -1.485 0.137541    
## NumCompaniesWorked                2.190e-01  5.100e-02   4.295 1.75e-05 ***
## OverTimeYes                       2.155e+00  2.555e-01   8.436  < 2e-16 ***
## PercentSalaryHike                 4.456e-03  5.032e-02   0.089 0.929434    
## PerformanceRating4               -1.962e-01  5.192e-01  -0.378 0.705539    
## RelationshipSatisfaction2        -1.067e+00  3.723e-01  -2.866 0.004162 ** 
## RelationshipSatisfaction3        -1.129e+00  3.259e-01  -3.466 0.000528 ***
## RelationshipSatisfaction4        -1.077e+00  3.169e-01  -3.400 0.000674 ***
## StockOptionLevel                 -1.830e-01  1.930e-01  -0.948 0.342993    
## TotalWorkingYears                -7.790e-02  3.712e-02  -2.098 0.035871 *  
## TrainingTimesLastYear            -1.907e-01  9.218e-02  -2.068 0.038593 *  
## WorkLifeBalance2                 -1.271e+00  4.625e-01  -2.748 0.005993 ** 
## WorkLifeBalance3                 -1.775e+00  4.349e-01  -4.080 4.50e-05 ***
## WorkLifeBalance4                 -1.353e+00  5.242e-01  -2.580 0.009866 ** 
## YearsAtCompany                    1.103e-01  5.069e-02   2.175 0.029622 *  
## YearsInCurrentRole               -9.279e-02  6.257e-02  -1.483 0.138051    
## YearsSinceLastPromotion           1.872e-01  5.336e-02   3.508 0.000452 ***
## YearsWithCurrManager             -1.739e-01  6.159e-02  -2.824 0.004745 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 909.69  on 1029  degrees of freedom
## Residual deviance: 535.81  on  972  degrees of freedom
## AIC: 651.81
## 
## Number of Fisher Scoring iterations: 15

Interpretation

This logistic regression model suggests that Monthly Income is not a relevant predictor of Attrition, whereas variables related to working environment, such as Distance from Home, Business Travel, Job Involvement, and Job Satisfaction, seem to be relevant. Variables related to positions or salary do not appear to be as relevant.

However, it is important to note that the model may be over-fitted, which means that it may be too complex and may not generalize well to new data. To address this issue, it may be necessary to remove non-relevant variables and build a new model.

Now, going to try only significant variables and build a new model:

logit.model = glm(Attrition ~ BusinessTravel + JobSatisfaction + MaritalStatus +  NumCompaniesWorked + OverTime + RelationshipSatisfaction + TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + YearsSinceLastPromotion + YearsWithCurrManager, family = binomial(link = "logit"), data = training)

summary(logit.model)
## 
## Call:
## glm(formula = Attrition ~ BusinessTravel + JobSatisfaction + 
##     MaritalStatus + NumCompaniesWorked + OverTime + RelationshipSatisfaction + 
##     TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + 
##     YearsSinceLastPromotion + YearsWithCurrManager, family = binomial(link = "logit"), 
##     data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5099  -0.5661  -0.3495  -0.1831   3.7734  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -0.08015    0.65220  -0.123 0.902198    
## BusinessTravelTravel_Frequently  1.61737    0.44979   3.596 0.000323 ***
## BusinessTravelTravel_Rarely      0.74823    0.42056   1.779 0.075223 .  
## JobSatisfaction2                -0.57471    0.29400  -1.955 0.050604 .  
## JobSatisfaction3                -0.63130    0.26268  -2.403 0.016246 *  
## JobSatisfaction4                -1.18781    0.27890  -4.259 2.05e-05 ***
## MaritalStatusMarried             0.30440    0.28478   1.069 0.285127    
## MaritalStatusSingle              1.16205    0.28542   4.071 4.67e-05 ***
## NumCompaniesWorked               0.10369    0.03992   2.597 0.009397 ** 
## OverTimeYes                      1.50077    0.19946   7.524 5.31e-14 ***
## RelationshipSatisfaction2       -0.77985    0.30619  -2.547 0.010867 *  
## RelationshipSatisfaction3       -0.67063    0.26315  -2.548 0.010820 *  
## RelationshipSatisfaction4       -0.77916    0.26351  -2.957 0.003108 ** 
## TotalWorkingYears               -0.08657    0.01963  -4.410 1.03e-05 ***
## TrainingTimesLastYear           -0.14876    0.07807  -1.906 0.056708 .  
## WorkLifeBalance2                -1.14091    0.37505  -3.042 0.002350 ** 
## WorkLifeBalance3                -1.52866    0.34755  -4.398 1.09e-05 ***
## WorkLifeBalance4                -1.16489    0.42418  -2.746 0.006028 ** 
## YearsSinceLastPromotion          0.16989    0.04015   4.232 2.32e-05 ***
## YearsWithCurrManager            -0.13624    0.04101  -3.322 0.000894 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 909.69  on 1029  degrees of freedom
## Residual deviance: 714.39  on 1010  degrees of freedom
## AIC: 754.39
## 
## Number of Fisher Scoring iterations: 6

Based on the new model with 10 relevant variables, it appears that the higher job satisfaction, work-life balance, and relationship with coworkers the higher the negative impact they have on the likelihood of attrition, while more business travel has a positive impact. It is important to note that although this model may be an improvement over the previous model, there may still be biases present and caution should be taken when interpreting the results.

Plots to better see relationship

gg.pred = ggpredict(logit.model, terms = c("TotalWorkingYears", "NumCompaniesWorked", "WorkLifeBalance"))
plot(gg.pred, ci=F)

In these first plots it can be seen that the higher of companies worked on the higher the probability of attrition, having a non-balanced work life also increases the probability of attrition and finally the more Working Years the lower the probability.

gg.pred = ggpredict(logit.model, terms = c("YearsSinceLastPromotion", "YearsWithCurrManager", "RelationshipSatisfaction"))
plot(gg.pred, ci=F)  

In these plots it can be seen that the higher of Years with current manager, the lower the probability of attrition, having a low Relationship Satisfaction also increases the probability of attrition and finally the more Years since last promotion, the higher the probability.

Predictions

The previous part focused on interpreting the coefficients of the logistic regression model and making sense of them. In the next part, the focus shifts to predictions. I will first start with a logistic regression and explain all the steps involved and then prioritize prediction performance by evaluating the area under the curve (AUC) rather than focusing on interpretation. This involves building multiple models and comparing their prediction power.

Prediction, using the Confusion Matrix with default 50% Bayes Rule:

probability <- predict(logit.model, newdata=testing, type='response')
prediction <- as.factor(ifelse(probability > 0.5, yes = "Yes", no = "No"))

confusionMatrix(prediction, testing$Attrition)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  360  52
##        Yes   9  19
##                                           
##                Accuracy : 0.8614          
##                  95% CI : (0.8255, 0.8923)
##     No Information Rate : 0.8386          
##     P-Value [Acc > NIR] : 0.1076          
##                                           
##                   Kappa : 0.3219          
##                                           
##  Mcnemar's Test P-Value : 7.551e-08       
##                                           
##             Sensitivity : 0.9756          
##             Specificity : 0.2676          
##          Pos Pred Value : 0.8738          
##          Neg Pred Value : 0.6786          
##              Prevalence : 0.8386          
##          Detection Rate : 0.8182          
##    Detection Prevalence : 0.9364          
##       Balanced Accuracy : 0.6216          
##                                           
##        'Positive' Class : No              
## 

The confusion matrix shows the performance of the model in terms of predicting the actual outcome versus the predicted outcome. In this case, there are two possible outcomes: “Yes” for employees who left the company and “No” for those who stayed.

The confusion matrix is a table with four values: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

In this confusion matrix, there are 360 true negatives (employees who stayed and were correctly predicted to stay), 19 true positives (employees who left and were correctly predicted to leave), 52 false negatives (employees who left but were predicted to stay), and 9 false positives (employees who stayed but were predicted to leave).

The statistics derived from the confusion matrix are used to evaluate the performance of the model but going to focus in AUC which is a measure using Sensitivity & Specificity.

roc.logit=roc(testing$Attrition ~ probability)

plot.roc(testing$Attrition, probability, col="darkblue", print.auc = TRUE,  auc.polygon=TRUE, grid=c(0.1, 0.2),
         grid.col=c("green", "red"), max.auc.polygon=TRUE,
         auc.polygon.col="lightblue", print.thres=TRUE)

The model has a AUC higher than 0.8 so it can already be considered a fair model. The best threshold with most balanced Specificity and Sensitivity measures is of 16.5% and in the next part going to look at this confusion matrix:

prediction <- as.factor(ifelse(probability > 0.165, yes = "Yes", no = "No"))

confusionMatrix(prediction, testing$Attrition)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  273  14
##        Yes  96  57
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.7068, 0.7898)
##     No Information Rate : 0.8386          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3701          
##                                           
##  Mcnemar's Test P-Value : 1.136e-14       
##                                           
##             Sensitivity : 0.7398          
##             Specificity : 0.8028          
##          Pos Pred Value : 0.9512          
##          Neg Pred Value : 0.3725          
##              Prevalence : 0.8386          
##          Detection Rate : 0.6205          
##    Detection Prevalence : 0.6523          
##       Balanced Accuracy : 0.7713          
##                                           
##        'Positive' Class : No              
## 

Now, more balanced results and predciting better the individuals who are leaving the company.

To improve the quality of our predictions, we need to address the issue of overfitting by using regularization techniques. One approach is to use 5-fold cross-validation, which allows us to evaluate the performance of the model on different subsets of the data and helps to prevent overfitting. Additionally, we can automate the process of tuning the hyperparameters of the model to optimize its performance.

Cross-validation

In this next part, I will use the caret package to set up the cross-validation function for all models. The number of folds is set to 5, although a higher number of folds can lead to better performance, it is more computationally expensive. I will also specify the binary dependent variable and the method to be “repeated cross-validation”.

ctrl <- trainControl(method = "repeatedcv",     
                     number = 5,               
                     classProbs = T,             
                     summaryFunction=twoClassSummary,
                     verboseIter = F) 

Note that for each model have to define a grid for the hyper-parameters to make them work better.

Regularized Linear Discriminant Analysis

From now on, we will focus on ROC and AUC as evaluation metrics for our predictive models. We will be using regularized linear discriminant analysis (rda) as our modeling technique. The reason we chose rda is that it is suitable for our binary classification problem with potential multicollinearity among predictor variables. Regularization will help prevent overfitting and improve model generalization performance.

param_grid = expand.grid(gamma = seq(0, 1, 0.1), lambda = seq(0.1, 0.9, 0.1))


ldaFit <- train(Attrition ~ .,               
                method ="rda",          
                data = training,
                tuneGrid = param_grid,  
                preProcess = c("center", "scale"),
                metric="ROC",
                trControl = ctrl)

plot(ldaFit)

It can be seen that by setting the hyperparameter lambda to 0.9 it achieves the highest, now going to get the exact value and the Confusion Matrix:

ldaPred = predict(ldaFit, testing, type = "prob")
prediction <- as.factor(ifelse(ldaPred[,2] > 0.5, "Yes", "No"))
confusionMatrix(prediction, testing$Attrition)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  351  37
##        Yes  18  34
##                                           
##                Accuracy : 0.875           
##                  95% CI : (0.8404, 0.9044)
##     No Information Rate : 0.8386          
##     P-Value [Acc > NIR] : 0.01975         
##                                           
##                   Kappa : 0.4822          
##                                           
##  Mcnemar's Test P-Value : 0.01522         
##                                           
##             Sensitivity : 0.9512          
##             Specificity : 0.4789          
##          Pos Pred Value : 0.9046          
##          Neg Pred Value : 0.6538          
##              Prevalence : 0.8386          
##          Detection Rate : 0.7977          
##    Detection Prevalence : 0.8818          
##       Balanced Accuracy : 0.7150          
##                                           
##        'Positive' Class : No              
## 
auc <- roc(testing$Attrition, ldaPred[, 2])$auc
auc
## Area under the curve: 0.8484

The area under the curve (AUC) is 0.8484, which is a measure of the model’s overall predictive power, a little higher than Logsitic but there is still room for improvement.

Machine Learning

In order to improve our prediction accuracy, we will explore the use of machine learning algorithms. Specifically, we will test four different models: K-nearest neighbors (KNN), decision tree, random forest, and gradient boosting.

Knn

KNN is a simple and intuitive algorithm that works by identifying the k-nearest data points in the training set to a given test point, and predicting the class of the test point based on the most common class of its neighbors.

knnFit <- train(Attrition ~ ., 
                  data = training,
                  method = "kknn",   
                  preProc=c('scale','center'),
                  tuneGrid = data.frame(kmax=c(17,19,21,23,25,27),distance=2,kernel='optimal'),  
                  trControl = ctrl)


plot(knnFit) 

knnProb = predict(knnFit, testing, type="prob")
prediction <- as.factor(ifelse(knnProb[,2] > 0.1, "Yes", "No"))

confusionMatrix(prediction, testing$Attrition)$table
##           Reference
## Prediction  No Yes
##        No  184   7
##        Yes 185  64
confusionMatrix(prediction, testing$Attrition)$overall[1:2]
##  Accuracy     Kappa 
## 0.5636364 0.1988012
auc <- roc(testing$Attrition, knnProb[, 2])$auc
auc
## Area under the curve: 0.7709

Based on the results, it appears that the KNN model performs moderately well with an AUC of 0.7709. However, it is important to note that the optimal number of maximum neighbors is around 24, which is not necessarily the best model for this data. It is predicting almost all of the Yes correctly but also a lot Yes when it is a true No.

Decision tree

Decision trees are a popular method for classification tasks, and work by splitting the data based on the values of different features, and recursively partitioning the data until each subset is as homogeneous as possible in terms of the target variable. Going to do the next steps for interpretation because probably not the best predictive power from only one tree.

control = rpart.control(minsplit = 30, maxdepth = 10, cp=0.01)
model = Attrition ~.
dtFit <- rpart(model, data=training, method = "class", control = control)

rpart.plot(dtFit, digits=3)

Interpreting this decision tree, it can be seen that it first starts with the variable “Over Time” and separates into two categories, then by “Monthly Income” and “Total Working Years”, and so on.

dtPred <- predict(dtFit, testing, type = "class")

dtProb <- predict(dtFit, testing, type = "prob")

prediction <- as.factor(ifelse(dtProb[,2] > 0.1, "Yes", "No"))

confusionMatrix(prediction, testing$Attrition)$table
##           Reference
## Prediction  No Yes
##        No  290  24
##        Yes  79  47
confusionMatrix(prediction, testing$Attrition)$overall[1:2]
##  Accuracy     Kappa 
## 0.7659091 0.3411642
auc <- roc(testing$Attrition, dtProb[, 2])$auc
auc
## Area under the curve: 0.7464

The AUC of 0.746 suggests that the model is not very effective at distinguishing between positive and negative cases but expected from only one decision tree. In the next part a combination of many trees is done with some conditions and the model is called Random Forest

Random Forest

Random forest is method that constructs many decision trees using different subsets of the data and features, and then aggregates their predictions to make a final classification.

rfFit <- train(Attrition ~ ., 
                  data = training,
                  method = "rf",   
                  preProc=c('scale','center'),
                  tuneLength = 10,     
                  trControl = ctrl)
plot(rfFit)    

rfProb = predict(rfFit, testing, type="prob")
prediction <- as.factor(ifelse(rfProb[,2] > 0.2, "Yes", "No"))

confusionMatrix(prediction, testing$Attrition)$table
##           Reference
## Prediction  No Yes
##        No  336  33
##        Yes  33  38
confusionMatrix(prediction, testing$Attrition)$overall[1:2] 
##  Accuracy     Kappa 
## 0.8500000 0.4457804
auc <- roc(testing$Attrition, rfProb[, 2])$auc
auc
## Area under the curve: 0.8303

AUC of 0.8315 indicates good discrimination power and better than a Decision Tree. The random forest model randomly selected an optimal of 8 predictors for each split in the decision trees.

In this next part the variable importance can be seen that the model used:

rf_imp <- varImp(rfFit, scale = F)
plot(rf_imp, scales = list(y = list(cex = .95)))

Monthly Income, Age and Total Working years seemed to be the most important variable for constructing the Random Forest model while Job Role and Level the least.

partial(rfFit, pred.var = "MonthlyIncome", plot = TRUE, rug = TRUE)

In this plot, the realtionshipo between Attrition and Monthly income is plotted and it can be seen that it is not linea. First very important, then it reaches a threshold and not very important.

partial(rfFit, pred.var = "TotalWorkingYears", plot = TRUE, rug = TRUE)

Very similar for Total Working Years. First very important, then it reaches a threshold and not very important.

Gradient Boosting

Gradient boosting is another machine learning method that constructs a sequence of decision trees, where each tree is trained to correct the errors of the previous tree, and the final prediction is made by aggregating the predictions of all the trees.

Initially, I attempted to run the Gradient Boosting model by restricting the sample and variables to the most important ones. However, my computer could only handle 5% of the sample and 7 variables, resulting in a poor AUC. I could have fix this by running the model in the cloud for several days but didn’t want to get into that part. Although I know this approach is not ideal, I decided to use the full training dataset for interpretation purposes. Unfortunately, the AUC results were still unsatisfactory. Therefore, I wrote the code without interpreting the results. Generally, the interpretation process for Random Forest is similar.

ctrlg <- trainControl(method = "repeatedcv",     
                     number = 5,               
                     classProbs = T,             
                     summaryFunction=twoClassSummary,
                     verboseIter = F) 

gbmFit <- train(Attrition ~ ., 
                  data = training,
                  method = "xgbTree",   
                  preProc=c('scale','center'),
                  tuneLength = 30,
                  metric="ROC",
                  trControl = ctrl)
plot(gbmFit)
gbmProb = predict(gbmFit, testing, type="prob")
prediction <- as.factor(ifelse(gbmProb[,2] > 0.1, "Yes", "No"))

confusionMatrix(prediction, testing$Attrition)$table
confusionMatrix(prediction, testing$Attrition)$overall[1:2]
gbm_imp <- varImp(gbmFit, scale = F)
plot(gbm_imp, scales = list(y = list(cex = .95)))

Best Classification Model and why

After evaluating different models and analyzing their performance, I found that the best predictive model was Regularized Linear Discriminant Analysis (LDA) by a small margin, followed by Logistic Regression and then Random Forest. Although I could not run Gradient Boosting at full power due to computational constraints, it is possible that it could have resulted in even better performance. Also, probably by increasing the number of folds could have gotten a better Random Forest. Nonetheless, the models that were evaluated provide a good indication of the best approaches for predicting employee attrition using the given dataset.

Advanced Regression

The goal in this next section is to predict Monthly Income and see if someone is receiving more or less than they should.

One such technique is Elastic Net models. Another option is again Random Forest, Gradient Boosting and Neural Networks. By comparing their performance, selecting the best models, and join them through and ensemble I can accurately predict Monthly Income and identify any discrepancies in the current salary structure.

Becnhmark

I first started with a benchmark by predicting all the new Monthly Income values as the average Monthly Income in the training set. However, I can do better than this approach by using advanced regression techniques.

benchFit <- lm(MonthlyIncome ~ 1, data=training)
predictions <- predict(benchFit, newdata=testing)
RMSE <- sqrt(mean((predictions - testing$MonthlyIncome)^2))
RMSE
## [1] 4786.233

In this part going to use the RMSE which measures of the differences between predicted values and actual values. Going to compare different RMSE for different models at the end.

Cross-validation

Cross validation parameters for continuous, five fold because of computational restrictions but 10 will be better.

ctrl <- trainControl(method = "repeatedcv", 
                     number = 5, 
                     repeats = 1)

# Create data frame with test results
test_results <- data.frame(MonthlyIncome = testing$MonthlyIncome)

Again, for each model have to define a grid for the hyper-parameters just like before.

Linear Regression

The first model we used for predicting Monthly Income is Linear Regression. However, since this model has a tendency to overfit, need to be careful in interpreting the results.

lm_tune <- train(MonthlyIncome~., data = training, 
                 method = "lm", 
                 preProc=c('scale', 'center'),
                 trControl = ctrl)
lm_tune
## Linear Regression 
## 
## 1030 samples
##   27 predictor
## 
## Pre-processing: scaled (57), centered (57) 
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 824, 825, 823, 824, 824 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1103.227  0.9433317  836.3577
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Predict

test_results$lm <- predict(lm_tune, testing)
postResample(pred = test_results$lm,  obs = test_results$MonthlyIncome)
##         RMSE     Rsquared          MAE 
## 1076.5268799    0.9494219  824.8354356

The first model a normal linear regression and store the predicted values in the test_results. Again, good starting point but probably model suffering of overfitting.

Not going to use Stepwise regressions because in General Ridge, Lasso and Elastic Net better models so going to use directly those.

Ridge

Linear regression model that adds a penalty term to the sum of squared coefficients to prevent overfitting and reduces the value of the betas.

ridge_grid <- expand.grid(lambda = seq(0, .1, length = 100))

ridge_tune <- train(MonthlyIncome~., data = training,
                    method='ridge',
                    preProc=c('scale','center'),
                    tuneGrid = ridge_grid,
                    trControl=ctrl)
plot(ridge_tune)

# the best tune
ridge_tune$bestTune
##        lambda
## 2 0.001010101

Best hyperparameter lambda is almost 0 and the predictions are the following:

# prediction
test_results$ridge <- predict(ridge_tune, testing)

postResample(pred = test_results$ridge,  obs = test_results$MonthlyIncome)
##         RMSE     Rsquared          MAE 
## 1076.3582981    0.9494489  825.0481612

The lambda value selected is almost 0, which means that the regularization term has almost no effect on the model. The result suggests that the performance of the Ridge model is almost the same as that of the Linear Regression model, which could be due to the small value of the lambda parameter.

Lasso

Linear regression model that adds a penalty term to the absolute sum of coefficients to encourage sparsity and feature selection

lasso_grid <- expand.grid(fraction = seq(.01, 1, length = 100))

lasso_tune <- train(MonthlyIncome~., data = training,
                    method='lasso',
                    preProc=c('scale','center'), 
                    tuneGrid = lasso_grid,
                    trControl=ctrl)
plot(lasso_tune)

lasso_tune$bestTune
##    fraction
## 78     0.78
test_results$lasso <- predict(lasso_tune, testing)
postResample(pred = test_results$lasso,  obs = test_results$MonthlyIncome)
##         RMSE     Rsquared          MAE 
## 1052.8005719    0.9518523  812.7701040

The Lasso model produced an optimal hyperparameter of alpha: 0.78. This resulted in a lower RMSE value, indicating that the Lasso model performed better than the other two models in predicting Monthly Income by removing not relevant variables.

Elastic Net

A linear regression model that combines Ridge and Lasso regularization to balance between avoiding overfitting and feature selection

elastic_grid = expand.grid(alpha = seq(0, .2, 0.01), lambda = seq(0, .1, 0.01))

glmnet_tune <- train(MonthlyIncome~., data = training,
                     method='glmnet',
                     preProc=c('scale','center'),
                     tuneGrid = elastic_grid,
                     trControl=ctrl)

plot(glmnet_tune)

glmnet_tune$bestTune
##     alpha lambda
## 231   0.2    0.1
test_results$glmnet <- predict(glmnet_tune, testing)

postResample(pred = test_results$glmnet,  obs = test_results$MonthlyIncome)
##         RMSE     Rsquared          MAE 
## 1068.8865944    0.9501097  821.2449474

When alpha is set to 1, the Elastic Net is equivalent to the Lasso. When alpha is set to 0, the Elastic Net is equivalent to the Ridge.

In this case, the optimal alpha value was 0.2 and the optimal lambda value was 0.1. This suggests that the Elastic Net model was more Lasso-like than Ridge-like. The fact that it predicted better than Ridge but worse than Lasso could be due to the fact that it found a good balance between the two regularization techniques, but perhaps did not fully capture the optimal regularization parameters of either model individually. It is also possible that the dataset being used is better suited to one type of regularization over another, in this case Lasso

Machine Learning

In order to reduce our prediction error, we will explore the use of machine learning algorithms. Specifically, we will test five different models: K-nearest neighbors (KNN), Random Forest, Gradient boosting and Neural Networks.

Knn

Same as Classification but for predciting a continuous variable now

knn_tune <- train(MonthlyIncome~., 
                  data = training,
                  method = "kknn",   
                  preProc=c('scale','center'),
                  tuneGrid = data.frame(kmax=c(5,7,9,11,13,15,19,21),distance=2,kernel='optimal'),
                  trControl = ctrl)
plot(knn_tune)

test_results$knn <- predict(knn_tune, testing)

postResample(pred = test_results$knn,  obs = test_results$MonthlyIncome)
##         RMSE     Rsquared          MAE 
## 1558.9875235    0.9019586 1164.9368111

Not a very good model for this dataset. One possible explanation for the worse results of KNN compared to linear regressionand the rest could be the dimensionality. KNN is sensitive to the number of variables in the dataset, and as the number of dimensions increases, the volume of the space increases exponentially, leading to sparsity of data. This can cause the KNN algorithm to have difficulty finding nearest neighbors and may result in overfitting.

Random Forest

Also, same as classification but now for continuous dependent variable

rf_tune <- train(MonthlyIncome~., 
                 data = training,
                 method = "rf",
                 preProc=c('scale','center'),
                 trControl = ctrl,
                 ntree = 100,
                 tuneGrid = data.frame(mtry=c(1,3,5,7,9,11,13,15)),
                 importance = TRUE)

plot(rf_tune)

Best randomly selected predictors is around 10 and 15 for the random Forest.

test_results$rf <- predict(rf_tune, testing)

postResample(pred = test_results$rf,  obs = test_results$MonthlyIncome)
##         RMSE     Rsquared          MAE 
## 1036.5774561    0.9535208  795.4503318

The Random Forest model outperformed all other models in terms of predictive accuracy, as demonstrated by its lower root mean squared error (RMSE) compared to the other models. This indicates that the Random Forest model was able to capture more complex patterns in the data, which allowed it to make more accurate predictions. Additionally, Random Forest can handle a large number of predictors, making it a versatile algorithm for various datasets, including this one-

rf_imp_cont <- varImp(rf_tune, scale = F)
plot(rf_imp_cont, scales = list(y = list(cex = .95)))

he Random Forest model shows that the most important variables in predicting Monthly Income are Job Level, Job Role, and Working Years. This result aligns with our understanding of the job market and how these factors typically impact salary.

Gradient Boosting

For the gradient boosting model, the same thing happened than with classification. I tried to run it with a smaller sample, with less predictors,etc but what my computer could handle it resulted in not a very good model so instead just write the code using the full sample and how I would have done it.

ctrlgc <- trainControl(method = "repeatedcv", 
                     number = 5, 
                     repeats = 1)

xgb_tune <- train(MonthlyIncome ~ ., 
                  data = training,
                  method = "xgbTree",
                  preProc=c('scale','center'),
                  objective="reg:squarederror",
                  trControl = ctrlgc,
                  tuneGrid = expand.grid(nrounds = c(500,1000), 
                                         max_depth = c(5,6,7), 
                                         eta = c(0.01, 0.1, 1),
                                         gamma = c(1, 2, 3), 
                                         colsample_bytree = c(1, 2),
                                         min_child_weight = c(1), subsample = c(0.2,0.5,0.8)))

test_results$xgb <- predict(xgb_tune, testing)

postResample(pred = test_results$xgb,  obs = test_results$MonthlyIncome)

plot(varImp(xgb_tune, scale = F), scales = list(y = list(cex = .95)))

Neural Networks

A neural network is a type of machine learning model that consists of multiple layers of interconnected nodes that perform computations on the input data to produce an output.

In R, have to convert categorical variables into dummies so that they can run in the model. Again, due to computational constraints could run it but probably be better than Random Forest and around Gradient Boosting predictive power.

nn_tune <- train( MonthlyIncome ~ .,
                 data = training,
                 method = "neuralnet",
                 preProc=c('scale','center'),
                 trControl = ctrlgc,
                 tuneGrid = expand.grid(layer1 = c(4, 2),
                                        layer2 = c(2, 1, 0),
                                        layer3 = c(0)))



test_results$nn <- predict(nn_tune, testing)

postResample(pred = test_results$nn,  obs = test_results$MonthlyIncome)

plot(varImp(nn_tune, scale = F), scales = list(y = list(cex = .95)))

Ensemble

To further improve our predictive accuracy, I used an ensemble method to combine the predictions from the top-performing models. Specifically, I will select the top 3 of models based on their performance on the validation set, and average their predictions on the test set to obtain the final ensemble prediction.

First, I will compare the mean absolute error (MAE) of each individual model trained:

apply(test_results[-1], 2, function(x) mean(abs(x - test_results$MonthlyIncome)))
##        lm     ridge     lasso    glmnet       knn        rf 
##  824.8354  825.0482  812.7701  821.2449 1164.9368  795.4503

The ones with the lowest MAE are in order: Random Fores, Lasso regression and elastic net. Probably gradient boosting and neural networks will be up there and would have used them.

Combine them and get the average of top 3:

test_results$comb = (test_results$rf + test_results$lasso + test_results$glmnet)/3

postResample(pred = test_results$comb,  obs = test_results$MonthlyIncome)
##         RMSE     Rsquared          MAE 
## 1032.2411264    0.9536303  793.6562391

The MAE is lower than the best performing model (Random Forest) so an improvement compared to only using 1 model.

Results of predicted Monthly Income:

yhat = test_results$comb

hist(yhat, col="lightblue")

Model is predicting a lot individuals with lower wages and it is doing a good job based on the other variables. It can be seen that it is an asymmetric distribution.

Confidence Intervals

Finally, the aim is to obtain confidence intervals for the ensemble. To achieve this, a new division is added to the cross-validation process: the test set is divided into two parts, namely “test” and “calibration”. The “calibration” set is used to make predictions on Monthly Incomes and compute the errors. These errors are then used to create a histogram, which provides the size of the prediction interval

y = test_results$MonthlyIncome
error = y-yhat
hist(error, col="lightblue")

Distribution n of errors from the model vs the real values. It can be seen that it represents a normal distribution.

So now going to build 90% confidence Intervals and plot it:

noise = error[1:80]
lwr = yhat[81:length(yhat)] + quantile(noise,0.05, na.rm=T)
upr = yhat[81:length(yhat)] + quantile(noise,0.95, na.rm=T)
predictions = data.frame(real=y[81:length(y)], fit=yhat[81:length(yhat)], lwr=lwr, upr=upr)
predictions = predictions %>% mutate(out=factor(if_else(real<lwr | real>upr,1,0)))

ggplot(predictions, aes(x=fit, y=real))+
  geom_point(aes(color=out)) + theme(legend.position="none") +
  geom_ribbon(data=predictions,aes(ymin=lwr,ymax=upr),alpha=0.3) +
  labs(title = "Prediction intervals", x = "Prediction",y="Real Monthly Income")

The red points falling within the prediction interval indicate that these individuals are being compensated appropriately based on the characteristics described in the dataset. On the other hand, the blue points lying outside the interval suggest that some individuals are being overpaid (on top) or underpaid (below) compared to their peers. It might be worthwhile for the company to investigate these cases and consider adjusting the salaries or factoring them into future salary decisions.

Conclusion

In conclusion, this analysis covered three main areas: feature engineering, attrition prediction, and monthly income prediction. In the data cleaning phase, I identified and addressed inconsistencies and did some basic descriptive statistics. In the attrition modeling phase, I developed and compared various models to interpret and predict employee attrition, identifying the most significant factors contributing to it and selecting best models using the AUC. In the monthly income modeling phase, I used similar techniques like utilizing machine learning algorithms to predict employee monthly income and identified the most important variables for accurate predictions. I did an ensemble with the performing models and built a plot with the confidence interval.

This analysis provided several insights that can be used to guide a company decisions. For example, the most critical factors contributing to employee attrition were job satisfaction, years at the company, and monthly income. Additionally, the most important variables in predicting monthly income were job level, job role, and working years. The findings suggest that utilizing machine learning algorithms can significantly improve the accuracy of employee attrition and monthly income predictions. Even though I couldn’t use Gradient Boosting and Neural Networks.