IBM Attrition Rate Prediction
1. Business Understanding
Present-day businesses are growing incredibly quickly, and as a result, corporations are in great need of skilled professionals. Whenever a corporation loses an experienced employee, they either try to offer them higher pay in order to retain them or they hire a new employee .
Anticipating this beforehand can save us a lot of time and money. Furthermore, it will enable Human Resource Departments to manage their limited resources efficiently, helping them to be flexible with regards to new hires and current employee workforce management.
We will start by defining attrition i.e. when an employee moves out of the company either voluntarily or involuntarily.
A corporation should aim for an average attrition rate of less than 10%, and one of more than 20% should raise red flags. Attrition can be caused due to multiple factors like poor management, lack of acknowledgement,toxic work culture and no career growth. In order to anticipate which employees are most likely to leave the company in the future, we will create and test several attrition prediction models.
Investigating the fascinating trends that result in employee churn is something we’re interested in. After the analysis, we’ll attempt to model employee attrition by applying particular machine learning models to forecast potential attritions. Then, we will try to help the company identify and give compensation to employees that we predicted to leave.
2.Data Understanding
2.1 Dataset Cleaning
Our datasets come from Kaggle’s IBM HR Analytics Employee Attrition & Performance dataset, which has 30 features and 1470 rows of employee data. We performed data cleaning on the original dataset, such as removing duplicate rows and columns, dropping columns with non-distinct values, and creating dummy variables for categorical variables. We randomly split the data into 70% of training data and 30% of testing data. We checked our datasets balance and found similar patterns among the training and testing data with those a balanced amount of employees who are going to leave and decided to use our randomly selected datasets.
Here is a summary of the dataset provided for our analysis:
Target Variable: Attrition (1,0)
Features/Attributes (Appendix Table 2.1.1 and Table 2.1.2) :
Demographics (age, marital status, roles, education, education field)
Employee responsibility(business travel, overtime hours)
Satisfaction(performance, relationship, work life balance)
Training data: 1,030 records with 54 variables
Test data: 440 records with 54 variables
2.2 Exploratory Data Analysis
To better understand the data and choose variables for the models, we conducted an exploratory data analysis through PCA. After computing the full PCA, it seems like the first two principal components have the highest proportion of variance 0.3306, 0.1362, which should be paid more attention to.
Next, we interpreted the meaning of the latent features (PCs). For each factor we displayed the top features that are responsible for 3/4 of the squared norm of the PCs.
The first latent feature is closely related to the working years, including YearsAtCompany, TotalWorkingYears and YearsInCurrentRole. Clearly, the longer the employees have worked with the company , the longer they might want to stay . Another latent feature is related to the age and number of companies employees have worked for i.e. the fewer companies an employee has worked for and are younger with few years of experience.
3. Modeling
3.1 Model Selection (K-fold cross validation)
Based on the target of estimating the probability of Attrition, we selected five different classification models to perform K-Fold cross-validation with training data and then selected the best performing model. The evaluation metric used was Performance Measure. Because prediction is a probability score, we convert to 1 or 0 via prediction > threshold, with threshold = 0.5. It means we predict the employee will leave if the probability of attrition is larger than 0.5; otherwise, it will not leave.
Based on data characteristics and model features, we choose Logistic Regression Model (m.lr), logistic regression with interaction using Lasso (m.lr.l), logistic regression with interaction using Post Lasso (m.lr.pl), Classification Tree(m.tree) and Random Forest (m.randomforest). With n-fold is 10, we performed K-Fold cross-validation and calculated the average performance measure of each model.
Out of five models, Logistic Regression achieved the highest Performance Measure score of 0.8699.
3.2 Feature Importance by Logistic Regression
For the logistic regression model, Model conditional probability of 𝑌 = 1 given 𝑋 via log odds:
β is the vector of coefficients and we analyzed the marginal impact of significant factors in the Logistic Regression Model. After looking at the summary of the model equation, we have the important features and calculated the marginal impact as shown:
In the logistic regression model, we calculate the marginal impact of the top five significant factors. The first is the Marital Status. If an employee is single, then attrition is 0.67 times more likely. If the employee works overtime, then it is 0.70 times more likely. The years at company and number of companies worked in, as EDA analysis showed, are kind of related and separately 1.61 and 1.79 times less likely for attrition. If an employee needs to travel frequently, then attrition 0.78 times more likely.
4. Evaluation
4.1 Performance measure
We compared all the models’ ROC curves to cross check whether logistic regression is the best model and confirmed it is.We are using the Confusion Matrix (Table 5.1.1) and ROC curve (Graph 5.1.2) as a performance metric to evaluate our model, and our optimal threshold is 0.5. Y=1 means the employee leaves; Y=0 means the employee stays.
The True Positive Rate is 33/(33+36)= 0.47,
The False Positive Rate is 18/(18+353)=0.049.
The probability of predicting attrition when the employee leaves is 0.47, and the probability of predicting attrition when the employee did not leave is 0.049.
4.2 Targeting Strategy
Overall, the attrition rate of our prediction is 19%. Since the company cannot give compensation to all the employees. We would like to use our prediction to help the company target those employees who would be likely to leave.
It shows the more employees we targeted, the higher True Positive Rate we get. Our predicted attrition rate is 19%. If we only target 19% of the population, it is 60% that we targeted the employee that is actually leaving the company based on our prediction. The lift curve shows that if we targeted 20% of the employees, our model is about 3.5 times better than the random guessing. Therefore, our model tends to help the company to better identify and target the employee who would likely leave.
5. Deployment
5.1 Recommendations
Before Hiring:
In this report, we put more attention to the inherent attributes(Attributes that we can know before choosing whether to hire the employee, e.g. marital status, number of companies previously employed, years of employment, etc.) of employees. Our model can greatly improve the accuracy of expected attrition rate for this employee before deciding to hire. This can greatly support and improve a company’s recruiting strategy, thereby saving the company’s costs.
For a company, the employee Turnover Cost can be very costly. Employee Turnover Costs may include advertisements for this position, interviews for new employees, background checks, and more. According to Employee Benefit News, the average cost of replacing an employee is 33% of the employee’s annual salary.
Only when the benefit of hiring these employees is positive, does the company decide to accept this employee.
But we found that the true value of employees is difficult to measure, so we tried to use a more intuitive attribute (time). According to Employee Benefit News, for companies to reach a break-even point on managers they hire, it takes an average of 6.2 months due to costs incurred.
Only when the benefit of hiring these employees is positive, does the company decide to accept this employee.
After Hiring:
Our predictive model can also help the company’s subsequent employee welfare programs to reduce employee turnover. Because often employee welfare programs are often costly, our predictive model can help companies make better decisions about who to use them on. This saves the company money.
For example, We can offer employees with stock benefits as incentives. The stock benefits bundles the employee’s benefits with the company. If the company is doing well, the employees are more likely to stay with the company instead of leaving. However, it also depends on how much stock the employees are holding on hand. That could also become a factor of if the employee is going to leave or not. According to our findings, we need to work on the employee working hours due to overtime is one of the reasons for employees to leave. Those who stay longer need a compensation boost or ownership stock options to become a partner in order to stay with the company.
For another example, we found that business travel has some negative effects on employee turnover rates. So we can significantly reduce the employee turnover rate by providing some remote work opportunities for employees. Offering remote work or flexible work schedules has also been shown to decrease turnover by 25%, according to Owl Labs’ “State of Remote Work.”
5.2 Ethical Issues and Risk
This can lead to some ethical issues, as there may be discrimination against candidates with certain ‘labels’ when hiring.
This may be unethical. Because it may be unfair to employees with low “predicted likelihood of leaving” because those with high “predicted likelihood of leaving” will be treated better even if the performance is the same.
Some ethical issue might arise due to dishonesty when filling out the survey, performance evaluation so that employees can get higher pay based on higher performance evaluation.
When we calculate the ‘benefit of hiring this employee’, we use average data. But in practice, the value of employees varies from company to company, from job to job, and even from employee to employee.
Competition among the same industry companies might offer higher compensation than the current company. This also affects our turnover rate, but we don’t take that into account.
6. Conclusion
Overall, our predictive model can improve company decision-making both in terms of hiring new employees and reducing turnover of older employees. When hiring new employees, we improve the company’s hiring decisions by predicting the ‘benefit of hiring this employee’. When reducing the turnover rate, it can help companies to pinpoint their employee incentive programs and thus save costs.
We utilized our PCA model to help select top important features which includes marital status, number of companies previously employed, years of employment and etc to help us identify those features we want to target in order to retain our employees before hiring. Our logistic regression model helps us to identify what are those features that are actually impacting the employees’ attrition rate to help us better target those features with employee welfare plans. Overall, we also used a threshold of 0.5 to determine the employee’s true positive rate and false positive rate as 0.47 and 0.049. As demonstrated on our curve, our prediction for the attrition rate is 19% and we will target the 20% of employees for those who are most likely to leave the company.