Skip to content

Miguel-G-Soares/Employee-Turnover-ML-Model

Repository files navigation

Employee Turnover Prediction at Salifort Motors

Project Overview

This project's goal was to analyze employee data from Salifort Motors to identify key factors that contribute to employee turnover and create various machine learning models to predict whom is most at risk. Predicting which employees are likely to leave the company, allows HR to take proactive steps to improve retention.

The dataset includes employee features like satisfaction, evaluation scores, tenure, and work-related statistics.

The best-performing models were the Decision Tree and Random Forest, with recall scores of 92.0% and 91.5% respectively.

Business Problem:

The HR department of Salifort Motors is concerned about employee turnover, a critical issue for many companies. High turnover rates lead to substantial recruitment and training costs and can cause operational disruptions. The department wants to understand what features contribute most to emloyee turnover rates and identify employees at risk. By retaining employees, Salifort Motors aims to lower these costs and maintain a more stable and experienced workforce.

Employee turnover costs have been studied in depth across industries, with estimates indicating that replacing an employee can cost anywhere from 50% to over 200% of their annual salary, depending on the role and expertise required. Actively improving retention, therefore, not only save direct costs but can also lead to better employee satisfaction and company culture.

Data

The dataset collected includes several variables that could influence employee turnover, such as:

  • Satisfaction Level: Self-reported satisfaction scores from employees.
  • Last Evaluation: Performance evaluations given to employees.
  • Number of Projects: The number of projects the employee has worked on.
  • Average Monthly Hours: Time spent working in the company each month.
  • Years at the Company: Employee tenure.
  • Work Accident: Whether the employee has experienced a work accident.
  • Promotion in Last 5 Years: If the employee has been promoted in the last five years.
  • Department: Department in which the employee works.
  • Salary: Salary levels (Low, Medium, High).
  • Turnover (Target): Whether the employee left the company or not.

Data Limitations:

  • There may be potential biases in self-reported data, such as satisfaction levels.
  • Salary data is categorized, limiting insights into precise pay discrepancies.
  • The time frame of the data is not explicitly provided, so trends over time may be difficult to analyze without that context.
  • Some features might be a scource of data leakage, such as Satisfaction levels and Evaluations, as such features are depenant on data scarsely collected.

Modeling

#Data manipulation
import numpy as np
import pandas as pd
from scipy import stats

#Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

#Data modeling
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import CategoricalNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

#Metrics and useful functions
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import GridSearchCV
from xgboost import plot_importance

#Saving models
import pickle
data = pd.read_csv("HR_capstone_dataset.csv")

EDA

Cleaning

data.head()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years Department salary
0 0.38 0.53 2 157 3 0 1 0 sales low
1 0.80 0.86 5 262 6 0 1 0 sales medium
2 0.11 0.88 7 272 4 0 1 0 sales medium
3 0.72 0.87 5 223 5 0 1 0 sales low
4 0.37 0.52 2 159 3 0 1 0 sales low
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   satisfaction_level     14999 non-null  float64
 1   last_evaluation        14999 non-null  float64
 2   number_project         14999 non-null  int64  
 3   average_montly_hours   14999 non-null  int64  
 4   time_spend_company     14999 non-null  int64  
 5   Work_accident          14999 non-null  int64  
 6   left                   14999 non-null  int64  
 7   promotion_last_5years  14999 non-null  int64  
 8   Department             14999 non-null  object 
 9   salary                 14999 non-null  object 
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
data.describe()
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years
count 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054 201.050337 3.498233 0.144610 0.238083 0.021268
std 0.248631 0.171169 1.232592 49.943099 1.460136 0.351719 0.425924 0.144281
min 0.090000 0.360000 2.000000 96.000000 2.000000 0.000000 0.000000 0.000000
25% 0.440000 0.560000 3.000000 156.000000 3.000000 0.000000 0.000000 0.000000
50% 0.640000 0.720000 4.000000 200.000000 3.000000 0.000000 0.000000 0.000000
75% 0.820000 0.870000 5.000000 245.000000 4.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000

Cleaning dataset's column names

data.columns
Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'Department', 'salary'],
      dtype='object')
#Changes all column names to be in lowercase
for name in data.columns:
    data.rename(columns={name:name.lower()},inplace=True)

data.columns
Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'work_accident', 'left',
       'promotion_last_5years', 'department', 'salary'],
      dtype='object')
#imporve readability and correct misspelings
data.rename(columns={'average_montly_hours':'average_monthly_hours'},inplace=True)
data.rename(columns={'time_spend_company':'tenure'},inplace=True)

Checking for missing values

data.isna().sum()
satisfaction_level       0
last_evaluation          0
number_project           0
average_monthly_hours    0
tenure                   0
work_accident            0
left                     0
promotion_last_5years    0
department               0
salary                   0
dtype: int64

No missing values were found

Checking for duplicates

data.duplicated().sum()
3008
data.duplicated().sum()/len(data.index)
0.2005467031135409

There are 3008 duplicated entries, correlating to 20.05% of the dataset being comprised of duplicated entries

Pattern analysis within duplicated data

data[data.duplicated()].head()
satisfaction_level last_evaluation number_project average_monthly_hours tenure work_accident left promotion_last_5years department salary
396 0.46 0.57 2 139 3 0 1 0 sales low
866 0.41 0.46 2 128 3 0 1 0 accounting low
1317 0.37 0.51 2 127 3 0 1 0 sales medium
1368 0.41 0.52 2 132 3 0 1 0 RandD low
1461 0.42 0.53 2 142 3 0 1 0 sales low
data[data.duplicated()].describe(include='all')
satisfaction_level last_evaluation number_project average_monthly_hours tenure work_accident left promotion_last_5years department salary
count 3008.000000 3008.000000 3008.000000 3008.000000 3008.000000 3008.000000 3008.000000 3008.000000 3008 3008
unique NaN NaN NaN NaN NaN NaN NaN NaN 10 3
top NaN NaN NaN NaN NaN NaN NaN NaN sales low
freq NaN NaN NaN NaN NaN NaN NaN NaN 901 1576
mean 0.545765 0.713787 3.803856 203.349734 4.029920 0.106051 0.525266 0.038564 NaN NaN
std 0.266406 0.182012 1.477272 54.467101 1.795619 0.307953 0.499444 0.192585 NaN NaN
min 0.090000 0.360000 2.000000 97.000000 2.000000 0.000000 0.000000 0.000000 NaN NaN
25% 0.380000 0.540000 2.000000 151.000000 3.000000 0.000000 0.000000 0.000000 NaN NaN
50% 0.530000 0.725000 4.000000 204.000000 3.000000 0.000000 1.000000 0.000000 NaN NaN
75% 0.780000 0.880000 5.000000 253.000000 5.000000 0.000000 1.000000 0.000000 NaN NaN
max 1.000000 1.000000 7.000000 310.000000 10.000000 1.000000 1.000000 1.000000 NaN NaN
data.department.value_counts(normalize=True)['sales']*3008
830.2633508900592
data.salary.value_counts(normalize=True)['low']*3008
1467.1996799786652
#Chance of two people randomly having the exact same information
1/100*1/100*1/6*1/200*1/9*1/2*1/2*1/2*1/10*1/3
3.8580246913580245e-11

There isn't any specific pattern to the duplicated entries and they are too prolific to be random chance, suggest investigating further into possible problems regarding data entry or storage

As explained, data is unlikely to be representative and as such should be remomved

data1 = data.drop_duplicates(keep='first')

Checking for outliers

for column in data1.columns:
    plt.figure(figsize=(5,1))
    sns.boxplot(data1[column],orient='h')
    plt.show()

png

png

png

png

png

png

png

png

Only tenure has significant outliers that might need to be removed depending on the model chosen

Exploration

How many employees are leaving the company

data1['left'].value_counts()
data1['left'].value_counts(normalize=True)

Why are employees leaving

plt.figure(figsize=(6,3))
sns.histplot(data=data1,x='satisfaction_level',bins=25,hue='left',alpha=0.5)
plt.show()

png

plt.figure(figsize=(6,3))
sns.histplot(data=data1,x='last_evaluation',bins=25,hue='left',alpha=0.5)
plt.show()

png

Satisfaction level vs Last evaluation shows the relationship betwwen the satisfaction of the company and an employee with their own work

plt.figure(figsize=(10,5))
sns.scatterplot(data=data,x='satisfaction_level',y='last_evaluation',hue='left',alpha=0.3)
plt.yticks([x/10 for x in range(0,11,1)])
plt.show()

png

There appears to be three main grouping of employees who leave the company, those who are highly evaluated but have abysmal satisfaction scores, those who are evaluated lower but still have a below average satisfaction and those who are highly rated and are satisfied

Having established the three main groupings, proceed to try to understand more about them

plt.figure(figsize=(10,5))
sns.scatterplot(data=data,x='average_monthly_hours',y='satisfaction_level',hue='left',alpha=0.3)
plt.yticks([x/10 for x in range(0,11,1)])
plt.show()

png

On average a person working full time works between 160 to 170 hours per month.

A clear distinction starts appearing between two of the most populous groups. The first are highly overworked employees, with high evaluations but dissatisfied and probably wanted to leave as fast as possible and resigned. Whilke the second are employees who were working less than average, weren't performing up to standards and were probably laid-off.

The third group is still without major characteristics towards wanting to leave, possibly satisfied employeed who found better opportunities

Note: The clear distinctions and shape of the distribution are clear symptoms of either synthetic or manipulated data

It could also be interesting to further observe what impacts satisfaction levels

sns.boxplot(data=data1, x='satisfaction_level', y='tenure', hue='left', orient="h")
plt.gca().invert_yaxis()

png

Overall we see two distinct trends for employees that leave, those with lower tenures and lower satisfactions and those who stayed for longer.

It is also possible to see a decline in satisfaction from the second to fourth year, with it being unnusualy low for those who left during their 4th year, does this have a significant impact on the number of people leaving in that year?

sns.histplot(data=data1, x='tenure', hue='left', multiple='dodge', shrink=5)
plt.title('Employee count according to tenure')
plt.show()

png

The decline in satisfaction within the 3-4th year seems to translate higly into higher rates of turnover, is there any reason behind this significant drop in satisfaction?

#Setting up the subplots
fig, axs = plt.subplots(nrows=2, ncols=3, figsize = (14,5))
gs=axs[1, 2].get_gridspec()
for i in [0,1]:
    for j in [0,1]:
        axs[i,j].remove()
axbig0=fig.add_subplot(gs[0:2, 0])
axbig1=fig.add_subplot(gs[0:2, 1])
fig.tight_layout()

#Separating data by tenure
short_tenure=data1[data1.tenure<6]
long_tenure=data1[data1.tenure>=6]

#Graphs
sns.barplot(data=data1, x='tenure', y='average_monthly_hours', hue='left',ax=axbig0)
axbig0.set_title('Hours worked by tenure')

sns.barplot(data=data1, x='tenure', y='number_project', hue='left',ax=axbig1)
axbig1.set_title('Number of projects by tenure')

sns.histplot(data=short_tenure, x='tenure', hue='salary', discrete=1, 
             hue_order=['low', 'medium', 'high'], multiple='dodge', shrink=.5,ax=axs[0,2])
axs[0,2].set_title('Number of projects by tenure')

sns.histplot(data=long_tenure, x='tenure', hue='salary', discrete=1, 
             hue_order=['low', 'medium', 'high'], multiple='dodge', shrink=.5,ax=axs[1,2])
axs[1,2].set_title('Number of projects by tenure')
plt.show()

png

An overall increase in number of projects can be seen, especially for those who left at year 4, with them averaging more than 250 hours a month, working on average more than 12.5 hours a day. This increase in workload isn't matched with increases in salary, where the proportions are similair within the first 5 years and only later on do the proportions start shifting towards higher salaries

Beyond this it is also of interest to look whether the employees who're putting in the work are getting the promotions they deserve

plt.figure(figsize=(14,3))
sns.scatterplot(data=data1, x='average_monthly_hours', y='promotion_last_5years', hue='left', alpha=0.5)
plt.title('Employees promoted according to wours worked')
plt.show()

png

Overall very few employees are being promoted, and the vast majority of them weren't the highest workers. The plot also shows that the employees weren't promoted and worked the longest hours

Following this, inpecting the distribution accross departments

sns.histplot(data=data1, x='department', hue='left', discrete=1,hue_order=[0, 1], multiple='dodge', shrink=.5)
plt.xticks(rotation=45,ha='right')
plt.show()

png

No department differs significantly from the other in terms of turnover proportion

Lastly checking the correlation between variables

plt.figure(figsize=(10,6))
sns.heatmap(data1.iloc[:,0:-2].corr(),annot=True,cmap='binary')

png

High positive correlation between number of projects, hours worked and evaluation received, and turnover is negatively correlated with employee satisfaction

Creating the ML models

The goal is to predict whether an employee will leave the company or not, which is a binary categorical outcome. Defining our task as one of binary classification.

There are multiple models available for the task, with the ones considered being:

  • Binomial logistic regression
  • Naive Bayes
  • Single Decision Tree
  • Random Forest
  • Gradient Boosting

Saving and loading models

def save_model(model_name,model_object):
    with open(model_name+'.pickle' , 'wb' ) as to_write:
   		 pickle.dump( model_object , to_write )
        
def load_model(model_name,model_object):
    with open(model_name+'.pickle','rb') as to_read:
        model_object = pickle.load(to_read)

Preparing dataset for modeling

Dataset has two categorical features that need to be transformed into numeric:

  • department: non-ordinal categorical variable
  • salary: ordinal categorical variable (low-average-high)
data2=data1.copy()
data2.salary.value_counts()
salary
low       5740
medium    5261
high       990
Name: count, dtype: int64
ordinal_encoder = OrdinalEncoder(categories=[['low','medium','high']])
ordinal_encoder.fit(data2[['salary']])
data2.loc[:,['salary']]=ordinal_encoder.transform(data2[['salary']])
data2.salary=data2.salary.astype('int64')
data2.salary.value_counts()
salary
0    5740
1    5261
2     990
Name: count, dtype: int64
data2 = pd.get_dummies(data2,columns=['department'],drop_first=True)
data2.info()
<class 'pandas.core.frame.DataFrame'>
Index: 11991 entries, 0 to 11999
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   satisfaction_level      11991 non-null  float64
 1   last_evaluation         11991 non-null  float64
 2   number_project          11991 non-null  int64  
 3   average_monthly_hours   11991 non-null  int64  
 4   tenure                  11991 non-null  int64  
 5   work_accident           11991 non-null  int64  
 6   left                    11991 non-null  int64  
 7   promotion_last_5years   11991 non-null  int64  
 8   salary                  11991 non-null  int64  
 9   department_RandD        11991 non-null  bool   
 10  department_accounting   11991 non-null  bool   
 11  department_hr           11991 non-null  bool   
 12  department_management   11991 non-null  bool   
 13  department_marketing    11991 non-null  bool   
 14  department_product_mng  11991 non-null  bool   
 15  department_sales        11991 non-null  bool   
 16  department_support      11991 non-null  bool   
 17  department_technical    11991 non-null  bool   
dtypes: bool(9), float64(2), int64(7)
memory usage: 1.3 MB

Logistic Regression

Logistic regression has 4 main assumptions

  • Linearity
  • Independent observations
  • No multicolinearity
  • No Extreme Outliers

To observe multicolinearity:

sns.pairplot(data2.iloc[:,:9])
plt.show()

png

No multicolinearity is present and observations are independent as each is referring to a distinct employee

Removing outliers, as determined previously only present in tenure

# Determining Q1 and Q3
tenure_q1=data2.tenure.quantile(0.25)
tenure_q3=data2.tenure.quantile(0.75)

# Calculating inter-quartile range
tenure_iqr=tenure_q3-tenure_q1

#Creating new dataframe without outliers
data_logreg=data2[(data2.tenure>(tenure_q1-1.5*tenure_iqr))|(data2.tenure<(tenure_q3+1.5*tenure_iqr))]
data_logreg.head()
satisfaction_level last_evaluation number_project average_monthly_hours tenure work_accident left promotion_last_5years salary department_RandD department_accounting department_hr department_management department_marketing department_product_mng department_sales department_support department_technical
0 0.38 0.53 2 157 3 0 1 0 0 False False False False False False True False False
1 0.80 0.86 5 262 6 0 1 0 1 False False False False False False True False False
2 0.11 0.88 7 272 4 0 1 0 1 False False False False False False True False False
3 0.72 0.87 5 223 5 0 1 0 0 False False False False False False True False False
4 0.37 0.52 2 159 3 0 1 0 0 False False False False False False True False False
#Selecting the variables
x_logreg = data_logreg.drop(columns=['left'])
y_logreg = data_logreg.left
x_logreg.info()
<class 'pandas.core.frame.DataFrame'>
Index: 11991 entries, 0 to 11999
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   satisfaction_level      11991 non-null  float64
 1   last_evaluation         11991 non-null  float64
 2   number_project          11991 non-null  int64  
 3   average_monthly_hours   11991 non-null  int64  
 4   tenure                  11991 non-null  int64  
 5   work_accident           11991 non-null  int64  
 6   promotion_last_5years   11991 non-null  int64  
 7   salary                  11991 non-null  int64  
 8   department_RandD        11991 non-null  bool   
 9   department_accounting   11991 non-null  bool   
 10  department_hr           11991 non-null  bool   
 11  department_management   11991 non-null  bool   
 12  department_marketing    11991 non-null  bool   
 13  department_product_mng  11991 non-null  bool   
 14  department_sales        11991 non-null  bool   
 15  department_support      11991 non-null  bool   
 16  department_technical    11991 non-null  bool   
dtypes: bool(9), float64(2), int64(6)
memory usage: 1.2 MB
xtrain,xtest,ytrain,ytest = train_test_split(x_logreg,y_logreg,stratify=y_logreg,test_size=0.2,random_state=0)
# Instantiating the model
logreg = LogisticRegression(random_state=0,max_iter=1000)

# Fitting the model
logreg.fit(xtrain,ytrain)
# Predicting turnover using test dataset
ypred_logreg = logreg.predict(xtest)

# Creating the confusion matrix
cm_logreg=metrics.confusion_matrix(ytest,ypred_logreg,labels=logreg.classes_)

# Diplaying the confusion matrix
metrics.ConfusionMatrixDisplay(confusion_matrix=cm_logreg,display_labels=logreg.classes_).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1ea6cb3c1d0>

png

model_results=pd.DataFrame({
    'Name' : ['Logistic Regression'],
    'Accuracy' : [metrics.accuracy_score(ytest,ypred_logreg)],
    'Precision' : [metrics.precision_score(ytest,ypred_logreg)],
    'Recall' : [metrics.recall_score(ytest,ypred_logreg)],
    'F1' : [metrics.f1_score(ytest,ypred_logreg)]})
model_results
Name Accuracy Precision Recall F1
0 Logistic Regression 0.825761 0.440476 0.18593 0.261484
save_model('LogReg',logreg)

Naive Bayes

The only assumption made by the naive Bayes model is independence among predictors, although this doesn't apply as demonstrated previously it can still preform satisfactorly with the assumption broken

x=data2.drop(columns=['left'])
y=data2.left
xtrain,xtest,ytrain,ytest=train_test_split(x,y,stratify=y,test_size=0.2,random_state=0)
nb = CategoricalNB()

nb.fit(xtrain,ytrain)
ypred_nb=nb.predict(xtest)

cm_nb=metrics.confusion_matrix(ytest,ypred_nb,labels=nb.classes_)

metrics.ConfusionMatrixDisplay(confusion_matrix=cm_nb,display_labels=nb.classes_).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1ea7000d0d0>

png

model_results=model_results._append(pd.DataFrame({
    'Name':'Naive Bayes',
    'Accuracy' : [metrics.accuracy_score(ytest,ypred_nb)],
    'Precision' : [metrics.precision_score(ytest,ypred_nb)],
    'Recall' : [metrics.recall_score(ytest,ypred_nb)],
    'F1' : [metrics.f1_score(ytest,ypred_nb)]}))
model_results
Name Accuracy Precision Recall F1
0 Logistic Regression 0.825761 0.440476 0.185930 0.261484
0 Naive Bayes 0.918299 0.782123 0.703518 0.740741
save_model('Naivebayes',nb)

Decision-Tree

There are no required assumptions from the model

In this scenarion the company is mostly interested in a model that allows them to know as much as possible who is in danger of leaving the company, for that the scoring metric used will be recall

# Instantiating the model
dt = DecisionTreeClassifier(random_state=0)

# Selecting parameters to tune
params_dt = {
    'max_depth':[4, 6, 8, None],
    'min_samples_leaf': [1, 2, 5],
    'min_samples_split': [2, 4, 6]
}

# Selecting scores
scoring=('accuracy','precision','recall','f1')

# Instantiating the cross-validation classifier
clf_dt = GridSearchCV(dt,param_grid=params_dt,scoring=scoring,cv=5,refit='recall')
%%time
clf_dt.fit(xtrain,ytrain)
CPU times: total: 4.75 s
Wall time: 4.82 s
print(clf_dt.best_params_)
print(clf_dt.best_score_)
{'max_depth': 6, 'min_samples_leaf': 1, 'min_samples_split': 2}
0.9202835117604147

Improving tuning

params_dt = {
    'max_depth':[5,6,7],
    'min_samples_leaf': [1,2],
    'min_samples_split': [2,3]
}
clf_dt = GridSearchCV(dt,param_grid=params_dt,scoring=scoring,cv=5,refit='recall')
%%time
clf_dt.fit(xtrain,ytrain)
CPU times: total: 1.38 s
Wall time: 1.41 s
print(clf_dt.best_params_)
print(clf_dt.best_score_)
{'max_depth': 6, 'min_samples_leaf': 1, 'min_samples_split': 2}
0.9202835117604147
ypred_dt = clf_dt.best_estimator_.predict(xtest)

cm_dt=metrics.confusion_matrix(
    ytest,
    ypred_dt,
    labels=clf_dt.best_estimator_.classes_
)

metrics.ConfusionMatrixDisplay(
    confusion_matrix=cm_dt,
    display_labels=clf_dt.best_estimator_.classes_
).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1ea71684510>

png

model_results=model_results._append(pd.DataFrame({
    'Name':['Decision Tree'],
    'Accuracy' : [metrics.accuracy_score(ytest,ypred_dt)],
    'Precision' : [metrics.precision_score(ytest,ypred_dt)],
    'Recall' : [metrics.recall_score(ytest,ypred_dt)],
    'F1' : [metrics.f1_score(ytest,ypred_dt)]}))
model_results
Name Accuracy Precision Recall F1
0 Logistic Regression 0.825761 0.440476 0.185930 0.261484
0 Naive Bayes 0.918299 0.782123 0.703518 0.740741
0 Decision Tree 0.981659 0.968254 0.919598 0.943299
save_model('DecisionTree',clf_dt.best_estimator_)

Random Forest

There are no required assumptions

rf = RandomForestClassifier(random_state=0)

params_rf = {
    'max_depth':[2,5,None],
    'min_samples_leaf':[1,2,3],
    'max_features':[0.25,0.5,0.75],
    'n_estimators':[50,100]
}

clf_rf = GridSearchCV(
    rf,
    param_grid=params_rf,
    scoring=scoring,
    cv=5,
    refit='recall')
%%time
clf_rf.fit(xtrain,ytrain)
CPU times: total: 2min 7s
Wall time: 2min 11s
print(clf_rf.best_params_)
print(clf_rf.best_score_)
{'max_depth': None, 'max_features': 0.75, 'min_samples_leaf': 1, 'n_estimators': 50}
0.9215354586857514

Tuning hyperparameters

params_rf = {
    'max_depth':[10,15,None],
    'min_samples_leaf':[1],
    'max_features':[0.6,0.7,0.8,0.9,1],
    'n_estimators':[40,60,70]
}
clf_rf = GridSearchCV(
    rf,
    param_grid=params_rf,
    scoring=scoring,
    cv=5,
    refit='recall')
%%time
clf_rf.fit(xtrain,ytrain)
CPU times: total: 2min 29s
Wall time: 2min 31s
print(clf_rf.best_params_)
print(clf_rf.best_score_)
{'max_depth': 10, 'max_features': 0.8, 'min_samples_leaf': 1, 'n_estimators': 40}
0.9221624179334003
params_rf = {
    'max_depth':[8,9,10,11,12],
    'min_samples_leaf':[1],
    'max_features':[0.8],
    'n_estimators':[20,30,40,50]
}
clf_rf = GridSearchCV(
    rf,
    param_grid=params_rf,
    scoring=scoring,
    cv=5,
    refit='recall')
%%time
clf_rf.fit(xtrain,ytrain)
CPU times: total: 44.2 s
Wall time: 44.7 s
print(clf_rf.best_params_)
print(clf_rf.best_score_)
{'max_depth': 10, 'max_features': 0.8, 'min_samples_leaf': 1, 'n_estimators': 30}
0.9221624179334003
ypred_rf = clf_rf.best_estimator_.predict(xtest)

cm_rf=metrics.confusion_matrix(
    ytest,
    ypred_rf,
    labels=clf_rf.best_estimator_.classes_
)

metrics.ConfusionMatrixDisplay(
    confusion_matrix=cm_rf,
    display_labels=clf_rf.best_estimator_.classes_
).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1ea6c9492d0>

png

model_results=model_results._append(pd.DataFrame({
    'Name':['Random Forest'],
    'Accuracy' : [metrics.accuracy_score(ytest,ypred_rf)],
    'Precision' : [metrics.precision_score(ytest,ypred_rf)],
    'Recall' : [metrics.recall_score(ytest,ypred_rf)],
    'F1' : [metrics.f1_score(ytest,ypred_rf)]}))
model_results
Name Accuracy Precision Recall F1
0 Logistic Regression 0.825761 0.440476 0.185930 0.261484
0 Naive Bayes 0.918299 0.782123 0.703518 0.740741
0 Decision Tree 0.981659 0.968254 0.919598 0.943299
0 Random Forest 0.984160 0.989130 0.914573 0.950392
save_model('RandomForest',clf_rf.best_estimator_)

Gradient Boosting

There are no required assumptions

xgb = XGBClassifier(objective='binary:logistic',random_state=0,enable_categorical=True)

params_xgb = {
    'max_depth':[2,5,10],
    'n_estimators':[30,50,80],
    'learning_rate':[0.01,0.1,0.3],
    'min_child_weight':[1,2,3],
    'colsample_bytree':[0.25,0.5,0.75]
}

clf_xgb = GridSearchCV(
    xgb,
    param_grid=params_xgb,
    scoring=scoring,
    cv=5,
    refit='recall')
%%time
clf_xgb.fit(xtrain,ytrain)
CPU times: total: 4min 5s
Wall time: 1min 11s
print(clf_xgb.best_params_)
print(clf_xgb.best_score_)
{'colsample_bytree': 0.5, 'learning_rate': 0.3, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 50}
0.9246761696338794
params_xgb = {
    'max_depth':[4,5,6],
    'n_estimators':[40,50,60],
    'learning_rate':[0.2,0.3,0.4],
    'min_child_weight':[1],
    'colsample_bytree':[0.4,0.5,0.6]
}

clf_xgb = GridSearchCV(
    xgb,
    param_grid=params_xgb,
    scoring=scoring,
    cv=5,
    refit='recall')
%%time
clf_xgb.fit(xtrain,ytrain)
CPU times: total: 1min 17s
Wall time: 22 s
print(clf_xgb.best_params_)
print(clf_xgb.best_score_)
{'colsample_bytree': 0.5, 'learning_rate': 0.3, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 50}
0.9246761696338794
params_xgb = {
    'max_depth':[5],
    'n_estimators':[50],
    'learning_rate':[0.25,0.3,0.35],
    'min_child_weight':[1],
    'colsample_bytree':[0.5]
}

clf_xgb = GridSearchCV(
    xgb,
    param_grid=params_xgb,
    scoring=scoring,
    cv=5,
    refit='recall')
%%time
clf_xgb.fit(xtrain,ytrain)
CPU times: total: 2.31 s
Wall time: 1.18 s
print(clf_xgb.best_params_)
print(clf_xgb.best_score_)
{'colsample_bytree': 0.5, 'learning_rate': 0.3, 'max_depth': 5, 'min_child_weight': 1, 'n_estimators': 50}
0.9246761696338794
ypred_xgb = clf_xgb.best_estimator_.predict(xtest)

cm_xgb=metrics.confusion_matrix(
    ytest,
    ypred_xgb,
    labels=clf_xgb.best_estimator_.classes_
)

metrics.ConfusionMatrixDisplay(
    confusion_matrix=cm_xgb,
    display_labels=clf_xgb.best_estimator_.classes_
).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1ea6d4b01d0>

png

model_results=model_results._append(pd.DataFrame({
    'Name':['Gradient Boosting'],
    'Accuracy' : [metrics.accuracy_score(ytest,ypred_xgb)],
    'Precision' : [metrics.precision_score(ytest,ypred_xgb)],
    'Recall' : [metrics.recall_score(ytest,ypred_xgb)],
    'F1' : [metrics.f1_score(ytest,ypred_xgb)]}))
model_results
Name Accuracy Precision Recall F1
0 Logistic Regression 0.825761 0.440476 0.185930 0.261484
0 Naive Bayes 0.918299 0.782123 0.703518 0.740741
0 Decision Tree 0.981659 0.968254 0.919598 0.943299
0 Random Forest 0.984160 0.989130 0.914573 0.950392
0 Gradient Boosting 0.982493 0.975936 0.917085 0.945596
save_model('GradientBoosting',clf_xgb.best_estimator_)

Most important features for modeling

importances=pd.DataFrame({'feature':x.columns,'feature_importance':clf_dt.best_estimator_.feature_importances_})
importances=importances[importances.feature_importance>0].sort_values(by='feature_importance',ascending=False)
importances
feature feature_importance
0 satisfaction_level 0.523339
1 last_evaluation 0.166406
2 number_project 0.129295
4 tenure 0.114793
3 average_monthly_hours 0.066123
16 department_technical 0.000044

Since both satisfaction levels and evaluations are so significant for the model, it could be usefull to develop a model to predict the values that doesn't depend on self reported values

Note: Satisfaction levels might be a scource of data leakage, further development could be made to have models that aren't based of such features

Conclusion

Based on the analysis, the decision Tree and Random Forest models emerged as the most effective tools for predicting employee turnover as they achieved the highest recall and precision scores, suggesting they can reliably identify employees who may leave the company. These models can be used by HR to focus retention efforts on high-risk employees.

Recommendations

  • Focus on employee satisfaction: The company should prioritize initiatives to improve employee satisfaction, as this was a key factor in predicting turnover.

  • Work-life balance: Employees with the most work hours billed have the highest risk of leaving, so a more balanced approach to time-spent at work is essential.

  • Career development: Offering promotions and career advancement opportunities could be a way to retain employees, especially those who are highly evaluated and are dedicated to the work.

Future Steps

  • Expand the analysis by including time-series data, which could provide deeper insights into employee behavior over time.

  • Investigate the impact of external factors, such as economic conditions, on employee turnover.

  • Test models with additional features, such as work-life balance or employee project interest, to enhance prediction accuracy further.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published