Predicting Absenteeism at the workplace, a categorical approach

5 min readAug 26, 2019

In this post, I am going to create a machine learning model on the popular Absenteeism at work dataset. The Absenteeism at work dataset has been used by many organizations seeking to find the main reason behind employees’ absence, to reduce expenses and increase productivity in order to meet customer satisfaction.

Dataset

The dataset used is the popular “Absenteeism at work” dataset. the dataset was created with records of absenteeism at work from July 2007 to July 2010 at a courier company in Brazil. it has also been used in academic research at the Universidade Nove de Julho — Postgraduate Program in Informatics and Knowledge Management.

Getting the data

We start by importing the necessary libraries, pandas, numpy, matplotlib, and seaborn. we then use pandas read_csv() to read the data.

Data exploration/analysis

The dataset has 740 examples and 20 features + the target variable(Absenteeism time in hours).

A bit of feature engineering

We can get 8 more features from feature engineering which are Age_Category, smoke_cat, absenteeism category, Disciplinary cat, drink_cat, Education_cat, transportation_category and distance category.

Data Visualization

Now that we’ve created new features, let’s see how they affect workplace absence.

First lets see how employees age affect absenteeism .

We can see from the plot that young and old employee has the highest numbers of hours absent.

Next, we can see how employees’ educational background affects absenteeism.

As educational qualification increases, so does the hours of absenteeism decreases.

This might be because the employee with higher qualifications receives more pay and benefits including health insurance. Who knows?

Employees who smoke tend to have more hours of absence than those who don’t.

So far, disciplinary actions have the highest effect on hours of the absence. When disciplinary measures are taken employees tends to starts to be punctual.

Building the Machine Learning Models

We split our data into train data and test data in the ratio 70:30 taking our target to be the absenteeism category and choosing features that could be gotten from an employee during recruitment.

We'll train our data with the following machine learning algorithm:

Naïve bayes
Random forest
Logistic regression
Support vector machine.

Note: Check target column to see if it’s balanced before you continue to the next step

From the Countplot above we can see that our target column is imbalanced, If we train the model without fixing this problem, the model will be completely biased.

Therefore we would apply a technique called oversampling,. the most common oversampling technique is SMOTE (Synthetic Minority Over-sampling Technique). In simple terms, it looks at the feature space for the minority class data points and considers its k nearest neighbors. for more information on SMOTE click here

Okay! Looks like SMOTE has increased the samples of minority classes. lets move on with the training.

Naive Bayes classifier

Logistic Regression model

Support vector machine model

Random Forest model

As we can see, the Random Forest classifier goes on the first place, with an accuracy of 0.89

Random Forest is a supervised learning algorithm, it builds multiple decision trees and merges them together to get a more accurate and stable prediction. One big advantage of random forest is, that it can be used for both classification and regression problems

Conclusion

In Conclusion, it has shown that absenteeism can be avoided before it occurs, or the prediction values can be used as an effective tool for crisis management. Also, the hidden causes of absenteeism could be used by an organization to set additional requirement for a new job. therefore the percentage of absenteeism could be reduced.

This is the publication of the project assignment of AI Wednesday cohort, organized by Data Science Nigeria.

Special thanks to everyone who contributed in making the project successful Abdul Quadri, Ojo Olawale, Ogheneovo Idoghor Otulagun Daniel Oluwatosin, Haneefah Abdul-Rahman Lekki

A special thanks to Data Science Nigeria for creating the platform to learn Data Science and Artificial Intelligence by creating world-class learning.

And also to our Instructors Daniel Ajisafe and the entire AI Wednesday Team, we say a big thank you.

Click here for the link to the notebook. If you find this post interesting 😍, let’s see how many times you can hit the clap button in a second.