Churn Prediction of Employees Using Machine Learning Techniques

: Employees are considered as the most valuable assets of any organization. Various policies have been introduced by the HR professionals to create a good working environment for them, but still, the rate of employees quitting the Technology Industry is quite high. Often the reason behind their early attrition could be due to company - related or personal issues, such as No satisfaction at the workplace, Fewer opportunities for learning, Undue Workload, Less Encouragement, and many others. This paper aims in discussing a structured way for predicting the churn rate of the employees by implementing various Classification techniques like SVM, Random Forest classifier, and Naives Bayes classifier. The performance of the classifiers was compared using metrics like Confusion Matrix, Recall, False Positive Rate, and Accuracy to determine the best model for the churn prediction. We found that among the models, the Random Forest classifier proved to be the best model for IT employee churn prediction. A Correlation Matrix was generated in the form of a heatmap to identify the important features that might impact the attrition rate.


INTRODUCTION
"Attrition" is not a new term for us anymore, as it has become an unavoidable situation in any business or organization, where staff and employees tend to leave due to their personal and professional circumstances. Further, this can cause a huge impact on any organization's growth curve if it is not given any attention, soon [1]. The major battle of employee attrition is right now being fought by the Technology Industries in India. Analysis from LinkedIn shows us that the software industry suffers from the highest turnover rates, which is about 13.2% compared to retail, entertainment, and professional industries. As per Maren Hogan, a talent acquisition expert, following points needs attention: 1. One-third of the new joiners quit, after six months in an organization. 2. After a week of working in a company, few decide on whether they want to continue staying there for the long term or not. 3. Also, a third of heads in companies having more than 100 employees are searching for new job opportunities [2].
Today's Millennial crowd in organizations is often identified as "job-hoppers", as they frequently change or quit their jobs to get to the next step of their career, as compared to the past generations. Rather than staying loyal to one company they often tend to search for better opportunities so that they can keep up in the era of digital progression. If we dig deep then we can find a distinct set of challenges faced by them like industry, proper recognition, communication, ethnicity, age, gender, etc. that drive the employees to leave a particular organization. Challenges faced by talent-hiring consultants are, sorting out the appropriate candidates through resumes and conversation, who will become the asset of the organization, and then if a person quits they need to repeat the entire hiring process. Every time hiring new talent and training them in current technologies involves a great amount of cost to the organization. Apart from this tangible expense, a fair amount of time we need to give the newly employed person to become a productive member of the project [3].
The Human Resource department of any organization generates a plethora of data related to employee's leave, promotion cycle, rewards, wages, various evaluations, conflicts, policies, and benefits. As a researcher, our work is to identify the correct parameters or areas where the employees face issues regularly at their workplace.
In this data-driven study, we will try to analyze the employee's data using some classification techniques and will provide quality insights and suggestions, so that the organization can retain them as well as develop them before it's too late. As HR professionals or managers our main focus should always be on an individual or certain groups of employees, especially towards their specific needs or their situation, then only it can further help an organization to grow more without losing good employees.

Research Objective
In our study, we will analyze the data of the Technology Professionals, especially their challenges that they face directly or indirectly at their workplace. Main objectives of this study are: 1. To identify those challenges or input variables that have a huge impact on the employee's intention to leave the organization. 2. To accurately predict which employee will leave the organization in the next few years, using classification models.

LITERATURE REVIEW
An evidence-based study by Janet et al. (2017) has combined the already published scholarly reviewed literature on HR Analytics and has concentrated on answering major questions on HR Analytics, how it works, its outcome, and why there is a need for HR Analytics to flourish? They have stated that the interest of people in analytics in the HR domain for the past few years has gradually increased [4].
Later, the authors concluded that the inclusion of HR Analytics in various organizations is very low and proofs on this topic are scattered, hence suggested areas for future research. Many firms or departments say Marketing, Finance, Supply Chain Management organizations today draw insights from the huge data collected from the employees so that they can stay in this competition. The Human Resource department generates massive amounts of data on employee turnover, Return on Investments, and Cost per hire, but somewhere they still face a harder time relating these data with the organization's performance. They should create reports on past performances, administrative tasks, and generate compliance reports to understand the employee's contribution to the organization [5].
HR applications followed by today's organization can act as a mediator between planned HR practices in an organization and the positive outcomes of employees. Hence, Innocenti et al. (2012) have proposed a model that uses survey data that has been collected from over 6000 employees working in almost 37 Italian organizations, and the outcome variables are employee commitment and their job satisfaction. By using the maximum likelihood estimation method and calculating the correlations between different variables it was reported that, there is always a positive effect of experienced HR practices on both affective engagement towards organization and job satisfaction factors [6].
Line managers are considered the assets of that particular organization, so it's necessary to keep them engaged so that they can add value to any organization. Few semi-structured interviews were performed by Sana et al. (2016), to understand the experience and perceptions of the line managers on the level of support and help provided by the HR professionals of their organization [7]. Further, they have stated that the line managers have raised concerns and have suggested ideas for improving few areas like perceptions regarding policies, workload, inadequate training, and HR practices, which we need to pay attention to during any research on the factors related to employee attrition or turnover.
There have been several studies on identifying the parameters that play a role in job satisfaction of the employees and predicting the attrition rate. Many Data Analytics techniques and classification models have been used to predict turnover. In any organization, innovation can be seldom duplicated but once a group of productive employees leaves, that place cannot be replicated easily. So, to retain these employees and predict the turnover rate, a Neural Network, with a 10-fold Cross-Validation was designed for a small Midwest manufacturing company to a greater accuracy [8]. Among Layoffs, Discharges, Unavoidable separation, it was identified that voluntary separation from an organization always proves as the most difficult area because the particular organization loses its investment on talent to its competitors out there. On this same note, Fan et al. (2012) in their study, focused on why technology enterprises in Taiwan are unable to retain their talented employees and they have discussed ways so that the organizations can increase the competitiveness among themselves. Techniques like clustering analysis, hybrid artificial neural networks and other machine learning techniques were applied to forecast the patterns of employee's turnover rate [9]. Again, many Classification models have been used for prediction purposes, on a HR analytics dataset from Kaggle, an online community data site. Correlations between different attributes were evaluated by Sisodia et al. (2017) in their paper [10]. A comparison between different classifiers was drawn using parameters like Accuracy, Precision, True Positive Rate, F-Measure, and few others. Weighted TPR-TNR has been proposed as another performance metric to evaluate the performance of various classifiers, as it especially focuses on the imbalance ratio of any dataset and assigns different weights to TPR (Sensitivity) and TNR (Specificity), which are majorly considered while comparing ROC curve of any model. A mix of balanced and imbalanced datasets was used to evaluate the performance of 12 classifiers using the above metric [11].
To build and maintain a strong relationship between an organization and its employees, Hebbar et al. (2018), in their study initially implemented Logistic Regression on an IBM Employee Attrition dataset available in Kaggle just to get a basic idea, on which outcome group every individual falls [12]. Later on, a comparative study was done with SVM and Random Forest models, and determined the major characteristics of the dataset performing Exploratory Data Analysis and represented the data using different visualization.
With the same dataset (that has been used above), Synthetic Minority Oversampling Technique (SMOTE) was performed by Bhartiya et al. (2019) in their paper, to balance the imbalance dataset, because the count of the "Attrition" parameter with value 0 was greater than "Attrition" with a value of 1. The above technique is often used to generate synthetic data records for that class whose count is very less. Attributes like Gender, Education Field, and Performance Rate were visualized for Attrition parameters thus giving an idea on the relevant features. A comparison between the performance metrics of the classification models provided new insights on improving the work ethics [13].
With redundant data, predicting the correct features becomes a little challenging. So, a superior machine learning model or algorithm called XGBoost gives high accuracy in predicting the attrition rate with fewer running times. Jain et al. (2018) recommend XGBoost as a highly robust model, which easily handles noisy data in a huge dataset, and in their study, it gives an accuracy of about 90% on an online HR dataset [14]. Further, it suggests IT organizations to use this as a top priority, predictive model to identify those employees who are willing to leave in near future and their reasons behind that.
A very common issue that today's IT professionals face is stress disorders. Though organizations do offer a nice workplace environment and different activities or workshops to relieve this stress, still the risk increases among the employees. Various machine learning techniques like Boosting and Decision trees were implemented by Reddy et al. (2018) in their study, and have determined that data on family history of illness, gender and health benefits provided by employers plays an important role in evaluating this type of risks [15]. Ensemble method gave the highest degree of accuracy and precision compared to Random Forest. General characteristics like having peers to work with and the financial needs of the employees become critical factors for those who are working for a longer tenure in any business or organization. So, for the hospitality industry in the USA, Self et al. (2011) attempted a qualitative study on identifying various factors that might impact an employee's decision to stay back in a company. By analyzing the interview transcripts that were obtained after an in-depth process, four factors were identified: Strong Responsibility towards the company, Financial Requirement, Proper Job Description, and Peers at the workplace has a positive effect on employees [16].
One of the challenges that the big organizations are facing is, motivating their employees and investing in them for their further development. Understanding the importance of investing in employee development and its final results, is very much needed by the organization. A model proposed by Lee et al. (2003) gives us an interrelationship between perceived investments and other job attitudes and the employee's plan to quit an organization. Factor analysis and Exploratory analysis were conducted for assessing the dimensionality and their insights, respectively. Results suggest that the more the employer spends resources on the development of their employees, the more they will be satisfied at their workplace, hence reducing the possibility of an employee quitting his or her job in that organization [17]. Burnett et al. (2019) propose a few topics on which one can use modern technology or tools to measure both employee engagement and the other HRM practices which can improve the same [18]. Different emotional states of employees affect their engagement at the workplace, either directly or indirectly. Further, they have pointed out that to improve on engagements we need to concentrate on three different levels: individual, team and organizational level and have suggested that with the real-time feedback from employees and rigorous research and analysis on the data will help the HRM department to understand the importance of employee engagement in their respective organization.
So, to stay in this competitive market, these technology industries need to continuously evolve in terms of skills and should be ready to embrace the ever-changing products and services. Even employees make themselves proficient in the new skills or technologies and try to search for better job opportunities outside. An analytics-driven approach can help organizations to overcome the situation. Combining the historic record of skills of each employee present in the HR database with the predictive models, Ramamurthy et al. (2015) have proposed an approach that evaluates a set of skills [19]. The algorithm in their study will provide a list of skills to some individuals, where they will fill in their target skills, helping business leaders to find potential candidates and will provide re-skilling offers to them.
One can go for Sentiment Analysis to determine the factors affecting employee retention, and organizations can use these models to understand the concepts of People Analytics. A conceptual study was done to identify key indicators to assess the human factors. Six important areas, like performance leadership, employee engagement, learning, workplace dynamics, and overall organizational development have used sentiment analysis to evaluate various insights. The Enron email corpus test case was incorporated to explain how we can predict the digital footprints. Further encourages implementing various data mining techniques or models to analyze the real-time data for predicting more accurate human factor patterns [20]. In addition to this, often interpersonal environment factors provide insights about employee development in any organization. Liu et al. (2019) in their study have concentrated on a state-owned enterprise in China, extracted the related features, and statistically analyzed the correlation between employee development in organizations and their interpersonal environment. The results of the predictive model prove that colleagues and classmates have a great impact on the growth of employees in their respective workplaces [21].

Research Gap
After reviewing the existing work, it was observed that many of the studies were following secondary data which is a HR analytics dataset available in an open-source dataset site, to predict employee turnover using Data Mining Techniques. The attributes that they have considered in their study are the generic parameters related to any employee who has already left the organization. Today, if we discuss with the IT professionals, we will get to know that they still face a set of challenges, both at their workplace and in their personal life which results in early attrition. This set of challenges often goes overlooked in this industry by the HR executives.
Every new employee who gets recruited might face a different set of challenges while working. So, analyzing the data of those employees who have already left the organization might not give us the features that apply to the new joiners. Rather, we need to interact with them frequently or take their feedback on a real-time basis, just to get the actual data related to their challenges, like Recognition, Challenging work, Scope of Development, Satisfaction Level, Unhealthy work ethics and Impact on them of their peers leaving an organization. For this reason, we are using primary data in our study that has been collected from employees working in various IT industries. We need to concentrate more on discussing what they want for their betterment in this organization. Then start predicting who might leave within a couple of years, post this we can offer them proper opportunities. This will not only encourage the employees but will help the organization in retaining its talent.

RESEARCH METHODOLOGY
This study is focused on employees from a specific age group that is from 20 to 39 years old, who are considered to be the major contributors to the highest turnover of any organization. In this research, surveys were conducted to get the raw data from the employees, which is first preprocessed, and then analysis was done to derive meaningful insights.
A questionnaire consisting of 35 questions was circulated among 200 employees and the response rate was around 79%. Among these responses, 83 were male and 75 were female employees. 80% of these employees had working experience of 4 years or less, the remaining 20% had an experience that varies from more than 4 years to 13 years. This survey had combinations of few open and close-ended questions, which includes a Likert scale and few dichotomous answer types. This will help us understand the actual perception of the employees regarding the organization or employer.
The entire questionnaire was designed based on our detailed review of the previous work that has been done by other researchers in this topic and our discussions with a few experts who are involved in the technology industries. Further, these questions have been divided into 5 sections, like Individual Beliefs, Management and Team, Engagement and Encouragement, Talent Development, Organisation and Leadership, to get an overall idea of the employees towards different verticals of an organization.
We are implementing and analyzing a few classification models in R studio.

Input Data Set
The data collected includes 11 attributes for each employee. The target variable "Quit_in_2years" consists of three classes, they are: "Maybe", "No" and "Yes", thus our study is a multi-class classification. Tab. 1 gives us the details on the attributes that will be used in our study:

Data Pre-Processing
Among the 11 attributes, "Gender", "Peers_Leaving" and the target variable, "Quit_in_2years" are categorical data types. So, to determine the impact of the above predictors on the target variable and evaluate the correlations among the attributes, the categorical fields were converted to numeric values. For example, "Female" was denoted by 1 and "Male" as 2. Under "Peers_Leaving" there were three categories, where "Yes" and "No" were given 1 and 0, respectively, while "Maybe" was denoted as 0.5. Similarly, the values of "Maybe", "Yes" and "No" for the target variable were denoted as 1, 2 and 3 respectively.
Though the null values in the dataset were really less, it was chosen to be replaced by the mean of the whole column rather than dropping the whole entry. To summarize the whole data, and to determine how close these variables have a linear relationship among themselves, we plotted a correlation matrix. This gives us an idea of identifying the features which have weak and strong dependencies.
For example, in Fig. 1, the darkest blue on the scale means there is a positive correlation among the attributes, whereas the dark red means a negative correlation. In the above figure, it can be observed that there is a stronger relationship between "Age" and "Years_of_Experience", again "Satisfaction_Level" and "Work Recognition has a positive correlation, with a coefficient of 0.53. The rest of the variables do not have a strong consistent relationship with each other. We observe that there are Negative Coefficients in the above matrix, this indicates that if the value of one attribute increases then the value of the other attribute will tend to decrease.

Feature Selection and Ranking
This approach helps in recognizing the correct features in any dataset, where we can easily differentiate the features that play a significant role in predicting employee's intention to leave in the next 2 years, from the other features. Further, it will help in building a reliable model, with greater accuracy. Here, an R package known as "caret", is being used which will automatically give us a report on the importance and relevance of the attributes in our dataset and will help in ranking those features.
So for the feature selection process, RFE (Recursive Feature Elimination) is chosen, which is majorly used with SVM to continuously build a model and simultaneously remove those features that have low weights and discover the optimal number of features. The algorithm is configured to explore all possible subsets of the attributes. Next, to specify ranks to the feature by importance, a method known as LVQ (Linear Vector Quantization) was used, which is a form of ANN (Artificial Neural Network) algorithm and allows us to choose the training instances and learn what those instances should look like.
In Fig. 2, we have ranked all the features as per the target classes. So, it can be inferred that among all the 11 features, "Satisfaction_Level", "Salary_Level", "Work_Recognition", "Gender" and "Challenging_Work" are the top 5 challenges that have a huge effect on the target variable, that is, "Quit_in_2 years". Whereas, "Promotion_in_last_year" and "Peers_Leaving" have the least impact on the employee's decision on leaving the organization in the future.

MODELS AND IMPLEMENTATION
In this research, three classification models were used to predict, whether a particular employee will leave the organization or not, based on the challenges he or she is facing currently at the workplace. Here, the classifiers that we are going to implement are the Random Forest classifier, SVM (Support Vector Machine), and Naive Bayes.
As per our research framework, after preprocessing the data, it was split into two parts, that is, train and test dataset in the 70:30 ratio. Trained our classification models by passing the training dataset and then evaluated the most efficient model by predicting the target value using the test dataset.
Our study on analyzing the performance metrics of the models has been bifurcated into two cases. Case 1: Includes all the three classes of the Target Variable. Case 2: Here we are including only two classes, that is, "Yes" and "No" of the Target Variable. 56 Employees who are still in the dilemma of whether they will leave their organization or not might affect the accuracy of the model. Hence, we removed them in Case 2 and analyzed the performance metrics.

Support Vector Machine
It comes under supervised learning techniques, majorly used for classification of data but is often implemented for regression problem statements. In this technique, the data points are separated from each other by a line or a hyperplane, and this division between the two sides categorizes the whole data sets into two or more classes. The space between the two classes is also known as margin, and this should be as large as possible so that we can reduce the error while classification. Package "e1071" is used for the implementation of the said model.
Tab. 2 gives us the Confusion Matrix of SVM, which includes all the classes of the Target variable whereas, Tab. 3 represents the Confusion Matrix for only two classes.

Naive Bayes Classifier
The crux of this classification method is based on the famous Bayes Theorem. It assumes that a particular feature or attribute in a class is independent of the existence of any other feature. The model is easy to build and is particularly useful if we have a huge dataset. With its simplicity in the model, Naive Bayes can outperform other sophisticated classification models for multi-class prediction. Below Tab. 4 and Tab. 5 are the Confusion Matrices for the above model for two different cases that we are considering in our study.

Random Forest Classifier
This model is an ensemble tree-based learning technique. Rather than using a single decision tree for classification of the data, it uses a set of decision trees that randomly selects subsets of data and train the model. Voting will be performed on the predictions from each of these trees and finally, the best solution will be selected. This method helps reduce the overfitting by averaging the results, as compared to traditional decision trees.
For the implementation of the classifier, a package called "randomForest" is used in our study. We can observe the values of predicted and actual instances from Tab. 6 and Tab. 7.

RESULTS AND DISCUSSION
So, to choose the best classifier for this study, we are comparing the existing performance metrics, say Model's Accuracy, Recall, Specificity, Precision, F-Measure, Area Under Curve (AUC) and another metrics that we are considering is Weighted TPR-TNR. For comparing the results of multi-class classification we are using the Macro Average Method for parameters like Recall (Sensitivity), Specificity, Precision, and F-Measure. This method helps in determining the performance of the overall system. As our data is a balanced dataset, we are using this method to calculate the average of the values that we obtained for each class.
As per our problem statement, we are mainly concerned with the people leaving the organization, thus to acquire complete knowledge to overcome this, parameters like Recall and AUC play a huge role along with the accuracy of the models.  From the above graph, we observe that the Random Forest classifier has achieved a far better prediction accuracy of 70.83% when compared to other classifiers.

Case 1: Comparing the performance metrics for all the Target Variable classes
Simultaneously, one must look for the Recall and Precision value apart from the model's accuracy. From Tab. 8 we can see that for Random Forest classifier the Recall value has increased but the Precision value is slightly less than Naive Bayes. Values of weighted TPR-TNR are the highest in the case of Random Forest than the other two models.
ROC curve is a trade-off between sensitivity and specificity, where the curve of a perfect classifier should have the highest Recall (True Positive Rate) with the lowest False Positive Rate. So, to summarize the performance of the classifiers we take the calculated area under the ROC curve into consideration, which is also known as AUC. So, the higher the AUC, the greater will be the accuracy of the model. With the highest AUC and lowest False positive rate, the Random Forest classifier stands out from the rest of the models.

Case 2: Comparing the performance metrics for only two Target Variable classes (without "Maybe")
Compared to Case 1, it can be observed that the accuracy of each model has increased by quite a percentage after excluding those employees who still had some difficulty in deciding on leaving their organization in the next two years. Both Naive Bayes classifier and Random Forest classifier have obtained an accuracy of 77.42%. Recall and F-Measure value of Naives Bayes is greater than the other two models, whereas if we observe Tab. 9, we can state that the Precision and weighted TPR-TNR for Random Forest classifier has increased, compared to SVM and Naive Bayes.  Considering the area under the ROC curve we can see from the above figure, that the Random Forest classifier still has a lead of 6.62% from SVM and 1.07% from Naive Bayes classifier.
So, it can be stated that with the lowest False Positive Rate and highest AUC, Random Forest Classifier proves to be a good model in this case as well.
Adding to this, as our study focuses more on predicting employees who might leave in near future, we should never forget the False Negatives in this case. That is, those employees who are planning to leave but the model somehow does not predict them correctly. We need to identify these False Negatives and should find ways to reduce this. TECHNICAL JOURNAL 15, 1(2021), 51-59

CONCLUSION
As per our discussion above, employees leaving organization has a major impact on the development of these technology organizations. Often the challenges or issues faced by the employees at the workplace or in their personal life have a great impact on their early attrition from the organization.
In our study we have identified the important factors that affect the employees, resulting in future attrition. To help with the analysis, data were collected from professionals working in IT industries. Majority of the attributes we considered did not have a significant correlation with each other. Further to get the top features that have a positive impact on employees, a method called RFE (Recursive Feature Elimination) was chosen for variable selection. This method helped in removing redundant and less important variables and highlighted features which has more impact on the target attribute. In addition to this, LVQ (Linear Vector Quantization) was introduced to rank all the attributes as per their importance.
Secondly, our goal was to accurately predict those employees who are planning to leave the organization in the next 2 years, using a few classification models. Techniques like SVM, Naive Bayes, and Random Forest classifiers have been implemented in this study. So, to analyze the pattern we bifurcated our analysis into two cases. In the first case, we considered all the Target Variable classes but for the second case, we removed those employees who still had their doubts about leaving a particular organization in near future. Observing the results, we conclude that the models implemented in Case 2 gave good accuracy as compared to Case 1. The most efficient model in our study was the Random Forest classifier giving us the highest accuracy and Recall value when compared to the other models.
Apart from getting a good raise and promotion, there have been other kinds of challenges faced by today's talent, which the HR executives or managers of the project need to take care. In the future direction, this study can be further extended, by including attributes, like Scope of Development, Views on workload distribution, Career goal discussion and Issues on unhealthy work ethics.
Organizing frequent feedback or a one to one interview on the organization policies can help HR understand the expectations. In our study we had a limited data size of 158 entries, it is suggested that with more data points and features we can achieve higher accuracy from these models.

Notice
This paper was presented at IC2ST-2021 -International Conference on Convergence of Smart Technologies. This conference was organized in Pune, India by Aspire Research Foundation, January 9-10, 2021. The paper will not be published anywhere else.