IBM Attrition Analysis

01 / The Problem

Attrition is expensive — and mostly preventable.

Replacing a knowledge worker typically costs 50–200% of their annual salary when you account for recruiting fees, onboarding, and the productivity ramp-up period. With a 16% attrition rate across 1,470 employees, the business was losing nearly 235 people per year — most of whom showed clear warning signals in the data long before they resigned.

The goal: identify which employees are most at risk, why they leave, and how much proactive intervention would be worth to the bottom line.

02 / The Dataset

IBM HR Analytics — 35 features, one target.

Structured Data

Demographics, job role, department, salary band, tenure, overtime status, travel frequency, job satisfaction, environment satisfaction, work-life balance score.

Target Variable

Attrition: Yes / No — imbalanced at 84% No / 16% Yes. Required careful handling to avoid a model that just predicts "No" for everyone.

The class imbalance was addressed with stratified train/test splits and precision-recall evaluation rather than accuracy, ensuring the model was actually learning to identify at-risk employees rather than gaming the metric.

03 / Methodology

Exploratory Data Analysis

Visualized attrition rates segmented by department, job role, salary quartile, tenure band, and overtime status using Matplotlib and Seaborn. Identified that Sales Representatives and Laboratory Technicians had the highest attrition rates (>30%), while overtime workers left at nearly 3× the rate of non-overtime workers.

Feature Engineering

Created composite features not present in the raw data. An Engagement Index (weighted average of JobSatisfaction, EnvironmentSatisfaction, and RelationshipSatisfaction) and Tenure Bands (0–2 years, 3–5, 6–10, 10+) both proved to be stronger predictors than any single raw feature. SQL was used to validate aggregations and cross-reference against compensation data.

Model Selection & Evaluation

Trained and compared Logistic Regression, Random Forest, and a Gradient Boosting classifier using Scikit-learn. Evaluated on F1-score and the Precision-Recall curve rather than accuracy. Random Forest achieved the best balance: high recall (catching true at-risk employees) without flooding HR with false positives.

Feature Importance & Interpretation

Extracted permutation importance scores to rank which features drove predictions. Presented findings as actionable HR insights rather than ML jargon — translating "OverTime has the highest feature importance coefficient" into "employees required to work overtime are 2.7× more likely to leave within 12 months."

04 / Key Findings

The data told a consistent story.

Overtime requirement 2.7× attrition risk

Low Engagement Index (< 2.0 / 4.0) 4× attrition probability

Salary below 3rd quartile for role 33% higher attrition rate

Tenure 0–2 years (new-hire window) 31% of all departures

Frequent business travel 24% higher vs. non-travel roles

05 / The ROI Model

Translating predictions into dollars.

Using conservative industry estimates for knowledge-worker replacement costs, I modeled the financial impact of deploying the Random Forest classifier to flag high-risk employees for targeted retention outreach.

At-risk cohort identified

~62

employees / year flagged

Avg. replacement cost

$42k

per departed employee

Retention program cost

$9k

per retained employee

Calculation

62 flagged employees × 70% model precision = 43 true at-risk 43 retained

43 employees × $42k replacement cost avoided $1,806,000

62 outreach interventions × $9k program cost − $558,000

19 false positives × $9k (unnecessary spend) − $171,000

Net annual savings (conservative estimate) $181,000 ✓