Analyzed 1,470 IBM employee records to surface the root causes of voluntary attrition — then translated those findings into a quantified ROI model demonstrating $181k in annual cost savings from targeted retention interventions.
01 / The Problem
Replacing a knowledge worker typically costs 50–200% of their annual salary when you account for recruiting fees, onboarding, and the productivity ramp-up period. With a 16% attrition rate across 1,470 employees, the business was losing nearly 235 people per year — most of whom showed clear warning signals in the data long before they resigned.
The goal: identify which employees are most at risk, why they leave, and how much proactive intervention would be worth to the bottom line.
02 / The Dataset
Structured Data
Demographics, job role, department, salary band, tenure, overtime status, travel frequency, job satisfaction, environment satisfaction, work-life balance score.
Target Variable
Attrition: Yes / No — imbalanced at 84% No / 16% Yes. Required careful handling to avoid a model that just predicts "No" for everyone.
The class imbalance was addressed with stratified train/test splits and precision-recall evaluation rather than accuracy, ensuring the model was actually learning to identify at-risk employees rather than gaming the metric.
03 / Methodology
Visualized attrition rates segmented by department, job role, salary quartile, tenure band, and overtime status using Matplotlib and Seaborn. Identified that Sales Representatives and Laboratory Technicians had the highest attrition rates (>30%), while overtime workers left at nearly 3× the rate of non-overtime workers.
Created composite features not present in the raw data. An Engagement Index (weighted average of JobSatisfaction, EnvironmentSatisfaction, and RelationshipSatisfaction) and Tenure Bands (0–2 years, 3–5, 6–10, 10+) both proved to be stronger predictors than any single raw feature. SQL was used to validate aggregations and cross-reference against compensation data.
Trained and compared Logistic Regression, Random Forest, and a Gradient Boosting classifier using Scikit-learn. Evaluated on F1-score and the Precision-Recall curve rather than accuracy. Random Forest achieved the best balance: high recall (catching true at-risk employees) without flooding HR with false positives.
Extracted permutation importance scores to rank which features drove predictions. Presented findings as actionable HR insights rather than ML jargon — translating "OverTime has the highest feature importance coefficient" into "employees required to work overtime are 2.7× more likely to leave within 12 months."
04 / Key Findings
05 / The ROI Model
Using conservative industry estimates for knowledge-worker replacement costs, I modeled the financial impact of deploying the Random Forest classifier to flag high-risk employees for targeted retention outreach.
At-risk cohort identified
~62
employees / year flagged
Avg. replacement cost
$42k
per departed employee
Retention program cost
$9k
per retained employee
Calculation
Tools & Technologies
© 2026 Owen Duffy