EVALUATION OF MACHINE LEARNING  MODELS FOR CUSTOMER CHURN  PREDICTION FROM IMBALANCED  DATASET

KUMAR, AYUSH

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets

Learn More

Please use this identifier to cite or link to this item: http://dspace.dtu.ac.in:8080/jspui/handle/repository/21806

Title:	EVALUATION OF MACHINE LEARNING MODELS FOR CUSTOMER CHURN PREDICTION FROM IMBALANCED DATASET
Authors:	KUMAR, AYUSH
Keywords:	MACHINE LEARNING MODELS CUSTOMER CHURN PREDICTION IMBALANCED DATASET XGBoosT KNN
Issue Date:	May-2025
Series/Report no.:	TD-8017;
Abstract:	Customer churn prediction remains a critical challenge for businesses, particularly in industries where retaining customers is more cost-effective than acquiring new ones. This study evaluates machine learning (ML) models for churn prediction using imbalanced datasets, addressing the inherent bias toward majority class instances that plagues traditional approaches. Six classifiers—XGBoost, LightGBM, Logistic Regression, K-Nearest Neighbours (KNN), AdaBoost, and Naive Bayes—are systematically assessed alongside the Synthetic Minority Oversampling Technique (SMOTE) to mitigate class imbalance. The research employs a publicly available telecom dataset with a 26.5% churn rate, pre-processed to handle missing values, encode categorical variables, and engineer temporal features. SMOTE is applied to balance training data, while evaluation prioritizes recall-oriented metrics (F2-score, AUC-PR, Matthews Correlation Coefficient) to reflect real-world business needs. Results demonstrate that tree-based ensemble models (XGBoost, LightGBM) outperform other classifiers, achieving AUC-PR scores of 0.78 and 0.75, respectively, alongside F2-scores of 0.68 and 0.65. These models effectively leverage hierarchical splitting to identify nonlinear relationships, such as the correlation between short-term contracts and churn risk. SMOTE enhances minority-class recall by 18–22% across all models but introduces precision trade-offs, particularly in KNN and Naive Bayes, which struggle with synthetic sample integration. Logistic Regression, while interpretable, shows limited robustness to imbalance (AUC-PR: 0.62), while AdaBoost’s iterative error correction improves stability but lags behind gradient-boosted methods. This study highlights SMOTE’s critical role in balancing dataset skewness while emphasizing the importance of metric selection: models optimized for accuracy (e.g., Naive Bayes at 89%) fail to address business costs associated with false negatives. Practical insights include actionable retention strategies, such as targeting high-risk customers identified by feature importance analysis (e.g., tenure, monthly charges). This work contributes a framework for imbalanced churn prediction, advocating for XGBoost/LightGBM with SMOTE in scenarios requiring high recall and model interpretability. Future directions include exploring dynamic resampling and ethical AI audits to address demographic biases in feature engineering.
URI:	http://dspace.dtu.ac.in:8080/jspui/handle/repository/21806
Appears in Collections:	M.E./M.Tech. Information Technology

Files in This Item:

File	Description	Size	Format
Ayush Kumar m.Tech..pdf		720.06 kB	Adobe PDF	View/Open

Show full item record