Analyzing & Predicting Customer Churn in Telecom industry using Machine learning Models.

Churn(which is loss of customers to competition) is a problem for telecom companies because it is expensive to acquire a new customer and companies want to retain their existing customers.

Case Study:

For a Telecom Company “X”, Churn is a problem for their business and churn rates have been increasing steadily over the last 1 year.Company wants to predict the propensity of its customers to churn and this would help the company to determine the right engagement or intervention plan.The Company wants to find out the factors influencing Customer Churn and to target the specific factors with offers more in-line with other service providers, which could help them to retain customers.

Objectives:

2. Highlighting the main variables\factors influencing Customer Churn.

3. Use various ML algorithms to build prediction models, evaluate the accuracy and performance of these models.

4. Finding out the best model and providing final conclusion.

Model Building steps:

Reading the Data:

Description of the Variables:

Data Visualization & Analysis:

Main observations from Visualization Statistics:

2. People with fibre optic internet, churned much more than people with DSL or no internet at all.

3. People without online backup, device protection, and online security churn fairly frequently.

4. Those without tech support tend to churn more frequently than those with tech support.

5. Those with Paperless Billing tend to churn more frequently than those without Paperless Billing.

6. Those with month-to-month contract tend to churn more frequently than those of one & two year contract.

6. Electronic check Payment method tend to churn more frequently than the other Payment method.

7.All of the categorical variables seem to have a reasonably broad distribution, therefore, all of them will be kept for the further analysis.

Average total Charges for Churn and Not Churn Customers:

Building Predictive models using Anova , Logistic Regression, Decision Tree and Random Forest.

  • Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time.
  • Adding tenure, Internet Service, Contract and total Charges significantly reduces the residual deviance.

Model 1: Logistic Regression:

Training Dataset: 5283 obs with 20 variables.

Test Dataset: 1760 obs with 20 variables.

Logistics Model Summary:

Model Evaluation using Confusion Matrix:

Observations — Logistic Regression model:

1.Tenure, Contract, Paperless Billing and Total charges are the significant variables and can be identified as the best predictors of Customer Churn.

2.AIC — 4441.7 (when all variables are taken as input) & AIC — 4476.8 (when only significant variables are taken as input). Hence AIC does not change much even if we use only the significant variables.

3.Model Accuracy — 80.7%

4.Sensitivity (true positive rate, recall) — 54.4% — the model has predicted 54.4% of customer that actually churned.

5.Specificity (true negative rate) — 90.3% — the model has predicted 90.3% of customer that are actually not churned.

6.Precision(Positive Prediction Value) — 66.8% prediction of churned customers is correct.

Model 2 — Decision Tree:

Interpretation of Decision Tree:

  • The contract variable is the most important. Customers with month-to-month contracts are more likely to churn.
  • Customers with DSL internet service are less likely to churn.
  • Customers who have stayed longer than 15 months are less likely to churn.
  • Customers without internet service are more likely to churn.

Model evaluation — Decision Tree Confusion Matrix:

Observations — Decision Tree Model:

1. Model Accuracy — 79.8%~80%

2. Sensitivity (true positive rate, recall) — 54.6% — the model has predicted 54.6% of customer that actually churned.

3. Specificity (true negative rate) — 88.9% — the model has predicted 88.9% of customer that are not churned.

4. Precision(Positive Prediction Value) — 64.07% prediction of churned customers is correct.

5. The Logistic Regression model is slightly better than the Decision Tree with an accuracy of 80.7%.

Model 3 — Random Forest

  • The black line represents the entire sample, the red line represents the error rate for Prediction “No” (not churned) and the green line represents the error rate for Prediction “Yes” (churned).
  • The overall error rate is around 20%.
  • As the number of trees increases, the error rate decreases, and then becomes almost constant after about 100 trees.

Model evaluation — Random Forest Confusion Matrix

Observations — Random forest Model:

1. Model Accuracy — 80.68%~80.7%

2. Sensitivity (true positive rate, recall) — 51.4% — the model has predicted 51.4 % of customer that actually churned.

3. Specificity (true negative rate) — 91.3% — the model has predicted 91.3 % of customer that are actually not churned.

4. Precision(Positive Prediction Value) — 67.99%~68% prediction of churned customers is correct.

5. The Logistic Regression model is almost equal as Random Forest Model with an accuracy of 80.74%.

Model Summary:

  1. Precision rate (percentage of correct prediction of churned customers) for Random Forest model (68%) is slightly better than Logistic Regression (66.8%).

3. Based on the above Precision rate, Random Forest model is the best fit model.

Variable importance measured by Random Forest Model.

  1. As per the plot. the Random Forest model has identified Tenure ,Contract and Total Charges as important predictors for churn.
  2. Internet service, Paperless Billing and Payment Method does not appear as important in this model, and Monthly Charges variable is now highly emphasized.

Data Visualization based on Random Forest Model, Churn Predictors: Contract, Tenure and Total Charges

Final Summary:

  1. Customers with month-to-month contracts.
  2. Customers without internet services and with fibre optic internet services.
  3. Customers without online backup, device protection, online security and tech support.
  4. Customers with Paperless Billing and Electronic Check Payment method.

Customers expected to Not Churn:

  1. Customers who have been with the company for a longer period.
  2. Average Total Charges, for Not Churned customers is approx. 2553 AED which is higher in comparison to Churned Customers which is approx. 1532 AED.
  3. Customers with DSL Internet Services.
  4. Customers with multiple lines.

You can find the coding in my Github repository, https://github.com/Reeka611/Churn-Prediction.

Thank you :)

I am a Freelance Data Scientist based in Dubai, UAE. My previous job was in a Telecommunication Company in Dubai, UAE as a Supervisor in Sales Operations.