The goal of this project is to develop a machine learning model to predict customers who will churn from the company. Before developing the model, data analysis and feature engineering steps are performed.
The Telco customer churn dataset contains information about 7,043 customers of a fictional telecom company in California in the third quarter. It includes information about which customers have left, stayed, or signed up for home phone and internet services.
- CustomerId: Customer ID
- Gender: Gender
- SeniorCitizen: Whether the customer is a senior citizen (1, 0)
- Partner: Whether the customer has a partner (Yes, No)
- Dependents: Whether the customer has dependents (Yes, No)
- Tenure: Number of months the customer has stayed with the company
- PhoneService: Whether the customer has phone service (Yes, No)
- MultipleLines: Whether the customer has multiple lines (Yes, No, No phone service)
- InternetService: Customer's internet service provider (DSL, Fiber optic, No)
- OnlineSecurity: Whether the customer has online security (Yes, No, No internet service)
- OnlineBackup: Whether the customer has online backup (Yes, No, No internet service)
- DeviceProtection: Whether the customer has device protection (Yes, No, No internet service)
- TechSupport: Whether the customer has technical support (Yes, No, No internet service)
- StreamingTV: Whether the customer has streaming TV (Yes, No, No internet service)
- StreamingMovies: Whether the customer has streaming movies (Yes, No, No internet service)
- Contract: Customer's contract term (Month-to-month, One year, Two years)
- PaperlessBilling: Whether the customer has paperless billing (Yes, No)
- PaymentMethod: Customer's payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
- MonthlyCharges: Monthly charges collected from the customer
- TotalCharges: Total charges collected from the customer
- Churn: Whether the customer has churned (Yes or No) - Indicates whether the customer left in the last month or quarter
- Data is read from the provided CSV file.
- Data types and missing values are checked and corrected.
- Encoding is applied to binary categorical variables.
- Standardization is performed for numeric variables.
- General overview of the dataset is provided.
- Numeric and categorical variables are identified and analyzed.
- Target variable analysis is conducted, including mean values by categorical variables and numeric variables by the target variable.
- Outlier analysis is performed.
- Missing observation analysis is conducted.
- Correlation analysis is done.
- Missing values and outliers are handled.
- Encoding operations are performed for categorical variables.
- Standardization is applied to numeric variables.
Several machine learning models are built and evaluated using cross-validation:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Decision Tree
- Random Forest
- CatBoost
- Light GBM
- XGBoost
The models are tuned using hyperparameter optimization, and the best-performing model is selected based on evaluation metrics such as accuracy, precision, recall, F1 score, and confusion matrix.
The performance of each model is evaluated, and the results are stored in a dataframe. The best model is selected based on the F1 score.
The best-performing model (XGBoost) is saved to a file named "best_model.pkl" using pickle for future use.