-
Notifications
You must be signed in to change notification settings - Fork 1
Product back orders prediction
Welcome to the Product-Backorders wiki!
Product Back orders *What is a backorder? : A customer order that has not been fulfilled. A backorder generally indicates that customer demand for a product or service exceeds a company’s capacity to supply it. Total backorders, also known as backlog, may be expressed in terms of units or dollar amount. *Product backorder may be the result of strong sales performance (e.g. the product is in such high demand that production cannot keep up with sales). However, backorders can upset consumers, lead to canceled orders and decreased customer loyalty. Companies want to avoid backorders, but also avoid overstocking every product (leading to higher inventory costs). *Machine learning can identify patterns related to backorders before customers order. Production can then adjust to minimize delays while customer service can provide accurate dates to keep customers informed and happy. The predictive analytics approach enables the maximum product to get in the hands of customers at the lowest cost to the organization.
DATA _The data comes from dataset: Kaggle’s Can You Predict Product Backorders? _ The data file contains the historical data for the 8 weeks prior to the week we are trying to predict. The data were taken as weekly snapshots at the start of each week. The target (or response) is the went_on_backorder variable. To model and predict the target, we’ll use the other features, which include:
- sku – Random ID for the product
- national_inv – Current inventory level for the part
- lead_time – Transit time for product (if available)
- in_transit_qty – Amount of product in transit from source
- forecast_3_month – Forecast sales for the next 3 months
- forecast_6_month – Forecast sales for the next 6 months
- forecast_9_month – Forecast sales for the next 9 months
- sales_1_month – Sales quantity for the prior 1 month time period
- sales_3_month – Sales quantity for the prior 3 month time period
- sales_6_month – Sales quantity for the prior 6 month time period
- sales_9_month – Sales quantity for the prior 9 month time period
- min_bank – Minimum recommend amount to stock
- potential_issue – Source issue for part identified
- pieces_past_due – Parts overdue from source
- perf_6_month_avg – Source performance for prior 6 month period
- perf_12_month_avg – Source performance for prior 12 month period
- local_bo_qty – Amount of stock orders overdue
- deck_risk – Part risk flag
- oe_constraint – Part risk flag
- ppap_risk – Part risk flag
- stop_auto_buy – Part risk flag
- rev_stop – Part risk flag
- went_on_backorder – Product actually went on backorder. This is the target value.
UNDERSTANDING
- Class imbalance (Only .67% volume of products went on back order)
- lead_time has NA values
- Last Row has all NA values
- Remove sku (all unique values)
- perf_6_month_avg, perf_12_month_avg attributes have missing data with -99 values.
- SKUs for which forecast and sales are 0 and the target class “Went to Back order is also “No”. Near Zero Variance attributes:
- in_transit_qty (Numeric)
- potential_issue (Catagorical)
- pieces_past_due (Numeric)
- local_bo_qty (Numeric)
- oe_constraint (Catagorical)
- stop_auto_buy (Catagorical)
- rev_stop (Catagorical)
PREPROCESSING
- Remove Null/Duplicate Rows
- Remove sku (all unique values)
- Impute NAs in lead_time to mean
- Normalize the quantity columns
- Convert categorical binary attributes to numeric attributes (Yes/No to 1/0)
- Remove the correlated columns based on corrplot
- Drop unused levels (When creating a subset of a dataframe
- Remove the SKUs for which forecast and sales are 0 and the target class “Went to Back order is also “No” . Remaining No of records = 1015940
- Feature selection using Random forest can be done
DATA BALANCING USING LIBRARY UNBALANCED
One challenge with this problem is dataset imbalance, when the majority class significantly outweighs the minority class.
To deal with unbalanced data set SMOTE (synthetic minority over-sampling technique) was used.
Underbalancing technique was also used for each feature set.
The second challenge is optimizing for the business case. To do so, we could explore cutoff (threshold) optimization which can be used to find the cutoff that maximizes expected profit.
METRICS
ROC Curve The Receiver Operating Characteristic (ROC) curve is a graphical method that measures the true positive rate (y-axis) against the false positive rate (x-axis). The benefit to the ROC curve:
- We can visualize how the binary classification model compares to randomly guessing
- We can calculate AUC (Area Under the Curve), which is a method to compare models (perfect classification = 1). We could use the ROC curve and pick a threshold for classification that corresponds to the point on the line for our desired balance between the true positive rate and false positive rate.
Alternatively, we may look at two different measures, precision and recall:
- Precision: the proportion of predicted backorders that actually go on backorder which is proportion of true predicted positives to total predicted positives
- Recall: the proportion of backordered items that are predicted to go on backorder which means proportion of true predicted positives to total actual positives If we set a low threshold for classification, we predict that parts go on backorder more often. This leads to higher recall and lower precision If we set a high threshold for classification, we do not predict that parts go on backorder as often. This leads to lower recall and higher precision