Product back orders prediction

Welcome to the Product-Backorders wiki!

Product Back orders *What is a backorder? : A customer order that has not been fulfilled. A backorder generally indicates that customer demand for a product or service exceeds a company’s capacity to supply it. Total backorders, also known as backlog, may be expressed in terms of units or dollar amount. *Product backorder may be the result of strong sales performance (e.g. the product is in such high demand that production cannot keep up with sales). However, backorders can upset consumers, lead to canceled orders and decreased customer loyalty. Companies want to avoid backorders, but also avoid overstocking every product (leading to higher inventory costs). *Machine learning can identify patterns related to backorders before customers order. Production can then adjust to minimize delays while customer service can provide accurate dates to keep customers informed and happy. The predictive analytics approach enables the maximum product to get in the hands of customers at the lowest cost to the organization.

DATA _The data comes from dataset: Kaggle’s Can You Predict Product Backorders? _ The data file contains the historical data for the 8 weeks prior to the week we are trying to predict. The data were taken as weekly snapshots at the start of each week. The target (or response) is the went_on_backorder variable. To model and predict the target, we’ll use the other features, which include:

sku – Random ID for the product
national_inv – Current inventory level for the part
lead_time – Transit time for product (if available)
in_transit_qty – Amount of product in transit from source
forecast_3_month – Forecast sales for the next 3 months
forecast_6_month – Forecast sales for the next 6 months
forecast_9_month – Forecast sales for the next 9 months
sales_1_month – Sales quantity for the prior 1 month time period
sales_3_month – Sales quantity for the prior 3 month time period
sales_6_month – Sales quantity for the prior 6 month time period
sales_9_month – Sales quantity for the prior 9 month time period
min_bank – Minimum recommend amount to stock
potential_issue – Source issue for part identified
pieces_past_due – Parts overdue from source
perf_6_month_avg – Source performance for prior 6 month period
perf_12_month_avg – Source performance for prior 12 month period
local_bo_qty – Amount of stock orders overdue
deck_risk – Part risk flag
oe_constraint – Part risk flag
ppap_risk – Part risk flag
stop_auto_buy – Part risk flag
rev_stop – Part risk flag
went_on_backorder – Product actually went on backorder. This is the target value.

UNDERSTANDING

Class imbalance (Only .67% volume of products went on back order)
lead_time has NA values
Last Row has all NA values
Remove sku (all unique values)
perf_6_month_avg, perf_12_month_avg attributes have missing data with -99 values.
SKUs for which forecast and sales are 0 and the target class “Went to Back order is also “No”. Near Zero Variance attributes:
in_transit_qty (Numeric)
potential_issue (Catagorical)
pieces_past_due (Numeric)
local_bo_qty (Numeric)
oe_constraint (Catagorical)
stop_auto_buy (Catagorical)
rev_stop (Catagorical)

PREPROCESSING

Remove Null/Duplicate Rows
Remove sku (all unique values)
Impute NAs in lead_time to mean
Normalize the quantity columns
Convert categorical binary attributes to numeric attributes (Yes/No to 1/0)
Remove the correlated columns based on corrplot
Drop unused levels (When creating a subset of a dataframe
Remove the SKUs for which forecast and sales are 0 and the target class “Went to Back order is also “No” . Remaining No of records = 1015940
Feature selection using Random forest can be done

DATA BALANCING USING LIBRARY UNBALANCED One challenge with this problem is dataset imbalance, when the majority class significantly outweighs the minority class. To deal with unbalanced data set SMOTE (synthetic minority over-sampling technique) was used. Underbalancing technique was also used for each feature set.
The second challenge is optimizing for the business case. To do so, we could explore cutoff (threshold) optimization which can be used to find the cutoff that maximizes expected profit.

METRICS

ROC Curve The Receiver Operating Characteristic (ROC) curve is a graphical method that measures the true positive rate (y-axis) against the false positive rate (x-axis). The benefit to the ROC curve:

We can visualize how the binary classification model compares to randomly guessing
We can calculate AUC (Area Under the Curve), which is a method to compare models (perfect classification = 1). We could use the ROC curve and pick a threshold for classification that corresponds to the point on the line for our desired balance between the true positive rate and false positive rate.

Alternatively, we may look at two different measures, precision and recall:

Precision: the proportion of predicted backorders that actually go on backorder which is proportion of true predicted positives to total predicted positives
Recall: the proportion of backordered items that are predicted to go on backorder which means proportion of true predicted positives to total actual positives If we set a low threshold for classification, we predict that parts go on backorder more often. This leads to higher recall and lower precision If we set a high threshold for classification, we do not predict that parts go on backorder as often. This leads to lower recall and higher precision

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Product back orders prediction

Clone this wiki locally