Using Backblaze dataset on Kaggle.
This was a Data Science Case Study. Dataset used for this project is private but a similar dataset and project can also be found on Kaggle.com
Disclaimer: This case study is based on a sample subset of a larger dataset and does not accurately solve the problem. Case study is done to demonstrate the use of different tools and libraries in ML, how to present your reports, use python for ML.
A sample of SMART hard drives dataset can be found and downloaded at: https://www.kaggle.com/backblaze/hard-drive-test-data
SMART features or S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) is a software monitoring system for hard drives. SMART generates a collection different metrics related to help evaluate the overall health of a Hard Drive.
A single metrics may not always determine the exact failure prediction but are commonly accepted to help identify any imminent failure and help handle the backup and restore, in time.
This case study relies on a given data stream provided for this purpose. The goal of this case study is to try and analyze given data and find out meaningful information that can help determine drives failure trends and different factors that may idicate if a drive would fail, and attempt to propose a more data driven answer to future failures based on SMART metrics.
The study concludes with discussing possible opportunities and challenges with existing model and features that can help design a better predictive model for future.
To access the entire analysis code in Jupyeter notebook, go to: Predicting Hard drive failure
Here's a quick overview of how this problem has been approached:
- Connect to the postgres server.
- Download the dataset offline
- Wrangle and explore
- Change Dimentions, clean and slice and dice
- Analyze dataset, plot most significant trends
- Feature Selection
- Model and predict
(This is Optional)
1. Number Hard Drives per model | 2. Number of positive failures by model |
3. Failure Trend over time | 4. Daily Failure Trend to determine missing failure data pattern |
and more...
- Conclusion
- Challenges with the current dataset and ways to improve it
python, sql, pandas, scikit and other machine learning libaries, postgres
@geekidharsh : I am Data Engineer with 4+ years of experience in E-commercal and Digital Acquisition. Analyzing swiftly changing user behaviors to make data driven decisions, at scale. Currently, I work at at Merck KGaA