Data Wrangling A2: Data Cleansing
- Data_Cleansing_Specifications.pdf: Assignment specifications.
- Data_Cleansing.ipynb/pdf: Python code to analyse the dataset and find and fix the problems in the data.
- Input data: transactional retail data from an online electronics store.
- 30945305_dirty_data.csv, 30945305_missing_data.csv, 30945305_outlier_data.csv: Input files (unclean data)
- 30945305_dirty_data_solution.csv, 30945305_missing_data_solution.csv, 30945305_outlier_data_solution.csv: Input files (clean data)
Tasks completed:
- Perform graphical and/or non-graphical EDA methods to understand the data first and then find and fix the data problems.
- Detect and fix errors in 30945305_dirty_data.csv
- Detect and remove outlier rows in 30945305_outlier_data_solution.csv (outliers are to be found w.r.t. delivery_charges attribute only)
- Impute the missing values in 30945305_missing_data.csv
Libraries used: pandas, numpy, matplotlib, nltk, nltk.sentiment.vader, sklearn.linear_model, scipy