A repository of SQL data cleaning projects.
This is a repo for small projects that can be used to practice data cleansing using SQL, Excel or any other method. This small project was inspired by a post made by Sushanta Khara on LinkedIn.
In Data Analysis, the analyst must ensure that the data is 'clean' before doing any analysis. 'Dirty' data can lead to unreliable, inaccurate and/or misleading results. Garbage in = garbage out.
These are the some steps that can be taken to properly prepare your dataset for analysis.
- Check for duplicate entries and remove them.
- Remove extra spaces and/or other invalid characters.
- Separate or combine values as needed.
- Ensure that certain values (age, dates...) are within certain range.
- Check for outliers.
- Correct incorrect spelling or inputted data.
- Adding new and relevant rows or columns to the new dataset.
- Check for null or empty values.
Using the criteria above, create a new SQL table with the properly formatted data.
This repository contains different projects/datasets to give the user many opportunities to practice:
- Basic select statements (select, where, group by, having).
- Aggregate functions (count, sum, min, max, avg)
- Joins (inner, outer, left, right)
- CTE's, temp tables and views
- string & date manipulation functions.
- Window functions (rank, lead, lag, row_number, ntile...)