This classic dataset contains the prices and other attributes of almost 54,000 diamonds.
- price price in US dollars ($326--$18,823)
- carat weight of the diamond (0.2--5.01)
- cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
- color diamond colour, from J (worst) to D (best)
- clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
- x length in mm (0--10.74)
- y width in mm (0--58.9)
- z depth in mm (0--31.8)
- depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
- table width of top of diamond relative to widest point (43--95)
Step involved in Data Preprocessing
- Data Cleaning
- Identifying and removing outliers
- Encoding categorical variables
The first column is an index ("Unnamed: 0") and thus we are going to remove it.
Min value of "x", "y", "z" are zero this indicates that there are faulty values in data that represents dimensionless or 2-dimensional diamonds. So we need to filter out those as it clearly faulty data points.
We lost 20 data points by deleting the dimensionless(2-D or 1-D) diamonds.
Checking for null values
we can see that the data is cleaned