Kwame Taylor, Codeup Darden Cohort, Oct 2020
Welcome to my data science clustering project: Kwame's Zillow Zestimates Error Control! This project will use clustering to find the drivers of error in the Zestimate of single-unit properties that were listed on Zillow in 2017. I will demonstrate how this data can be used for quality control and preventing more Zestimate errors in the future.
I plan to create an MVP and then iterate through the data science pipeline multiple times.
Date | Goal | Finished? |
---|---|---|
10/15/2020 | Project planning, start on outline/bones of project. | |
10/17/2020 | Finish MVP of wrangle.py and preprocessing.py. | |
10/18/2020 | Finish MVP of explore.py and start model.py MVP. | |
10/19/2020 | Finish model.py MVP and iterate through data science pipeline 1x. | |
10/20/2020 | Practice presentation, 1x iteration, sleep. | |
10/21/2020 | Presentation day! (turn in project) |
The project deliverables are the following: Jupyter Notebook data science pipeline walkthrough with conclusions, data visualizations, README, and modules with functions (wrangle.py
, preprocessing.py
, explore.py
, and model.py
).
Pipeline iteration 1:
- Project plan and timeline
- README outline
- Structure project bones
- Reach the minimum/MVP for each stage to be able to move on to the next stage.
Pipeline iteration 2:
- Recalibrate project plan timeline
- Tidy the data a little further
- Put functions into modules
- Flesh out README
- Run one statistical test
- Explore and feature engineering with clustering
Pipeline iteration 3 (to-do list):
- remove outliers with isolation forest
- make README more thorough
- add the data dictionary the hypotheses to README
- review my notes and the project specs
- turn cluster_area into dummy variables (i.e. is_cluster_area_1, is_cluster_area_2, etc.)
- change statistical test to be better regarding the distribution of years
- takeaways on how where to focus efforts to reduce log error
- put remaining notebook code into functions
- add two more models and test the best model on test data
- copy comments from prepare code into presentation notebook
- conclusions
- practice presentation and make script/notes
Things I'll save for future iterations for the sake of time:
- title and label visualizations better
- plot centroids
- GeoPy implementation
- plot elevation's relation to latitude/longitude to see if log error has anything to do with topographical data
- make my module functions more generic and useful
- another hypothesis test
Term | Definition |
---|---|
parcelid (index) | Unique identifier for parcels (lots) |
bathcnt | Number of bathrooms in home including fractional bathrooms |
sqft | Calculated total finished living area of the home |
latitude | Latitude of the middle of the parcel multiplied by 10e6 |
longitude | Longitude of the middle of the parcel multiplied by 10e6 |
yearbuilt | The Year the principal residence was built |
value | The total tax assessed value of the parcel |
county (engineered) | County in which the parcel is located |
bathbedcnt (engineered) | Number of bedrooms plus bathrooms in home |
decade (engineered) | The Decade the principal residence was built |
century (engineered) | The Century the principal residence was built |
cluster_area (engineered) | Clusters based on latitude, longitude, and county |
logerror (prediction target) | The difference between log of Zestimate (prediction) and log of actual sales price of a property |
County encoded | County |
---|---|
0 | Los Angeles County |
1 | Orange County |
2 | Ventura County |
𝐻0: There is no difference in Zestimate log error in properties built in the 1800s and the overall log error.
𝐻𝑎: There is a difference in Zestimate log error in properties built in the 1800s and the overall log error.
𝐻0: There is no difference in Zestimate log error in properties built in the 1960s and the overall log error.
𝐻𝑎: There is a difference in Zestimate log error in properties built in the 1960s and the overall log error.
Instructions for use and reproduction:
To see and read through the main notebook, you can navigate to kwames-zillow-zestimates-error-control.ipynb
in this GitHub repository.
You can explore the functions from the notebook more indepth in the wrangle.py
, preprocessing.py
, explore.py
, and model.py
files.
In order to run the code in this repository, you'll need:
- An installation of python through anaconda
- An
env.py
file that defines the following variables:
- 'user'
- 'host'
- 'password'
The code in here was developed on MacOS, but should run fine anywhere you can install python + anaconda.
- Codeup curriculum
- Simple and Multiple Linear Regression in Python
- GeoPy
- 4 Automatic Outlier Detection Algorithms in Python
- Isolation Forest documentation
- Outlier Detection with Isolation Forest
- Markdown Table generator
- Geographic Data with Basemap
- Preprocessing: why you should generate polynomial features first before standardizing
- World Elevation Contours data from UCLA
- California school districts data from UCLA
- California school districts from CA's DoE
- Understanding K-means Clustering in Machine Learning
- Faith's Darden reviews, of course!
- And extra big thanks to my Codeup Darden cohort colleagues for being a constant source of knowledge, help, and motivation!