Skip to content

Usage and implementation of a machine learning algorithm to allow for house price predictions in california.

License

Notifications You must be signed in to change notification settings

thatwonguy/Machine-Learning-House-Price-Prediction

Repository files navigation

License: MIT

Machine-Learning-House-Price-Prediction

Usage and implementation of a machine learning algorithm to allow for house price predictions in california. Streamlit is being used and integrated for UI/UX end-user experience.

This repo demonstrates usage of a custom pre-trained model that can be used over and over again without having to retrain the data.

This project is an end-to-end machine learning pipeline for predicting house values based on user-input parameters. The pipeline covers all MLOps stages, including data gathering, preprocessing, model training, testing, deployment, monitoring, and retraining, with a focus on automation, version control, and reproducibility.

# the notebook used to carry out the ML/ data science work
research.ipynb

# the python code used to create the MVP result for end-user experience 
# with MLOPS backend considerations created for it using streamlit, to keep 
# it running and functional, outlined below in this README.md 
app.py

Diagram of the general flow of implentation can be seen below:

  1. GATHERING DATA:
    The most important step is understanding what a successful result looks like. In this case we want to predict house values and have a location where the end-user can provide specific input parameters and receive a house value generated by a machine learning model.

    Once we understand a final outcome we want, we need to gather the relevant data that would allow us to do that.

    Key Points for MLOPS:

    • Objective: Collect relevant, high-quality data to train a robust model.
    • Source: U.S. Census data on housing prices (from Kaggle).
    • Process: Define success metrics for the model, such as mean absolute error (MAE), and collect data that aligns with these goals.
    • Versioning: Use DVC (Data Version Control) to track data changes.
    • We landed on US Census data on house prices, with a target column that provides house values that we can train our data on.
  2. DATA PREPARATION:
    Here we bring in the data into a notebook and perform EDA, getting a good sense of the data, removing null values, and cleaning the data to see how we can work with it and what would be the best machine learning model to use and determine if we need more data or different data and continuing to prepare the data to ensure this data is clean and adequate to address the question we want answered with our model.

    Key Points for MLOPS:

    • EDA & Preprocessing: Clean the data, handle missing values, and document all steps. Ensure - transformations are clearly documented and stored in scripts for reproducibility.
    • Feature Engineering: Perform feature selection and transformation based on correlation, variance, and model requirements.
    • Consistency: Document transformations to ensure new data undergoes identical preprocessing steps before inference.
  3. DATA WRANGLING:
    This goes hand in hand with data preparation above. Obtaining all the data and prepping, preprocessing, cleaning and transforming the data. Please note: the preprocessing step needs to be clearly documented and defined, as any new data being fed into the machine learning model will always have to be preprocessed in the same consistent manner orderwise the Machine Learning model will not provide the results.

  4. ANALYSE DATA::

    • Again, this step goes hand in hand with the 2 previous steps inside your jupyter notebook environment typically and may start to involve f1-scores, variance/bias analysis, featuring engineering, correlation metrics, accuracy scores, and other statistical metrics to try and pin-point the best algorithm and solution for model training results and outputs. A machine learning model that gives poor results is of no use ultimately.
  5. TRAIN MODEL:
    In this case we utilized the sci-kit learn python library to carry out multiple models for testing to see which would provide us the best outcome:

    • linear regression (used in this repo high level demo in research.ipynb)
    • random forest (used in this repo as high level demo in research.ipynb)
    • xgboost regressor (performed the best when also coupled with a gridsearchcv layer for additional optimization)(used in this repo as high level demo in research.ipynb)

    Key Points for MLOPS:

    • Model Selection: Evaluate models (e.g., Linear Regression, Random Forest, XGBoost) using k-fold cross-validation and metrics like RMSE and R².
    • Hyperparameter Tuning: Use GridSearchCV or other tuning methods to optimize model parameters.
    • Automated Training: Set up training in a pipeline (e.g., Airflow) to automate retraining when new data is added or metrics degrade.
  6. TEST MODEL:
    This is all still being carried out in the reseach.ipynb notebook with metrics that show which model performed the best. Ideally, the data science/machine learning engineering team would have sprints set-aside during development as well as after deployment to continue to massage the machine learning model and algorithms to get better results and optimize results. This can be done with algorithm fine-tuning, better data, more data, more feature engineering. Time needs to be allocated for this if we want to see better results and this should be incorporated as continuous work during an AGILE SPRINT and treated as such.

    This would also be the step where we would need to deploy the model and host the end-user experience and test outputs to ensure it will deploy as expected in the next step. This would ideally be done on a test/dev/staging environment designed to mimic the actual production environment.

    Data Engineers and ML Engineers would take over at this step once a working model has been produced and is ready to use. The Data scientists and ML Engineers solely should be focused on the machine learning models and a seperate team would ideally handle deployment of the models. Additional and separate layers of coding and scripting and pipeline work are needed to deploy the models and potenetially even automate the process.

    • The data science team, with the help of data engineers, would implement the models in the form of pickle files or some other format.
    • Yes, a machine learning model, along with the results and data and algorithm can all be saved in a neat package to be implemented for any new data so that no additional training is necessary.

    Key Points for MLOPS:

    • Testing Environment: Conduct testing in a controlled staging environment mimicking production.
    • Metrics Evaluation: Track performance metrics (accuracy, bias-variance analysis) and save results in a database (e.g., MLflow or WandB).
    • Validation: Perform A/B testing and feature importance analysis to validate results against expected outcomes.
  7. DEPLOYMENT:
    As touched upon in the previous steps last bullet point. The data engineering team would ideally abstract this pickle file in the code so that any new models would simply replace this model with a new one and no other portions of the established code-base would need to change. This allows the data science team or ml engineers to focus their efforts on algorithms and improving results without breaking any other section of the architecture or delivery pipeline. Good micro-services architecture and potential usage of OOPS programming on the part of data engineers would need to be implemented here.

    These pickle files would essentially be pre-trained models (the best one we want to use) that would be incorporated into the code to be used for end-users. If the data science team finds a better model, they simply provide the new pickle file, and the old file gets replaced, wheraupon the end-user will immediately see better and more accurate results. This can change depending on what kind of infrastructure your team is using or has available to use.

    Key Points for MLOPS:

    • Containerization: Use Docker to package the model with dependencies for consistent deployment.
    • Orchestration: Deploy the container using Kubernetes to ensure scalability and resilience.
    • CI/CD Pipeline: Automate deployment with CI/CD tools (e.g., Jenkins or GitHub Actions) to push model updates seamlessly.

    For Model Monitoring and Maintenance for the MLOPS team:

    • Monitoring: Track model performance in production with drift detection, alerting for performance degradation, and logging in tools like Prometheus or Grafana.
    • Retraining Triggers: Set up conditions (e.g., data drift or accuracy drop) to trigger retraining and notify the team.
    • Model Governance: Maintain a record of deployed models, their versions, and performance metrics for audit and rollback as needed.

    The end result of this particular exercise demonstrating this process can be seen here by clicking the link below:

About

Usage and implementation of a machine learning algorithm to allow for house price predictions in california.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published