Skip to content

πŸŒ† City Predictor πŸŒ‡: A fun and exciting machine learning project that guesses one of four cities based on UofT student survey answers πŸŽ“. Built entirely from scratch, our custom Random Forest model boasts a high test accuracy! πŸ§ πŸ”

Notifications You must be signed in to change notification settings

kxnoun/CityClassifier

Repository files navigation

README.md

πŸŒ† City Predictor: UofT Student Survey Analysis πŸŒ‡

Welcome to the City Predictor project! This repository contains the code and data for a machine learning model that predicts cities based on survey responses from students at the University of Toronto. The model guesses one of four cities: Rio de Janeiro, Dubai, New York City, or Paris, based on answers to 10 questions.

Our model achieved a test accuracy of 87.5% (63/72)

πŸ“Š Project Overview

In this project, we:

  • Collected survey data with responses to 10 questions describing a city.
  • Explored and cleaned the data, addressing missing values and outliers.
  • Tested multiple machine learning algorithms.
  • Chose the Random Forest model for its high accuracy.
  • Built a custom model from scratch, including preprocessing pipelines and hyperparameter tuning.

Feel free to explore the repository to see how different models were tested and to understand the custom implementation of our chosen model.

πŸš€ Getting Started

To run the code, you'll need Python 3 installed on your machine, along with pandas and numpy. The cool part of this project is that we did not use any prebuilt models or libraries like scikit-learn for the final implementation!

Prerequisites

Make sure you have the following packages installed:

pip install pandas numpy

Running the Model

  1. Clone the repository:

    git clone https://github.com/kxnoun/city-predictor.git
    cd city-predictor
  2. Explore the code:

    • preprocessing.py: Functions for data preprocessing.
    • pipeline.py: Custom pipeline and transformers.
    • random_forest.py: Custom Random Forest implementation.
    • alternate_models/: Directory containing different models tried during the project.
    • main.py: Main script for training and predicting.
  3. Run the main script:

    python main.py

This will train the model on the provided dataset and make predictions based on the survey responses.

πŸ“‚ Project Structure

city-predictor/
β”‚
β”œβ”€β”€ preprocessing.py
β”œβ”€β”€ pipeline.py
β”œβ”€β”€ random_forest.py
β”œβ”€β”€ main.py
β”œβ”€β”€ alternate_models/
β”‚   β”œβ”€β”€ kNearestNeighbour.ipynb
β”‚   β”œβ”€β”€ LogisticRegression.ipynb
β”‚   └── ...
β”œβ”€β”€ clean_dataset.csv
└── stopwords.txt

πŸ“š Data

The dataset contains survey responses with the following features:

  • Q1-Q4: Likert scale questions (1 to 5)
  • Q5: Categories of travel companions (binary: Siblings, Co-worker, Partner, Friends)
  • Q6: Rankings of city aspects (1 to 6: Skyscrapers, Sport, Art and Music, Carnival, Cuisine, Economic)
  • Q7: Average temperature
  • Q8: Number of languages spoken
  • Q9: Number of fashion styles
  • Q10: Descriptive text about the city

πŸ› οΈ Custom Pipeline

The custom pipeline includes steps like:

  • Imputing missing values.
  • Converting numeric inputs.
  • Adjusting outliers.
  • Encoding categorical data.
  • Extracting rankings.
  • Cleaning text.
  • Vectorizing text using TF-IDF.
  • Training the Random Forest model.

πŸ“ˆ Model Performance

After trying various models, we found that the Random Forest model provided the highest accuracy on the test set. We fine-tuned the hyperparameters to achieve optimal performance.

πŸŽ‰ Contributors

  • Adam: Data Pre-processing, Model Exploration (Random Forest), Model Implementation, Hyper-parameter Fine-tuning, Cross-validation
  • Neha: Data and Model Exploration (Logistic Regression, kNN), pred.py script Implementation, Report Writing
  • Katerina: Data Exploration and Visualization
  • Manahil: Model Exploration (Decision Tree), Custom RF Implementation for pred.py, Report Writing

About

πŸŒ† City Predictor πŸŒ‡: A fun and exciting machine learning project that guesses one of four cities based on UofT student survey answers πŸŽ“. Built entirely from scratch, our custom Random Forest model boasts a high test accuracy! πŸ§ πŸ”

Topics

Resources

Stars

Watchers

Forks