Movie Rating Prediction Model

Objective

The objective of this project is to build a model that predicts the rating of a movie based on features such as genre, director, and actors. By analyzing historical movie data, we aim to develop a regression model that accurately estimates the rating given to a movie by users or critics. This project involves data analysis, preprocessing, feature engineering, and machine learning modeling techniques to gain insights into the factors that influence movie ratings and build a reliable prediction model.

Data Preprocessing

Loading the Dataset

The dataset is loaded with a specified encoding to ensure proper reading of data.

Handling Missing Values

The 'Rating' column's missing values are filled with the mean rating.
Missing values in other columns are imputed with the mean for numeric columns and the most frequent value for categorical columns.

Feature Engineering

A new feature 'Total Actors' is created to capture the number of actors listed in each movie.

Handling Non-numeric Values

Non-numeric columns with empty strings are converted to NaN for proper handling.

Model Building and Evaluation

Splitting the Data

The dataset is split into training and testing sets.

Preprocessing Pipeline

Numeric features are scaled, and missing values are imputed with the mean.
Categorical features are one-hot encoded, and missing values are imputed with the most frequent value.

Model Selection

Four regression models are selected for evaluation:

Linear Regression
Ridge Regression
Lasso Regression
Random Forest Regressor

Cross-Validation

Each model is evaluated using 5-fold cross-validation to assess its performance on the training data.

Model Fitting and Testing

Models are fitted on the full training set and evaluated on the test set using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2 Score).

Visualizations

The following visualizations are generated to provide insights into the model's performance and data characteristics:

Distribution of Movie Ratings

A histogram showing the distribution of movie ratings in the dataset.

Correlation Matrix

A heatmap showing the correlation between numeric features.

Comparison of R-squared Scores

A bar plot comparing the cross-validated R-squared scores of different models.

Actual vs. Predicted Ratings

A scatter plot showing the relationship between actual and predicted ratings for the test set.

Residual Plot

A scatter plot showing the residuals (errors) of the predicted ratings.

Outputs

The key outputs from the model evaluation are:

Cross-validated R-squared: Measures the proportion of variance in the dependent variable that is predictable from the independent variables. A higher R-squared value indicates better model performance.
Mean Squared Error (MSE): Measures the average squared difference between the predicted and actual ratings. Lower MSE indicates better model performance.
Root Mean Squared Error (RMSE): The square root of MSE, providing a measure of prediction error in the same units as the ratings.
R-squared (R2 Score): Indicates how well the model's predictions approximate the actual data points. A higher R-squared value indicates better fit.

Conclusion

The models built in this project aim to predict movie ratings based on features like genre, director, and actors. The evaluation metrics show that while the models can provide some insights, their predictive power is relatively modest (R-squared values close to zero). This suggests that while these features contribute to movie ratings, other factors not captured in this dataset may also play significant roles. Further feature engineering, data enrichment, and model tuning could improve the accuracy of these predictions.

The project demonstrates the entire process of data analysis, preprocessing, feature engineering, and machine learning modeling to answer the question of predicting movie ratings, providing a foundation for further exploration and improvement.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
IMDbAnalysis		IMDbAnalysis
Images		Images
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movie Rating Prediction Model

Table of Contents

Objective

Data Preprocessing

Loading the Dataset

Handling Missing Values

Feature Engineering

Handling Non-numeric Values

Model Building and Evaluation

Splitting the Data

Preprocessing Pipeline

Model Selection

Cross-Validation

Model Fitting and Testing

Visualizations

Distribution of Movie Ratings

Correlation Matrix

Comparison of R-squared Scores

Actual vs. Predicted Ratings

Residual Plot

Outputs

Conclusion

About

Releases

Packages

Languages

License

noturlee/IMDb-DataAnalysis

Folders and files

Latest commit

History

Repository files navigation

Movie Rating Prediction Model

Table of Contents

Objective

Data Preprocessing

Loading the Dataset

Handling Missing Values

Feature Engineering

Handling Non-numeric Values

Model Building and Evaluation

Splitting the Data

Preprocessing Pipeline

Model Selection

Cross-Validation

Model Fitting and Testing

Visualizations

Distribution of Movie Ratings

Correlation Matrix

Comparison of R-squared Scores

Actual vs. Predicted Ratings

Residual Plot

Outputs

Conclusion

About

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages