The objective of this project is to build a model that predicts the rating of a movie based on features such as genre, director, and actors. By analyzing historical movie data, we aim to develop a regression model that accurately estimates the rating given to a movie by users or critics. This project involves data analysis, preprocessing, feature engineering, and machine learning modeling techniques to gain insights into the factors that influence movie ratings and build a reliable prediction model.
The dataset is loaded with a specified encoding to ensure proper reading of data.
- The 'Rating' column's missing values are filled with the mean rating.
- Missing values in other columns are imputed with the mean for numeric columns and the most frequent value for categorical columns.
A new feature 'Total Actors' is created to capture the number of actors listed in each movie.
Non-numeric columns with empty strings are converted to NaN for proper handling.
The dataset is split into training and testing sets.
- Numeric features are scaled, and missing values are imputed with the mean.
- Categorical features are one-hot encoded, and missing values are imputed with the most frequent value.
Four regression models are selected for evaluation:
- Linear Regression
- Ridge Regression
- Lasso Regression
- Random Forest Regressor
Each model is evaluated using 5-fold cross-validation to assess its performance on the training data.
Models are fitted on the full training set and evaluated on the test set using metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared (R2 Score).
The following visualizations are generated to provide insights into the model's performance and data characteristics:
A histogram showing the distribution of movie ratings in the dataset.
A heatmap showing the correlation between numeric features.
A bar plot comparing the cross-validated R-squared scores of different models.
A scatter plot showing the relationship between actual and predicted ratings for the test set.
A scatter plot showing the residuals (errors) of the predicted ratings.
The key outputs from the model evaluation are:
- Cross-validated R-squared: Measures the proportion of variance in the dependent variable that is predictable from the independent variables. A higher R-squared value indicates better model performance.
- Mean Squared Error (MSE): Measures the average squared difference between the predicted and actual ratings. Lower MSE indicates better model performance.
- Root Mean Squared Error (RMSE): The square root of MSE, providing a measure of prediction error in the same units as the ratings.
- R-squared (R2 Score): Indicates how well the model's predictions approximate the actual data points. A higher R-squared value indicates better fit.
The models built in this project aim to predict movie ratings based on features like genre, director, and actors. The evaluation metrics show that while the models can provide some insights, their predictive power is relatively modest (R-squared values close to zero). This suggests that while these features contribute to movie ratings, other factors not captured in this dataset may also play significant roles. Further feature engineering, data enrichment, and model tuning could improve the accuracy of these predictions.
The project demonstrates the entire process of data analysis, preprocessing, feature engineering, and machine learning modeling to answer the question of predicting movie ratings, providing a foundation for further exploration and improvement.