This project is started by team Datastic, for the module BT4222 Mining Web Data for Business Insights, which is a module belonging to the National University of Singapore. Movie data provided by MovieLens is used along with movie plots scrapped from Wikipedia. The aim of this project is to discover the characteristics of certain categories of audience and create a movie recommender model to recommend movies based on preferences of the audience.
- Ang Kian Hwee (A0150445M)
- Lee Ying Yang (A0170208N)
- Zhai Chen (A0156995H)
- Tang Wenqian (A0161814J)
Google drive of all datasets here
Note that not all datasets uploaded here on github due to size limitations
Setup
- pip install the following packages (if not yet downloaded, from requirements.txt):
- afinn
- textblob
- colour (for bar chart's bar colour)
- prettytable
- nltk
- Download all nltk corpus using nltk.download("popular")
- Ignore mounting of Google Drive step and changing of path
- Set file paths for users.csv and user_ratings.csv
Movie Metadata Cleaning
This step was ran before hand. Do not run this section as genome-scores.csv is not uploaded due to file size limit on GitHub.
- Set file paths for the following files:
- movie_data_merged_v1.csv
- genome-scores.csv
- genome_tags.csv
- Delete rubbish column 'Unnamed: 0' after reading in movie_data_merged_v1.csv
- Merge both genome tags and genome scores dataframe and keep tags that have a relevance score of 70% and above
- Merge master_movies_df with the filtered tags from step 3
- Run Text Cleaning on genres, Casts and Director columns in master_movies_df
- Join the first names and last names of both Casts and Director respectively using join_name function
Datasets Overview & Merging
- Run this section of codes to get basic statistics of the Movie, User and Rating dataframes
- This section is needed to initialise our master_data dataframe which is used for analysis later on
Exploratory Data Analysis on the population
- This section consist of many sub sections that helps to find insights to our problem statement
- Running these sub sections is straight forward and the user should not meet into problems unless the required variables are not initialised
Setup
- pip install the following packages (if not yet downloaded, from requirements.txt):
- afinn
- textblob
- colour (for bar chart's bar colour)
- prettytable
- community
- networkx
- Ignore mounting of Google Drive step and changing of path
- Set file paths for the following files:
- users.csv
- movie_data_merged_v2.csv
- user_ratings.csv
- Run Read in Datasets block of code
Network Building
- Run code blocks 1 to 4 to add the nodes and edges to network
- Run ForceAtlas2 algorithm. This might take quite a long to do.
- Run the remaining code blocks under Build Network section
- Run code blocks under Centralities Measures to get various statistics about the network
Community Formation
- First, split the network into clusters using community package (.best_partition) under Community Detection section
- Draw out the network with the communities formed using colour coding
- Save the community list of users into a json file for usage in Community Study notebook
Setup
- pip install the following packages (if not yet downloaded, from requirements.txt):
- afinn
- textblob
- colour (for bar chart's bar colour)
- prettytable
- nltk
- Download all nltk corpus using nltk.download("popular")
- Ignore mounting of Google Drive step and changing of path
- Set file paths for the following files:
- users.csv
- movie_data_merged_v2.csv
- user_ratings.csv
- community_list.json
- Run Read in Datasets block of code
- Run all pre-defined functions and variables to be used for analysis later on
- A few key functions/variables to take note of:
- plot_ratings_genres_dist - plot the distribution of genres for each rating score
- plot_user_distribution - plot the distribution of predictor variable
- plot_genre_dist_over_var - plot distribution of genres over each category of a predictor variable
- plot_rating_dist_over_var - plot distribution of ratings over each category of a predictor variable
- com_movie_var - returns a list of genres which the audience has watched
- get_user_rating - returns dataframe consisting of user-movieIds who have given a particular rating score
- com_giant_tags_5stars/com_giant_tags_1stars - lists of 4 strings consisting of concatenated genome tags for movies with either rating-5 or rating-1
- get_keywords_community2_tags - prints the top 10 tags which has the highest TF-IDF values in that community. These are tags that defines a particular movie given a particular rating.
- get_tfidf_string - returns a dataframe of words with their TF-IDF values. Used to plot word clouds.
- comX_users_movies - dataframe consisting of user-movieIds belonging in community X, where X is 1 to 4
Community Analysis
- Run the codes for the different types of study:
- Distribution Study
- Sentiment analysis on Genome Tags
- Tags Study
- Movie Metadata Study (Casts & Director)
- Each Study section should run without problems provided all pre-defined functions and variables have been initialised
Setup
- Set file paths for users.csv, movie_data_merged_v2.csv and user_ratings.csv
- Run the data merging, rating bucketization, and train validation test split.
- Run each of the data preprocessing steps that will also fit each of the data transformers
and transforms training data - data_pipe function then combines all preprocessing steps and transformers into a function.
Then applied to test and validation datasets
Modelling
- Naive Bayes and logistic regression do not take much training time.
- However, XGboost does take some time, and therefore not reccomended to rerun.
- Right after training XGboost, there is a section to save and reload the model.
Depending on whether directly running after training, or loading from a pre-saved model, uncomment and run the corresponding code blocks for each option only.
Reccomended to use the loaded model from here on instead of rerunning XGBoost - All diagnosing steps should then be possible provided data has been already processed, and model is loaded.
- Gridsearch logs are at the bottom of notebook, due to it's length. Re-running not recommended.
- Load data and run the merging similar to predictive modelling
- Split train validation test, and convert each into model required by surprise package
- Package used is surprise
- The function diagnose_reccomendation is used to check accuracy and F1-score of each model trained.
make sure to load it before proceeding. - Each of the modelling steps can be ran without any issues.
- SVD++ takes extremely long to run, not reccomended to do so
- SVD-tuned does take around a few minutes to run.
- The rest of the modelling runs relatively quickly.
- Model saved at the end, using the package's dump module.
- Trained models can also be loaded from there.