BT4222 Group Project

This project is started by team Datastic, for the module BT4222 Mining Web Data for Business Insights, which is a module belonging to the National University of Singapore. Movie data provided by MovieLens is used along with movie plots scrapped from Wikipedia. The aim of this project is to discover the characteristics of certain categories of audience and create a movie recommender model to recommend movies based on preferences of the audience.

Group Members:

Ang Kian Hwee (A0150445M)
Lee Ying Yang (A0170208N)
Zhai Chen (A0156995H)
Tang Wenqian (A0161814J)

Instructions to run the various python notebooks

Google drive of all datasets here
Note that not all datasets uploaded here on github due to size limitations

1. EDA

Setup

pip install the following packages (if not yet downloaded, from requirements.txt):
1. afinn
2. textblob
3. colour (for bar chart's bar colour)
4. prettytable
5. nltk
Download all nltk corpus using nltk.download("popular")
Ignore mounting of Google Drive step and changing of path
Set file paths for users.csv and user_ratings.csv

Movie Metadata Cleaning

This step was ran before hand. Do not run this section as genome-scores.csv is not uploaded due to file size limit on GitHub.

Set file paths for the following files:
1. movie_data_merged_v1.csv
2. genome-scores.csv
3. genome_tags.csv
Delete rubbish column 'Unnamed: 0' after reading in movie_data_merged_v1.csv
Merge both genome tags and genome scores dataframe and keep tags that have a relevance score of 70% and above
Merge master_movies_df with the filtered tags from step 3
Run Text Cleaning on genres, Casts and Director columns in master_movies_df
Join the first names and last names of both Casts and Director respectively using join_name function

Datasets Overview & Merging

Run this section of codes to get basic statistics of the Movie, User and Rating dataframes
This section is needed to initialise our master_data dataframe which is used for analysis later on

Exploratory Data Analysis on the population

This section consist of many sub sections that helps to find insights to our problem statement
Running these sub sections is straight forward and the user should not meet into problems unless the required variables are not initialised

2. Community Detection

Setup

pip install the following packages (if not yet downloaded, from requirements.txt):
1. afinn
2. textblob
3. colour (for bar chart's bar colour)
4. prettytable
5. community
6. networkx
Ignore mounting of Google Drive step and changing of path
Set file paths for the following files:
1. users.csv
2. movie_data_merged_v2.csv
3. user_ratings.csv
Run Read in Datasets block of code

Network Building

Run code blocks 1 to 4 to add the nodes and edges to network
Run ForceAtlas2 algorithm. This might take quite a long to do.
Run the remaining code blocks under Build Network section
Run code blocks under Centralities Measures to get various statistics about the network

Community Formation

First, split the network into clusters using community package (.best_partition) under Community Detection section
Draw out the network with the communities formed using colour coding
Save the community list of users into a json file for usage in Community Study notebook

3. Communities Study

Setup

pip install the following packages (if not yet downloaded, from requirements.txt):
1. afinn
2. textblob
3. colour (for bar chart's bar colour)
4. prettytable
5. nltk
Download all nltk corpus using nltk.download("popular")
Ignore mounting of Google Drive step and changing of path
Set file paths for the following files:
1. users.csv
2. movie_data_merged_v2.csv
3. user_ratings.csv
4. community_list.json
Run Read in Datasets block of code
Run all pre-defined functions and variables to be used for analysis later on
A few key functions/variables to take note of:
1. plot_ratings_genres_dist - plot the distribution of genres for each rating score
2. plot_user_distribution - plot the distribution of predictor variable
3. plot_genre_dist_over_var - plot distribution of genres over each category of a predictor variable
4. plot_rating_dist_over_var - plot distribution of ratings over each category of a predictor variable
5. com_movie_var - returns a list of genres which the audience has watched
6. get_user_rating - returns dataframe consisting of user-movieIds who have given a particular rating score
7. com_giant_tags_5stars/com_giant_tags_1stars - lists of 4 strings consisting of concatenated genome tags for movies with either rating-5 or rating-1
8. get_keywords_community2_tags - prints the top 10 tags which has the highest TF-IDF values in that community. These are tags that defines a particular movie given a particular rating.
9. get_tfidf_string - returns a dataframe of words with their TF-IDF values. Used to plot word clouds.
10. comX_users_movies - dataframe consisting of user-movieIds belonging in community X, where X is 1 to 4

Community Analysis

Run the codes for the different types of study:
1. Distribution Study
2. Sentiment analysis on Genome Tags
3. Tags Study
4. Movie Metadata Study (Casts & Director)
Each Study section should run without problems provided all pre-defined functions and variables have been initialised

4. Predictive_Modelling

Setup

Set file paths for users.csv, movie_data_merged_v2.csv and user_ratings.csv
Run the data merging, rating bucketization, and train validation test split.
Run each of the data preprocessing steps that will also fit each of the data transformers
and transforms training data
data_pipe function then combines all preprocessing steps and transformers into a function.
Then applied to test and validation datasets

Modelling

Naive Bayes and logistic regression do not take much training time.
However, XGboost does take some time, and therefore not reccomended to rerun.
Right after training XGboost, there is a section to save and reload the model.
Depending on whether directly running after training, or loading from a pre-saved model, uncomment and run the corresponding code blocks for each option only.
Reccomended to use the loaded model from here on instead of rerunning XGBoost
All diagnosing steps should then be possible provided data has been already processed, and model is loaded.
Gridsearch logs are at the bottom of notebook, due to it's length. Re-running not recommended.

5. Collaborative Filtering

Load data and run the merging similar to predictive modelling
Split train validation test, and convert each into model required by surprise package
- Package used is surprise
The function diagnose_reccomendation is used to check accuracy and F1-score of each model trained.
make sure to load it before proceeding.
Each of the modelling steps can be ran without any issues.
- SVD++ takes extremely long to run, not reccomended to do so
- SVD-tuned does take around a few minutes to run.
- The rest of the modelling runs relatively quickly.
Model saved at the end, using the package's dump module.
- Trained models can also be loaded from there.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
datasets		datasets
saved_models		saved_models
Communities_Study.ipynb		Communities_Study.ipynb
Community_Detection.ipynb		Community_Detection.ipynb
Overall Exploratory Data Analysis.ipynb		Overall Exploratory Data Analysis.ipynb
collaborative_filtering.ipynb		collaborative_filtering.ipynb
predictive_modelling.ipynb		predictive_modelling.ipynb
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BT4222 Group Project

Group Members:

Instructions to run the various python notebooks

1. EDA

2. Community Detection

3. Communities Study

4. Predictive_Modelling

5. Collaborative Filtering

About

Releases

Packages

Languages

overaneout/Datastic_BT4222_Mining_Web_Data_for_Business_Insights

Folders and files

Latest commit

History

Repository files navigation

BT4222 Group Project

Group Members:

Instructions to run the various python notebooks

1. EDA

2. Community Detection

3. Communities Study

4. Predictive_Modelling

5. Collaborative Filtering

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages