This is a repository of the the RBDA final Project The three datasets we are using are as follows:
- Movie inherent attributes including the budgets, release date etc, which is collected from TMDb;
- Movie production informations, including cast, director etc. which is collected from TMDb;
- Socoal media comment, which is collected from Twitter.
The Movie Database (TMDb) is a community built movie and TV database. TMDb offers up to 364904 movies data, including its budget, production crews, release date, genres, language, runtime. Also, they provide a detailed information of an actor, recording all the movies he or she has participated in.
- Data/tmdb_5000_credits.csv - crew infomation.
- Data/tmdb_5000_movies.csv - overview infomation.
We select following features to train our model:
- Genre, genre is represented by a 0/1 matrix. There are totally 19 genres, including Fantasy, Adventure, Fantasy etc.
- Lang, genre is represented by a 0/1 matrix. There are totally 27 languages.
- Budget.
- Release year.
- Cast impression index, describing the box office appeal of the whole cast. We use an weighted average historical revenue of the all the cast members.
We use Hadoop map reduce to perform data process. There are bad data records that we need to filter out from the original data source, including budget missing, cast information missing, movies that all too old that lack of statistics value. For the data integration stage, the steps are as follows:
- List all the movies that a actor has ever participated in;
- Find the renvenue of a all the movies;
- For each single movie, calculate the cast impression according to the cast list; These steps need very complex join operations from both tables, so we use HBase to store the movie infomation and cast infomation because of its extraordinarily high scalability.
Twitter is a social media site used by over three hundred million users, and movie is one of the most popular topics users are discussing about. In order to estimate a movie's social attention. We manully find the offical twitter account, extract all the data of the twitter account as features.
- Data\Movie_Twitter_Account.csv - manually noted offical Twitter account.
- Data\twitter_data.json - twitter data we collected using Twitter API.
- average tweet "favorite" count.
- maximum tweet "favorite" count.
- overall tweet "favorite" count.
- average tweet "retweet" count.
- maximum tweet "retweet" count.
- overall tweet "retweet" count.
- follower count.
- We use a virtual machine based platform to build the whole project, which needs following open source software:
- Clone the project and start virtual machine. We have written the Vagrant script to help install all the essensial dependencies like HBase, Hadoop, Spark.
git git@github.com:nerokapa/RBDA_Project.git
cd RBDA_Project
vagrant up
cd /vagrant
- Process data in parallel. This procesure will produce processed_movie_data.dat in ./Data
cd ETL
# if Hadoop not installed
sh ETL_run_local.sh
# if Hadoop installed
sh ETL_run_hadoop.sh
- Process cast infomation in parallel. This procesure will produce processed_movie_data.dat in ./Data
cd ../Analytics
# if Hadoop not installed
sh cast_process_local.sh
# if Hadoop installed
sh cast_process_hadoop.sh
- Start HBase server and thirft service, and the data will be inserted to the hbase
sh start_hbase.sh
# use bellowing script to test the hbase
python filter_test.py
- Model training. There is a some parameter need to filled in.
sh run_predict.sh
- Data post-process You can use Result/harness.py to run the data post process.