An end-to-end Data Science application for predicting revenue for films from the 2010s.
Ensure you have Python 3.5 or greater installed. You can use pip or anaconda. You can download the latest version here.
Navigate to the folder in which you want to store this repository. Then clone the repository and change directory to the repository:
git clone https://github.com/jklewis99/furiosa.git
cd furiosa
Windows
py -m venv [ENV_NAME]
.\[ENV_NAME]\Scripts\activate
Linux/Mac
python3 -m venv [ENV_NAME]
source [ENV_NAME]/bin/activate
conda update conda
conda create -n [ENV_NAME]
conda activate [ENV_NAME]
conda install pip # install pip to allow easy requirements.txt install
pip install -r requirements.txt
The data for this project was generated using a set of APIs and databases. All databases can be found here.
The initial dataset came from the MovieLens 25M Dataset. Only movies that were released in the 2010's decade (2010-2019) were kept. For each of these movies, a request was made to the TMDB API for updated or new features for budget
, title
, vote_count
, vote_average
, revenue
, runtime
, popularity
, and overview
. After/During these requests, additional requests were made to get information on credits
and crew
and release_dates
.
To get data for trailers, the YouTube Data API was used. The YouTube API's search list method and Videos list methods were used to get data on trailers, specifically title
, channel_title
, channel_id
, description
, release_date
, tags
, view_count
,like_count
, dislike_count
, and comment_count
(features renamed for Python syntax), with a similarity score
added based on the similarity_score metric.
The main API requests used in this project can be found in the youtubeAPIrequests.py
file. In order for this file to work on your computer without access tokens and refresh tokens (and signing in at every execution of the file), there are a few steps to follow:
Follow directions at the YouTube Data API Overview page to get started.
Follow directions at the Getting Started with authentication page to get started. When you have the JSON file that identifies the credientials to your application, set up your Environment Variable (System Properties -> Advanced -> Environment Variables -> User variables for {USER} -> New) for GOOGLE_APPLICATION_CREDENTIALS
to the path where your application's JSON file is saved. If you do not wish to set up this environment variable globally, you can set it up at the beginning of each shell session instead. These instuctions can be found here.
Run the following commands in the console at the root of the furiosa directory:
cd examples
python youtube_api_test.py
This should return the following:
Toy Story 3: Trailer - Walt Disney Studios
Much of the visualization and evaluation of the data itself can be found in the notebooks
folder. Additionally, many graphs that represent the significance of each feature and some other visual data representations are available in the correlations
folder in figures
The following machine learning models were used in this repository:
- Neural Networks (Regression)
- Support Vector Machines (Regression)
- Linear Regression
- Decision Tree (Regression)
- Random Forest (Regression)