Skip to content

An exploratory data analysis of movie revenues and the factors influencing them using the TMDb API and Python.

Notifications You must be signed in to change notification settings

herrerovir/Python-movies-eda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🎬 Movies: an exploratory data analysis

This repository contains an exploratory data analysis of movies using Python.

Table of content

Introduction

The global movie industry is a multi-billion dollar sector with great economic impact and influence. For producers, investors and other parties involved in the movie making process, it is crucial to understand the potential revenue a film can generate.

Goal

The main objective of this project is to perform an exploratory analysis of the movies to discover valuable information about the revenue generated by the movies and the factors that influence it.

By exploring and analyzing the data, it is expected to find answers to the following key points:

  • Movie genres with the highest revenues
  • Movie genres with the highest budgets
  • Factors that influence (positively or negatively) a movie's revenue

Project overview

  1. Data collection
  2. Data loading
  3. Data cleaning
  4. Data analysis
  5. Insights

Dependencies

The following tools are required to carry out this project:

  • Python 3
  • Jupyter Notebooks
  • Python libraries:
    • Urllib.request
    • Gzip
    • Json
    • Numpy
    • Pandas
    • Matplotlib.pyplot
    • Seaborn

Technical skills

Throughout the implementation of this project, the following skills were applied:

  • RESTful API integration
  • Json file managment
  • Data loading
  • Data cleaning
  • Data exploration
  • Data analysis
  • Data visualization

Dataset

The dataset used for this analysis is a JSON file. It can be found uploaded in this repository as a zip file.

The dataset consists of:

  • 30498 entries
  • 27 columns

Data collection

The Movie Database (TMDb) is a popular database for movies and TV shows. TMDb provides a web API in the form of so-called RESTful web service. In order to get access to the TMDb API, it is necessary to register in their website and generate an API key. The process is the following:

  1. register at https://www.themoviedb.org/account/signup
  2. after sucessful registration, go to setting > api
  3. generate and memorize the API key

Data loading

As explained in the Jupyter Notebook EDA-movies, the data collection section is just an educational step to get familiar with working with the TMDb API services. The dataset used in this project was provided by HLRS in Stuttgart and was loaded into the Notebook using Pandas.

Data cleaning

Once the data is loaded, it must undergo a cleaning and preprocessing phase. This part of the analysis is essential and critical, as a clean and normalized data set will ensure data integrity and reliability.

Data exploration

To obtain useful information from this dataset, an in-depth exploratory analysis was carried out. The dataset was analyzed in a univariate and bivariate basis.

Univariate analysis involved analyzing each variable in the dataset separately.

The bivariate analysis consisted of examining two different variables to determine whether there is a dependence or relationship between them.

Data visualization

Data visualization plays a crucial role in data analysis, as it is the stage at which the conclusions drawn from the analysis are effectively communicated.

This stage focuses on creating visual representations of the insights gained during the analysis. The Python libraries Matplotlib and Seaborn were used for this purpose.

Insights

The main goal of this project is to explore the movie dataset to find insightful data on the factors that affect a movie's revenue.

Through this analysis it was found that the top 4 movies genres that generate the highest revenues are: action, adventure, science fiction and fantasy.

Revenues for these four genres have been on the rise since 2013. Action, adventure and science fiction have seen a greater increase than any other genres since 2015. In contrast, comedy, crime, drama, horror, romance and thriller have not increased in revenue since 1990.

It was also found that the top 4 movies genres with the largest budgets are: action, adventure, science fiction and fantasy.

The revenues of the following genres have increased since 2012: action, adventure, animation, family, fantasy, and science fiction. On the other hand, drama genres, horror, mystery and thriller haven't increased their budgets since 1990.

This study concludes that there is a clear positive correlation between movie budget and revenue. Movie popularity is positively correlated with both budget and movie revenue. There is no correlation between revenue and movie length. It is also found that film rating is not correlated with either budget or revenue.