This repository contains an exploratory data analysis of movies using Python.
- Intro
- Goal
- Project Overview
- Dependencies
- Technical skills
- Data set
- Data Cleaning
- Data Exploration
- Data Visualization
- Insights
The global movie industry is a multi-billion dollar sector with great economic impact and influence. For producers, investors and other parties involved in the movie making process, it is crucial to understand the potential revenue a film can generate.
The main objective of this project is to perform an exploratory analysis of the movies to discover valuable information about the revenue generated by the movies and the factors that influence it.
By exploring and analyzing the data, it is expected to find answers to the following key points:
- Movie genres with the highest revenues
- Movie genres with the highest budgets
- Factors that influence (positively or negatively) a movie's revenue
- Data collection
- Data loading
- Data cleaning
- Data analysis
- Insights
The following tools are required to carry out this project:
- Python 3
- Jupyter Notebooks
- Python libraries:
- Urllib.request
- Gzip
- Json
- Numpy
- Pandas
- Matplotlib.pyplot
- Seaborn
Throughout the implementation of this project, the following skills were applied:
- RESTful API integration
- Json file managment
- Data loading
- Data cleaning
- Data exploration
- Data analysis
- Data visualization
The dataset used for this analysis is a JSON file. It can be found uploaded in this repository as a zip file.
The dataset consists of:
- 30498 entries
- 27 columns
The Movie Database (TMDb) is a popular database for movies and TV shows. TMDb provides a web API in the form of so-called RESTful web service. In order to get access to the TMDb API, it is necessary to register in their website and generate an API key. The process is the following:
- register at https://www.themoviedb.org/account/signup
- after sucessful registration, go to setting > api
- generate and memorize the API key
As explained in the Jupyter Notebook EDA-movies, the data collection section is just an educational step to get familiar with working with the TMDb API services. The dataset used in this project was provided by HLRS in Stuttgart and was loaded into the Notebook using Pandas.
Once the data is loaded, it must undergo a cleaning and preprocessing phase. This part of the analysis is essential and critical, as a clean and normalized data set will ensure data integrity and reliability.
To obtain useful information from this dataset, an in-depth exploratory analysis was carried out. The dataset was analyzed in a univariate and bivariate basis.
Univariate analysis involved analyzing each variable in the dataset separately.
The bivariate analysis consisted of examining two different variables to determine whether there is a dependence or relationship between them.
Data visualization plays a crucial role in data analysis, as it is the stage at which the conclusions drawn from the analysis are effectively communicated.
This stage focuses on creating visual representations of the insights gained during the analysis. The Python libraries Matplotlib and Seaborn were used for this purpose.
The main goal of this project is to explore the movie dataset to find insightful data on the factors that affect a movie's revenue.
Through this analysis it was found that the top 4 movies genres that generate the highest revenues are: action, adventure, science fiction and fantasy.
Revenues for these four genres have been on the rise since 2013. Action, adventure and science fiction have seen a greater increase than any other genres since 2015. In contrast, comedy, crime, drama, horror, romance and thriller have not increased in revenue since 1990.
It was also found that the top 4 movies genres with the largest budgets are: action, adventure, science fiction and fantasy.
The revenues of the following genres have increased since 2012: action, adventure, animation, family, fantasy, and science fiction. On the other hand, drama genres, horror, mystery and thriller haven't increased their budgets since 1990.
This study concludes that there is a clear positive correlation between movie budget and revenue. Movie popularity is positively correlated with both budget and movie revenue. There is no correlation between revenue and movie length. It is also found that film rating is not correlated with either budget or revenue.