ETL pipeline using Pandas in Python

Context

At the outset, I will set up an account to obtain an API key for streamlined data requests. Prioritizing security and to prevent direct embedding of the API key in the source code, I will establish a configuration file named config.py. This project focuses on constructing an ETL pipeline, encompassing data extraction, transformation, and loading processes, to collect information from The Movie Database (TMDB) website.

These csv file contained dataset of movies listed and including to,
dataset: The main Movies Metadata file. Contains information of movies featured in the movie database. Features include backdrops, budget, revenue, release dates, posters, languages, production companies and countries.
tmdb_movies_info: Contains the movie title, vote, movies plot from TMDB website. Available in the form of a stringified JSON Object.

API key

Create the pythonFile.py for:

This script extracts data from a website without any transformation or cleaning.
The purpose is to explore the raw data and understand its structure.
Your code for extracting data goes here
Save the raw data to a CSV file without any cleaning or transformation
This step helps in understanding the nature of the data and planning for the ETL process
Code for saving data to a CSV file

Create the pythonCSV.py for:

This script creates a CSV file after applying the ETL (Data Extraction, Data Transformation, Data Loading) pipeline to the data obtained from a website using an API key.
Import necessary libraries for data manipulation and API requests
Save the cleaned and transformed data to a CSV file
Code for saving data to a CSV file

Import necessary libraries

Data Extraction

The dataset has 28 column with 837 rows, with no any cleaning data.
- create a loop that requests each movie one at a time and appends the response to a list.
- send a single GET request to the API, receive a JSON record {'success': False, 'status_code': 34, 'status_message': 'The resource you requested could not be found.'}

Data Tranformation

List of column names we want to work with from the main dataframe df_columns = ["column1", "column2", "column3", ...] # Replace with actual column names
Select only the columns specified in df_columns
Code for selecting columns from the raw data
Save the selected columns to a CSV file
Code for saving selected data to a CSV file

Data Loading
Define a function to extract the English name of spoken languages def get_spoken_languages(row):

Extract the spoken languages information spoken_languages = row['spoken_languages']
Check if the spoken_languages field is not empty if spoken_languages:
- Iterate through the list of dictionaries to find the 'english_name' field for language in spoken_languages: if 'english_name' in language: return language['english_name']
- Return None if 'english_name' is not found return None
Apply the function to the 'spoken_languages' column to create a new column with cleaned data
df['cleaned_spoken_languages'] = df.apply(get_spoken_languages, axis=1)
Save the cleaned and transformed data to a CSV file
Code for saving data to a CSV file

Define a function to extract the production country name def get_production_countries(row):

Extract the name of the production country production_countries = row['production_countries']
Check if the production_countries field is not empty if production_countries:
- Iterate through the list of dictionaries to find the 'name' field with the value 'United Kingdom' for country in production_countries: if 'name' in country and country['name'] == 'United Kingdom': return 'United Kingdom'
- Return None if 'United Kingdom' is not found return None
Apply the function to the 'production_countries' column to create a new column with cleaned data
df['cleaned_production_countries'] = df.apply(get_production_countries, axis=1)
Save the cleaned and transformed data to a CSV file
Code for saving data to a CSV file

Define a function to create genre columns def create_genre_columns(row, genres):

Extract the genres information movie_genres = row['genres']
Initialize a dictionary to store the genre columns genre_columns = {genre['name']: 0 for genre in genres}
Check if the genres field is not empty if movie_genres:
- Iterate through the list of dictionaries to find the 'name' field for genre in movie_genres: if 'name' in genre: Set the corresponding genre column to 1 if the movie belongs to that genre genre_columns[genre['name']] = 1 return genre_columns
List of genres you want to create columns for genre_list = ['Crime', 'Drama', 'Thriller', ...] # Replace with actual genre names
Apply the function to create genre columns and merge them with the main dataframe
- genre_columns_df = df.apply(create_genre_columns, axis=1, genres=genre_list)
- df = pd.concat([df, genre_columns_df], axis=1)
Save the cleaned and transformed data to a CSV file
Code for saving data to a CSV file

Creat list of unique genres to explode out

In the process of preparing our data for analysis or machine learning, we've enhanced our dataset. First, we expanded the list of columns we're working with, including unique genres identified through the get_unique_genres function. Next, we broke down the 'genres_all' column into individual rows, allowing us to represent each movie's genres more granularly. To make this information suitable for analytical and machine learning algorithms, we utilized pd.crosstab to generate binary columns for each unique genre. In this binary representation, a value of 1 indicates that a specific genre is associated with a movie, while a value of 0 indicates its absence. This approach simplifies the categorical genre data, making it easier to process and analyze in our tasks.

create a list of column names called df_columns that allows us to select the columns we want from the main dataframe. df_columns = ['title', 'spoken_languages', 'production_countries',
add column genres name df_columns.extend(get_unique_genres(genres_list))
break the [genres] into peaces s = df['genres_all'].explode()
make 0,1 in to each genres df = df.join(pd.crosstab(s.index, s))

Then finalized the process by persisting the data into a table. The data has been written to CSV files, which are stored in the same directory as my script.

This is the result from ETL pipeline.

Source: https://www.kaggle.com/code/kanangnuts/etl-pipeline-using-pandas-in-python (My kaggle)

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
__pycache__		__pycache__
README.md		README.md
dataset.csv		dataset.csv
pythonCSV.py		pythonCSV.py
pythonFile.py		pythonFile.py
tmdb_movies_info_result.csv		tmdb_movies_info_result.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL pipeline using Pandas in Python

About

Releases

Packages

Languages

Kanangnut/ETL-pipeline-using-Pandas-in-Python

Folders and files

Latest commit

History

Repository files navigation

ETL pipeline using Pandas in Python

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages