multinational-retail-data-centralisation

This script uses pandas, numpy and a PostGreSQL server to download data from various sources, combine them all into one DataFrame, and then clean them

data_extraction.py

This script is used to demonstrate various different ways of accessing data, and then cleaning them.

It creates a class that lists all the tables in a PSQL server and then downloads all data from them
It then uses the module requests to query an online API and recieves data.
It then accesses an online AWS S3 bucket using boto3, downloading the data into a file called product_list.csv
Finally it extracts all the data from a .pdf file

data_cleaning.py

This script combines all the data stored in `data_extraction.py` and combines them all into one `Pandas Dataframe`

Pandas functions are then used to clean the data in numerous ways such as:

ensuring the phone numbers are of format xxxxxxxxxxx where x are numerals. This gets rid of any country codes, bad formatting etc
ensuring all dates are in the format YYYY-MM-DD and does not contain and strings such as March
ensuring the emails have a username and domain. The domain is not being checked, but could be changed to only accept values such as gmail.com or hotmail.com etc and other cleaning functions for the other data sources. Once cleaned, the script uploads to a new PSQL Server

database_utils.py

small module/scripts that allows easy functions to connect, upload and read data from a `PostGreSQL` server.

The PSQL and AWS S3 bucket credentials are stored in a .yaml file to access the data

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
__pycache__		__pycache__
mydb		mydb
README.md		README.md
card_details.pdf		card_details.pdf
cleaning_test.py		cleaning_test.py
data_cleaning.py		data_cleaning.py
data_extraction.py		data_extraction.py
database_utils.py		database_utils.py
dataframe.txt		dataframe.txt
db_creds.yaml		db_creds.yaml
df.csv		df.csv
logfile		logfile
logfiles		logfiles
new_user_credentials.csv		new_user_credentials.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

multinational-retail-data-centralisation

data_extraction.py

This script is used to demonstrate various different ways of accessing data, and then cleaning them.

data_cleaning.py

This script combines all the data stored in `data_extraction.py` and combines them all into one `Pandas Dataframe`

database_utils.py

small module/scripts that allows easy functions to connect, upload and read data from a `PostGreSQL` server.

About

Releases

Packages

Contributors 2

Languages

AdjunxLynx/multinational-retail-data-centralisation

Folders and files

Latest commit

History

Repository files navigation

multinational-retail-data-centralisation

data_extraction.py

This script is used to demonstrate various different ways of accessing data, and then cleaning them.

data_cleaning.py

This script combines all the data stored in data_extraction.py and combines them all into one Pandas Dataframe

database_utils.py

small module/scripts that allows easy functions to connect, upload and read data from a PostGreSQL server.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

This script combines all the data stored in `data_extraction.py` and combines them all into one `Pandas Dataframe`

small module/scripts that allows easy functions to connect, upload and read data from a `PostGreSQL` server.

Packages