Skip to content

This project showcases a complete SQL-based data analysis using the real-world Olist Brazilian E-commerce dataset. The goal is to explore customer behavior, sales performance, product trends, and logistics efficiency by writing clear, structured SQL queries over a relational schema.

Notifications You must be signed in to change notification settings

crowjames/Data_Analysis_Project_Ecommerce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Analysis Project 1 - Olist Ecommerce

Microsoft SQL Server SQL Tableau Excel

📚 Table of Contents


🔍 Overview

This project is a hands-on SQL analysis of the Olist Brazilian E-commerce dataset — a real marketplace operating in Brazil with multiple sellers and thousands of customers. The idea is to treat this like a real data analyst case: dive into the raw data, write exploratory and KPI-driven queries, and surface insights that could actually help a business.

Everything is documented as a step-by-step journal. I walk through what questions I’m trying to answer, what SQL I’m writing to get there, and what I’m learning along the way.


🛠️ Tools Used

  • SQL: For data cleaning, transformation, and analytical querying
  • SMSS: Used as the database engine to store and manage the relational dataset
  • Tableau: For visualization of results and reporting

📦 Dataset

The dataset used in this project comes from Olist, a real Brazilian e-commerce marketplace that connects multiple sellers to customers across Brazil. It contains detailed historical data on thousands of orders placed between 2016 and 2018. The dataset is publicly available on Kaggle and is known for its completeness and authenticity.

It includes information such as:

  • Order and delivery timestamps
  • Customer and seller location data
  • Product details and categories
  • Payment methods and amounts
  • Customer reviews and ratings

This rich structure allows for in-depth analysis of customer behavior, product performance, logistics, and satisfaction trends.


🗄️ Database

To manage and explore the dataset effectively, a relational SQL database has been created. T-SQL is used to store and query the data. The schema has been designed to reflect the natural relationships between orders, customers, sellers, and products.

Key relationships include:

  • Customers placing multiple orders
  • Each order containing one or more products
  • Sellers fulfilling different items across various orders
  • Reviews linked to each order for quality assessment

This structure supports complex queries and enables robust data slicing for KPIs, trends, and insight generation.


Schema structure

schema

🧹 Data Processing

The raw CSV files from the Kaggle dataset required some initial cleaning and preparation. SQL scripts were written to:

  • Import and load the data into SQL Server Management Studio

    The raw dataset files (CSV format) were uploaded directly into SQL Server Management Studio (SSMS) using the "Import Flat File..." wizard. This allowed me to create all tables easily with their respective columns and data types, without (mostly) changing the type.

image image
  • Handle missing or null values

I’m checking for blank and NULL values across all key tables in the Olist dataset. This helps identify any data quality issues that could affect the accuracy of the analysis. For each table, I count how many rows are missing critical fields like IDs, timestamps, payment values, or location data.

Example:

 
 
 SELECT 
    COUNT(*) AS total_rows,
    SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) AS missing_order_id,
    SUM(CASE WHEN order_item_id IS NULL THEN 1 ELSE 0 END) AS missing_order_item_id,
    SUM(CASE WHEN product_id IS NULL THEN 1 ELSE 0 END) AS missing_product_id,
    SUM(CASE WHEN seller_id IS NULL THEN 1 ELSE 0 END) AS missing_seller_id,
    SUM(CASE WHEN shipping_limit_date IS NULL THEN 1 ELSE 0 END) AS missing_shipping_limit_date,
    SUM(CASE WHEN price IS NULL THEN 1 ELSE 0 END) AS missing_price,
    SUM(CASE WHEN freight_value IS NULL THEN 1 ELSE 0 END) AS missing_freight_value
FROM olist_order_items_dataset;
 
  

I then run the equivalent code for each table, confirming the number of rows containing NULL values and excluding them.

  • Normalize formats (e.g., dates)

Here the scope is to check the formats of key fields such as dates or costs. If in a wrong format change that.

Example:

image

Here we change the date using the CAST function, we are not going to do an analysis that requires the precise moment.

 
 
 ALTER TABLE olist_orders_dataset
ALTER COLUMN order_purchase_timestamp DATE;

ALTER TABLE olist_orders_dataset
ALTER COLUMN order_approved_at DATE;

ALTER TABLE olist_orders_dataset
ALTER COLUMN order_delivered_carrier_date DATE;

ALTER TABLE olist_orders_dataset
ALTER COLUMN order_delivered_customer_date DATE;

ALTER TABLE olist_orders_dataset
ALTER COLUMN order_estimated_delivery_date DATE;

 
 

Result: image


📈 Project Includes

  • Overview: A high-level summary of key performance indicators (KPIs), such as total revenue, average delivery time, top product categories.

Total revenue:
image

Running total revenue per year :
image

Average delivery time:
image


Top product categories (10 best performing by tot sales):
image





👁️‍🗨️ Visualization


While the foundational analysis and data preparation were done in SQL, the remaining exploration particularly around customer behavior, delivery patterns, and geographical trends will be completed in Tableau. This decision was made to leverage Tableau's interactive and visual capabilities, making the insights easier to interpret and communicate to both technical and non-technical stakeholders. The dashboard will consolidate key KPIs and allow users to explore the data through filters, maps, and dynamic charts.

Here the result:

olist cx analysis

Click here for the interactive version of the dashboard.


About

This project showcases a complete SQL-based data analysis using the real-world Olist Brazilian E-commerce dataset. The goal is to explore customer behavior, sales performance, product trends, and logistics efficiency by writing clear, structured SQL queries over a relational schema.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published