Our customers (subscribers) seek help to build skills to deploy simple and viable batch pipelines entirely on Docker involving the following relational and NoSQL databases:
- Cassandra
- MySQL
- Redis
I successfully engineered 3 batch data processing pipelines with PySpark while having the databases entirely on Docker.
I ingested, pre-processed and visualized the data in these databases to validate their successful deployment.
I also analyzed customer purchasing behavior.
I plan to write a blog post about how to deploy these 3 batch pipelines on Docker soon. Stay tuned!
I chose the eCommerce behavior data from multi category store available on Kaggle to focus on successfully implementing the 3 batch pipelines.
Real business data requires more pre-processing than the transformations I performed with this data.
Data file contains customer behavior data on a large multi-category online store's website for 1 month (November 2019).
Each row in the file represents an event.
-
All events are related to products and users
-
There are 3 different types of events → view, cart and purchase
The 2 purchase funnels are
- view → cart → purchase
- view → purchase
Here's the distribution of events in the data:
Cassandra
MySQL
Redis
I performed the following analyses on the pre-processed (transformed) data in storage
- Views by category
- Purchase category vs Volume
- Top 20 brands purchased
- Purchase conversion volume
All data, I based my analysis on, is collected by and belongs to Open CDP project.