Introductory workshops for beginners in Apache Spark with Python (pyspark) and SQL (Spark SQL). Repository includes IPYNB notebooks and data.
Note: file paths in notebooks will require updating
Covers some core concepts using Spark for data analysis including:
- Loading data
- Spark SQL & basic data transformations
- Writing data
- Caching data for performance
Demonstrates the concept of "Tidy Data" using example code in Apache Spark and tidying five common types of untidy data:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables.