Skip to content

c-stephenson/workshops-spark_intro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

workshops-spark_intro

Introductory workshops for beginners in Apache Spark with Python (pyspark) and SQL (Spark SQL). Repository includes IPYNB notebooks and data.

Note: file paths in notebooks will require updating

I - Intro

Covers some core concepts using Spark for data analysis including:

  • Loading data
  • Spark SQL & basic data transformations
  • Writing data
  • Caching data for performance

II - Tidy Data

Demonstrates the concept of "Tidy Data" using example code in Apache Spark and tidying five common types of untidy data:

  • Column headers are values, not variable names.
  • Multiple variables are stored in one column.
  • Variables are stored in both rows and columns.
  • Multiple types of observational units are stored in the same table.
  • A single observational unit is stored in multiple tables.

About

Introductory Spark workshop - IPYNB notebook and data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published