Skip to content

Learning Apache Spark by performing data preparation and aggregation on CSV dataset of local climate data.

License

Notifications You must be signed in to change notification settings

jkrajcir/LocalClimateData

Repository files navigation

Data Preparation and Aggregation of Local Climatological Data Dataset

Overview

The dataset includes data points of climatic values at monthly, daily, and hourly intervals.

The average, variance, and standard deviation was calculated on data points for each interval. Monthly data points are grouped by year, daily grouped by day of the week, and hourly grouped by hour of the day.

All three intervals share some similar columns such as:

  • Wind speed (miles per hour)
  • Minimum/Maximum temperature (Fahrenheit)
  • Atmospheric pressure (in Hg)
  • Precipitation (inches)

There are several "outlier" columns:

  • Monthly: Days with Heavy Fog
  • Monthly: Days with Thunderstorms
  • Hourly: Relative Humidity (percentage)
  • Hourly: Visibility (miles)

Dataset and documenation:

Tech

Spark.NET library is being used to implement the Spark job to perform data preparation and aggregation. The serverless Spark pools in Azure Synapse Analytics is being using to run the Spark job. CSV dataset file and subsequent parquet files are stored in Azure Data Lake Storage Gen2.

Implementation Steps Summary

  1. CSV dataset is written to parquet files

  2. Filter out unnecessary columns, including:

    • Columns that don't have any values or only has one distinct value
    • Redundant columns
    • Raw data column
  3. Split filtered DataFrame into three DataFrames based on their report type:

    • Monthly summary DataFrame
    • Daily summary DataFrame
    • Hourly DataFrame
  4. Cast several columns and perform some aggregation on each DataFrame, results shown below

Results (text files)

Other

  • "Home page" for U.S. Local Climatological Data.
  • Other climate data datasets provided by the NOAA can be found here.

About

Learning Apache Spark by performing data preparation and aggregation on CSV dataset of local climate data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages