Your data IS your schema
This tiny library helps to overcome excessive complexity in hand-written pyspark dataframe schemas.
Shape your data as NamedTuple
or dataclasses - they can freely mix:
from dataclasses import dataclass from tinsel import struct, transform from typing import NamedTuple, Optional, Dict, List @struct @dataclass class UserInfo: hobby: List[str] last_seen: Optional[int] pet_ages: Dict[str, int] @struct class User(NamedTuple): login: str age: int active: bool info: Optional[UserInfo]
Transform root node (User
in our case) into schema:
schema = transform(User)
Create some data, if necessary:
data = [ User( login="Ben", age=18, active=False, info=None ), User( login="Tom", age=32, active=True, info=UserInfo( hobby=["pets", "flowers"], last_seen=16, pet_ages={"Jack": 2, "Sunshine": 6} ) ) ]
And… voilà!:
from pyspark.sql import SparkSession sc = SparkSession.builder.master('local').getOrCreate() df = sc.createDataFrame(data=data, schema=schema) df.printSchema() df.show(truncate=False)
This will output:
root |-- login: string (nullable = false) |-- age: integer (nullable = false) |-- active: boolean (nullable = false) |-- info: struct (nullable = true) | |-- hobby: array (nullable = false) | | |-- element: string (containsNull = false) | |-- last_seen: integer (nullable = true) | |-- pet_ages: map (nullable = false) | | |-- key: string | | |-- value: integer (valueContainsNull = false) +-----+---+------+----------------------------------------------+ |login|age|active|info | +-----+---+------+----------------------------------------------+ |Ben |18 |false |null | |Tom |32 |true |[[pets, flowers],, [Jack -> 2, Sunshine -> 6]]| +-----+---+------+----------------------------------------------+
- use native python types; no extra DSL, no cryptic API — just plain Python;
- small and fast;
- provide type shims for some types absent in Python, like
long
orshort
; - nullable fields naturally fits into schema definition;
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.