Skip to content

Latest commit

 

History

History
494 lines (327 loc) · 12.5 KB

presentation.md

File metadata and controls

494 lines (327 loc) · 12.5 KB

dbt_synth_data
a dbt package for creating synthetic data


Data Engineering @

Approach


Two main methods for creating fake data

disguise

de-identify real data

possibly "fuzz" some values

⚠️ can be susceptible to re-identification

sculpting

start with nothing

synthesize data by describing it

👍 safer, but more work

dbt_synth_data does this

motivation

Other tools (Faker for Python, Mimesis for Postgres) exist for creating synthetic data... why build another?

  • cross-platform: write code once, run on
    Snowflake, Postgres, SQLite...
  • performance: Faker: 10 min for 100K rows; Snowflake: 1.2 hrs for 10B rows

how it works

💡   most SQL dialects can generate() rows:

-- Snowflake:
select
  row_number() over (order by 1) as rownum
from table(generator( rowcount => 100 ));

-- Postgres:
select
  s.rownum as rownum
from generate_series( 1, 100 ) as s(rownum);

produces...

rownum
1
2
3
100

💡   most SQL dialects can generate random() numbers:

-- Snowflake:
select random();
--> int between -BIGINT_MIN : +BIGINT_MAX
select uniform(0::float, 1::float, random());
--> float between 0.0 : 1.0

-- Postgres:
select random();
--> float between 0.0 : 1.0

💡2   put together, generate rows of randomness:

-- Snowflake:
select
  row_number() over (order by 1) as rownum,
  uniform(0::float, 1::float, random()) as randval
from table(generator( rowcount => 100 ))

-- Postgres:
select
  s.idx as rownum,
  random() as randval
from generate_series( 1, 100 ) as s(idx)

produces...

rownum randval
1 0.01736233267
2 0.8667546837
3 0.4433875307

yay!

📍 (non-uniform distributions are possible too)

Challenges

precise syntax varies by SQL dialect

-- Snowflake:
select
  uniform(0::float, 1::float, random()) as randval
from table(generator( rowcount => 100 ))

-- Postgres:
select
  random() as randval
from generate_series( 1, 100 ) as s(idx)

come on

nodding

dbt to the rescue

dbt (data build tool)

  • used extensively by DE
  • compiles and runs SQL
  • write macros (functions) that produce SQL
  • abstract across different SQL dialects
{{ synth_distribution_continuous_uniform(min=0.0, max=10.0) }}

compiles to

uniform(0.0::float, 10.0::float, random())
-- when run on Snowflake

(10.0 - 0.0) * random() ) + 0.0
-- when run on Postgres

dbt_synth_data uses CTEs
(common table expressions)
extensively to deal with two problems:

problem 1: SQL engines "optimize away"
randomness inside subqueries

really?

-- grab 100 random schools by id
select
  ceil(uniform(100::float, 500::float, random())) as rand_school_id,
  (
    select k_school
    from db.schema.dim_school
    where school_id=ceil(uniform(100::float, 500::float, random()))
    limit 1
  ) as k_school
from table(generator( rowcount => 100 ))

optimizer runs inner subquery once,
result reused for every row of outer query!

produces...

rand_school_id k_school
165 f8de717cfb7ed59631852d1c7f63bc70
456 f8de717cfb7ed59631852d1c7f63bc70
339 f8de717cfb7ed59631852d1c7f63bc70

oops!

CTEs fix this:

-- grab 100 random schools by id
with dim_school as (
  select school_id, k_school
  from db.schema.dim_school
),
base as (
  select
    ceil(uniform(100::float, 500::float, random())) as rand_school_id
  from table(generator( rowcount => 100 ))
)
select base.*, dim_school.k_school
from base
  join dim_school on base.rand_school_id=dim_school.school_id

produces...

rand_school_id k_school
474 23bd8cc78d10d8a3f6a9ec35caa38d33
171 80ed1896bc0fed93d078ac5126181f3d
256 7b96ed1ecf4d0f6ff41ee65f698e7d69

much better

problem 2: some SQL dialects
don't support column reuse

select
  ceil(uniform(0::float, 10::float, random())) as randint,
  5 * randint as multint
from table(generator( rowcount => 100 ))
-- works in Snowflake, but not Postgres :(

CTEs fix this:

with step1 as (
  select
    ceil(uniform(0::float, 10::float, random())) as randint
  from table(generator( rowcount => 100 ))
),
step2 as (
  select
    randint,
    5 * randint as multint
  from step1
)
select * from step2

Features

dbt_synth_data puts all of this together:

  • implements cross-platform macros
  • construct complex distributions of randomness
  • provides many types of columns
  • provides seed data for names, cities, and more
  • builds up data using CTEs
{{ synth_distribution_continuous_normal(mean=0.0, stddev=1.0) }}

normal distribution

(histogram with 1M values)

{{ synth_distribution_continuous_exponential(lambda=0.1) }}

normal distribution

(histogram with 1M values)

{{ synth_distribution_union(
  synth_distribution_continuous_normal(mean=5.0, stddev=1.0),
  synth_distribution_continuous_normal(mean=8.0, stddev=1.0),
  weights=[1, 2]
) }}

normal distribution

(histogram with 1M values)

{{ synth_distribution_average(
  synth_distribution_continuous_exponential(lambda=0.1),
  synth_distribution_continuous_normal(mean=2.0, stddev=1.0),
  weights=[1,4]
) }}

normal distribution

(histogram with 1M values)

dbt_synth_data provides many column types

  • basic: boolean, int, int_sequence, numeric, date, date_sequence, primary_key, string, value, expression, mapping, values
  • statistical: distribution, correlation
  • referential: foreign_key, lookup, select
  • data: word, words, firstname, lastname, language, country, geo_region, city
  • composite: address, phone_number

see docs for full syntax and examples

dim_customer:

with
{{ synth_column_primary_key(name='k_customer') }}
{{ synth_column_firstname(name='first_name') }}
{{ synth_column_lastname(name='last_name') }}
{{ synth_column_expression(name='full_name', expression="first_name || ' ' || last_name") }}
{{ synth_column_expression(name='sort_name', expression="last_name || ', ' || first_name") }}
{{ synth_column_date(name='birth_date', min='1938-01-01', max='1994-12-31') }}
{{ synth_column_address(name='shipping_address', countries=['United States'],
  parts=['street_address', 'city', 'geo_region_abbr', 'postal_code']) }}
{{ synth_column_phone_number(name='phone_number') }}
{{ synth_table(rows=100000) }}
select * from synth_table

(compiles into a list of CTEs that build up synthetic data)

produces...

k_customer first_name last_name full_name sort_name birth_date shipping_address phone_number
1de0...059a Lucio Ferguson Lucio Ferguson Ferguson, Lucio 1986-04-24 3474 Plenty Ordinary Dr., Bryan, WI 93117 (293) 959-0612
d1be...ae12 Lon Cherry Lon Cherry Cherry, Lon 1954-08-22 1752 Consumer Ln. #11, Columbus, SC 895 (498) 173-0332
c33b...0ef5 Laverne Villa Laverne Villa Villa, Laverne 1951-07-13 PO Box 514, Fremont, CA 20052 (462) 191-0042
655a...79c9 Carly Ayala Carly Ayala Ayala, Carly 1944-09-16 9684 Buildings Rd. #447, The Bronx, OR 81518 (542) 746-0487
237a...57c7 Tamika Morales Tamika Morales Morales, Tamika 1993-12-04 8666 Beauty Blvd. No. 748, Plymouth, AR 90491 (961) 212-0923

performance

various distributions on xsmall Snowflake warehouse

table rows cols run time data size
distributions 10B 15 81 min 614.7 GB

e-commerce store on xsmall Snowflake warehouse

table rows cols run time data size
customers 1M 8 ~20 secs 53.0 MB
products 50k 3 ~20 secs 2.4 MB
stores 20k 5 cols ~2 secs 1.3 MB
orders 50M 4 ~2 hrs 1.0 GB
inventory 100M 4 ~5.5 hrs 2.5 GB

usage

we are using dbt_synth_data to build
edu_edfi_source_synth: synthetic education data
which can feed into Stadium/EDU and Podium

(useful for demos, training, development)

dependency graph

(edu_edfi_source_synth model dependency graph)

highlighted dependency graph

We had to first create the universe of Courses, then Course Offerings, then Sections, so that their relationships mirror a realistic school system.

sample sections

Here's an example query output from the synthetic Ed-fi warehouse!

thank you

Any Questions?