dbt_synth_data
a dbt package for creating synthetic data

Data Engineering @

Approach

Two main methods for creating fake data

de-identify real data

possibly "fuzz" some values

⚠️ can be susceptible to re-identification

start with nothing

synthesize data by describing it

👍 safer, but more work

↑ dbt_synth_data does this

motivation

Other tools (Faker for Python, Mimesis for Postgres) exist for creating synthetic data... why build another?

cross-platform: write code once, run on
Snowflake, Postgres, SQLite...
performance: Faker: 10 min for 100K rows; Snowflake: 1.2 hrs for 10B rows

how it works

💡 most SQL dialects can generate() rows:

-- Snowflake:
select
  row_number() over (order by 1) as rownum
from table(generator( rowcount => 100 ));

-- Postgres:
select
  s.rownum as rownum
from generate_series( 1, 100 ) as s(rownum);

produces...

rownum
1
2
3
⋮
100

💡 most SQL dialects can generate random() numbers:

-- Snowflake:
select random();
--> int between -BIGINT_MIN : +BIGINT_MAX
select uniform(0::float, 1::float, random());
--> float between 0.0 : 1.0

-- Postgres:
select random();
--> float between 0.0 : 1.0

💡² put together, generate rows of randomness:

-- Snowflake:
select
  row_number() over (order by 1) as rownum,
  uniform(0::float, 1::float, random()) as randval
from table(generator( rowcount => 100 ))

-- Postgres:
select
  s.idx as rownum,
  random() as randval
from generate_series( 1, 100 ) as s(idx)

produces...

rownum	randval
1	0.01736233267
2	0.8667546837
3	0.4433875307
⋮	⋮

yay!

📍 (non-uniform distributions are possible too)

Challenges

precise syntax varies by SQL dialect

-- Snowflake:
select
  uniform(0::float, 1::float, random()) as randval
from table(generator( rowcount => 100 ))

-- Postgres:
select
  random() as randval
from generate_series( 1, 100 ) as s(idx)

dbt to the rescue

dbt (data build tool)

used extensively by DE
compiles and runs SQL
write macros (functions) that produce SQL
abstract across different SQL dialects

{{ synth_distribution_continuous_uniform(min=0.0, max=10.0) }}

compiles to

uniform(0.0::float, 10.0::float, random())
-- when run on Snowflake

(10.0 - 0.0) * random() ) + 0.0
-- when run on Postgres

dbt_synth_data uses CTEs
(common table expressions)
extensively to deal with two problems:

problem 1: SQL engines "optimize away"
randomness inside subqueries

-- grab 100 random schools by id
select
  ceil(uniform(100::float, 500::float, random())) as rand_school_id,
  (
    select k_school
    from db.schema.dim_school
    where school_id=ceil(uniform(100::float, 500::float, random()))
    limit 1
  ) as k_school
from table(generator( rowcount => 100 ))

optimizer runs inner subquery once,
result reused for every row of outer query!

produces...

rand_school_id	k_school
165	f8de717cfb7ed59631852d1c7f63bc70
456	f8de717cfb7ed59631852d1c7f63bc70
339	f8de717cfb7ed59631852d1c7f63bc70
⋮	⋮

oops!

CTEs fix this:

-- grab 100 random schools by id
with dim_school as (
  select school_id, k_school
  from db.schema.dim_school
),
base as (
  select
    ceil(uniform(100::float, 500::float, random())) as rand_school_id
  from table(generator( rowcount => 100 ))
)
select base.*, dim_school.k_school
from base
  join dim_school on base.rand_school_id=dim_school.school_id

produces...

rand_school_id	k_school
474	23bd8cc78d10d8a3f6a9ec35caa38d33
171	80ed1896bc0fed93d078ac5126181f3d
256	7b96ed1ecf4d0f6ff41ee65f698e7d69
⋮	⋮

much better

problem 2: some SQL dialects
don't support column reuse

select
  ceil(uniform(0::float, 10::float, random())) as randint,
  5 * randint as multint
from table(generator( rowcount => 100 ))
-- works in Snowflake, but not Postgres :(

CTEs fix this:

with step1 as (
  select
    ceil(uniform(0::float, 10::float, random())) as randint
  from table(generator( rowcount => 100 ))
),
step2 as (
  select
    randint,
    5 * randint as multint
  from step1
)
select * from step2

Features

dbt_synth_data puts all of this together:

implements cross-platform macros
construct complex distributions of randomness
provides many types of columns
provides seed data for names, cities, and more
builds up data using CTEs

{{ synth_distribution_continuous_normal(mean=0.0, stddev=1.0) }}

(histogram with 1M values)

{{ synth_distribution_continuous_exponential(lambda=0.1) }}

(histogram with 1M values)

{{ synth_distribution_union(
  synth_distribution_continuous_normal(mean=5.0, stddev=1.0),
  synth_distribution_continuous_normal(mean=8.0, stddev=1.0),
  weights=[1, 2]
) }}

(histogram with 1M values)

{{ synth_distribution_average(
  synth_distribution_continuous_exponential(lambda=0.1),
  synth_distribution_continuous_normal(mean=2.0, stddev=1.0),
  weights=[1,4]
) }}

(histogram with 1M values)

dbt_synth_data provides many column types

basic: boolean, int, int_sequence, numeric, date, date_sequence, primary_key, string, value, expression, mapping, values
statistical: distribution, correlation
referential: foreign_key, lookup, select
data: word, words, firstname, lastname, language, country, geo_region, city
composite: address, phone_number

see docs for full syntax and examples

dim_customer:

with
{{ synth_column_primary_key(name='k_customer') }}
{{ synth_column_firstname(name='first_name') }}
{{ synth_column_lastname(name='last_name') }}
{{ synth_column_expression(name='full_name', expression="first_name || ' ' || last_name") }}
{{ synth_column_expression(name='sort_name', expression="last_name || ', ' || first_name") }}
{{ synth_column_date(name='birth_date', min='1938-01-01', max='1994-12-31') }}
{{ synth_column_address(name='shipping_address', countries=['United States'],
  parts=['street_address', 'city', 'geo_region_abbr', 'postal_code']) }}
{{ synth_column_phone_number(name='phone_number') }}
{{ synth_table(rows=100000) }}
select * from synth_table

(compiles into a list of CTEs that build up synthetic data)

produces...

k_customer	first_name	last_name	full_name	sort_name	birth_date	shipping_address	phone_number
1de0...059a	Lucio	Ferguson	Lucio Ferguson	Ferguson, Lucio	1986-04-24	3474 Plenty Ordinary Dr., Bryan, WI 93117	(293) 959-0612
d1be...ae12	Lon	Cherry	Lon Cherry	Cherry, Lon	1954-08-22	1752 Consumer Ln. #11, Columbus, SC 895	(498) 173-0332
c33b...0ef5	Laverne	Villa	Laverne Villa	Villa, Laverne	1951-07-13	PO Box 514, Fremont, CA 20052	(462) 191-0042
655a...79c9	Carly	Ayala	Carly Ayala	Ayala, Carly	1944-09-16	9684 Buildings Rd. #447, The Bronx, OR 81518	(542) 746-0487
237a...57c7	Tamika	Morales	Tamika Morales	Morales, Tamika	1993-12-04	8666 Beauty Blvd. No. 748, Plymouth, AR 90491	(961) 212-0923
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮

performance

various distributions on xsmall Snowflake warehouse

table	rows	cols	run time	data size
distributions	10B	15	81 min	614.7 GB

e-commerce store on xsmall Snowflake warehouse

table	rows	cols	run time	data size
customers	1M	8	~20 secs	53.0 MB
products	50k	3	~20 secs	2.4 MB
stores	20k	5 cols	~2 secs	1.3 MB
orders	50M	4	~2 hrs	1.0 GB
inventory	100M	4	~5.5 hrs	2.5 GB

usage

we are using dbt_synth_data to build
edu_edfi_source_synth: synthetic education data
which can feed into Stadium/EDU and Podium

(useful for demos, training, development)

(edu_edfi_source_synth model dependency graph)

We had to first create the universe of Courses, then Course Offerings, then Sections, so that their relationships mirror a realistic school system.

Here's an example query output from the synthetic Ed-fi warehouse!

thank you

Any Questions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

presentation.md

presentation.md

Approach

motivation

how it works

Challenges

Features

performance

usage

thank you

Files

presentation.md

Latest commit

History

presentation.md

File metadata and controls

Approach

motivation

how it works

Challenges

Features

performance

usage

thank you