dbt_synth_data
a dbt
package for creating synthetic data
Two main methods for creating fake data
de-identify real data
possibly "fuzz" some values
start with nothing
synthesize data by describing it
👍 safer, but more work
dbt_synth_data
does thisOther tools (Faker for Python, Mimesis for Postgres) exist for creating synthetic data... why build another?
- cross-platform: write code once, run on
Snowflake, Postgres, SQLite... - performance: Faker: 10 min for 100K rows; Snowflake: 1.2 hrs for 10B rows
💡 most SQL dialects can generate()
rows:
-- Snowflake:
select
row_number() over (order by 1) as rownum
from table(generator( rowcount => 100 ));
-- Postgres:
select
s.rownum as rownum
from generate_series( 1, 100 ) as s(rownum);
produces...
rownum |
---|
1 |
2 |
3 |
⋮ |
100 |
💡 most SQL dialects can generate random()
numbers:
-- Snowflake:
select random();
--> int between -BIGINT_MIN : +BIGINT_MAX
select uniform(0::float, 1::float, random());
--> float between 0.0 : 1.0
-- Postgres:
select random();
--> float between 0.0 : 1.0
💡2 put together, generate rows of randomness:
-- Snowflake:
select
row_number() over (order by 1) as rownum,
uniform(0::float, 1::float, random()) as randval
from table(generator( rowcount => 100 ))
-- Postgres:
select
s.idx as rownum,
random() as randval
from generate_series( 1, 100 ) as s(idx)
produces...
rownum | randval |
---|---|
1 | 0.01736233267 |
2 | 0.8667546837 |
3 | 0.4433875307 |
⋮ | ⋮ |
yay!
📍 (non-uniform distributions are possible too)
precise syntax varies by SQL dialect
-- Snowflake:
select
uniform(0::float, 1::float, random()) as randval
from table(generator( rowcount => 100 ))
-- Postgres:
select
random() as randval
from generate_series( 1, 100 ) as s(idx)
dbt to the rescue
dbt (data build tool)
- used extensively by DE
- compiles and runs SQL
- write macros (functions) that produce SQL
- abstract across different SQL dialects
{{ synth_distribution_continuous_uniform(min=0.0, max=10.0) }}
compiles to
uniform(0.0::float, 10.0::float, random())
-- when run on Snowflake
(10.0 - 0.0) * random() ) + 0.0
-- when run on Postgres
dbt_synth_data
uses CTEs
(common table expressions)
extensively to deal with two problems:
problem 1: SQL engines "optimize away"
randomness inside subqueries
-- grab 100 random schools by id
select
ceil(uniform(100::float, 500::float, random())) as rand_school_id,
(
select k_school
from db.schema.dim_school
where school_id=ceil(uniform(100::float, 500::float, random()))
limit 1
) as k_school
from table(generator( rowcount => 100 ))
optimizer runs inner subquery once,
result reused for every row of outer query!
produces...
rand_school_id | k_school |
---|---|
165 | f8de717cfb7ed59631852d1c7f63bc70 |
456 | f8de717cfb7ed59631852d1c7f63bc70 |
339 | f8de717cfb7ed59631852d1c7f63bc70 |
⋮ | ⋮ |
oops!
CTEs fix this:
-- grab 100 random schools by id
with dim_school as (
select school_id, k_school
from db.schema.dim_school
),
base as (
select
ceil(uniform(100::float, 500::float, random())) as rand_school_id
from table(generator( rowcount => 100 ))
)
select base.*, dim_school.k_school
from base
join dim_school on base.rand_school_id=dim_school.school_id
produces...
rand_school_id | k_school |
---|---|
474 | 23bd8cc78d10d8a3f6a9ec35caa38d33 |
171 | 80ed1896bc0fed93d078ac5126181f3d |
256 | 7b96ed1ecf4d0f6ff41ee65f698e7d69 |
⋮ | ⋮ |
much better
problem 2: some SQL dialects
don't support column reuse
select
ceil(uniform(0::float, 10::float, random())) as randint,
5 * randint as multint
from table(generator( rowcount => 100 ))
-- works in Snowflake, but not Postgres :(
CTEs fix this:
with step1 as (
select
ceil(uniform(0::float, 10::float, random())) as randint
from table(generator( rowcount => 100 ))
),
step2 as (
select
randint,
5 * randint as multint
from step1
)
select * from step2
dbt_synth_data
puts all of this together:
- implements cross-platform macros
- construct complex distributions of randomness
- provides many types of columns
- provides seed data for names, cities, and more
- builds up data using CTEs
{{ synth_distribution_continuous_normal(mean=0.0, stddev=1.0) }}
(histogram with 1M values)
{{ synth_distribution_continuous_exponential(lambda=0.1) }}
(histogram with 1M values)
{{ synth_distribution_union(
synth_distribution_continuous_normal(mean=5.0, stddev=1.0),
synth_distribution_continuous_normal(mean=8.0, stddev=1.0),
weights=[1, 2]
) }}
(histogram with 1M values)
{{ synth_distribution_average(
synth_distribution_continuous_exponential(lambda=0.1),
synth_distribution_continuous_normal(mean=2.0, stddev=1.0),
weights=[1,4]
) }}
(histogram with 1M values)
dbt_synth_data
provides many column types
- basic: boolean, int, int_sequence, numeric, date, date_sequence, primary_key, string, value, expression, mapping, values
- statistical: distribution, correlation
- referential: foreign_key, lookup, select
- data: word, words, firstname, lastname, language, country, geo_region, city
- composite: address, phone_number
see docs for full syntax and examples
dim_customer:
with
{{ synth_column_primary_key(name='k_customer') }}
{{ synth_column_firstname(name='first_name') }}
{{ synth_column_lastname(name='last_name') }}
{{ synth_column_expression(name='full_name', expression="first_name || ' ' || last_name") }}
{{ synth_column_expression(name='sort_name', expression="last_name || ', ' || first_name") }}
{{ synth_column_date(name='birth_date', min='1938-01-01', max='1994-12-31') }}
{{ synth_column_address(name='shipping_address', countries=['United States'],
parts=['street_address', 'city', 'geo_region_abbr', 'postal_code']) }}
{{ synth_column_phone_number(name='phone_number') }}
{{ synth_table(rows=100000) }}
select * from synth_table
(compiles into a list of CTEs that build up synthetic data)
produces...
k_customer | first_name | last_name | full_name | sort_name | birth_date | shipping_address | phone_number |
---|---|---|---|---|---|---|---|
1de0...059a | Lucio | Ferguson | Lucio Ferguson | Ferguson, Lucio | 1986-04-24 | 3474 Plenty Ordinary Dr., Bryan, WI 93117 | (293) 959-0612 |
d1be...ae12 | Lon | Cherry | Lon Cherry | Cherry, Lon | 1954-08-22 | 1752 Consumer Ln. #11, Columbus, SC 895 | (498) 173-0332 |
c33b...0ef5 | Laverne | Villa | Laverne Villa | Villa, Laverne | 1951-07-13 | PO Box 514, Fremont, CA 20052 | (462) 191-0042 |
655a...79c9 | Carly | Ayala | Carly Ayala | Ayala, Carly | 1944-09-16 | 9684 Buildings Rd. #447, The Bronx, OR 81518 | (542) 746-0487 |
237a...57c7 | Tamika | Morales | Tamika Morales | Morales, Tamika | 1993-12-04 | 8666 Beauty Blvd. No. 748, Plymouth, AR 90491 | (961) 212-0923 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
various distributions on xsmall Snowflake warehouse
table | rows | cols | run time | data size |
---|---|---|---|---|
distributions | 10B | 15 | 81 min | 614.7 GB |
e-commerce store on xsmall Snowflake warehouse
table | rows | cols | run time | data size |
---|---|---|---|---|
customers | 1M | 8 | ~20 secs | 53.0 MB |
products | 50k | 3 | ~20 secs | 2.4 MB |
stores | 20k | 5 cols | ~2 secs | 1.3 MB |
orders | 50M | 4 | ~2 hrs | 1.0 GB |
inventory | 100M | 4 | ~5.5 hrs | 2.5 GB |
we are using dbt_synth_data
to buildedu_edfi_source_synth
: synthetic education data
which can feed into Stadium/EDU and Podium
(useful for demos, training, development)
(edu_edfi_source_synth
model dependency graph)
We had to first create the universe of Courses, then Course Offerings, then Sections, so that their relationships mirror a realistic school system.
Here's an example query output from the synthetic Ed-fi warehouse!
Any Questions?