Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2967] [Feature] run dbt in --sample mode #8378

Open
3 tasks done
MichelleArk opened this issue Aug 12, 2023 · 12 comments
Open
3 tasks done

[CT-2967] [Feature] run dbt in --sample mode #8378

MichelleArk opened this issue Aug 12, 2023 · 12 comments
Labels
enhancement New feature or request Refinement Maintainer input needed

Comments

@MichelleArk
Copy link
Contributor

MichelleArk commented Aug 12, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Introduce a new boolean flag, --sample/--no-sample, that informs whether dbt should run in 'sample' mode.

In sample mode, dbt would resolve ref and source identifiers as select * from <identifier> limit <sample-size>.

A default sample size could be used (perhaps 100?), or set in config at the model-level to inform the size of the sample to select when referencing the model.

models:
  - name: my_model
     config: 
       sample: 
         size: 10

This would also enable setting a sample size using the hierarchical config syntax in dbt_project.yml:

--- dbt_project.yml
models: 
  <resource-path>:
    + sample: {<dictionary>}

It would also be interesting to explore providing column-level filters as configs, so that samples would resolve as select * from <identifier> where <filters> limit <sample-size>. This way, it would be possible to tune samples to ensure referential integrity within a run in sample mode.

We may also want to create an equivalent env_var, so folks could set a sample for a given environment

Describe alternatives you've considered

It's possible to implement something like this with overrides of ref & source macros, but that makes dbt upgrades more difficult.

Also considered a simpler version of this where --sample is an integer argument representing the sample size in #8337. This doesn't leave design space for configuring filters on model/source samples however.

Who will this benefit?

any dbt user looking for speedier (at the cost of completeness/accuracy) development & CI runs!

Are you interested in contributing this feature?

yep!

Anything else?

Took a first pass at this in a spike here: #8337

@MichelleArk MichelleArk added enhancement New feature or request triage labels Aug 12, 2023
@github-actions github-actions bot changed the title [Feature] run dbt in --sample mode [CT-2967] [Feature] run dbt in --sample mode Aug 12, 2023
@jaypeedevlin
Copy link
Contributor

Hey @MichelleArk, excited about this enhancement, as it intersects with some planned work that we were going to execute using ref/source overrides!

We definitely would love to see this in a really configurable form, like the where config you've suggested, or even additionally an order by (ie order by random()). Essentially, the more customizable the better from our perspective!

@jtcohen6 jtcohen6 removed the triage label Aug 14, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 14, 2023

Thanks for opening @MichelleArk! :)

I love how simple the --sample boolean flag is, with all the actual configuration defined in code.

Question: Is this sufficient in practice? Will users need the ability to define multiple different sampling strategies for the same models, to run in different environment (dev vs. CI vs. QA vs. ...)?

If so: Should that look like lots of vars / env vars within the sample config block? Or, should it look like different "named" sample strategies that can be referenced on the CLI?


I'd offer a few more arguments in favor of supporting column-specific filters:

  • Pick a few representative segments (customers/regions/etc) in development & CI → sense-check in downstream dashboarding ("last-month revenue for Bouches-du-Rhône isn't looking quite right")
  • For time-series data, limit based on recency (e.g. "last 3 days"), rather than data scattered across time
  • Benefit from partition pruning on DWHs (like BQ) where that's the only sure way to achieve better time/cost savings

This could look like, defining additional attributes on columns in a model's properties:

# models/properties.yml
models:
  - name: my_model
    columns:
      - name: customer_id
        sample: [10, 30]  # array -> 'in' filter
      - name: created_at
        sample: "> {{ run_started_at.date() - modules.datetime.timedelta(days=3) }}"  # string -> filter as is

Then, anywhere you write:

-- models/another_model.sql
with my_model as (
    select * from {{ ref('my_model') }}
), ...

dbt would actually resolve that as a subquery like:

-- models/another_model.sql
with my_model as (
    select * from (
       select * from <database>.<schema>.my_model
       where customer_id in (10, 30)
          and created_at > '2023-08-11'  # writing this on Aug 14
    ) subq
), ...

Or, it could also be a config that's set for many models at once (or even every model in the project):

# dbt_project.yml
models:
  my_project:
    +sample:
      customer_id: [10, 30]  # array -> 'in' filter
      created_at: "> {{ run_started_at.date() - modules.datetime.timedelta(days=3) }}"  # string -> filter as is

The latter option would require dbt to check the set of columns first, to see if fields named customer_id and created_at actually exist (case-insensitive of course). As dbt is templating out each source / ref call, it would figure out if the model/source being referenced has those columns. I think we could do that in one of two ways:

  • If those models have enforced contracts → save an introspective DWH query
  • If those models already exist in the development schema (or defer to prod) → introspective DWH query

Bonus considerations:

  • Making sure this works for tests. For generic tests, it's like a default set of where configs—maybe even better!—but the two features should work happily together.
  • Making sure this works for Python models (!), where the filters might need to be applied via filter(...).filter(...) (snowpark, pyspark, pandas) instead of where ... and ....

@jaypeedevlin I'm happy to hear that this piques your interest! I'm curious - what would be your motivation for order by random()? I'm thinking it's the exact opposite of what I'm describing above: give me a truly random smattering of data, which is unlikely to match up with other models in downstream joins / dashboarding.

@jtcohen6 jtcohen6 added the Refinement Maintainer input needed label Aug 14, 2023
@QuentinCoviaux
Copy link

Love this topic!

I also agree that column-specific filters predicates might be more interesting for consistency across models or even just on its own. If you have years of data in your model, I feel it might be more relevant to get everything from the past month rather than random data points scattered - although probably depends on the use cases and what you are trying to test!

I'm wondering if we could customize where to apply this filter though. As I understand it, if one sets this in a .yml file then I assume it would be executed at the end of the model whereas it might be better performance-wise to run it early - especially if you have big tables and a lot of transformations in a model.

We have been doing something similar, on the first group of CTEs of certain models where we simply select from source or ref, we added this block:

select * from {{ ref('stage') }}
{% if target.name != 'prod' and var("dev_load_full") != true %}
where to_date("timestamp") >= dateadd('day', -{{ var("dev_load_days") }}, current_date())
{% endif %}

And we define those variables in the dbt_project.yml as default:

vars:
  dev_load_full: false
  dev_load_days: 7

This way we can sample the data early in the query rather than later.

@jaypeedevlin
Copy link
Contributor

@jtcohen6 that was a reasonably arbitrary example of truly random sampling. Most of our models are incremental so we'd be (in theory) randomly sampling just the most recent data anyway. Definitely not the most sophisticated method for my example though, I'll grant you that, but just arguing to make this as configurable as possible.

@graciegoheen
Copy link
Contributor

graciegoheen commented Aug 23, 2023

Really love the conversation going here - I have personally written a handful of hack-y work-arounds to accomplish this exact thing, so would be excited to see a native approach in dbt-core, especially for folks trying to optimize time and warehouse spend! Curious about the name "sample" vs. something like "limit" - I know we already have a limit config for test outputs and a limit for the dbt show command. Thinking more about it, I like that the name "sample" is distinguished from "limiting the outputs" (which it's not!).

I can also imagine folks wanting to configure different samples per environment, which is something we could enable using environment variables.

# dbt_project.yml
models:
  my_project:
    +sample: {{ env_var('sample') }}

I'm definitely in favor of folks being able to define a filter, which we recommend in our best practices guide and has come up in a handful of solutions.

@QuentinCoviaux my understanding here is that running dbt in --sample mode actually affects what {{ ref() }} and {{ source() }} compile to, rather than adding a limit at the end of a model. So, for example, if we had a model with many CTEs:

with customers as (

    select * from {{ ref('stg_customers') }}

),

orders as (

    select * from {{ ref('stg_orders') }}

),

payments as (

    select * from {{ ref('stg_payments') }}

),

customer_orders as (

        select
        customer_id,

        min(order_date) as first_order,
        max(order_date) as most_recent_order,
        count(order_id) as number_of_orders
    from orders

    group by customer_id

),

customer_payments as (

    select
        orders.customer_id,
        sum(amount) as total_amount

    from payments

    left join orders on
         payments.order_id = orders.order_id

    group by orders.customer_id

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order,
        customer_orders.most_recent_order,
        customer_orders.number_of_orders,
        customer_payments.total_amount as customer_lifetime_value

    from customers

    left join customer_orders
        on customers.customer_id = customer_orders.customer_id

    left join customer_payments
        on  customers.customer_id = customer_payments.customer_id

)

select * from final

when executing in --sample mode, the {{ ref() }} statements here would compile to their raw location plus their configured limits. So, if we had configured samples as such:

models:
  - name: stg_customers
     config: 
       sample: 
         size: 10
  - name: stg_orders
     config: 
       sample: 
         size: 50
  - name: stg_payments
     config: 
       sample: 
         size: 75

the compiled code would look something like this:

with customers as (

    select * from my_warehouse.my_schema.stg_customers limit 10

),

orders as (

    select * from my_warehouse.my_schema.stg_orders limit 50

),

payments as (

    select * from my_warehouse.my_schema.stg_payments limit 75

),

customer_orders as (

        select
        customer_id,

        min(order_date) as first_order,
        max(order_date) as most_recent_order,
        count(order_id) as number_of_orders
    from orders

    group by customer_id

),

customer_payments as (

    select
        orders.customer_id,
        sum(amount) as total_amount

    from payments

    left join orders on
         payments.order_id = orders.order_id

    group by orders.customer_id

),

final as (

    select
        customers.customer_id,
        customers.first_name,
        customers.last_name,
        customer_orders.first_order,
        customer_orders.most_recent_order,
        customer_orders.number_of_orders,
        customer_payments.total_amount as customer_lifetime_value

    from customers

    left join customer_orders
        on customers.customer_id = customer_orders.customer_id

    left join customer_payments
        on  customers.customer_id = customer_payments.customer_id

)

select * from final

In that case, we would actually be filtering as early as possible.

I have seen this sort of problem solved in 2 ways:

  1. overriding the ref & source macros (this solution)
  2. adding a limit macro to the end of relevant model (there's lots of old discussions on this)

I'd be curious @jtcohen6 and @MichelleArk for y'all's thoughts on the pros / cons of each approach - and why "sampling" is better than "limiting".

@dbeatty10
Copy link
Contributor

Insights about sampling

"Why not just take samples?"

Sampling is actually pretty hard. Data developers will sometimes use sample-ish hacks (like testing with the last 3 days’ worth of data, or using a sample function built into the database) but these can often come back to bite you. Samples need to take into account referential integrity and unevenly distributed data if they are to be any use when testing.

One solution

SnowShu is a sampling engine designed to support testing in data development.

Here's one solution for sampling by Health Union:

Caveat: I haven't tried it out personally.

Interesting links

Here's some other interesting links from Health Union:

@graciegoheen
Copy link
Contributor

It would also be interesting to explore providing column-level filters as configs, so that samples would resolve as select * from where limit . This way, it would be possible to tune samples to ensure referential integrity within a run in sample mode.

Just adding a +1 here to being able to sample via adding a filter!

@aranke
Copy link
Member

aranke commented Nov 15, 2023

Great discussion here, just wanted to add my 2c stemming from a conversation with @dbeatty10:

Could we sample with an optional join_key config?

Here's an example of how I could see this working:

  1. Sample selects n random rows from the customers table
  2. join_key on transactions table allows us to find all transactions for those customers instead of randomly sampling from customers

This could be a natural extension to the work in #8652 (unit-test rows from seeds) to enable unit-testing on sample rows from models.

To wrap it all up, we could create a new “hybrid” test that takes this sample as an input but checks for data assertions as output (e.g., check that customer status is churned if they haven't made a purchase in the last 12 months).

This could help alleviate mocking fatigue in unit testing for fct/dim tables that depend on dozens of int/stg tables.

@moseleyi
Copy link

Sampling sounds like a great idea! I would also second the ask for full customisation of this configuration. Not just limit but ordering and even using a SQL filtering for it. For example in BigQuery, I would use one partition for example. However it could be expanded with time. Having it to begin with is a great start!

@graciegoheen graciegoheen removed this from the v1.9 milestone Aug 27, 2024
@joellabes
Copy link
Contributor

Bonus considerations:

  • Making sure this works for tests. For generic tests, it's like a default set of where configs—maybe even better!—but the two features should work happily together.

Just checking whether this will be included? (🥺)

@graciegoheen
Copy link
Contributor

Making sure this works for tests. For generic tests, it's like a default set of where configs—maybe even better!—but the two features should work happily together.

@joellabes @jtcohen6 Does this mean, a data test for a sampled model should only run on the sample of data? That seems easy, given that the data test will run on the filtered table that's been created in the warehouse.

Or, does this mean -> #10877

@joellabes
Copy link
Contributor

Does this mean, a data test for a sampled model should only run on the sample of data?

Yes this is what I meant! I was thinking about dbt test --select the_full_version_of_my_big_model --sample which defers to production. Would the expectation here be that it would bring through the subset of rows defined on the_full_version_of_my_big_model's sample dictionary and give them all to the dbt test? That sounds great to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Refinement Maintainer input needed
Projects
None yet
Development

No branches or pull requests

10 participants