-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2967] [Feature] run dbt in --sample mode #8378
Comments
Hey @MichelleArk, excited about this enhancement, as it intersects with some planned work that we were going to execute using ref/source overrides! We definitely would love to see this in a really configurable form, like the |
Thanks for opening @MichelleArk! :) I love how simple the Question: Is this sufficient in practice? Will users need the ability to define multiple different sampling strategies for the same models, to run in different environment (dev vs. CI vs. QA vs. ...)? If so: Should that look like lots of vars / env vars within the I'd offer a few more arguments in favor of supporting column-specific filters:
This could look like, defining additional attributes on # models/properties.yml
models:
- name: my_model
columns:
- name: customer_id
sample: [10, 30] # array -> 'in' filter
- name: created_at
sample: "> {{ run_started_at.date() - modules.datetime.timedelta(days=3) }}" # string -> filter as is Then, anywhere you write: -- models/another_model.sql
with my_model as (
select * from {{ ref('my_model') }}
), ... dbt would actually resolve that as a subquery like: -- models/another_model.sql
with my_model as (
select * from (
select * from <database>.<schema>.my_model
where customer_id in (10, 30)
and created_at > '2023-08-11' # writing this on Aug 14
) subq
), ... Or, it could also be a config that's set for many models at once (or even every model in the project): # dbt_project.yml
models:
my_project:
+sample:
customer_id: [10, 30] # array -> 'in' filter
created_at: "> {{ run_started_at.date() - modules.datetime.timedelta(days=3) }}" # string -> filter as is The latter option would require dbt to check the set of columns first, to see if fields named
Bonus considerations:
@jaypeedevlin I'm happy to hear that this piques your interest! I'm curious - what would be your motivation for |
Love this topic! I also agree that column-specific filters predicates might be more interesting for consistency across models or even just on its own. If you have years of data in your model, I feel it might be more relevant to get everything from the past month rather than random data points scattered - although probably depends on the use cases and what you are trying to test! I'm wondering if we could customize where to apply this filter though. As I understand it, if one sets this in a .yml file then I assume it would be executed at the end of the model whereas it might be better performance-wise to run it early - especially if you have big tables and a lot of transformations in a model. We have been doing something similar, on the first group of CTEs of certain models where we simply select from source or ref, we added this block:
And we define those variables in the
This way we can sample the data early in the query rather than later. |
@jtcohen6 that was a reasonably arbitrary example of truly random sampling. Most of our models are incremental so we'd be (in theory) randomly sampling just the most recent data anyway. Definitely not the most sophisticated method for my example though, I'll grant you that, but just arguing to make this as configurable as possible. |
Really love the conversation going here - I have personally written a handful of hack-y work-arounds to accomplish this exact thing, so would be excited to see a native approach in I can also imagine folks wanting to configure different samples per environment, which is something we could enable using environment variables.
I'm definitely in favor of folks being able to define a filter, which we recommend in our best practices guide and has come up in a handful of solutions. @QuentinCoviaux my understanding here is that running dbt in
when executing in
the compiled code would look something like this:
In that case, we would actually be filtering as early as possible. I have seen this sort of problem solved in 2 ways:
I'd be curious @jtcohen6 and @MichelleArk for y'all's thoughts on the pros / cons of each approach - and why "sampling" is better than "limiting". |
Insights about sampling
One solution
Here's one solution for sampling by Health Union: Caveat: I haven't tried it out personally. Interesting linksHere's some other interesting links from Health Union:
|
Just adding a +1 here to being able to sample via adding a filter! |
Great discussion here, just wanted to add my 2c stemming from a conversation with @dbeatty10:
Here's an example of how I could see this working:
This could be a natural extension to the work in #8652 (unit-test rows from seeds) to enable unit-testing on sample rows from models. To wrap it all up, we could create a new “hybrid” test that takes this sample as an input but checks for data assertions as output (e.g., check that customer status is This could help alleviate mocking fatigue in unit testing for |
Sampling sounds like a great idea! I would also second the ask for full customisation of this configuration. Not just limit but ordering and even using a SQL filtering for it. For example in BigQuery, I would use one partition for example. However it could be expanded with time. Having it to begin with is a great start! |
Just checking whether this will be included? (🥺) |
@joellabes @jtcohen6 Does this mean, a data test for a sampled model should only run on the sample of data? That seems easy, given that the data test will run on the filtered table that's been created in the warehouse. Or, does this mean -> #10877 |
Yes this is what I meant! I was thinking about |
Is this your first time submitting a feature request?
Describe the feature
Introduce a new boolean flag,
--sample/--no-sample
, that informs whether dbt should run in 'sample' mode.In sample mode, dbt would resolve
ref
andsource
identifiers asselect * from <identifier> limit <sample-size>
.A default sample size could be used (perhaps 100?), or set in
config
at the model-level to inform the size of the sample to select when referencing the model.This would also enable setting a sample size using the hierarchical config syntax in dbt_project.yml:
It would also be interesting to explore providing column-level filters as configs, so that samples would resolve as
select * from <identifier> where <filters> limit <sample-size>
. This way, it would be possible to tune samples to ensure referential integrity within a run in sample mode.We may also want to create an equivalent env_var, so folks could set a sample for a given environment
Describe alternatives you've considered
It's possible to implement something like this with overrides of ref & source macros, but that makes dbt upgrades more difficult.
Also considered a simpler version of this where
--sample
is an integer argument representing the sample size in #8337. This doesn't leave design space for configuring filters on model/source samples however.Who will this benefit?
any dbt user looking for speedier (at the cost of completeness/accuracy) development & CI runs!
Are you interested in contributing this feature?
yep!
Anything else?
Took a first pass at this in a spike here: #8337
The text was updated successfully, but these errors were encountered: