-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1481] [Feature] Expose async arg in run_query()
#315
Comments
run_query()
run_query()
@jaysobel I buy into your vision of speeding up the zero-copy clone process for large projects! Reminds me of a famous song by the Proclaimers:
I think implementing dbt-labs/dbt-core#5005 would be a better bet than exposing I saw your comment here -- would dbt-labs/dbt-core#5005 be just as good an outcome for you? Rationale
|
@dbeatty10 I don't see 5005 as a substitute since it would just enable cloning fewer models in CI runs, as opposed to cloning faster, in all contexts. |
Had a chance to chat more about this with @dbeatty10 & @Fleid. Super interesting stuff. Let's stay concrete to the two use cases called out in the issue:
These are completely legit! We should seek to make these as fast + painless as possible. That said, if I'm being candid, I live in fear of code like: {% macro clone_all_my_tables() %}
{% for query in queries %}
{% do run_query(query, async=true) %} -- "will I ever get a callback?" - dbt, or jerco during high school theatre auditions
{% endfor %}
{% set wait_sql %}
select system$wait(30); -- hopefully long enough for everything to complete...?
{% end wait_sql %}
{% do run_query(wait_sql, async=false) %}
{% endmacro %}
Snowflake calls out some "best practices," which feel (to me) pretty nontrivial:
I worry that we'd be setting ourselves on the precipice of some pretty gnarly request-handling, all of which would need to handled with Jinja code... tl;dr What if we used threads instead?If you're doing parallelism in dbt today, it's because you're executing a DAG, and some resources within that DAG can be built / operated on independently from each other — insofar as the dependency order allows. This holds true for Async queries, parallelism... there be dragons in these waters. We want to be thoughtful about how we expose these capabilities to end users, rather than throwing them right into the deep. Async calls != DAG: these things are mutually exclusive. "Houston, we have a leak in the abstraction." IMO the best answer here looks:
Let's imagine a CI environment, and a form of To be clear, this would require a change within |
Just noting that I'm not on the same page here. May just be our implementation, or out-of-date docs in the dbt Cloud CI Job but pic implies full 500 models are cloned as step 1. And we do ours this way so we can do full-refresh, then incremental, then tests as separate steps. I think what I'm hearing is |
Yes, this is what I'm thinking! Sorry for not being clearer in the leadup; this doesn't yet exist today. You could picture it as something like:
Except, what if dbt did the first step for you implicitly? And what you actually run is a more configurable version of "deferral," where rewriting refs for unbuilt upstream models is one option, and cloning is another:
Fair - we'd want to do some testing / benchmarking to confirm. |
Both of these seem like great options. The |
Wow this would be such a win for dbt on Snowflake. Cloning our dev schemas takes upwards of 12 minutes right now.
|
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Is this your first time submitting a feature request?
Describe the feature
In dbt Snowflake projects with ~1,000 models, cloning schemas takes 10s of minutes, and typically comes up once per-user per-day to refresh development environments, or even more frequently to set up CI testing schemas, as in the documentation for the dbt Cloud CI Job.
Typically the
create schema ci_temp.x as clone prod.x
command is run as a dbt macro using therun_query()
function, which (presumably) uses the Snowflake connector for Python'scursor.execute()
function. However this blog post from select.dev demonstrates that in the context of cloning schemas (or many tables at a time) theexecute_async()
variant of this function (a convenience function settingasync = true
onexecute()
) can be 60x faster than the basicexecute()
of the same command.Can the
run_query()
macro, and underlying Jinja interface with Snowflake's Python cursor, expose this optional parameter?Describe alternatives you've considered
Waiting for Snowflake to improve the performance of
clone schema
. Why does a "zero copy clone" take so long?In the context of clones for CI test runs the whole schema isn't needed. The specific models required to run
state:modified+1
is given by the selector-s @state:modified --exclude state:modfied+1
(assuming the subsequent run has --full-refresh). However selectors cannot be passed to therun-operation
command (per #5005). If cloning the whole schema can take 20 seconds, this level of optimization would be moot anyway.Who will this benefit?
Enterprise clients of
dbt Cloud
using thedbt Cloud CI Job
will save 10 minutes off the top of CI runs (which can also only run one at a time, and often queue up).AE teams blocked from automatically refreshing clones everyday by the runtime of
10 minutes * headcount
.Are you interested in contributing this feature?
No response
Anything else?
The article by Niall Woodward.
https://select.dev/posts/snowflake-clones
Snowflake Python docs "Performing an Asynchronous Query"
https://docs.snowflake.com/en/user-guide/python-connector-example.html#performing-an-asynchronous-query
A sad log of dbt Cloud CI job executions taking 10+ minutes.
Typical 10 minutes spent on cloning
The text was updated successfully, but these errors were encountered: