-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAP-912] Support Metadata Freshness #938
Comments
I understand this won't make 1.7 but is it committed to for 1.8? It would have such a huge impact on us (Medialab)'s use of DBT. Happy to support with info on where to find this in BQ information schema tables. I submitted the original request which has made it into 1.7 - but just need the BQ support - thanks! |
I've done some spiking and rough benchmarking of whether a batch-implementation of metadata-based freshness would lead to performance gains here, and my conclusion is that there isn't currently a way to implement a batch-strategy that achieves performance improvements for metadata-based source freshness given limitations of BigQuery's Python SDK. DetailsThe current implementation of metadata-based source freshness iterates over selected sources, and retrieves the freshness metadata for each source by fetching a BigQuery It is significantly faster than using a Batch Implementation -- Approach 1A simple implementation would be to iterate over the selected. This works, but is fundamentally the same approach as the current implementation, just called earlier on during execution. It would still require the same 1 API call per selected source, and so does not lead to significant performance improvements. This was roughed-in here: 29e533f From my benchmarking (running 10 invocations of each implementation on 6 sources in jaffle-shop), the current (non-batch implementation) took an average of 2.57s while the batch implementation took an average of 2.62s. The fastest non-batch invocation took 2.35s, and the fastest batch invocation took 2.17s. I believe the variability here actually comes from variability in BigQuerys Batch Implementation -- Approach 2Another implementation would be to fetch all the datasets available to the client upfront, and for each dataset, fetch all its tables. This would (in theory) only require 2 API calls per dataset the client has access to regardless of the number of sources selected. I've roughed this in here: b1035e2 However, the objects (
We'd end up needing to fetch the If there are other implementation options that would get us to a constant number of API calls for the batch of sources, I'm super open to hearing them but from my exploration so far -- it does not seem worthwhile to implement a naive batch implementation strategy because it is more complex but with negligible performance gains. |
@MichelleArk this is a super terrifying conclusion - and it was batch via BQ from my initial feature request that has driven all of this work around source freshness which have been much anticipated since I raised the issue 15 months ago. I have a team week planned 3rd June where we are hoping to finally shift to the benefits of this whole endeavour. In the linked FR, there is correspondence in the comments about how to get this exact information from the INFORMATION_SCHEMA table: BQ docs are now updated; https://cloud.google.com/bigquery/docs/information-schema-table-storage#schema Your 2 investigated approaches suffer from the same limitation which is that you're iterating through table metadata - I think that's the antithesis of the batch approach here. The whole point would be to query the single information schema table once. :( |
@adamcunnington-mlg -- thanks for linking the updated docs, and sorry I hadn't been aware of the INFORMATION_SCHEMA.TABLE_STORAGE option. I hear the disappointment re: not having this implemented yet as part of the recent 1.8 releases. I did some further spiking using the and after some tweaking I think we have a promising path forward in An initial naive implementation of However, filtering within the metadata macro by relation schema name in addition to filtering for exact schema & table name matches (51243e5) does the trick in getting the query time constant! Both a project with 50 sources and 100 sources took just under 5s to complete Note that this was all done while hard-coding the Based on the benchmarking in this spik, I'd expect that a project with ~1000 sources would take about 3 minutes to compute source freshness using the current implementation (non-batch). @adamcunnington-mlg if you can share, does that line up with what you're seeing in a production project? Would be awesome to get sense of the performance impact these changes would have on some real-world projects. |
@MichelleArk thanks for the fast response on this - it's much appreciated - and this sounds promising. I'm not close to how the source freshness process actually works so please forgive any naivety here in my comments/questions:
Overall, I think I'd expect the source freshness time to be as slow as 1 query of information schema (which may be very loosely correlated with the size of that result but it's going to be relatively flat) plus the time for dbt internals to iterate through the tables and extract freshness information - a linear time but small. Overall, I think 3 minutes for 1000 tables surprises me. I'd be expecting more like 10-20 seconds? 3 minutes is definitely better than our current 15 but not quite what I was expecting. I suspect I'm missing something fundamental here though in how this process could work. |
Thank you for the spike and the write-up @MichelleArk! Other considerations for this approach:
@adamcunnington-mlg Seconding Michelle, I would ask that you please try testing these out in your actual project with many many sources:
If you could provide us with concrete numbers for both approaches, that would help me a lot in deciding on the appropriate next step here — to switch this behavior for everyone (strictly better), to consider implementing it as another configurable option (with both pros & cons relative to the v1.8 behavior), or to put it down for good ("do nothing"). |
Describe the feature
Support metadata-based freshness by implementing the new macro and feature flag described in dbt-labs/dbt-core#7012.
Who will this benefit?
Everyone who wants faster freshness results from BigQuery!
The text was updated successfully, but these errors were encountered: