-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix error when setting a large number of properties #312
Conversation
Bugfix Fix Velir#269. This change greatly reduces the likelihood of an error when specifying a large number of property_ids in `ga4.combine_property_data()`. * Fixed the following bug * Changed to copy a table for each peoperty_id dbt_project.yml ```yml vars: ga4: source_project: source-project-id property_ids: [ 000000001 , 000000002 , ... , 000000040 ] start_date: 20210101 static_incremental_days: 3 combined_dataset: combined_dataset_name ``` ```shell $ dbt run -s base_ga4__events --full-refresh 06:51:19 Running with dbt=1.5.0 06:52:05 Found 999 models, 999 tests, 999 snapshots, 999 analyses, 999 macros, 999 operations, 999 seed files, 999 sources, 999 exposures, 999 metrics, 999 groups 06:52:06 06:52:14 Concurrency: 4 threads (target='dev') 06:52:14 06:52:14 1 of 1 START sql view model dataset_name.base_ga4__events ......... [RUN] 06:56:17 BigQuery adapter: https://console.cloud.google.com/bigquery?project=project-id&j=bq:asia-northeast1:????????-????-????-????-????????????&page=queryresults 06:56:17 1 of 1 ERROR creating sql view model dataset_name.base_ga4__events [ERROR in 243.80s] 06:56:18 06:56:18 Finished running 1 view model in 0 hours 4 minutes and 11.62 seconds (251.62s). 06:56:22 06:56:22 Completed with 1 error and 0 warnings: 06:56:22 06:56:23 Database Error in model base_ga4__events (models/staging/base/base_ga4__events.sql) 06:56:23 The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters. 06:56:23 06:56:23 Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1 ``` Merging this pull request will enable execution. ```shell $ dbt run -s base_ga4__events --full-refresh HH:mm:ss Running with dbt=1.5.0 HH:mm:ss Found 999 models, 999 tests, 999 snapshots, 999 analyses, 999 macros, 999 operations, 999 seed files, 999 sources, 999 exposures, 999 metrics, 999 groups HH:mm:ss HH:mm:ss Concurrency: 4 threads (target='dev') HH:mm:ss HH:mm:ss 1 of 1 START sql incremental model dataset_name.base_ga4__events ... [RUN] HH:mm:ss Cloned from `source-project-id.analytics_000000001.events_*[20210101-20240324]` to `project-id.combined_dataset_name.events_YYYYMMDD000000001`. HH:mm:ss Cloned from `source-project-id.analytics_000000002.events_*[20210101-20240324]` to `project-id.combined_dataset_name.events_YYYYMMDD000000002`. .... HH:mm:ss Cloned from `source-project-id.analytics_000000040.events_*[20210101-20240324]` to `project-id.combined_dataset_name.events_YYYYMMDD000000040`. HH:mm:ss 1 of 1 OK created sql incremental model dataset_name.base_ga4__events [CREATE TABLE (? rows, ? processed) in ?] HH:mm:ss HH:mm:ss Finished running 1 incremental model in ? (?). HH:mm:ss HH:mm:ss Completed successfully HH:mm:ss HH:mm:ss Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1 ``` --- Fixed timeout in clone operation The following error will almost never occur because I have changed to clone separated by property_id. * Removed https://github.com/Velir/dbt-ga4/blame/6.0.1/README.md#L323-L332 from README.md * Resolved the following operation https://github.com/Velir/dbt-ga4/blame/6.0.1/README.md#L323-L332 > Jobs that run a large number of clone operations are prone to timing out. As a result, it is recommended that you increase the query timeout if you need to backfill or full-refresh the table, when first setting up or when the base model gets modified. Otherwise, it is best to prevent the base model from rebuilding on full refreshes unless needed to minimize timeouts. > > ``` > models: > ga4: > staging: > base: > base_ga4__events: > +full_refresh: false > ```
@yamotech, thank you for this. I've done a quick review of the code and it looks fine but I want to test it on a project first. I have a client with six combined GA4 properties who should work. Pardon my ignorance, but how do I check whether parallel cloning is running in parallel? Ideally, I'd like to write a Python test to verify this behavior, but I minimally want to confirm that the code works. Also, why do you nest the The outer macro basically does nothing other than a Minimally, I'll want you to update the macro to use |
Thanks for the review, @dgitis! I worked on parallel cloning in #303, but closed it because the implementation was inadequate. Reasons for poor implementation:
This pull request only addresses I have removed the following from ### TODO
This pull request does not address it, but the content seems worthwhile for us to tackle.
#### Merges only the data for the specified property_id into base_ga4__events.
For example, "Merges only the data for property_id `000000041` and `000000042` into `base_ga4__events`".
$ dbt run-operation insert_overwrite_properties_data --args '{ property_ids: [000000041, 000000042], start_date: 20210101, end_date: 20240324 }'
To execute `{{ ga4.combine_property_data(property_ids, start_date, end_date) }}` within `insert_overwrite_properties_data`.
We don't need to re-increment the data for property_ids that are already included in base_ga4__events, so I want to address this.
#### Execute combine_property_data in parallel to speed up cloning.
The content I wanted to address in #303.
My dbt project has more than 40 property_ids.
I want to combine these property_ids in parallel to reduce latency.
I'll create a new pull request if I come up with a better approach.
I was thinking of cloning only the specified property_id. $ dbt run-operation combine_specified_property_data --args '{ property_id: 000000100, start_date: 20240101, end_date: 20240325 }'
$ dbt run-operation combine_specified_property_data --args '{ property_id: 000000101, start_date: 20240301, end_date: 20240325 }' Parallelism to work does not require
I removed |
My apologies. I understand. I think I can probably reproduce this by copying the same source table over and over again and test this fix. I've tested some multi-site issues that way before and I know our code doesn't check for the same ID listed more than once in |
{# Copy intraday tables #} | ||
{%- set relations = dbt_utils.get_relations_by_pattern(schema_pattern=schema_name, table_pattern='events_intraday_%', database=var('source_project')) -%} | ||
{% for relation in relations %} | ||
{%- set relation_suffix = relation.identifier|replace('events_intraday_', '') -%} | ||
{%- if relation_suffix|int >= earliest_shard_to_retrieve|int -%} | ||
CREATE OR REPLACE TABLE `{{target.project}}.{{var('combined_dataset')}}.events_intraday_{{relation_suffix}}{{property_id}}` CLONE `{{var('source_project')}}.analytics_{{property_id}}.events_intraday_{{relation_suffix}}`; | ||
create or replace table `{{target.project}}.{{var('combined_dataset')}}.events_intraday_{{relation_suffix}}{{property_id}}` clone `{{var('source_project')}}.analytics_{{property_id}}.events_intraday_{{relation_suffix}}`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems best to minimize the time between query generation and execution.
With the changes made in this pull request, the likelihood of this happening should decrease significantly, so it seems prudent to monitor the situation for now.
The following error occurred:
23:07:31 BigQuery adapter: https://console.cloud.google.com/bigquery?project={{target.project}}&j=bq:asia-northeast1:xxxxxxxxx-xxxx-xxxx-xxx-xxxxxxxxxxxx&page=queryresults
23:07:31 1 of 999 ERROR creating sql incremental model {{dataset_id}}.base_ga4__events ... [ERROR in 260.79s]
23:08:48 Completed with 1 error and 0 warnings:
23:08:48
23:08:48 Database Error in model base_ga4__events (models/staging/base/base_ga4__events.sql)
23:08:48 Not found: Table {{target.project}}:analytics_{{property_id}}.events_intraday_20240327 was not found in location asia-northeast1 at [77:5]
It appears that there was a transition from events_intraday_20240327
to events_20240327
before referencing {{target.project}}:analytics_{{property_id}}.events_intraday_20240327
, resulting in a Not found
error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been experiencing the error mentioned above quite frequently lately, and it's been causing some trouble.
If possible, I would like to proceed with addressing this error here, but is merging this pull request difficult?
If there are any specific fixes needed, please let me know so I can take care of them.
Also, if there are any other good solutions available, I'd like to consider them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I just had to wrap my head around what was actually being changed beyond the reduction of whitespace characters. Thanks for the PR! Approved.
Description & motivation
Bugfix
Fix #269.
This change greatly reduces the likelihood of an error when specifying a large number of property_ids in
ga4.combine_property_data()
.dbt_project.yml
Merging this pull request will enable execution.
Fixed timeout in clone operation
The following error will almost never occur because I have changed to clone separated by property_id.
https://github.com/Velir/dbt-ga4/blame/6.0.1/README.md#L323-L332
Checklist
dbt test
andpython -m pytest .
to validate existing tests