Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error when setting a large number of properties #312

Merged
merged 3 commits into from
Apr 16, 2024

Conversation

yamotech
Copy link
Contributor

@yamotech yamotech commented Mar 24, 2024

Description & motivation

Bugfix

Fix #269.

This change greatly reduces the likelihood of an error when specifying a large number of property_ids in ga4.combine_property_data().

  • Fixed the following bug
    • Changed to copy a table for each peoperty_id

dbt_project.yml

vars:
    ga4:
        source_project: source-project-id
        property_ids: [
          000000001
          , 000000002
          , ...
          , 000000040
        ]
        start_date: 20210101
        static_incremental_days: 3
        combined_dataset: combined_dataset_name
$ dbt run -s base_ga4__events --full-refresh
06:51:19  Running with dbt=1.5.0
06:52:05  Found 999 models, 999 tests, 999 snapshots, 999 analyses, 999 macros, 999 operations, 999 seed files, 999 sources, 999 exposures, 999 metrics, 999 groups
06:52:06
06:52:14  Concurrency: 4 threads (target='dev')
06:52:14
06:52:14  1 of 1 START sql view model dataset_name.base_ga4__events ......... [RUN]
06:56:17  BigQuery adapter: https://console.cloud.google.com/bigquery?project=project-id&j=bq:asia-northeast1:????????-????-????-????-????????????&page=queryresults
06:56:17  1 of 1 ERROR creating sql view model dataset_name.base_ga4__events  [ERROR in 243.80s]
06:56:18
06:56:18  Finished running 1 view model in 0 hours 4 minutes and 11.62 seconds (251.62s).
06:56:22
06:56:22  Completed with 1 error and 0 warnings:
06:56:22
06:56:23  Database Error in model base_ga4__events (models/staging/base/base_ga4__events.sql)
06:56:23    The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters.
06:56:23
06:56:23  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1

Merging this pull request will enable execution.

$ dbt run -s base_ga4__events --full-refresh
HH:mm:ss  Running with dbt=1.5.0
HH:mm:ss  Found 999 models, 999 tests, 999 snapshots, 999 analyses, 999 macros, 999 operations, 999 seed files, 999 sources, 999 exposures, 999 metrics, 999 groups
HH:mm:ss
HH:mm:ss  Concurrency: 4 threads (target='dev')
HH:mm:ss
HH:mm:ss  1 of 1 START sql incremental model dataset_name.base_ga4__events ... [RUN]
HH:mm:ss  Cloned from `source-project-id.analytics_000000001.events_*` to `project-id.combined_dataset_name.events_YYYYMMDD000000001`.
HH:mm:ss  Cloned from `source-project-id.analytics_000000002.events_*` to `project-id.combined_dataset_name.events_YYYYMMDD000000002`.
....
HH:mm:ss  Cloned from `source-project-id.analytics_000000040.events_*` to `project-id.combined_dataset_name.events_YYYYMMDD000000040`.
HH:mm:ss  1 of 1 OK created sql incremental model dataset_name.base_ga4__events  [CREATE TABLE (? rows, ? processed) in ?]
HH:mm:ss
HH:mm:ss  Finished running 1 incremental model in ? (?).
HH:mm:ss
HH:mm:ss  Completed successfully
HH:mm:ss
HH:mm:ss  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Fixed timeout in clone operation

The following error will almost never occur because I have changed to clone separated by property_id.

https://github.com/Velir/dbt-ga4/blame/6.0.1/README.md#L323-L332

Jobs that run a large number of clone operations are prone to timing out. As a result, it is recommended that you increase the query timeout if you need to backfill or full-refresh the table, when first setting up or when the base model gets modified. Otherwise, it is best to prevent the base model from rebuilding on full refreshes unless needed to minimize timeouts.

models:
  ga4:
    staging:
      base:
        base_ga4__events:
          +full_refresh: false

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • [na] I have added tests & descriptions to my models (and macros if applicable)
  • I have run dbt test and python -m pytest . to validate existing tests

Bugfix

Fix Velir#269.

This change greatly reduces the likelihood of an error when specifying a large number of property_ids in `ga4.combine_property_data()`.

* Fixed the following bug
  * Changed to copy a table for each peoperty_id

dbt_project.yml

```yml
vars:
    ga4:
        source_project: source-project-id
        property_ids: [
          000000001
          , 000000002
          , ...
          , 000000040
        ]
        start_date: 20210101
        static_incremental_days: 3
        combined_dataset: combined_dataset_name
```

```shell
$ dbt run -s base_ga4__events --full-refresh
06:51:19  Running with dbt=1.5.0
06:52:05  Found 999 models, 999 tests, 999 snapshots, 999 analyses, 999 macros, 999 operations, 999 seed files, 999 sources, 999 exposures, 999 metrics, 999 groups
06:52:06
06:52:14  Concurrency: 4 threads (target='dev')
06:52:14
06:52:14  1 of 1 START sql view model dataset_name.base_ga4__events ......... [RUN]
06:56:17  BigQuery adapter: https://console.cloud.google.com/bigquery?project=project-id&j=bq:asia-northeast1:????????-????-????-????-????????????&page=queryresults
06:56:17  1 of 1 ERROR creating sql view model dataset_name.base_ga4__events  [ERROR in 243.80s]
06:56:18
06:56:18  Finished running 1 view model in 0 hours 4 minutes and 11.62 seconds (251.62s).
06:56:22
06:56:22  Completed with 1 error and 0 warnings:
06:56:22
06:56:23  Database Error in model base_ga4__events (models/staging/base/base_ga4__events.sql)
06:56:23    The query is too large. The maximum standard SQL query length is 1024.00K characters, including comments and white space characters.
06:56:23
06:56:23  Done. PASS=0 WARN=0 ERROR=1 SKIP=0 TOTAL=1
```

Merging this pull request will enable execution.

```shell
$ dbt run -s base_ga4__events --full-refresh
HH:mm:ss  Running with dbt=1.5.0
HH:mm:ss  Found 999 models, 999 tests, 999 snapshots, 999 analyses, 999 macros, 999 operations, 999 seed files, 999 sources, 999 exposures, 999 metrics, 999 groups
HH:mm:ss
HH:mm:ss  Concurrency: 4 threads (target='dev')
HH:mm:ss
HH:mm:ss  1 of 1 START sql incremental model dataset_name.base_ga4__events ... [RUN]
HH:mm:ss  Cloned from `source-project-id.analytics_000000001.events_*[20210101-20240324]` to `project-id.combined_dataset_name.events_YYYYMMDD000000001`.
HH:mm:ss  Cloned from `source-project-id.analytics_000000002.events_*[20210101-20240324]` to `project-id.combined_dataset_name.events_YYYYMMDD000000002`.
....
HH:mm:ss  Cloned from `source-project-id.analytics_000000040.events_*[20210101-20240324]` to `project-id.combined_dataset_name.events_YYYYMMDD000000040`.
HH:mm:ss  1 of 1 OK created sql incremental model dataset_name.base_ga4__events  [CREATE TABLE (? rows, ? processed) in ?]
HH:mm:ss
HH:mm:ss  Finished running 1 incremental model in ? (?).
HH:mm:ss
HH:mm:ss  Completed successfully
HH:mm:ss
HH:mm:ss  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
```

---

Fixed timeout in clone operation

The following error will almost never occur because I have changed to clone separated by property_id.

* Removed https://github.com/Velir/dbt-ga4/blame/6.0.1/README.md#L323-L332 from README.md
* Resolved the following operation

https://github.com/Velir/dbt-ga4/blame/6.0.1/README.md#L323-L332

> Jobs that run a large number of clone operations are prone to timing out. As a result, it is recommended that you increase the query timeout if you need to backfill or full-refresh the table, when first setting up or when the base model gets modified. Otherwise, it is best to prevent the base model from rebuilding on full refreshes unless needed to minimize timeouts.
>
> ```
> models:
>   ga4:
>     staging:
>       base:
>         base_ga4__events:
>           +full_refresh: false
> ```
@dgitis
Copy link
Collaborator

dgitis commented Mar 25, 2024

@yamotech, thank you for this.

I've done a quick review of the code and it looks fine but I want to test it on a project first. I have a client with six combined GA4 properties who should work.

Pardon my ignorance, but how do I check whether parallel cloning is running in parallel?

Ideally, I'd like to write a Python test to verify this behavior, but I minimally want to confirm that the code works.

Also, why do you nest the combine_specified_property_data macro in the combine_property_data macro?

The outer macro basically does nothing other than a for ... in statement. Is this structure necessary for the parallelism to work?

Minimally, I'll want you to update the macro to use adapter.dispatch with a default__combine_specified_property_data. This allows people to override the macro which I think is pretty important in this case.

@yamotech
Copy link
Contributor Author

I've done a quick review of the code and it looks fine but I want to test it on a project first. I have a client with six combined GA4 properties who should work.

Pardon my ignorance, but how do I check whether parallel cloning is running in parallel?

Thanks for the review, @dgitis!
I apologize for the lack of explanation, but this pull request does not address parallel cloning.

I worked on parallel cloning in #303, but closed it because the implementation was inadequate.
You can confirm parallel cloning behavior by running $ dbt run -s +base_ga4__events in #303.

Reasons for poor implementation:

  • It is necessary to create a models/staging/base/intermediate/combine_property/int_ga4__combine_property_{{ INDEX }}.sql file for each configurable property_id.
  • You need to reference the int_ga4__combine_property_{{ INDEX }} model within base_ga4__events.

This pull request only addresses BugFix and Fixed timeout in clone operation.

I have removed the following from Description & motivation.
Deleted sections are not addressed in this pull request.
If I come up with a better approach than #303, I'll create a new pull request.

### TODO

This pull request does not address it, but the content seems worthwhile for us to tackle.

#### Merges only the data for the specified property_id into base_ga4__events.

For example, "Merges only the data for property_id `000000041` and `000000042` into `base_ga4__events`".

$ dbt run-operation insert_overwrite_properties_data --args '{ property_ids: [000000041, 000000042], start_date: 20210101, end_date: 20240324 }'

To execute `{{ ga4.combine_property_data(property_ids, start_date, end_date) }}` within `insert_overwrite_properties_data`.
We don't need to re-increment the data for property_ids that are already included in base_ga4__events, so I want to address this.

#### Execute combine_property_data in parallel to speed up cloning.

The content I wanted to address in #303.

My dbt project has more than 40 property_ids.
I want to combine these property_ids in parallel to reduce latency.
I'll create a new pull request if I come up with a better approach.

Also, why do you nest the combine_specified_property_data macro in the combine_property_data macro?

The outer macro basically does nothing other than a for ... in statement. Is this structure necessary for the parallelism to work?

I was thinking of cloning only the specified property_id.

$ dbt run-operation combine_specified_property_data --args '{ property_id: 000000100, start_date: 20240101, end_date: 20240325 }'
$ dbt run-operation combine_specified_property_data --args '{ property_id: 000000101, start_date: 20240301, end_date: 20240325 }'

Parallelism to work does not require combine_specified_property_data. So in the commit 5d829f1, I moved the processing to combine_property_data.

Minimally, I'll want you to update the macro to use adapter.dispatch with a default__combine_specified_property_data. This allows people to override the macro which I think is pretty important in this case.

I removed combine_specified_property_data so I didn't address that aspect.

@dgitis
Copy link
Collaborator

dgitis commented Mar 26, 2024

My apologies. I understand.

I think I can probably reproduce this by copying the same source table over and over again and test this fix. I've tested some multi-site issues that way before and I know our code doesn't check for the same ID listed more than once in property_ids.

{# Copy intraday tables #}
{%- set relations = dbt_utils.get_relations_by_pattern(schema_pattern=schema_name, table_pattern='events_intraday_%', database=var('source_project')) -%}
{% for relation in relations %}
{%- set relation_suffix = relation.identifier|replace('events_intraday_', '') -%}
{%- if relation_suffix|int >= earliest_shard_to_retrieve|int -%}
CREATE OR REPLACE TABLE `{{target.project}}.{{var('combined_dataset')}}.events_intraday_{{relation_suffix}}{{property_id}}` CLONE `{{var('source_project')}}.analytics_{{property_id}}.events_intraday_{{relation_suffix}}`;
create or replace table `{{target.project}}.{{var('combined_dataset')}}.events_intraday_{{relation_suffix}}{{property_id}}` clone `{{var('source_project')}}.analytics_{{property_id}}.events_intraday_{{relation_suffix}}`;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems best to minimize the time between query generation and execution.

With the changes made in this pull request, the likelihood of this happening should decrease significantly, so it seems prudent to monitor the situation for now.

The following error occurred:

23:07:31  BigQuery adapter: https://console.cloud.google.com/bigquery?project={{target.project}}&j=bq:asia-northeast1:xxxxxxxxx-xxxx-xxxx-xxx-xxxxxxxxxxxx&page=queryresults
23:07:31  1 of 999 ERROR creating sql incremental model {{dataset_id}}.base_ga4__events ... [ERROR in 260.79s]
23:08:48  Completed with 1 error and 0 warnings:
23:08:48  
23:08:48  Database Error in model base_ga4__events (models/staging/base/base_ga4__events.sql)
23:08:48    Not found: Table {{target.project}}:analytics_{{property_id}}.events_intraday_20240327 was not found in location asia-northeast1 at [77:5]

It appears that there was a transition from events_intraday_20240327 to events_20240327 before referencing {{target.project}}:analytics_{{property_id}}.events_intraday_20240327, resulting in a Not found error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgitis

I've been experiencing the error mentioned above quite frequently lately, and it's been causing some trouble.
If possible, I would like to proceed with addressing this error here, but is merging this pull request difficult?
If there are any specific fixes needed, please let me know so I can take care of them.

Also, if there are any other good solutions available, I'd like to consider them.

Copy link
Collaborator

@adamribaudo-velir adamribaudo-velir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I just had to wrap my head around what was actually being changed beyond the reduction of whitespace characters. Thanks for the PR! Approved.

@adamribaudo-velir adamribaudo-velir merged commit f58011f into Velir:main Apr 16, 2024
2 checks passed
@yamotech yamotech deleted the fix-269 branch April 17, 2024 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Cannot have a large number of properties
3 participants