[Feature] dbt-Bigquery retry for DML job BQ API errors (503 errors, etc. ) #45

hui-zheng · 2021-10-22T04:50:02Z

Is there an existing feature request for this?

I have searched the existing issues

Describe the Feature

Background

Currently, dbt-bigquery profile.yml has two configurations: Timeouts and Retries

We at Semios have been using dbt extensively to run high-frequency data pipelines on BigQuery (some of them run every 10 minutes). Because of the strict SLA requirement, we require more granular control on BigQuery timeout and retry mechanism to minimize the intermittent query failures and its impact to our pipeline.

We found that the existing configs are not enough to mitigate all intermittent query failures. for example, the Deadline of 120.0s exceeded 503 errors raised by BigQuery JobInsert API that occurred from time to time and impacted several dbt-BigQuery users in the dbt community.

We would like to suggest a more comprehensive BigQuery retry config to minimize all these errors.

Technical Context

Before I propose the suggestion, let me explain how BigQuery query works and how we would like to retry.

How BigQuery query works in dbt

at the core, BigQuery query is made by two steps in dbt

    def _query_and_results(self, client, sql, conn, job_params, timeout=None):
        """Query the client and wait for results."""
        # Cannot reuse job_config if destination is set and ddl is used
        job_config = google.cloud.bigquery.QueryJobConfig(**job_params)    # <--- Step 1
        query_job = client.query(sql, job_config=job_config)               # <--- Step 2
        iterator = query_job.result(timeout=timeout)
        return query_job, iterator

In the first step, client.query() submits a query to BQ JobInsert API server, when succeeded, BQ server creates a new BigQuery query job, and return the query job id back to the client as part ofquery_job object. This step shall be very quick, normally under a few seconds. however, in some rare cases, it would take much longer and might even up to 4 minutes according to the BigQuery engineering team.

In the 2nd step, query_job.result() await for the BigQuery executing (running) the query and return the results back to the client as an iterator. depending on the complexity of the query, this step could takes long, from tens of seconds to tens of minutes.

how dbt currently retries BigQuery queries

Currently, The dbt retry config is used to set the retries of the overall _query_and_results. the dbt timeout config is used only to control step 2, query_job.result(timeout=timeout). These configs make sense on its own.

However, you might notice that there is no control over the timeout of query() in step 1, other than relying on its [default value] (https://github.com/googleapis/python-bigquery/blob/1762e5bd98ab089abf65a0d146234c77217b8dbe/google/cloud/bigquery/client.py#L3193). When BQ JobInsert API sharding cluster is unstable. it could take up to 4 minutes to create the query job id in some rare cases. Because the query() had a client-side [default deadline of 120s](DEFAULT_RETRY = retry.Retry(predicate=_should_retry)), the client quits at 120s while server side is still waiting for the job creation. That's why several dbt community members had experienced the [Deadline of 120.0s exceeded errors (https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400).

BQ team has since patched the query() to make a client-side [default deadline of 10 mins](DEFAULT_RETRY = retry.Retry(predicate=_should_retry, deadline=600.0). It means by default, query() could take up to 4 minutes to make one attempt to create a new job id. If it fails, the client-side could have up to 10 minutes to retry. This would work fine for users who don't have stringent timing requirements. They won't mind waiting for 10 minutes to create a new BQ job ID and then waiting for another 10 minutes to get the result.

Unfortunately, it doesn't work for us. In situations like this, we would rather fail faster and try again. It shall not take more than 30 seconds to create a new job id. if it takes too long, we would rather let the query() timeout and fail, so that we could retry it again.

the Proposed Solution

In order to gain fine control of the timeout mechanism of step 1 and step 2, we would like to propose the following 4 dbt configs

job_creation_timeout_seconds     # specific for initiate BQ job, to control the timeout of step 1, query()
job_execution_timeout_seconds    # specific for awaiting job result, to control the timeout of step 2, result()

job_retry_deadline_seconds       # to control the overal query, retry_deadline of _query_and_results()
job_retries                      # to control the overall query, retries of _query_and_results()

For example, we could set the configs below to fail faster on the step of BQ job creation, while allowing queries with long-running results.

job_creation_timeout_seconds=30
job_execution_timeout_seconds=1200
job_retry_deadline_seconds=1500
job_retries=3

These settings would allow us to control the timeout behaviors of step 1 and step 2 on their own, hence maximizing our chances to mitigate different kinds of intermittent errors.

NOTE:
job_execution_timeout_seconds is the renaming of the current timeout config.
job_retries is the renaming of the current retries config.

Describe alternatives you've considered

Currently, we have to override the dbt-core code to allow the query() to fail faster and retry, and to mitigate the 503 error.

Who will this benefit?

the dbt-bigquery users who would need more fine control of BigQuery query behaviors

Are you interested in contributing this feature?

yes

Anything else?

this issue has been discussed by several dbt community members in slack
https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-10-26T13:11:11Z

@hui-zheng Your proposal makes a lot of sense to me! I really appreciate the level of detail in your explanation. While I'd obviously prefer fewer than four separate configs related to job timeout/retry, it sounds like that's just the level of fine-grained control that many community members need.

From an implementation point of view, it sounds like everything should be quite simple: adding more properties to BigQueryCredentials and passing them into the various methods you've identified above. To that end, I'm going to transfer this issue to the dbt-bigquery repo (where the code now lives), mark this a good first issue, and welcome a PR for it.

For my part, I'd just like to see us maintain backwards compatibility for the existing config names (timeout_seconds + retries). I agree that the proposed names are much more precise, and they're what we should document going forward. If users upgrade with the old config names, they should continue working, and we should map their values over to the new configs.

hui-zheng · 2021-10-29T20:06:32Z

@jtcohen6,

thank you. your suggestions sound good.

Just a note, we need to also make sure we use google-bigquery version 2.26 or above to make this work.

hui-zheng · 2021-10-30T03:10:49Z

I will take the first stab at this issue.

jtcohen6 transferred this issue from dbt-labs/dbt-core Oct 26, 2021

jtcohen6 added type:enhancement New feature or request good_first_issue Good for newcomers labels Oct 26, 2021

This was referenced Oct 30, 2021

add new bq job timeout and retry config #49

Closed

add new bq job timeout and retry config #50

Merged

McKnight-42 closed this as completed in #50 Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] dbt-Bigquery retry for DML job BQ API errors (503 errors, etc. ) #45

[Feature] dbt-Bigquery retry for DML job BQ API errors (503 errors, etc. ) #45

hui-zheng commented Oct 22, 2021

jtcohen6 commented Oct 26, 2021

hui-zheng commented Oct 29, 2021

hui-zheng commented Oct 30, 2021

[Feature] dbt-Bigquery retry for DML job BQ API errors (503 errors, etc. ) #45

[Feature] dbt-Bigquery retry for DML job BQ API errors (503 errors, etc. ) #45

Comments

hui-zheng commented Oct 22, 2021

Is there an existing feature request for this?

Describe the Feature

Background

Technical Context

How BigQuery query works in dbt

how dbt currently retries BigQuery queries

the Proposed Solution

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

jtcohen6 commented Oct 26, 2021

hui-zheng commented Oct 29, 2021

hui-zheng commented Oct 30, 2021