Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] dbt-Bigquery retry for DML job BQ API errors (503 errors, etc. ) #45

Closed
1 task done
hui-zheng opened this issue Oct 22, 2021 · 3 comments · Fixed by #50
Closed
1 task done

[Feature] dbt-Bigquery retry for DML job BQ API errors (503 errors, etc. ) #45

hui-zheng opened this issue Oct 22, 2021 · 3 comments · Fixed by #50
Labels
good_first_issue Good for newcomers type:enhancement New feature or request

Comments

@hui-zheng
Copy link
Contributor

Is there an existing feature request for this?

  • I have searched the existing issues

Describe the Feature

Background

Currently, dbt-bigquery profile.yml has two configurations: Timeouts and Retries

We at Semios have been using dbt extensively to run high-frequency data pipelines on BigQuery (some of them run every 10 minutes). Because of the strict SLA requirement, we require more granular control on BigQuery timeout and retry mechanism to minimize the intermittent query failures and its impact to our pipeline.

We found that the existing configs are not enough to mitigate all intermittent query failures. for example, the Deadline of 120.0s exceeded 503 errors raised by BigQuery JobInsert API that occurred from time to time and impacted several dbt-BigQuery users in the dbt community.

We would like to suggest a more comprehensive BigQuery retry config to minimize all these errors.

Technical Context

Before I propose the suggestion, let me explain how BigQuery query works and how we would like to retry.

How BigQuery query works in dbt

at the core, BigQuery query is made by two steps in dbt

    def _query_and_results(self, client, sql, conn, job_params, timeout=None):
        """Query the client and wait for results."""
        # Cannot reuse job_config if destination is set and ddl is used
        job_config = google.cloud.bigquery.QueryJobConfig(**job_params)    # <--- Step 1
        query_job = client.query(sql, job_config=job_config)               # <--- Step 2
        iterator = query_job.result(timeout=timeout)
        return query_job, iterator

In the first step, client.query() submits a query to BQ JobInsert API server, when succeeded, BQ server creates a new BigQuery query job, and return the query job id back to the client as part ofquery_job object. This step shall be very quick, normally under a few seconds. however, in some rare cases, it would take much longer and might even up to 4 minutes according to the BigQuery engineering team.

In the 2nd step, query_job.result() await for the BigQuery executing (running) the query and return the results back to the client as an iterator. depending on the complexity of the query, this step could takes long, from tens of seconds to tens of minutes.

how dbt currently retries BigQuery queries

Currently, The dbt retry config is used to set the retries of the overall _query_and_results. the dbt timeout config is used only to control step 2, query_job.result(timeout=timeout). These configs make sense on its own.

However, you might notice that there is no control over the timeout of query() in step 1, other than relying on its [default value] (https://github.com/googleapis/python-bigquery/blob/1762e5bd98ab089abf65a0d146234c77217b8dbe/google/cloud/bigquery/client.py#L3193). When BQ JobInsert API sharding cluster is unstable. it could take up to 4 minutes to create the query job id in some rare cases. Because the query() had a client-side [default deadline of 120s](DEFAULT_RETRY = retry.Retry(predicate=_should_retry)), the client quits at 120s while server side is still waiting for the job creation. That's why several dbt community members had experienced the [Deadline of 120.0s exceeded errors (https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400).

BQ team has since patched the query() to make a client-side [default deadline of 10 mins](DEFAULT_RETRY = retry.Retry(predicate=_should_retry, deadline=600.0). It means by default, query() could take up to 4 minutes to make one attempt to create a new job id. If it fails, the client-side could have up to 10 minutes to retry. This would work fine for users who don't have stringent timing requirements. They won't mind waiting for 10 minutes to create a new BQ job ID and then waiting for another 10 minutes to get the result.

Unfortunately, it doesn't work for us. In situations like this, we would rather fail faster and try again. It shall not take more than 30 seconds to create a new job id. if it takes too long, we would rather let the query() timeout and fail, so that we could retry it again.

the Proposed Solution

In order to gain fine control of the timeout mechanism of step 1 and step 2, we would like to propose the following 4 dbt configs

job_creation_timeout_seconds     # specific for initiate BQ job, to control the timeout of step 1, query()
job_execution_timeout_seconds    # specific for awaiting job result, to control the timeout of step 2, result()

job_retry_deadline_seconds       # to control the overal query, retry_deadline of _query_and_results()
job_retries                      # to control the overall query, retries of _query_and_results()

For example, we could set the configs below to fail faster on the step of BQ job creation, while allowing queries with long-running results.

job_creation_timeout_seconds=30
job_execution_timeout_seconds=1200
job_retry_deadline_seconds=1500
job_retries=3

These settings would allow us to control the timeout behaviors of step 1 and step 2 on their own, hence maximizing our chances to mitigate different kinds of intermittent errors.

NOTE:
job_execution_timeout_seconds is the renaming of the current timeout config.
job_retries is the renaming of the current retries config.

Describe alternatives you've considered

Currently, we have to override the dbt-core code to allow the query() to fail faster and retry, and to mitigate the 503 error.

Who will this benefit?

the dbt-bigquery users who would need more fine control of BigQuery query behaviors

Are you interested in contributing this feature?

yes

Anything else?

this issue has been discussed by several dbt community members in slack
https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400

@jtcohen6
Copy link
Contributor

@hui-zheng Your proposal makes a lot of sense to me! I really appreciate the level of detail in your explanation. While I'd obviously prefer fewer than four separate configs related to job timeout/retry, it sounds like that's just the level of fine-grained control that many community members need.

From an implementation point of view, it sounds like everything should be quite simple: adding more properties to BigQueryCredentials and passing them into the various methods you've identified above. To that end, I'm going to transfer this issue to the dbt-bigquery repo (where the code now lives), mark this a good first issue, and welcome a PR for it.

For my part, I'd just like to see us maintain backwards compatibility for the existing config names (timeout_seconds + retries). I agree that the proposed names are much more precise, and they're what we should document going forward. If users upgrade with the old config names, they should continue working, and we should map their values over to the new configs.

@jtcohen6 jtcohen6 transferred this issue from dbt-labs/dbt-core Oct 26, 2021
@jtcohen6 jtcohen6 added type:enhancement New feature or request good_first_issue Good for newcomers labels Oct 26, 2021
@hui-zheng
Copy link
Contributor Author

@jtcohen6,

thank you. your suggestions sound good.

Just a note, we need to also make sure we use google-bigquery version 2.26 or above to make this work.

@hui-zheng
Copy link
Contributor Author

I will take the first stab at this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good_first_issue Good for newcomers type:enhancement New feature or request
Projects
None yet
2 participants