You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, dbt-bigquery profile.yml has two configurations: Timeouts and Retries
We at Semios have been using dbt extensively to run high-frequency data pipelines on BigQuery (some of them run every 10 minutes). Because of the strict SLA requirement, we require more granular control on BigQuery timeout and retry mechanism to minimize the intermittent query failures and its impact to our pipeline.
We found that the existing configs are not enough to mitigate all intermittent query failures. for example, the Deadline of 120.0s exceeded 503 errors raised by BigQuery JobInsert API that occurred from time to time and impacted several dbt-BigQuery users in the dbt community.
We would like to suggest a more comprehensive BigQuery retry config to minimize all these errors.
Technical Context
Before I propose the suggestion, let me explain how BigQuery query works and how we would like to retry.
def_query_and_results(self, client, sql, conn, job_params, timeout=None):
"""Query the client and wait for results."""# Cannot reuse job_config if destination is set and ddl is usedjob_config=google.cloud.bigquery.QueryJobConfig(**job_params) # <--- Step 1query_job=client.query(sql, job_config=job_config) # <--- Step 2iterator=query_job.result(timeout=timeout)
returnquery_job, iterator
In the first step, client.query() submits a query to BQ JobInsert API server, when succeeded, BQ server creates a new BigQuery query job, and return the query job id back to the client as part ofquery_job object. This step shall be very quick, normally under a few seconds. however, in some rare cases, it would take much longer and might even up to 4 minutes according to the BigQuery engineering team.
In the 2nd step, query_job.result() await for the BigQuery executing (running) the query and return the results back to the client as an iterator. depending on the complexity of the query, this step could takes long, from tens of seconds to tens of minutes.
how dbt currently retries BigQuery queries
Currently, The dbt retry config is used to set the retries of the overall _query_and_results. the dbt timeout config is used only to control step 2, query_job.result(timeout=timeout). These configs make sense on its own.
However, you might notice that there is no control over the timeout of query() in step 1, other than relying on its [default value] (https://github.com/googleapis/python-bigquery/blob/1762e5bd98ab089abf65a0d146234c77217b8dbe/google/cloud/bigquery/client.py#L3193). When BQ JobInsert API sharding cluster is unstable. it could take up to 4 minutes to create the query job id in some rare cases. Because the query() had a client-side [default deadline of 120s](DEFAULT_RETRY = retry.Retry(predicate=_should_retry)), the client quits at 120s while server side is still waiting for the job creation. That's why several dbt community members had experienced the [Deadline of 120.0s exceeded errors (https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400).
BQ team has since patched the query() to make a client-side [default deadline of 10 mins](DEFAULT_RETRY = retry.Retry(predicate=_should_retry, deadline=600.0). It means by default, query() could take up to 4 minutes to make one attempt to create a new job id. If it fails, the client-side could have up to 10 minutes to retry. This would work fine for users who don't have stringent timing requirements. They won't mind waiting for 10 minutes to create a new BQ job ID and then waiting for another 10 minutes to get the result.
Unfortunately, it doesn't work for us. In situations like this, we would rather fail faster and try again. It shall not take more than 30 seconds to create a new job id. if it takes too long, we would rather let the query() timeout and fail, so that we could retry it again.
the Proposed Solution
In order to gain fine control of the timeout mechanism of step 1 and step 2, we would like to propose the following 4 dbt configs
job_creation_timeout_seconds# specific for initiate BQ job, to control the timeout of step 1, query()job_execution_timeout_seconds# specific for awaiting job result, to control the timeout of step 2, result()job_retry_deadline_seconds# to control the overal query, retry_deadline of _query_and_results()job_retries# to control the overall query, retries of _query_and_results()
For example, we could set the configs below to fail faster on the step of BQ job creation, while allowing queries with long-running results.
These settings would allow us to control the timeout behaviors of step 1 and step 2 on their own, hence maximizing our chances to mitigate different kinds of intermittent errors.
NOTE: job_execution_timeout_seconds is the renaming of the current timeout config. job_retries is the renaming of the current retries config.
Describe alternatives you've considered
Currently, we have to override the dbt-core code to allow the query() to fail faster and retry, and to mitigate the 503 error.
Who will this benefit?
the dbt-bigquery users who would need more fine control of BigQuery query behaviors
@hui-zheng Your proposal makes a lot of sense to me! I really appreciate the level of detail in your explanation. While I'd obviously prefer fewer than four separate configs related to job timeout/retry, it sounds like that's just the level of fine-grained control that many community members need.
From an implementation point of view, it sounds like everything should be quite simple: adding more properties to BigQueryCredentials and passing them into the various methods you've identified above. To that end, I'm going to transfer this issue to the dbt-bigquery repo (where the code now lives), mark this a good first issue, and welcome a PR for it.
For my part, I'd just like to see us maintain backwards compatibility for the existing config names (timeout_seconds + retries). I agree that the proposed names are much more precise, and they're what we should document going forward. If users upgrade with the old config names, they should continue working, and we should map their values over to the new configs.
Is there an existing feature request for this?
Describe the Feature
Background
Currently, dbt-bigquery profile.yml has two configurations: Timeouts and Retries
We at Semios have been using dbt extensively to run high-frequency data pipelines on BigQuery (some of them run every 10 minutes). Because of the strict SLA requirement, we require more granular control on BigQuery timeout and retry mechanism to minimize the intermittent query failures and its impact to our pipeline.
We found that the existing configs are not enough to mitigate all intermittent query failures. for example, the Deadline of 120.0s exceeded 503 errors raised by BigQuery JobInsert API that occurred from time to time and impacted several dbt-BigQuery users in the dbt community.
We would like to suggest a more comprehensive BigQuery retry config to minimize all these errors.
Technical Context
Before I propose the suggestion, let me explain how BigQuery query works and how we would like to retry.
How BigQuery query works in dbt
at the core, BigQuery query is made by two steps in dbt
In the first step,
client.query()
submits a query to BQ JobInsert API server, when succeeded, BQ server creates a new BigQuery query job, and return thequery job id
back to the client as part ofquery_job
object. This step shall be very quick, normally under a few seconds. however, in some rare cases, it would take much longer and might even up to 4 minutes according to the BigQuery engineering team.In the 2nd step,
query_job.result()
await for the BigQuery executing (running) the query and return the results back to the client as an iterator. depending on the complexity of the query, this step could takes long, from tens of seconds to tens of minutes.how dbt currently retries BigQuery queries
Currently, The dbt
retry
config is used to set the retries of the overall_query_and_results
. the dbttimeout
config is used only to control step 2,query_job.result(timeout=timeout)
. These configs make sense on its own.However, you might notice that there is no control over the timeout of
query()
in step 1, other than relying on its [default value] (https://github.com/googleapis/python-bigquery/blob/1762e5bd98ab089abf65a0d146234c77217b8dbe/google/cloud/bigquery/client.py#L3193). When BQ JobInsert API sharding cluster is unstable. it could take up to 4 minutes to create the query job id in some rare cases. Because thequery()
had a client-side [default deadline of 120s](DEFAULT_RETRY = retry.Retry(predicate=_should_retry)), the client quits at 120s while server side is still waiting for the job creation. That's why several dbt community members had experienced the [Deadline of 120.0s exceeded errors (https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400).BQ team has since patched the
query()
to make a client-side [default deadline of 10 mins](DEFAULT_RETRY = retry.Retry(predicate=_should_retry, deadline=600.0). It means by default,query()
could take up to 4 minutes to make one attempt to create a new job id. If it fails, the client-side could have up to 10 minutes to retry. This would work fine for users who don't have stringent timing requirements. They won't mind waiting for 10 minutes to create a new BQ job ID and then waiting for another 10 minutes to get the result.Unfortunately, it doesn't work for us. In situations like this, we would rather fail faster and try again. It shall not take more than 30 seconds to create a new job id. if it takes too long, we would rather let the query() timeout and fail, so that we could retry it again.
the Proposed Solution
In order to gain fine control of the timeout mechanism of step 1 and step 2, we would like to propose the following 4 dbt configs
For example, we could set the configs below to fail faster on the step of BQ job creation, while allowing queries with long-running results.
These settings would allow us to control the timeout behaviors of step 1 and step 2 on their own, hence maximizing our chances to mitigate different kinds of intermittent errors.
NOTE:
job_execution_timeout_seconds
is the renaming of the currenttimeout
config.job_retries
is the renaming of the currentretries
config.Describe alternatives you've considered
Currently, we have to override the dbt-core code to allow the
query()
to fail faster and retry, and to mitigate the 503 error.Who will this benefit?
the dbt-bigquery users who would need more fine control of BigQuery query behaviors
Are you interested in contributing this feature?
yes
Anything else?
this issue has been discussed by several dbt community members in slack
https://getdbt.slack.com/archives/C99SNSRTK/p1627926805026400
The text was updated successfully, but these errors were encountered: