Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future cancellation impacts decision to retry transactions #1417

Closed
divergentdave opened this issue May 26, 2023 · 0 comments · Fixed by #1418
Closed

Future cancellation impacts decision to retry transactions #1417

divergentdave opened this issue May 26, 2023 · 0 comments · Fixed by #1418
Assignees

Comments

@divergentdave
Copy link
Collaborator

In places where we pipeline many database query futures with futures::future::try_join_all or tokio::try_join!, the short-circuiting behavior of these combinators may keep us from retrying database transactions when we otherwise would. If one query gets a T_R_SERIALIZATION_FAILURE/T_R_DEADLOCK_DETECTED error, all queries submitted after it will get IN_FAILED_SQL_TRANSACTION. The connection task will submit these error results over per-query future channels in this order, but the query futures may wake up and receive their results in a different order. Thus, a IN_FAILED_SQL_TRANSACTION error may be passed out of our datastore method to try_join_all/try_join! first, which cancels all outstanding futures, and in turn may cause a T_R_SERIALIZATION_FAILURE to be dropped while in a tokio-postgres channel, before it gets to our check_error() function.

This means that expected serialization errors may inadvertently cause transactions to fail while handling a background processing job or an HTTP request. Normally, job retry logic or whichever HTTP client's retry logic will compensate for this with a larger-scope retry, at the cost of some repeated work. However, the Janus-Daphne integration test is not as forgiving, because Daphne does not yet support idempotent handling of aggregation job requests. There's a critical window from when Daphne receives an aggregation job request from Janus, until Janus successfully store the results in its database, where any error will result in the aggregators becoming out of sync, stranding that job's reports, and causing the integration test's collection to fail. Recent changes in this window increased the degree to which we pipeline requests, and we are now seeing CI failures due to this test. I examined logs for two of them, and saw that an aggregation job PUT was getting retried, failing the first time due to "current transaction is aborted", and then succeeding the second time, but with "ReportReplayed" errors for each report share. The database log showed that the reason the transaction was aborted was a serialization failure. I put together a set of changes to not cancel futures within this window, and confirmed that this fixes the test's flakiness, see d3ac65e.

I think that addressing this issue would either require intrusive modifications to tokio-postgres, to push our transaction retry decision inside the connection task or expose copies of errors through a separate future/stream, or significant changes to our coding discipline, to treat all futures interacting with the database as not cancellation safe, due to their retry flag side effects. I looked at sqlx, the other major Postgres client library in Rust, and it doesn't support query pipelining, thus making it both immune to this issue, and not a good fit for our codebase, since we rely on pipelining currently to keep the number of network round trips down. We could work around the impact of this issue in our test suite by adding a retry loop to the Daphne integration test. We should keep an eye on the error rate of aggregation jobs, collection jobs, and HTTP server requests during load tests for possible impact from this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants