Future cancellation impacts decision to retry transactions #1417

divergentdave · 2023-05-26T16:26:30Z

In places where we pipeline many database query futures with futures::future::try_join_all or tokio::try_join!, the short-circuiting behavior of these combinators may keep us from retrying database transactions when we otherwise would. If one query gets a T_R_SERIALIZATION_FAILURE/T_R_DEADLOCK_DETECTED error, all queries submitted after it will get IN_FAILED_SQL_TRANSACTION. The connection task will submit these error results over per-query future channels in this order, but the query futures may wake up and receive their results in a different order. Thus, a IN_FAILED_SQL_TRANSACTION error may be passed out of our datastore method to try_join_all/try_join! first, which cancels all outstanding futures, and in turn may cause a T_R_SERIALIZATION_FAILURE to be dropped while in a tokio-postgres channel, before it gets to our check_error() function.

This means that expected serialization errors may inadvertently cause transactions to fail while handling a background processing job or an HTTP request. Normally, job retry logic or whichever HTTP client's retry logic will compensate for this with a larger-scope retry, at the cost of some repeated work. However, the Janus-Daphne integration test is not as forgiving, because Daphne does not yet support idempotent handling of aggregation job requests. There's a critical window from when Daphne receives an aggregation job request from Janus, until Janus successfully store the results in its database, where any error will result in the aggregators becoming out of sync, stranding that job's reports, and causing the integration test's collection to fail. Recent changes in this window increased the degree to which we pipeline requests, and we are now seeing CI failures due to this test. I examined logs for two of them, and saw that an aggregation job PUT was getting retried, failing the first time due to "current transaction is aborted", and then succeeding the second time, but with "ReportReplayed" errors for each report share. The database log showed that the reason the transaction was aborted was a serialization failure. I put together a set of changes to not cancel futures within this window, and confirmed that this fixes the test's flakiness, see d3ac65e.

I think that addressing this issue would either require intrusive modifications to tokio-postgres, to push our transaction retry decision inside the connection task or expose copies of errors through a separate future/stream, or significant changes to our coding discipline, to treat all futures interacting with the database as not cancellation safe, due to their retry flag side effects. I looked at sqlx, the other major Postgres client library in Rust, and it doesn't support query pipelining, thus making it both immune to this issue, and not a good fit for our codebase, since we rely on pipelining currently to keep the number of network round trips down. We could work around the impact of this issue in our test suite by adding a retry loop to the Daphne integration test. We should keep an eye on the error rate of aggregation jobs, collection jobs, and HTTP server requests during load tests for possible impact from this issue.

The text was updated successfully, but these errors were encountered:

branlwyd self-assigned this May 26, 2023

branlwyd mentioned this issue May 26, 2023

Datastore: implement barrier if we see "in failed tx" error. #1418

Merged

branlwyd closed this as completed in #1418 May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future cancellation impacts decision to retry transactions #1417

Future cancellation impacts decision to retry transactions #1417

divergentdave commented May 26, 2023

Future cancellation impacts decision to retry transactions #1417

Future cancellation impacts decision to retry transactions #1417

Comments

divergentdave commented May 26, 2023