Fix tasks spawned by `tower_batch::Batch` only being alive during one test execution #2390

jvff · 2021-06-24T20:00:45Z

Motivation

While implementing tests in #2371, I ran into an issue that caused the tests to fail in CI. As far as I understand, the issues arises because of three things:

tower_batch::Batch::new spawns a task in the Tokio runtime;
the primitives::*::Verifiers are stored in shared global variables, properly handled by Lazy wrappers;
#[tokio::test] creates separate runtimes for separate tests

This means that shared batch verifiers will spawn a task on the runtime that first uses it, but that task will not survive for the next runtime (i.e., the next test execution), causing it to fail.

The #2371 PR included a workaround that runs two tests in a single runtime, in order to guarantee that the underlying spawned task is alive for both of the tests.

Note: I think it's important to mention that this issue only affects tests, since in a running Zebra instance I think there should only be one Tokio runtime.

Additional work

Separate the two tests that compose the v4_with_sprout_transfers in zebra_consensus::transaction::tests.

Solution Ideas

This section contains some things I considered while investigating the issue and trying to find a solution or workaround.

Create an independent verifier for each test

Add a zebra_test or batch verifier function that creates an independent verifier.

This solution is robust, because tests can't affect each other.

We can simplify the tests by moving verifier setup code into a common function or method.

Create a global verifier tokio runtime

Add a zebra_test function that enters a shared verifier tokio runtime:
https://docs.rs/tokio/1.7.1/tokio/runtime/struct.Runtime.html#shutdown

The first test creates the shared runtime, the others just re-use it.

Remove global shared verifiers

This might be the simplest solution, but I'm not sure if it violates any existing design decisions that could lead to more issues in the future. I'm not entirely sure, but I think that currently the global verifiers are only used inside zebra_consensus::transaction::Verifier, which is in turn only used inside zebra_consensus::block::Verifier, and that is only used inside the zebra_consensus::chain::init function.

If the scope is indeed limited, it might be possible to just remove the global shared verifiers so that they are just instantiated inside the transaction::Verifier. If the verifiers are instantiated elsewhere and sent to the transaction::Verifier through generic type parameters, it could also help with testing, because mock verifiers could be sent.

The advantage is its limited scope and possible simplicity. The disadvantage is that it might affect an active design.

Create a `RestartableLazy` type for tests

Since the issue arises because the Batch service is created only once, it might be possible to override the Lazy implementation when in #[cfg(test)] with a RestartableLazy type, that exposes a restart method that forces the verifier to be recreated. This type can then be aliased to Lazy when building the tests.

I looked into how hard this can be, and the Lazy implementation is quite simple. However, there is one big issue. Lazy is safe because it can assume that the inner value will never change (or will only change through &mut Lazy. In our case, the Lazy is placed in a global variable, so changes would have to happen through a &Lazy reference, and that requires runtime checks, like using RwLock and RefCell. That also makes things harder for implementing Deref and DerefMut, because it's not a simple dereference, but actual usage of guard types.

More research would be required to have a good implementation. The advantages are that it should be minimal change to the existing code, and it would only impact tests. The disadvantages are that the new type has some complex edge-cases that need to be figured out, and that the tests would run through a separate path than actual code, which can be risky.

Refactor `tower_batch::Batch` to not spawn the worker task

The idea here is to have some sort of SharedFuture type that wraps the actual worker.run() future in an Arc, so that this future can be included in all returned ResponseFutures. This means that the worker progresses every time the ResponseFuture is polled.

The difficulty here is to poll a shared future, so the SharedFuture abstraction might come in handy to isolate the complexity. The future might need to be in an Arc<tokio::sync::Mutex<...>>, and acquiring the lock might be a little complex, since first it requires polling the future to acquire the lock, then polling the actual future, and ensuring the obtained guard is dropped so that it is never held between poll calls.

The advantages are that this makes tower_batch more robust, and that no code outside tower_batch would need to be changed. The disadvantages are that the SharedFuture abstraction needs to be properly implemented, and that the worker no longer runs in a separate task. The latter can be mitigated by having the actual response spawned in a new task, but the user of the library must be aware of this.

Recreate the worker task if `Batch` detects that it has stopped

This can be done in #[cfg(test)]. There is already a check to see if the worker is alive, so the change would be to update that code to respawn the task. To do so, the service would need to be wrapped in an Arc, so that it can be shared between the multiple worker instances that end up appearing. Other parameters (max_items and max_latency) will also need to be stored in the Batch in order to recreate the Worker.

If this is done only in #[cfg(test)], the extra Batch fields will also require that attribute, which might make the code harder to read and possibly easier to make mistakes.

The advantage is the limited changes. The disadvantages are the possible performance, memory and/or code readability hit.

Related Work

A potential solution might be related to Generalize/deduplicate async batch verification services / create a derive-able proc macro #1951

The text was updated successfully, but these errors were encountered:

teor2345 · 2021-06-24T20:59:21Z

I think we should deduplicate the batch verifiers before doing this task

teor2345 · 2021-06-24T21:01:47Z

We can't re-use the same tokio runtime in all the tests, because some tests pause and advance Tokio's internal time.

jvff · 2021-06-24T21:15:14Z

Yep, which is why that solution worked for #2371, but will likely not work for some new test we implement soon :(

dconnolly · 2021-06-29T01:34:11Z

If the scope is indeed limited, it might be possible to just remove the global shared verifiers so that they are just instantiated inside the transaction::Verifier. If the verifiers are instantiated elsewhere and sent to the transaction::Verifier through generic type parameters, it could also help with testing, because mock verifiers could be sent.

I think this is doable. The goal with the global batch verifiers is to amortize signature/proof verification across many transactions, including across blocks, so if the TransactionVerifier is the same across those, that seems doable.

jvff added C-enhancement Category: This is an improvement S-needs-triage Status: A bug report needs triage labels Jun 24, 2021

This was referenced Jun 24, 2021

Generalize/deduplicate async batch verification services / create a derive-able proc macro #1951

Closed

Refactor Sprout Join Split validation by transaction verifier #2371

Merged

mpguerra added the P-Low label Jun 25, 2021

jvff mentioned this issue Jun 25, 2021

Create a shared Tokio runtime for tests #2397

Merged

3 tasks

mpguerra removed the S-needs-triage Status: A bug report needs triage label Jun 28, 2021

teor2345 mentioned this issue Jun 29, 2021

Fix intermittent errors in the groth16 verifier tests #2412

Merged

3 tasks

mpguerra linked a pull request Jun 29, 2021 that will close this issue

Fix intermittent errors in the groth16 verifier tests #2412

Merged

3 tasks

jvff mentioned this issue Jun 30, 2021

Refactor validation of Sapling shielded data in transaction::Verifier #2419

Merged

3 tasks

mpguerra linked a pull request Jul 1, 2021 that will close this issue

Create a shared Tokio runtime for tests #2397

Merged

3 tasks

jvff mentioned this issue Jul 9, 2021

Consider refactoring the verifiers in transaction::Verifier #2473

Closed

jvff added the S-incomplete label Feb 24, 2022

jvff closed this as completed Feb 24, 2022

arya2 mentioned this issue Mar 31, 2023

feat(script): Add binary for finding references to closed issues #6347

Merged

6 tasks

mpguerra mentioned this issue Apr 11, 2023

Tracking: TODOs with closed tasks #6281

Closed

4 tasks

mpguerra closed this as not planned Won't fix, can't repro, duplicate, stale May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tasks spawned by `tower_batch::Batch` only being alive during one test execution #2390

Fix tasks spawned by `tower_batch::Batch` only being alive during one test execution #2390

jvff commented Jun 24, 2021 •

edited

Loading

teor2345 commented Jun 24, 2021

teor2345 commented Jun 24, 2021

jvff commented Jun 24, 2021

dconnolly commented Jun 29, 2021

Fix tasks spawned by tower_batch::Batch only being alive during one test execution #2390

Fix tasks spawned by tower_batch::Batch only being alive during one test execution #2390

Comments

jvff commented Jun 24, 2021 • edited Loading

Motivation

Additional work

Solution Ideas

Create an independent verifier for each test

Create a global verifier tokio runtime

Remove global shared verifiers

Create a RestartableLazy type for tests

Refactor tower_batch::Batch to not spawn the worker task

Recreate the worker task if Batch detects that it has stopped

Related Work

teor2345 commented Jun 24, 2021

teor2345 commented Jun 24, 2021

jvff commented Jun 24, 2021

dconnolly commented Jun 29, 2021

Fix tasks spawned by `tower_batch::Batch` only being alive during one test execution #2390

Fix tasks spawned by `tower_batch::Batch` only being alive during one test execution #2390

jvff commented Jun 24, 2021 •

edited

Loading

Create a `RestartableLazy` type for tests

Refactor `tower_batch::Batch` to not spawn the worker task

Recreate the worker task if `Batch` detects that it has stopped