Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tasks spawned by tower_batch::Batch only being alive during one test execution #2390

Closed
1 task
jvff opened this issue Jun 24, 2021 · 4 comments · Fixed by #2412 or #2397
Closed
1 task

Fix tasks spawned by tower_batch::Batch only being alive during one test execution #2390

jvff opened this issue Jun 24, 2021 · 4 comments · Fixed by #2412 or #2397
Labels
C-enhancement Category: This is an improvement

Comments

@jvff
Copy link
Contributor

jvff commented Jun 24, 2021

Motivation

While implementing tests in #2371, I ran into an issue that caused the tests to fail in CI. As far as I understand, the issues arises because of three things:

  • tower_batch::Batch::new spawns a task in the Tokio runtime;
  • the primitives::*::Verifiers are stored in shared global variables, properly handled by Lazy wrappers;
  • #[tokio::test] creates separate runtimes for separate tests

This means that shared batch verifiers will spawn a task on the runtime that first uses it, but that task will not survive for the next runtime (i.e., the next test execution), causing it to fail.

The #2371 PR included a workaround that runs two tests in a single runtime, in order to guarantee that the underlying spawned task is alive for both of the tests.

Note: I think it's important to mention that this issue only affects tests, since in a running Zebra instance I think there should only be one Tokio runtime.

Additional work

Solution Ideas

This section contains some things I considered while investigating the issue and trying to find a solution or workaround.

Create an independent verifier for each test

Add a zebra_test or batch verifier function that creates an independent verifier.

This solution is robust, because tests can't affect each other.

We can simplify the tests by moving verifier setup code into a common function or method.

Create a global verifier tokio runtime

Add a zebra_test function that enters a shared verifier tokio runtime:
https://docs.rs/tokio/1.7.1/tokio/runtime/struct.Runtime.html#shutdown

The first test creates the shared runtime, the others just re-use it.

Remove global shared verifiers

This might be the simplest solution, but I'm not sure if it violates any existing design decisions that could lead to more issues in the future. I'm not entirely sure, but I think that currently the global verifiers are only used inside zebra_consensus::transaction::Verifier, which is in turn only used inside zebra_consensus::block::Verifier, and that is only used inside the zebra_consensus::chain::init function.

If the scope is indeed limited, it might be possible to just remove the global shared verifiers so that they are just instantiated inside the transaction::Verifier. If the verifiers are instantiated elsewhere and sent to the transaction::Verifier through generic type parameters, it could also help with testing, because mock verifiers could be sent.

The advantage is its limited scope and possible simplicity. The disadvantage is that it might affect an active design.

Create a RestartableLazy type for tests

Since the issue arises because the Batch service is created only once, it might be possible to override the Lazy implementation when in #[cfg(test)] with a RestartableLazy type, that exposes a restart method that forces the verifier to be recreated. This type can then be aliased to Lazy when building the tests.

I looked into how hard this can be, and the Lazy implementation is quite simple. However, there is one big issue. Lazy is safe because it can assume that the inner value will never change (or will only change through &mut Lazy. In our case, the Lazy is placed in a global variable, so changes would have to happen through a &Lazy reference, and that requires runtime checks, like using RwLock and RefCell. That also makes things harder for implementing Deref and DerefMut, because it's not a simple dereference, but actual usage of guard types.

More research would be required to have a good implementation. The advantages are that it should be minimal change to the existing code, and it would only impact tests. The disadvantages are that the new type has some complex edge-cases that need to be figured out, and that the tests would run through a separate path than actual code, which can be risky.

Refactor tower_batch::Batch to not spawn the worker task

The idea here is to have some sort of SharedFuture type that wraps the actual worker.run() future in an Arc, so that this future can be included in all returned ResponseFutures. This means that the worker progresses every time the ResponseFuture is polled.

The difficulty here is to poll a shared future, so the SharedFuture abstraction might come in handy to isolate the complexity. The future might need to be in an Arc<tokio::sync::Mutex<...>>, and acquiring the lock might be a little complex, since first it requires polling the future to acquire the lock, then polling the actual future, and ensuring the obtained guard is dropped so that it is never held between poll calls.

The advantages are that this makes tower_batch more robust, and that no code outside tower_batch would need to be changed. The disadvantages are that the SharedFuture abstraction needs to be properly implemented, and that the worker no longer runs in a separate task. The latter can be mitigated by having the actual response spawned in a new task, but the user of the library must be aware of this.

Recreate the worker task if Batch detects that it has stopped

This can be done in #[cfg(test)]. There is already a check to see if the worker is alive, so the change would be to update that code to respawn the task. To do so, the service would need to be wrapped in an Arc, so that it can be shared between the multiple worker instances that end up appearing. Other parameters (max_items and max_latency) will also need to be stored in the Batch in order to recreate the Worker.

If this is done only in #[cfg(test)], the extra Batch fields will also require that attribute, which might make the code harder to read and possibly easier to make mistakes.

The advantage is the limited changes. The disadvantages are the possible performance, memory and/or code readability hit.

Related Work

@teor2345
Copy link
Contributor

I think we should deduplicate the batch verifiers before doing this task

@teor2345
Copy link
Contributor

We can't re-use the same tokio runtime in all the tests, because some tests pause and advance Tokio's internal time.

@jvff
Copy link
Contributor Author

jvff commented Jun 24, 2021

Yep, which is why that solution worked for #2371, but will likely not work for some new test we implement soon :(

@mpguerra mpguerra added the P-Low label Jun 25, 2021
@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label Jun 28, 2021
@dconnolly
Copy link
Contributor

If the scope is indeed limited, it might be possible to just remove the global shared verifiers so that they are just instantiated inside the transaction::Verifier. If the verifiers are instantiated elsewhere and sent to the transaction::Verifier through generic type parameters, it could also help with testing, because mock verifiers could be sent.

I think this is doable. The goal with the global batch verifiers is to amortize signature/proof verification across many transactions, including across blocks, so if the TransactionVerifier is the same across those, that seems doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: This is an improvement
Projects
None yet
4 participants