-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tasks spawned by tower_batch::Batch
only being alive during one test execution
#2390
Comments
I think we should deduplicate the batch verifiers before doing this task |
We can't re-use the same tokio runtime in all the tests, because some tests pause and advance Tokio's internal time. |
Yep, which is why that solution worked for #2371, but will likely not work for some new test we implement soon :( |
I think this is doable. The goal with the global batch verifiers is to amortize signature/proof verification across many transactions, including across blocks, so if the TransactionVerifier is the same across those, that seems doable. |
Motivation
While implementing tests in #2371, I ran into an issue that caused the tests to fail in CI. As far as I understand, the issues arises because of three things:
tower_batch::Batch::new
spawns a task in the Tokio runtime;primitives::*::Verifier
s are stored in shared global variables, properly handled byLazy
wrappers;#[tokio::test]
creates separate runtimes for separate testsThis means that shared batch verifiers will spawn a task on the runtime that first uses it, but that task will not survive for the next runtime (i.e., the next test execution), causing it to fail.
The #2371 PR included a workaround that runs two tests in a single runtime, in order to guarantee that the underlying spawned task is alive for both of the tests.
Note: I think it's important to mention that this issue only affects tests, since in a running Zebra instance I think there should only be one Tokio runtime.
Additional work
v4_with_sprout_transfers
inzebra_consensus::transaction::tests
.Solution Ideas
This section contains some things I considered while investigating the issue and trying to find a solution or workaround.
Create an independent verifier for each test
Add a
zebra_test
or batch verifier function that creates an independent verifier.This solution is robust, because tests can't affect each other.
We can simplify the tests by moving verifier setup code into a common function or method.
Create a global verifier tokio runtime
Add a
zebra_test
function that enters a shared verifier tokio runtime:https://docs.rs/tokio/1.7.1/tokio/runtime/struct.Runtime.html#shutdown
The first test creates the shared runtime, the others just re-use it.
Remove global shared verifiers
This might be the simplest solution, but I'm not sure if it violates any existing design decisions that could lead to more issues in the future. I'm not entirely sure, but I think that currently the global verifiers are only used inside
zebra_consensus::transaction::Verifier
, which is in turn only used insidezebra_consensus::block::Verifier
, and that is only used inside thezebra_consensus::chain::init
function.If the scope is indeed limited, it might be possible to just remove the global shared verifiers so that they are just instantiated inside the
transaction::Verifier
. If the verifiers are instantiated elsewhere and sent to thetransaction::Verifier
through generic type parameters, it could also help with testing, because mock verifiers could be sent.The advantage is its limited scope and possible simplicity. The disadvantage is that it might affect an active design.
Create a
RestartableLazy
type for testsSince the issue arises because the
Batch
service is created only once, it might be possible to override theLazy
implementation when in#[cfg(test)]
with aRestartableLazy
type, that exposes arestart
method that forces the verifier to be recreated. This type can then be aliased toLazy
when building the tests.I looked into how hard this can be, and the
Lazy
implementation is quite simple. However, there is one big issue.Lazy
is safe because it can assume that the inner value will never change (or will only change through&mut Lazy
. In our case, theLazy
is placed in a global variable, so changes would have to happen through a&Lazy
reference, and that requires runtime checks, like usingRwLock
andRefCell
. That also makes things harder for implementingDeref
andDerefMut
, because it's not a simple dereference, but actual usage of guard types.More research would be required to have a good implementation. The advantages are that it should be minimal change to the existing code, and it would only impact tests. The disadvantages are that the new type has some complex edge-cases that need to be figured out, and that the tests would run through a separate path than actual code, which can be risky.
Refactor
tower_batch::Batch
to not spawn the worker taskThe idea here is to have some sort of
SharedFuture
type that wraps the actualworker.run()
future in anArc
, so that this future can be included in all returnedResponseFuture
s. This means that the worker progresses every time theResponseFuture
is polled.The difficulty here is to poll a shared future, so the
SharedFuture
abstraction might come in handy to isolate the complexity. The future might need to be in anArc<tokio::sync::Mutex<...>>
, and acquiring the lock might be a little complex, since first it requires polling the future to acquire the lock, then polling the actual future, and ensuring the obtained guard is dropped so that it is never held betweenpoll
calls.The advantages are that this makes
tower_batch
more robust, and that no code outsidetower_batch
would need to be changed. The disadvantages are that theSharedFuture
abstraction needs to be properly implemented, and that the worker no longer runs in a separate task. The latter can be mitigated by having the actual response spawned in a new task, but the user of the library must be aware of this.Recreate the worker task if
Batch
detects that it has stoppedThis can be done in
#[cfg(test)]
. There is already a check to see if the worker is alive, so the change would be to update that code to respawn the task. To do so, the service would need to be wrapped in anArc
, so that it can be shared between the multiple worker instances that end up appearing. Other parameters (max_items
andmax_latency
) will also need to be stored in theBatch
in order to recreate theWorker
.If this is done only in
#[cfg(test)]
, the extraBatch
fields will also require that attribute, which might make the code harder to read and possibly easier to make mistakes.The advantage is the limited changes. The disadvantages are the possible performance, memory and/or code readability hit.
Related Work
The text was updated successfully, but these errors were encountered: