Skip to content
This repository has been archived by the owner on Jul 27, 2022. It is now read-only.

Problem: liveness tracker doesn't contain information about some nodes in multi-node integration test #1007

Closed
tomtau opened this issue Feb 6, 2020 · 13 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@tomtau
Copy link
Contributor

tomtau commented Feb 6, 2020

From the multi-node integration test log:

[2020-02-06T00:55:32Z INFO  chain_abci::app] received commit request
301	[2020-02-06T00:55:37Z INFO  chain_abci::app] received beginblock request
302	[2020-02-06T00:55:37Z WARN  chain_abci::app::app_init::validator_state] Validator in `last_commit_info` not found in liveness tracker
303	[2020-02-06T00:55:37Z INFO  chain_abci::app] received endblock request
304	[2020-02-06T00:55:37Z INFO  chain_abci::app] received commit request
305	[2020-02-06T00:55:39Z INFO  chain_abci::app] received beginblock request
306	[2020-02-06T00:55:39Z WARN  chain_abci::app::app_init::validator_state] Validator in `last_commit_info` not found in liveness tracker
307	[2020-02-06T00:55:39Z INFO  chain_abci::app] received endblock request

this warning shouldn't be normally displayed, as it denotes that the validator wasn't added for tracking

@tomtau tomtau added this to the CRO Sprint 23 milestone Feb 6, 2020
@tomtau tomtau added the bug Something isn't working label Feb 6, 2020
@yihuang
Copy link
Collaborator

yihuang commented Feb 10, 2020

The problem is like this: when a node is detected non-live fault at begin_block of N, at the end_block of N, the validator is to be jailed, the liveness_tracker is immediately removed, and validator change will be returned to tendermint.
But the node will still appear in last_commit_info in next two blocks, the appearance in the first block is understandable since last_commit_info of block N+1 is about votes of block N where the validator still in effect, but it still appears in the second block, I guess it's because tendermint have some latency to actually remove the validator.
Since it's like normal behavior, one solution might be to simply ignore it?

@tomtau
Copy link
Contributor Author

tomtau commented Feb 10, 2020

I guess it's because tendermint have some latency to actually remove the validator.

it's because of the semantics how validator updates work:
https://docs.tendermint.com/master/spec/abci/abci.html#endblock

Validator updates returned by block H impact blocks H+1, H+2, and H+3, but only effects changes on the validator set of H+2

--
I think the problem is what if this wasn't expected?
Some solution options:

  • scheduling to remove the liveness tracking at N+2?
  • providing more details in the log -- e.g. the validator address + if it's metadata is scheduled to be deleted? the expected ones vs unexpected ones can have different log levels

@yihuang
Copy link
Collaborator

yihuang commented Feb 11, 2020

I propose to add a field inactive_block_height in validator state, which records the block height when the node becomes invalid, when the liveness update procedure find liveness_tracker not exists, it can further check the inactive_block_height, if the diff <= 2, then it's ok, otherwise log a real problem.
The field is cleared when the node re-join.

@tomtau
Copy link
Contributor Author

tomtau commented Feb 11, 2020

could it also be cleared if diff == 2 ?

@yihuang
Copy link
Collaborator

yihuang commented Feb 11, 2020

could it also be cleared if diff == 2 ?

If it gets cleared when diff == 2, it's dedicated for liveness_tracker, if we keep it, it can be an indication of inactiveness and the exact block/time when it becomes inactive, and can be useful in other situation.
There are two choices regarding where to store this state:

  • Store an Option<Timespec> in StakedState.
  • Add a standalone BTreeMap<StakedStateAddress, Timespec>.
    They are cleared either when node re-join, or when validator metadata get cleared.

@yihuang
Copy link
Collaborator

yihuang commented Feb 12, 2020

With the current implementation, the simplest thing to do seems to reuse the after_unbond_delete scheduling, and clear liveness_tracker in metadata_clean?

@tomtau
Copy link
Contributor Author

tomtau commented Feb 12, 2020

With the current implementation, the simplest thing to do seems to reuse the after_unbond_delete scheduling, and clear liveness_tracker in metadata_clean?

The only thing I'd worry with that is that get_nonlive_validators currently iterates through all trackers -- the code is the same as the original behaviour from @devashishdxt , it was just moved to one place -- so depending on the parameters, it may be the case through this interaction, every node that left (being punished or unbonded) will be punished for being non-live.

@tomtau
Copy link
Contributor Author

tomtau commented Feb 12, 2020

Maybe the easiest option is to keep it as it is, before printing the "warn" log, check if the address is in tendermint_validator_addresses -- if it is, just print it with lower level (e.g. debug! or info!) ... if it is not, print it with warn! or error!, and include the address in both printouts, so one knows what address it didn't find from logs?

@yihuang
Copy link
Collaborator

yihuang commented Feb 12, 2020

Maybe the easiest option is to keep it as it is, before printing the "warn" log, check if the address is in tendermint_validator_addresses -- if it is, just print it with lower level (e.g. debug! or info!) ... if it is not, print it with warn! or error!, and include the address in both printouts, so one knows what address it didn't find from logs?

Sounds good.

yihuang added a commit to yihuang/chain that referenced this issue Feb 12, 2020
…mation about some nodes in multi-node integration test

Solution:
- Check `tendermint_validator_addresses` when `liveness_tracker` not
found, if cleaned up, log error, otherwise log info.
@devashishdxt
Copy link
Collaborator

devashishdxt commented Feb 12, 2020

I think we should not lower log level here.

Explanation:
Assume a situation where a valid validator is not added to liveness tracker (which will be a bug). If we lower the log level to info! or debug!, we'll never catch this if we're not reading all the info! or debug! logs. On the other hand, we can easily filter out warn! and error! logs and see if there was an issue. If we see this warn! message more than 2-3 times for same validator, then it should be a bug (seeing this log 2-3 times is ok because of #1007 (comment)).

@tomtau
Copy link
Contributor Author

tomtau commented Feb 12, 2020

I think we should not lower log level here.

Explanation:
Assume a situation where a valid validator is not added to liveness tracker (which will be a bug). If we lower the log level to info! or debug!, we'll never catch this if we're not reading all the info! or debug! logs. On the other hand, we can easily filter out warn! and error! logs and see if there was an issue. If we see this warn! message more than 2-3 times for same validator, then it should be a bug (seeing this log 2-3 times is ok because of #1007 (comment)).

Add a log when it's added to the tracker?

@devashishdxt
Copy link
Collaborator

I think we should not lower log level here.
Explanation:
Assume a situation where a valid validator is not added to liveness tracker (which will be a bug). If we lower the log level to info! or debug!, we'll never catch this if we're not reading all the info! or debug! logs. On the other hand, we can easily filter out warn! and error! logs and see if there was an issue. If we see this warn! message more than 2-3 times for same validator, then it should be a bug (seeing this log 2-3 times is ok because of #1007 (comment)).

Add a log when it's added to the tracker?

Yes. Add logs at entry and exit points. Also print validator address in logs.

bors bot added a commit that referenced this issue Feb 12, 2020
1055: Problem (Fix #1007): liveness tracker doesn't contain information about some nodes in multi-node integration test r=tomtau a=yihuang

Solution:
- Check `tendermint_validator_addresses` when `liveness_tracker` not found, if cleaned up, log error otherwise log info.

Co-authored-by: yihuang <huang@crypto.com>
@bors bors bot closed this as completed in 667d823 Feb 12, 2020
@tomtau tomtau reopened this Feb 12, 2020
@yihuang
Copy link
Collaborator

yihuang commented Feb 12, 2020

Then I think we'd better to accurately implement our intensions here: The liveness_tracker for the vote is ok to not exists if and only if block_height - inactive_block_height <= 2.
Implementation options:

  • Record the inactive_block_height (cleared when active again?)
  • Add a scheduling queue to remove liveness_tracker two blocks later. (clear when re-join? not trigger non-live fault when scheduled?)

@tomtau tomtau closed this as completed Feb 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants