Problem: liveness tracker doesn't contain information about some nodes in multi-node integration test #1007

tomtau · 2020-02-06T01:09:44Z

From the multi-node integration test log:

[2020-02-06T00:55:32Z INFO  chain_abci::app] received commit request
301	[2020-02-06T00:55:37Z INFO  chain_abci::app] received beginblock request
302	[2020-02-06T00:55:37Z WARN  chain_abci::app::app_init::validator_state] Validator in `last_commit_info` not found in liveness tracker
303	[2020-02-06T00:55:37Z INFO  chain_abci::app] received endblock request
304	[2020-02-06T00:55:37Z INFO  chain_abci::app] received commit request
305	[2020-02-06T00:55:39Z INFO  chain_abci::app] received beginblock request
306	[2020-02-06T00:55:39Z WARN  chain_abci::app::app_init::validator_state] Validator in `last_commit_info` not found in liveness tracker
307	[2020-02-06T00:55:39Z INFO  chain_abci::app] received endblock request

this warning shouldn't be normally displayed, as it denotes that the validator wasn't added for tracking

The text was updated successfully, but these errors were encountered:

yihuang · 2020-02-10T03:22:24Z

The problem is like this: when a node is detected non-live fault at begin_block of N, at the end_block of N, the validator is to be jailed, the liveness_tracker is immediately removed, and validator change will be returned to tendermint.
But the node will still appear in last_commit_info in next two blocks, the appearance in the first block is understandable since last_commit_info of block N+1 is about votes of block N where the validator still in effect, but it still appears in the second block, I guess it's because tendermint have some latency to actually remove the validator.
Since it's like normal behavior, one solution might be to simply ignore it?

tomtau · 2020-02-10T04:39:35Z

I guess it's because tendermint have some latency to actually remove the validator.

it's because of the semantics how validator updates work:
https://docs.tendermint.com/master/spec/abci/abci.html#endblock

Validator updates returned by block H impact blocks H+1, H+2, and H+3, but only effects changes on the validator set of H+2

--
I think the problem is what if this wasn't expected?
Some solution options:

scheduling to remove the liveness tracking at N+2?
providing more details in the log -- e.g. the validator address + if it's metadata is scheduled to be deleted? the expected ones vs unexpected ones can have different log levels

yihuang · 2020-02-11T05:10:47Z

I propose to add a field inactive_block_height in validator state, which records the block height when the node becomes invalid, when the liveness update procedure find liveness_tracker not exists, it can further check the inactive_block_height, if the diff <= 2, then it's ok, otherwise log a real problem.
The field is cleared when the node re-join.

tomtau · 2020-02-11T05:45:12Z

could it also be cleared if diff == 2 ?

yihuang · 2020-02-11T07:00:54Z

could it also be cleared if diff == 2 ?

If it gets cleared when diff == 2, it's dedicated for liveness_tracker, if we keep it, it can be an indication of inactiveness and the exact block/time when it becomes inactive, and can be useful in other situation.
There are two choices regarding where to store this state:

Store an Option<Timespec> in StakedState.
Add a standalone BTreeMap<StakedStateAddress, Timespec>.
They are cleared either when node re-join, or when validator metadata get cleared.

yihuang · 2020-02-12T01:56:53Z

With the current implementation, the simplest thing to do seems to reuse the after_unbond_delete scheduling, and clear liveness_tracker in metadata_clean?

tomtau · 2020-02-12T02:38:57Z

With the current implementation, the simplest thing to do seems to reuse the after_unbond_delete scheduling, and clear liveness_tracker in metadata_clean?

The only thing I'd worry with that is that get_nonlive_validators currently iterates through all trackers -- the code is the same as the original behaviour from @devashishdxt , it was just moved to one place -- so depending on the parameters, it may be the case through this interaction, every node that left (being punished or unbonded) will be punished for being non-live.

tomtau · 2020-02-12T02:47:22Z

Maybe the easiest option is to keep it as it is, before printing the "warn" log, check if the address is in tendermint_validator_addresses -- if it is, just print it with lower level (e.g. debug! or info!) ... if it is not, print it with warn! or error!, and include the address in both printouts, so one knows what address it didn't find from logs?

yihuang · 2020-02-12T02:48:38Z

Maybe the easiest option is to keep it as it is, before printing the "warn" log, check if the address is in tendermint_validator_addresses -- if it is, just print it with lower level (e.g. debug! or info!) ... if it is not, print it with warn! or error!, and include the address in both printouts, so one knows what address it didn't find from logs?

Sounds good.

…mation about some nodes in multi-node integration test Solution: - Check `tendermint_validator_addresses` when `liveness_tracker` not found, if cleaned up, log error, otherwise log info.

devashishdxt · 2020-02-12T03:38:26Z

I think we should not lower log level here.

Explanation:
Assume a situation where a valid validator is not added to liveness tracker (which will be a bug). If we lower the log level to info! or debug!, we'll never catch this if we're not reading all the info! or debug! logs. On the other hand, we can easily filter out warn! and error! logs and see if there was an issue. If we see this warn! message more than 2-3 times for same validator, then it should be a bug (seeing this log 2-3 times is ok because of #1007 (comment)).

tomtau · 2020-02-12T03:54:38Z

I think we should not lower log level here.

Explanation:
Assume a situation where a valid validator is not added to liveness tracker (which will be a bug). If we lower the log level to info! or debug!, we'll never catch this if we're not reading all the info! or debug! logs. On the other hand, we can easily filter out warn! and error! logs and see if there was an issue. If we see this warn! message more than 2-3 times for same validator, then it should be a bug (seeing this log 2-3 times is ok because of #1007 (comment)).

Add a log when it's added to the tracker?

devashishdxt · 2020-02-12T03:58:14Z

I think we should not lower log level here.
Explanation:
Assume a situation where a valid validator is not added to liveness tracker (which will be a bug). If we lower the log level to info! or debug!, we'll never catch this if we're not reading all the info! or debug! logs. On the other hand, we can easily filter out warn! and error! logs and see if there was an issue. If we see this warn! message more than 2-3 times for same validator, then it should be a bug (seeing this log 2-3 times is ok because of #1007 (comment)).

Add a log when it's added to the tracker?

Yes. Add logs at entry and exit points. Also print validator address in logs.

1055: Problem (Fix #1007): liveness tracker doesn't contain information about some nodes in multi-node integration test r=tomtau a=yihuang Solution: - Check `tendermint_validator_addresses` when `liveness_tracker` not found, if cleaned up, log error otherwise log info. Co-authored-by: yihuang <huang@crypto.com>

yihuang · 2020-02-12T09:58:00Z

Then I think we'd better to accurately implement our intensions here: The liveness_tracker for the vote is ok to not exists if and only if block_height - inactive_block_height <= 2.
Implementation options:

Record the inactive_block_height (cleared when active again?)
Add a scheduling queue to remove liveness_tracker two blocks later. (clear when re-join? not trigger non-live fault when scheduled?)

tomtau assigned yihuang, tomtau and devashishdxt Feb 6, 2020

tomtau added this to the CRO Sprint 23 milestone Feb 6, 2020

tomtau added the bug Something isn't working label Feb 6, 2020

tomtau mentioned this issue Feb 6, 2020

Bump indexmap from 1.3.1 to 1.3.2 #1006

Merged

bors bot closed this as completed in 667d823 Feb 12, 2020

tomtau reopened this Feb 12, 2020

tomtau closed this as completed Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem: liveness tracker doesn't contain information about some nodes in multi-node integration test #1007

Problem: liveness tracker doesn't contain information about some nodes in multi-node integration test #1007

tomtau commented Feb 6, 2020

yihuang commented Feb 10, 2020 •

edited

Loading

tomtau commented Feb 10, 2020

yihuang commented Feb 11, 2020 •

edited

Loading

tomtau commented Feb 11, 2020

yihuang commented Feb 11, 2020

yihuang commented Feb 12, 2020

tomtau commented Feb 12, 2020

tomtau commented Feb 12, 2020

yihuang commented Feb 12, 2020

devashishdxt commented Feb 12, 2020 •

edited

Loading

tomtau commented Feb 12, 2020

devashishdxt commented Feb 12, 2020

yihuang commented Feb 12, 2020

Problem: liveness tracker doesn't contain information about some nodes in multi-node integration test #1007

Problem: liveness tracker doesn't contain information about some nodes in multi-node integration test #1007

Comments

tomtau commented Feb 6, 2020

yihuang commented Feb 10, 2020 • edited Loading

tomtau commented Feb 10, 2020

yihuang commented Feb 11, 2020 • edited Loading

tomtau commented Feb 11, 2020

yihuang commented Feb 11, 2020

yihuang commented Feb 12, 2020

tomtau commented Feb 12, 2020

tomtau commented Feb 12, 2020

yihuang commented Feb 12, 2020

devashishdxt commented Feb 12, 2020 • edited Loading

tomtau commented Feb 12, 2020

devashishdxt commented Feb 12, 2020

yihuang commented Feb 12, 2020

yihuang commented Feb 10, 2020 •

edited

Loading

yihuang commented Feb 11, 2020 •

edited

Loading

devashishdxt commented Feb 12, 2020 •

edited

Loading