-
Notifications
You must be signed in to change notification settings - Fork 457
Unrecoverable fork - cause 3 - Lisk 0.2 testnet #82
Comments
@karek314: I will give my explanation on why this fork occurs. This type of fork is caused by a mismatch between the expected public key, and the received block's generator public key, when processing a block. The following gives an example of the error message which will be recorded when a node enters this type of fork: Note: In 0.2.1, we improved the logging in this case to give more detailed information, and within this explanation I have added comments to the code. Error message{
"level":"error",
"message":"Expected generator: 7ac9d4b708fb19eaa200eb883be56601ddceed96290a3a033114750b7fda9d0b Received generator: fa7bfd3a2dc0ca55b700247aae4694709d6cdfa34c6bfb0237e032d7aae404f0",
"timestamp":"2016-05-07 03:40:44"
} Fork log{
"level":"info",
"message": "Fork",
"timestamp":"2016-05-07 03:40:44",
"data": {
"delegate":"fa7bfd3a2dc0ca55b700247aae4694709d6cdfa34c6bfb0237e032d7aae404f0",
"block":{
"id":"13609465205005996164",
"timestamp":34054830,
"height":61118,
"previousBlock":
"1280968842307106697"
},
"cause":3
} In this example the node has received the following public key: The expected public key is determined, like so: 1. Lisk gets a list of all 101 delegates sorted by voteprivate.getKeysSortByVote = function (cb) {
modules.accounts.getAccounts({
isDelegate: 1,
sort: { "vote": -1, "publicKey": 1 },
limit: slots.delegates
}, ["publicKey"], function (err, rows) {
if (err) {
cb(err)
}
cb(null, rows.map(function (el) {
return el.publicKey
}))
});
} 2. Lisk determines an order of delegates based on an initial seed
Delegates.prototype.generateDelegateList = function (height, cb) {
private.getKeysSortByVote(function (err, truncDelegateList) {
if (err) {
return cb(err);
}
var seedSource = modules.round.calc(height).toString(); // Establish current round as the seed source
var currentSeed = crypto.createHash('sha256').update(seedSource, 'utf8').digest(); // Create first sha256 hash seed from current round
// Iterate of 101 delegates list sorted by vote
for (var i = 0, delCount = truncDelegateList.length; i < delCount; i++) {
// Iterate over each byte in current hash (up to 4th byte)
for (var x = 0; x < 4 && i < delCount; i++, x++) {
var newIndex = currentSeed[x] % delCount; // Determine new index
var b = truncDelegateList[newIndex]; // Get current delegate at new index
// Swap delegate stack positions
truncDelegateList[newIndex] = truncDelegateList[i];
truncDelegateList[i] = b;
}
// Generate a new hash seed based on current seed
currentSeed = crypto.createHash('sha256').update(currentSeed).digest();
}
cb(null, truncDelegateList);
});
} 3. Lisk determines a slot number according to the block timestamp, and selects the expected public key from the previously generated delegates listDelegates.prototype.validateBlockSlot = function (block, cb) {
// Generate delegate list based on current block height (round)
self.generateDelegateList(block.height, function (err, activeDelegates) {
if (err) {
return cb(err);
}
// Get slot number according to given block timestamp
var currentSlot = slots.getSlotNumber(block.timestamp); // block.timestamp / 10 block time = 3405483
var delegate_id = activeDelegates[currentSlot % slots.delegates]; // delegate_id = 3405483 % 101 delegates = 66
// Expected: 7ac9d4b708fb19eaa200eb883be56601ddceed96290a3a033114750b7fda9d0b (66)
// Received: fa7bfd3a2dc0ca55b700247aae4694709d6cdfa34c6bfb0237e032d7aae404f0
// Compare public keys
if (delegate_id && block.generatorPublicKey == delegate_id) {
return cb(); // Keys match
} else {
// Keys do not match
library.logger.error("Expected generator: " + delegate_id + " Received generator: " + block.generatorPublicKey);
return cb("Failed to verify slot: " + currentSlot);
}
});
} If the received and expected public keys do not match. The node enters a fork, named cause 3. Possible Causes
I hope this is a full enough explanation on why this fork is occurring. I am currently working on a solution, but posting this explanation to open the discussion and improve understanding. |
Upon checking the database snapshot which everyone is seeding their nodes from, I've found a problem: Example, as of writing the current round is:
Rounds are normally applied and "flushed" when they end. However, this is the current state of the db:
Therefore, that's a total of 3 rounds which have not been applied properly. As a consequence this affects delegate votes, which in turn affects the delegate listing sorted by vote mentioned in this issue, and causing the amount of forks we've been experiencing. Taking one block which occurred in round: 4 as an example. In block: https://explorer.lisk.io/block/15972574090333109954 at height: 376, there are 25 transactions, all of which are still present in Conclusions:
Next steps: I will verify all blocks in the current snapshot, replace the current one at downloads.lisk.io, and then later implement the required checking for the next release. Any snapshots I take in the meantime I will manually check for "unapplied" rounds. |
Interestingly there were many forks were active delegated were changing (like yesterday). Provided that the numerous forks were due to this issue (which we don't know for sure), that would explain why many nodes are miscalculating the delegate votes. As far as I understood, the If a fork 3 happens (or actually whatever error check on |
This would explain why Crypti would be unstable for a while after a major change in voting from large transactions, shuffling delegate ranks, though it eventually resolves itself. The time syncing has also been a chronic issue. |
other idea, remove the mem_round table and perform the computation in memory for the current round. |
I believe we found definitive proof of fork cause 3 occuring because of delegates shifting. Here is an error message from earlier:
Below we can see two nodes, one healthy, the other not, and on the right we can see the two delegates public keys which can be compared to the two wallets. Its clearly showing that the delegates swapped places and if we see the public keys from above, its the same two delegates mentioned. I'm not quite sure what can be done to alieviate this problem but one solution that I am proposing is the following: At the beginning of the round, delegate votes are fixed in place. These votes will be collected among all nodes until the end of the round. On the final block of the round these votes would be tallied and signed onto that block. It may be helpful for the next block afterwards to pause for longer than normal block time (if thats possible) to allow this final block to propagate through the network. In this way, the shifting issue is resolved and every node is allowed a brief period to get caught up and fully in sync to generate the newest delegate list for the next block. |
I agree, the nodes don't have the same information about the votes, above all if votes occurs in the last blocks, at the end of each round. So they generate an errorneus list of active delegates if the order change at the end. Either we prevent this fork to occur, either we make it recoverable. I expect the solution of @Isabello being heavy to implement because we include a new type of block in the blockchain. However waiting 60s to start the next round (ie there will be 60s between 2 cross-round blocks) could be enough to start the mainnet and prevent "enough" fork cause 3. We can work then on a better solution later. |
I'll admit to not understanding every aspect of the code, but the solution mentioned sounds good to me. Was just talking to rabid about it. I could be missing something, but how much overhead can there be with one more table with 101 rows and 3 columns. rank, publicKey, and forging slot (since this is randomized) Edit: GreXX's suggestion of just another column in the current table that contains the previous round's rank on that round's last block, seems more efficient then a whole new table. Edit 2: Looking at the code. I wonder if it would be easier to wait to apply votes until the last or first block in a round instead of recording the previous round's ranks |
I'm thinking there should be 2 blockchains. One is the complete history, transactions of a complete forging round, voting, and delegate rank in a block with a block cycle of once every forging round, and another with the blocks forged by each delegate, going back a number of forging rounds for redundancy. The permanent blockchain is verified in background by the delegates as a separate task from forging to the short term temporay chain. Voting and change of delegate ranking seems to cause network instability, and snapshots are coming out invalid. having a permanent blockchain with long block time allows better validation and prevents partial, truncated invalid data from corrupting main chain. working short term chain going back several cycles for forging gives some redundancy and can be processed faster. |
namecoin and blockstack both use a 2 step process for registering names on a blockchain. The first step is a "pre-order" that records the users request on the blockchain but does not officially register it. After 6 blocks (btc/nmc), the user must send a registration transaction to complete. If the pre-order is not followed by the registration within Y blocks, the pre-order expires and is discarded. Similarly, a registration without a prior pre-order is ignored as invalid. Perhaps a similar system for casting votes would help?
Both the "record" and "cast" transactions must be present within the correct time spacing and sequence to be valid |
There is something similar binded in lisk, but that you don't see:
What can happen very unlikely is that 2 usernames get registered in different area of the network. After a few blocks, while the user got a success registering his name, it will disappear (and his fees retruned) because there were no consensus. |
We had a deep discussion recently. What came out:
|
Do nodes do any type of P2P timestamp comparison with each other like in bitcoin?https://en.bitcoin.it/wiki/Block_timestamp |
- Adding checks for missed/unapplied rounds. - Utilizing pg-promise tasks.
- Wrapping memory table updates within db transaction. - Adding reversible RoundPromiser constructor. - Keeping track of round ticking state.
Stale being, no blocks received within 30 seconds.
Stale being, no blocks received within 15 seconds.
As issue has been closed couple times. I assume that some of this fixes has been applied in 0.2.3. But issue still occurs. Im aware it's in progress now, but maybe it will be helpful.
|
- Disabling when entering into a fork. - Disabling when blocks not received on time. - Enabling when block received on time.
When block has not been received yet.
Fixes unrecoverable fork due to first round not being applied.
as far i heard it's fixed, i have just reproduced with 0.3.0
|
Here i have reproduced twice, because of low internet connectivity. Sounds that lagging could trigger fork 3 now. |
Yes i agree, i couldn't reproduce it on machine with stable connectivity. but one with poor internet connection enters forks very often. How about implement mechanism to remove last block and it's data when node can't move on with blockchain ? |
It is already done but this case is much more complicated and involve a
|
@karmacoma @MaxKK can we close this bug? there is some internal documentation to improve network reliability of the network preventing such forks. Maybe we could open up some discussion somewhere is the wiki instead, because it is difficult to keep track of this kind of issues in github/issues |
@MaxKK Agree, we will close here and focus on improving network reliability on a case by case basis. For example: #213. We should also improve our fork documentation here: https://github.com/LiskHQ/lisk-docs/blob/development/Troubleshooting.md or a wiki. |
As far as I know, DPOS consensus is not BFT, so the solution is improving the consensus layer. In bad network condition, the DPOS consensus will cause many nodes see different new block (same height, different hash) be producing by different delegate node. Sometimes, bad network condition can even cause the new block producing by delegate can not be confirmed, because the last irreversible block (LIB) need (n-1)*2/3 confirmations by witness node, so in this condition the block will be suspend forever. So, the DPOS is not safe and is not BFT, lisk and bitshares and steemit have the same problem. |
It has been fixed with bash lisk.sh rebuild but lisk couldn't automatically fix db. I guess issue happened during connectivity problems, i have reported similar issue with previous database engine.
The text was updated successfully, but these errors were encountered: