State Recovery management implementation #477

TheCharlatan · 2022-06-07T23:14:15Z

No description provided.

codecov-commenter · 2022-06-08T16:48:03Z

Codecov Report

Merging #477 (b6b1b30) into main (cda13b7) will decrease coverage by 0.1%.
The diff coverage is 1.3%.

@@           Coverage Diff           @@
##            main    #477     +/-   ##
=======================================
- Coverage   12.4%   12.3%   -0.1%     
=======================================
  Files         34      34             
  Lines       8906    9033    +127     
=======================================
+ Hits        1108    1111      +3     
- Misses      7798    7922    +124

Impacted Files	Coverage Δ
src/cli/command.rs	`25.7% <0.0%> (-1.3%)`	⬇️
src/databased/runtime.rs	`30.3% <0.0%> (-13.4%)`	⬇️
src/error.rs	`0.0% <0.0%> (ø)`
src/farcasterd/runtime.rs	`0.0% <0.0%> (ø)`
src/walletd/runtime.rs	`0.0% <0.0%> (ø)`
src/rpc/request.rs	`15.3% <13.3%> (+0.1%)`	⬆️
src/cli/opts.rs	`45.0% <33.3%> (-0.3%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cda13b7...b6b1b30. Read the comment docs.

doc/staterecovery_sequencediagram.txt

zkao

Not a trivial PR to review, the logic related to the multipart message creates too much noise, and I tried to ignore those hacky bits, as they are temporary

I'd like to have some feedback on my comments, for the sake of understanding

doc/staterecovery_sequencediagram.txt

src/cli/command.rs

src/databased/runtime.rs

zkao · 2022-06-22T12:28:53Z

src/databased/runtime.rs

@@ -225,6 +223,124 @@ impl Runtime {
    }
 }

+pub fn checkpoint_handle_multipart_receive(
+    checkpoint_multipart_chunk: request::CheckpointMultipartChunk,
+    pending_checkpoint_chunks: &mut HashMap<[u8; 20], HashSet<CheckpointChunk>>,


Why HashSet and not Vec?

They might not arrive in order, and by their format there should be no duplicates.

How come they might not arrive on order?

again, this is part of multpart messages so no need to reply, it will be soon gone

zkao · 2022-06-22T12:35:09Z

src/farcasterd/runtime.rs

+            Request::RestoreCheckpoint(swap_id) => {
+                if self.running_swaps.contains(&swap_id) {
+                    endpoints.send_to(
+                        ServiceBus::Ctl,
+                        ServiceId::Farcasterd,
+                        source,
+                        Request::Failure(Failure {
+                            code: 1,
+                            info: "Cannot restore an already running swap".to_string(),
+                        }),
+                    )?;
+                    return Ok(());


this implies farcasterd Runtime will always have an up to date self.running_swaps

Yes, this is not ideal. Should I add a TODO for finding a better metric, or do you have better idea?

Not sure. Maybe send a msg to swapd and if it doesn't error, then swapd is up and running, i guess?

It's printing another ESB error if it send to a non-existing / dead endpoint. I'd like to avoid that.

is that true even if we handle the error at the level above internet2? meaning not letting error propagate to internet2

anyway possibly that is something that should end up in internet2 for checking if service is alive

Yeah, the error is logged at the ESB level first, before we can capture it in the runtime.

I removed the running_swaps check and am instead now checking if walletd is still running by sending a Hello request on the Msg bus.

Good point @zkao. @TheCharlatan & I had a long discussion exploring multiple options for addressing this.
Ultimately, we should have a maintenance daemon that perform aliveness checks with all services. For now, we should just perform this check ad-hoc when restoring a checkpoint - regardless of yet another ESB error.

Checking if walletd is up in b6b1b30

src/rpc/request.rs

src/walletd/runtime.rs

src/rpc/request.rs

…me state

…wallet

TheCharlatan force-pushed the state_recovery2 branch 2 times, most recently from e48a5c3 to 565c115 Compare June 8, 2022 14:43

TheCharlatan force-pushed the state_recovery2 branch 2 times, most recently from 5da0ca6 to 07878b8 Compare June 10, 2022 23:02

TheCharlatan added the mainnet label Jun 16, 2022

zkao reviewed Jun 19, 2022

View reviewed changes

doc/staterecovery_sequencediagram.txt Outdated Show resolved Hide resolved

zkao mentioned this pull request Jun 20, 2022

State recovery2 rebased #497

Closed

TheCharlatan force-pushed the state_recovery2 branch from 07878b8 to 9217074 Compare June 21, 2022 16:16

TheCharlatan marked this pull request as ready for review June 21, 2022 19:06

zkao reviewed Jun 22, 2022

View reviewed changes

zkao mentioned this pull request Jun 22, 2022

Ping/ponging opt-in feature at the controller level cyphernet-labs/rust-microservices#5

Open

TheCharlatan added 10 commits June 26, 2022 13:55

Checkpointd: Add endpoint for retrieving checkpoints

87ef329

Checkpoint: Add workflow for restoring a checkpoint state in walletd

8e66007

Swaps: Cleanup wallet and checkpoint state after swap completes

c1d2521

Checkpoint: Correct message sending

4dc7765

Checkpoint: Improve edge-case and error handling

e2e25f9

Checkpoint: Remove distincts checkpoints for points on time of the sa…

a79af20

…me state

Checkpointing: Make multipart handling exported functions

a793fec

Docs: Update state recovery sequence diagram

ef9a9f7

Checkpointing: Print error if attempting to restore already existing …

9a3b222

…wallet

Checkpointing: Patch sequence diagram

91756f6

TheCharlatan force-pushed the state_recovery2 branch from 372099b to 91756f6 Compare June 26, 2022 11:59

Farcasterd: Check if walletd is running before restoring checkpoint

b6b1b30

Lederstrumpf approved these changes Jun 26, 2022

View reviewed changes

Lederstrumpf merged commit a2f8071 into farcaster-project:main Jun 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State Recovery management implementation #477

State Recovery management implementation #477

TheCharlatan commented Jun 7, 2022

codecov-commenter commented Jun 8, 2022 •

edited

Loading

zkao left a comment

zkao Jun 22, 2022

TheCharlatan Jun 22, 2022

zkao Jun 22, 2022

zkao Jun 22, 2022 •

edited

Loading

zkao Jun 22, 2022

TheCharlatan Jun 22, 2022

zkao Jun 22, 2022

TheCharlatan Jun 22, 2022 •

edited

Loading

zkao Jun 22, 2022

TheCharlatan Jun 22, 2022

TheCharlatan Jun 24, 2022

Lederstrumpf Jun 24, 2022

TheCharlatan Jun 25, 2022 •

edited

Loading

State Recovery management implementation #477

State Recovery management implementation #477

Conversation

TheCharlatan commented Jun 7, 2022

codecov-commenter commented Jun 8, 2022 • edited Loading

Codecov Report

zkao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zkao Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheCharlatan Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheCharlatan Jun 25, 2022 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Jun 8, 2022 •

edited

Loading

zkao Jun 22, 2022 •

edited

Loading

TheCharlatan Jun 22, 2022 •

edited

Loading

TheCharlatan Jun 25, 2022 •

edited

Loading