Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State Recovery management implementation #477

Merged
merged 11 commits into from
Jun 26, 2022

Conversation

TheCharlatan
Copy link
Member

No description provided.

@TheCharlatan TheCharlatan force-pushed the state_recovery2 branch 2 times, most recently from e48a5c3 to 565c115 Compare June 8, 2022 14:43
@codecov-commenter
Copy link

codecov-commenter commented Jun 8, 2022

Codecov Report

Merging #477 (b6b1b30) into main (cda13b7) will decrease coverage by 0.1%.
The diff coverage is 1.3%.

@@           Coverage Diff           @@
##            main    #477     +/-   ##
=======================================
- Coverage   12.4%   12.3%   -0.1%     
=======================================
  Files         34      34             
  Lines       8906    9033    +127     
=======================================
+ Hits        1108    1111      +3     
- Misses      7798    7922    +124     
Impacted Files Coverage Δ
src/cli/command.rs 25.7% <0.0%> (-1.3%) ⬇️
src/databased/runtime.rs 30.3% <0.0%> (-13.4%) ⬇️
src/error.rs 0.0% <0.0%> (ø)
src/farcasterd/runtime.rs 0.0% <0.0%> (ø)
src/walletd/runtime.rs 0.0% <0.0%> (ø)
src/rpc/request.rs 15.3% <13.3%> (+0.1%) ⬆️
src/cli/opts.rs 45.0% <33.3%> (-0.3%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cda13b7...b6b1b30. Read the comment docs.

@TheCharlatan TheCharlatan force-pushed the state_recovery2 branch 2 times, most recently from 5da0ca6 to 07878b8 Compare June 10, 2022 23:02
@zkao zkao mentioned this pull request Jun 20, 2022
@TheCharlatan TheCharlatan marked this pull request as ready for review June 21, 2022 19:06
Copy link
Member

@zkao zkao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a trivial PR to review, the logic related to the multipart message creates too much noise, and I tried to ignore those hacky bits, as they are temporary

I'd like to have some feedback on my comments, for the sake of understanding

doc/staterecovery_sequencediagram.txt Outdated Show resolved Hide resolved
src/cli/command.rs Show resolved Hide resolved
src/databased/runtime.rs Show resolved Hide resolved
@@ -225,6 +223,124 @@ impl Runtime {
}
}

pub fn checkpoint_handle_multipart_receive(
checkpoint_multipart_chunk: request::CheckpointMultipartChunk,
pending_checkpoint_chunks: &mut HashMap<[u8; 20], HashSet<CheckpointChunk>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why HashSet and not Vec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They might not arrive in order, and by their format there should be no duplicates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come they might not arrive on order?

Copy link
Member

@zkao zkao Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, this is part of multpart messages so no need to reply, it will be soon gone

Comment on lines 898 to 909
Request::RestoreCheckpoint(swap_id) => {
if self.running_swaps.contains(&swap_id) {
endpoints.send_to(
ServiceBus::Ctl,
ServiceId::Farcasterd,
source,
Request::Failure(Failure {
code: 1,
info: "Cannot restore an already running swap".to_string(),
}),
)?;
return Ok(());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implies farcasterd Runtime will always have an up to date self.running_swaps

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is not ideal. Should I add a TODO for finding a better metric, or do you have better idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Maybe send a msg to swapd and if it doesn't error, then swapd is up and running, i guess?

Copy link
Member Author

@TheCharlatan TheCharlatan Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's printing another ESB error if it send to a non-existing / dead endpoint. I'd like to avoid that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that true even if we handle the error at the level above internet2? meaning not letting error propagate to internet2

anyway possibly that is something that should end up in internet2 for checking if service is alive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the error is logged at the ESB level first, before we can capture it in the runtime.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the running_swaps check and am instead now checking if walletd is still running by sending a Hello request on the Msg bus.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @zkao. @TheCharlatan & I had a long discussion exploring multiple options for addressing this.
Ultimately, we should have a maintenance daemon that perform aliveness checks with all services. For now, we should just perform this check ad-hoc when restoring a checkpoint - regardless of yet another ESB error.

Copy link
Member Author

@TheCharlatan TheCharlatan Jun 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking if walletd is up in b6b1b30

src/rpc/request.rs Show resolved Hide resolved
src/walletd/runtime.rs Outdated Show resolved Hide resolved
src/rpc/request.rs Show resolved Hide resolved
@Lederstrumpf Lederstrumpf merged commit a2f8071 into farcaster-project:main Jun 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants