-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State Recovery management implementation #477
State Recovery management implementation #477
Conversation
e48a5c3
to
565c115
Compare
Codecov Report
@@ Coverage Diff @@
## main #477 +/- ##
=======================================
- Coverage 12.4% 12.3% -0.1%
=======================================
Files 34 34
Lines 8906 9033 +127
=======================================
+ Hits 1108 1111 +3
- Misses 7798 7922 +124
Continue to review full report at Codecov.
|
5da0ca6
to
07878b8
Compare
07878b8
to
9217074
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a trivial PR to review, the logic related to the multipart message creates too much noise, and I tried to ignore those hacky bits, as they are temporary
I'd like to have some feedback on my comments, for the sake of understanding
@@ -225,6 +223,124 @@ impl Runtime { | |||
} | |||
} | |||
|
|||
pub fn checkpoint_handle_multipart_receive( | |||
checkpoint_multipart_chunk: request::CheckpointMultipartChunk, | |||
pending_checkpoint_chunks: &mut HashMap<[u8; 20], HashSet<CheckpointChunk>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why HashSet and not Vec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They might not arrive in order, and by their format there should be no duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come they might not arrive on order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, this is part of multpart messages so no need to reply, it will be soon gone
src/farcasterd/runtime.rs
Outdated
Request::RestoreCheckpoint(swap_id) => { | ||
if self.running_swaps.contains(&swap_id) { | ||
endpoints.send_to( | ||
ServiceBus::Ctl, | ||
ServiceId::Farcasterd, | ||
source, | ||
Request::Failure(Failure { | ||
code: 1, | ||
info: "Cannot restore an already running swap".to_string(), | ||
}), | ||
)?; | ||
return Ok(()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this implies farcasterd Runtime will always have an up to date self.running_swaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is not ideal. Should I add a TODO for finding a better metric, or do you have better idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. Maybe send a msg to swapd and if it doesn't error, then swapd is up and running, i guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's printing another ESB error if it send to a non-existing / dead endpoint. I'd like to avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that true even if we handle the error at the level above internet2? meaning not letting error propagate to internet2
anyway possibly that is something that should end up in internet2 for checking if service is alive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the error is logged at the ESB level first, before we can capture it in the runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the running_swaps
check and am instead now checking if walletd is still running by sending a Hello
request on the Msg bus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @zkao. @TheCharlatan & I had a long discussion exploring multiple options for addressing this.
Ultimately, we should have a maintenance daemon that perform aliveness checks with all services. For now, we should just perform this check ad-hoc when restoring a checkpoint - regardless of yet another ESB error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking if walletd is up in b6b1b30
372099b
to
91756f6
Compare
No description provided.