-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hotfix(el/cl): allow multi clients to start if at least one node is up #2000
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files☔ View full report in Codecov by Sentry. |
eth/executionclient/multi_client.go
Outdated
mc.clientsMu[i].Lock() | ||
|
||
if mc.clients[i] == nil { | ||
if err := mc.connect(ctx, i); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be smart to take this out to a function since its not the only place we do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do it in two places:
if mc.clients[i] == nil {
if err := mc.connect(ctx, i); err != nil {
mc.logger.Warn("failed to connect to client",
zap.String("addr", mc.nodeAddrs[i]),
zap.Error(err))
mc.clientsMu[i].Unlock()
return err
}
}
if client == nil {
if err := mc.connect(ctx, i); err != nil {
mc.logger.Warn("failed to connect to client",
zap.String("addr", mc.nodeAddrs[i]),
zap.Error(err))
allErrs = errors.Join(allErrs, err)
mc.currentClientIndex.Store(int64(nextClientIndex)) // Advance.
mc.clientsMu[i].Unlock()
continue
}
}
So the repetitive parts are the nil
check and the log. Do you want me to move them inside the connect
and rename them to something like connectIfDisconnected
?
Spec Alignment failure is supposed to be fixed by #2003 |
beacon/goclient/goclient.go
Outdated
if err := gc.assertSameGenesis(genesis.Data); err != nil { | ||
gc.genesisMu.Lock() | ||
defer gc.genesisMu.Unlock() | ||
|
||
gc.log.Fatal("client returned unexpected genesis", | ||
zap.String("address", s.Address()), | ||
zap.Any("client_genesis", genesis.Data), | ||
zap.Any("expected_genesis", gc.genesis), | ||
zap.Error(err), | ||
) | ||
return // Tests may override Fatal's behavior | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might crash the program after a while that it's running if the second client won't be actuveat start but become active later. We should instead just stop using this client.
This is much easier to achieve if we just check the genesis comparing to a local stored value (e.g: saving genesis fork version in networkconfig )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clients should never have the same geneses because it would mean they use different ETH networks, so IMO we shouldn't continue in this case. If it were just a log, the multi-client would just silently log it and switch to the next client.
…replace connected atomic with regular bool value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some suggestions
logger.Error("Consensus http client initialization failed", | ||
gc.log.Error("Consensus http client initialization failed", | ||
zap.String("address", addr), | ||
zap.Error(err), | ||
) | ||
|
||
return nil, fmt.Errorf("create http client: %w", err) | ||
return fmt.Errorf("create http client: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really specific to this PR but we often (and here as well) do 2 "duplicate" things
- log error
- return that same error (that's always gonna be logged eventually by the caller resulting in roughly duplicate log-line)
maybe would be simpler to just return error
(with formatted with fmt.Errorf
to provide the necessary context) in places like this, bringing it up so we can get on the same page (whether we want to keep an eye on things like this or not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main difference and issue is that when logging with zap and adding fields its easy to search them by the label value. we'll need to squeeze everything to the fmt.Errorf
, it'll still be searchable, but not by label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we could think more about our logging approach, but I think this package currently uses logging in a way similar to other packages. If we decide to improve logging (e.g. use custom error types with fields), we need to do it project-wide
if len(mc.clients) == 1 { | ||
return f(mc.clients[0]) | ||
return f(mc.clients[0]) // no need for mutex because one client is always non-nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that only because if we are only using 1 client and it doesn't initialize on MultiClient.New
we just terminate SSV node ? I guess it works that way, but do we need to handle it as "an edge case" - why not just let it get handled by the code below ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it works that way. IMO, this check doesn't add much code and it's very easy to read. Thinking about how the code below would handle one client is much harder. Generally, we never use the multi-client with only one client except for tests (I think we need to use the previous implementation with only one client because it's been used for a long time without any issues)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it does handle the case of 1-client correctly (if it doesn't that would indicate there might be some other issue with that code, plus it's something that's easy to test)
I guess a related discussion would be - #1308 (comment) (where we also "fall back" to older code rather than using the new implementation)
I understand that "fall back to previous code" might be slightly safer to do short-term, but this approach keeps accruing tech-debt over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I think it's not a tech debt because the ExecutionClient
implementation remains. I don't like using multi-client just for one client because it also changes logging, and adds some overhead. Similarly, we didn't use the beacon client's multi-client implementation just for one client. The first implementation didn't have MultiClient
, in that case, I would prefer using just one client for everything. But we decided to keep the original version the least modified possible, so in this case, keeping using the old implementation just for one client looks good to me.
Maybe we need to allow MultiClient
to be used only if 2+ clients are provided though
@@ -426,9 +404,6 @@ func (ec *ExecutionClient) streamLogsToChan(ctx context.Context, logs chan<- Blo | |||
case <-ec.closed: | |||
return fromBlock, ErrClosed | |||
|
|||
case <-ec.healthyCh: | |||
return fromBlock, ErrUnhealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not working as expected, I saw it returning ErrUnhealthy
when the client was healthy. This channel looks tricky and dangerous to work with (see another comment about closing channel), so I'm removing it
FYI @moshe-blox
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Just briefly looking at this channel (and corresponding mutex) - it's not entirely clear how it is supposed to be used (what it's for, etc.)
so if we aren't gonna remove it entirely we need to at least re-think it's purpose/usage
defer ec.healthyChMu.Unlock() | ||
|
||
if err := ec.healthy(ctx); err != nil { | ||
close(ec.healthyCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This triggers closing of closed channel which panics, I'm removing it as it's dangerous to use
FYI @moshe-blox
eth/executionclient/multi_client.go
Outdated
// If forever is false, it tries all clients only once and if no client is available then it returns an error. | ||
// If forever is true, it iterates clients forever. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it seems now this is not entirely true for len(mc.clients) == 1
(which is why I sort of called it a tech debt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It kind of begs the question now - why behavior for 1 client is different (my guess would be because it's just "old behavior we want to preserve"), maybe worth leaving a comment about it
eth/executionclient/multi_client.go
Outdated
if forever { | ||
limit = math.MaxInt32 | ||
} | ||
for i := 0; i < limit; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems when "spinning forever" here - under certain circumstances (when clients error immediately)
- we can potentially exhaust
math.MaxInt32
(so I'd use 64 bit instead) - we can generate a lot of spamming logs
so we'd probably want to introduce some kind of delay (when "spinning forever") after we've done 1 round over all the clients (after i
has increased by len(mc.clients)
)
and this is also why I'd prefer "spinning forever" to be implemented in a separate method - #2000 (comment) - otherwise this code gets more complex than it should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can potentially exhaust but I thought it's hardly reachable and we need to review the logic if so, so I thought this would be a bit simpler to understand. I agree with spamming logs, but I think we need to switch clients ASAP without any delays.
I rewrote this using another approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with spamming logs, but I think we need to switch clients ASAP without any delays.
I was thinking of something like 100ms-1000ms delay, but maybe for time-constraint duty (like block production) that's too high of an additional burden ...
we can get back to it if it (log-spamming) proves to be an actual issue in prod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (with minor comments)
23d393d
to
6b51a6a
Compare
6b51a6a
to
85779ec
Compare
#1964 was merged but it doesn't allow nodes to be unreachable on start. The idea was that when we have issues with nodes, it's always the out-of-sync issue and the node is still reachable, and we want to catch cases when we provide a bad node URL. However, it makes testing more complicated as we cannot shut down the node, we have to make it desynced instead. Also, there may be cases when a node is down due to some issue like infra issues, so this PR allows SSV node to start if at least one node is up