Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hotfix(el/cl): allow multi clients to start if at least one node is up #2000

Merged
merged 44 commits into from
Jan 30, 2025

Conversation

y0sher
Copy link
Contributor

@y0sher y0sher commented Jan 22, 2025

#1964 was merged but it doesn't allow nodes to be unreachable on start. The idea was that when we have issues with nodes, it's always the out-of-sync issue and the node is still reachable, and we want to catch cases when we provide a bad node URL. However, it makes testing more complicated as we cannot shut down the node, we have to make it desynced instead. Also, there may be cases when a node is down due to some issue like infra issues, so this PR allows SSV node to start if at least one node is up

Copy link

codecov bot commented Jan 22, 2025

Comment on lines 241 to 244
mc.clientsMu[i].Lock()

if mc.clients[i] == nil {
if err := mc.connect(ctx, i); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be smart to take this out to a function since its not the only place we do it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do it in two places:

			if mc.clients[i] == nil {
				if err := mc.connect(ctx, i); err != nil {
					mc.logger.Warn("failed to connect to client",
						zap.String("addr", mc.nodeAddrs[i]),
						zap.Error(err))

					mc.clientsMu[i].Unlock()
					return err
				}
			}
		if client == nil {
			if err := mc.connect(ctx, i); err != nil {
				mc.logger.Warn("failed to connect to client",
					zap.String("addr", mc.nodeAddrs[i]),
					zap.Error(err))

				allErrs = errors.Join(allErrs, err)
				mc.currentClientIndex.Store(int64(nextClientIndex)) // Advance.
				mc.clientsMu[i].Unlock()
				continue
			}
		}

So the repetitive parts are the nil check and the log. Do you want me to move them inside the connect and rename them to something like connectIfDisconnected?

@nkryuchkov nkryuchkov changed the title fix: allow one client with multi clients. hotfix(el/cl): allow multi clients to start if at least one node is up Jan 23, 2025
@nkryuchkov nkryuchkov requested a review from moshe-blox January 23, 2025 17:44
@nkryuchkov
Copy link
Contributor

Spec Alignment failure is supposed to be fixed by #2003

Comment on lines 283 to 292
if err := gc.assertSameGenesis(genesis.Data); err != nil {
gc.genesisMu.Lock()
defer gc.genesisMu.Unlock()

gc.log.Fatal("client returned unexpected genesis",
zap.String("address", s.Address()),
zap.Any("client_genesis", genesis.Data),
zap.Any("expected_genesis", gc.genesis),
zap.Error(err),
)
return // Tests may override Fatal's behavior
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might crash the program after a while that it's running if the second client won't be actuveat start but become active later. We should instead just stop using this client.
This is much easier to achieve if we just check the genesis comparing to a local stored value (e.g: saving genesis fork version in networkconfig )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clients should never have the same geneses because it would mean they use different ETH networks, so IMO we shouldn't continue in this case. If it were just a log, the multi-client would just silently log it and switch to the next client.

…replace connected atomic with regular bool value
Copy link
Contributor

@iurii-ssv iurii-ssv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some suggestions

Comment on lines -256 to +245
logger.Error("Consensus http client initialization failed",
gc.log.Error("Consensus http client initialization failed",
zap.String("address", addr),
zap.Error(err),
)

return nil, fmt.Errorf("create http client: %w", err)
return fmt.Errorf("create http client: %w", err)
Copy link
Contributor

@iurii-ssv iurii-ssv Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really specific to this PR but we often (and here as well) do 2 "duplicate" things

  • log error
  • return that same error (that's always gonna be logged eventually by the caller resulting in roughly duplicate log-line)

maybe would be simpler to just return error (with formatted with fmt.Errorf to provide the necessary context) in places like this, bringing it up so we can get on the same page (whether we want to keep an eye on things like this or not)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main difference and issue is that when logging with zap and adding fields its easy to search them by the label value. we'll need to squeeze everything to the fmt.Errorf, it'll still be searchable, but not by label.

Copy link
Contributor

@nkryuchkov nkryuchkov Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we could think more about our logging approach, but I think this package currently uses logging in a way similar to other packages. If we decide to improve logging (e.g. use custom error types with fields), we need to do it project-wide

beacon/goclient/goclient.go Outdated Show resolved Hide resolved
beacon/goclient/goclient.go Show resolved Hide resolved
eth/executionclient/multi_client.go Outdated Show resolved Hide resolved
Comment on lines 363 to 406
if len(mc.clients) == 1 {
return f(mc.clients[0])
return f(mc.clients[0]) // no need for mutex because one client is always non-nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that only because if we are only using 1 client and it doesn't initialize on MultiClient.New we just terminate SSV node ? I guess it works that way, but do we need to handle it as "an edge case" - why not just let it get handled by the code below ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it works that way. IMO, this check doesn't add much code and it's very easy to read. Thinking about how the code below would handle one client is much harder. Generally, we never use the multi-client with only one client except for tests (I think we need to use the previous implementation with only one client because it's been used for a long time without any issues)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does handle the case of 1-client correctly (if it doesn't that would indicate there might be some other issue with that code, plus it's something that's easy to test)

I guess a related discussion would be - #1308 (comment) (where we also "fall back" to older code rather than using the new implementation)

I understand that "fall back to previous code" might be slightly safer to do short-term, but this approach keeps accruing tech-debt over time

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I think it's not a tech debt because the ExecutionClient implementation remains. I don't like using multi-client just for one client because it also changes logging, and adds some overhead. Similarly, we didn't use the beacon client's multi-client implementation just for one client. The first implementation didn't have MultiClient, in that case, I would prefer using just one client for everything. But we decided to keep the original version the least modified possible, so in this case, keeping using the old implementation just for one client looks good to me.

Maybe we need to allow MultiClient to be used only if 2+ clients are provided though

eth/executionclient/multi_client.go Outdated Show resolved Hide resolved
eth/executionclient/multi_client.go Outdated Show resolved Hide resolved
@y0sher y0sher requested a review from oleg-ssvlabs January 27, 2025 09:43
@@ -426,9 +404,6 @@ func (ec *ExecutionClient) streamLogsToChan(ctx context.Context, logs chan<- Blo
case <-ec.closed:
return fromBlock, ErrClosed

case <-ec.healthyCh:
return fromBlock, ErrUnhealthy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not working as expected, I saw it returning ErrUnhealthy when the client was healthy. This channel looks tricky and dangerous to work with (see another comment about closing channel), so I'm removing it
FYI @moshe-blox

Copy link
Contributor

@iurii-ssv iurii-ssv Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Just briefly looking at this channel (and corresponding mutex) - it's not entirely clear how it is supposed to be used (what it's for, etc.)

so if we aren't gonna remove it entirely we need to at least re-think it's purpose/usage

defer ec.healthyChMu.Unlock()

if err := ec.healthy(ctx); err != nil {
close(ec.healthyCh)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This triggers closing of closed channel which panics, I'm removing it as it's dangerous to use
FYI @moshe-blox

eth/executionclient/multi_client.go Outdated Show resolved Hide resolved
Comment on lines 366 to 367
// If forever is false, it tries all clients only once and if no client is available then it returns an error.
// If forever is true, it iterates clients forever.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it seems now this is not entirely true for len(mc.clients) == 1 (which is why I sort of called it a tech debt)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the comment

Copy link
Contributor

@iurii-ssv iurii-ssv Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It kind of begs the question now - why behavior for 1 client is different (my guess would be because it's just "old behavior we want to preserve"), maybe worth leaving a comment about it

if forever {
limit = math.MaxInt32
}
for i := 0; i < limit; i++ {
Copy link
Contributor

@iurii-ssv iurii-ssv Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems when "spinning forever" here - under certain circumstances (when clients error immediately)

  • we can potentially exhaust math.MaxInt32 (so I'd use 64 bit instead)
  • we can generate a lot of spamming logs

so we'd probably want to introduce some kind of delay (when "spinning forever") after we've done 1 round over all the clients (after i has increased by len(mc.clients))

and this is also why I'd prefer "spinning forever" to be implemented in a separate method - #2000 (comment) - otherwise this code gets more complex than it should be

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can potentially exhaust but I thought it's hardly reachable and we need to review the logic if so, so I thought this would be a bit simpler to understand. I agree with spamming logs, but I think we need to switch clients ASAP without any delays.

I rewrote this using another approach

Copy link
Contributor

@iurii-ssv iurii-ssv Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with spamming logs, but I think we need to switch clients ASAP without any delays.

I was thinking of something like 100ms-1000ms delay, but maybe for time-constraint duty (like block production) that's too high of an additional burden ...

we can get back to it if it (log-spamming) proves to be an actual issue in prod

Copy link
Contributor

@iurii-ssv iurii-ssv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with minor comments)

@y0sher y0sher force-pushed the fix/multiclient-oneclient branch from 23d393d to 6b51a6a Compare January 30, 2025 16:00
@nkryuchkov nkryuchkov force-pushed the fix/multiclient-oneclient branch from 6b51a6a to 85779ec Compare January 30, 2025 16:06
@nkryuchkov nkryuchkov merged commit 285cd60 into stage Jan 30, 2025
6 of 7 checks passed
@nkryuchkov nkryuchkov deleted the fix/multiclient-oneclient branch January 30, 2025 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants