hotfix(el/cl): allow multi clients to start if at least one node is up #2000

y0sher · 2025-01-22T15:45:15Z

#1964 was merged but it doesn't allow nodes to be unreachable on start. The idea was that when we have issues with nodes, it's always the out-of-sync issue and the node is still reachable, and we want to catch cases when we provide a bad node URL. However, it makes testing more complicated as we cannot shut down the node, we have to make it desynced instead. Also, there may be cases when a node is down due to some issue like infra issues, so this PR allows SSV node to start if at least one node is up

codecov · 2025-01-22T16:47:18Z

Codecov Report

Attention: Patch coverage is 45.92145% with 179 lines in your changes missing coverage. Please review.

Project coverage is 48.0%. Comparing base (860597a) to head (4751783).
Report is 1 commits behind head on stage.

Files with missing lines	Patch %	Lines
beacon/goclient/goclient.go	57.6%	40 Missing and 4 partials ⚠️
beacon/goclient/proposer.go	0.0%	42 Missing ⚠️
eth/executionclient/multi_client.go	71.2%	31 Missing and 5 partials ⚠️
beacon/goclient/aggregator.go	0.0%	11 Missing ⚠️
beacon/goclient/committee_subscribe.go	0.0%	10 Missing ⚠️
beacon/goclient/sync_committee.go	0.0%	10 Missing ⚠️
beacon/goclient/sync_committee_contribution.go	0.0%	10 Missing ⚠️
beacon/goclient/attest.go	0.0%	9 Missing ⚠️
beacon/goclient/voluntary_exit.go	0.0%	7 Missing ⚠️

Additional details and impacted files

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

eth/executionclient/multi_client.go

y0sher · 2025-01-23T13:32:29Z

eth/executionclient/multi_client.go

+			mc.clientsMu[i].Lock()
+
+			if mc.clients[i] == nil {
+				if err := mc.connect(ctx, i); err != nil {


might be smart to take this out to a function since its not the only place we do it.

We do it in two places:

if mc.clients[i] == nil { if err := mc.connect(ctx, i); err != nil { mc.logger.Warn("failed to connect to client", zap.String("addr", mc.nodeAddrs[i]), zap.Error(err)) mc.clientsMu[i].Unlock() return err } }

if client == nil { if err := mc.connect(ctx, i); err != nil { mc.logger.Warn("failed to connect to client", zap.String("addr", mc.nodeAddrs[i]), zap.Error(err)) allErrs = errors.Join(allErrs, err) mc.currentClientIndex.Store(int64(nextClientIndex)) // Advance. mc.clientsMu[i].Unlock() continue } }

So the repetitive parts are the nil check and the log. Do you want me to move them inside the connect and rename them to something like connectIfDisconnected?

eth/executionclient/multi_client.go

nkryuchkov · 2025-01-23T17:45:58Z

Spec Alignment failure is supposed to be fixed by #2003

y0sher · 2025-01-26T11:22:03Z

beacon/goclient/goclient.go

+			if err := gc.assertSameGenesis(genesis.Data); err != nil {
+				gc.genesisMu.Lock()
+				defer gc.genesisMu.Unlock()
+
+				gc.log.Fatal("client returned unexpected genesis",
+					zap.String("address", s.Address()),
+					zap.Any("client_genesis", genesis.Data),
+					zap.Any("expected_genesis", gc.genesis),
+					zap.Error(err),
+				)
+				return // Tests may override Fatal's behavior
+			}


this might crash the program after a while that it's running if the second client won't be actuveat start but become active later. We should instead just stop using this client.
This is much easier to achieve if we just check the genesis comparing to a local stored value (e.g: saving genesis fork version in networkconfig )

Clients should never have the same geneses because it would mean they use different ETH networks, so IMO we shouldn't continue in this case. If it were just a log, the multi-client would just silently log it and switch to the next client.

eth/executionclient/multi_client.go

…replace connected atomic with regular bool value

iurii-ssv

Left some suggestions

iurii-ssv · 2025-01-26T20:13:54Z

beacon/goclient/goclient.go

-		logger.Error("Consensus http client initialization failed",
+		gc.log.Error("Consensus http client initialization failed",
 			zap.String("address", addr),
 			zap.Error(err),
 		)

-		return nil, fmt.Errorf("create http client: %w", err)
+		return fmt.Errorf("create http client: %w", err)


Not really specific to this PR but we often (and here as well) do 2 "duplicate" things

log error

return that same error (that's always gonna be logged eventually by the caller resulting in roughly duplicate log-line)

maybe would be simpler to just return error (with formatted with fmt.Errorf to provide the necessary context) in places like this, bringing it up so we can get on the same page (whether we want to keep an eye on things like this or not)

The main difference and issue is that when logging with zap and adding fields its easy to search them by the label value. we'll need to squeeze everything to the fmt.Errorf, it'll still be searchable, but not by label.

I agree that we could think more about our logging approach, but I think this package currently uses logging in a way similar to other packages. If we decide to improve logging (e.g. use custom error types with fields), we need to do it project-wide

beacon/goclient/goclient.go

eth/executionclient/multi_client.go

iurii-ssv · 2025-01-26T20:47:43Z

eth/executionclient/multi_client.go

 	if len(mc.clients) == 1 {
-		return f(mc.clients[0])
+		return f(mc.clients[0]) // no need for mutex because one client is always non-nil
 	}


Is that only because if we are only using 1 client and it doesn't initialize on MultiClient.New we just terminate SSV node ? I guess it works that way, but do we need to handle it as "an edge case" - why not just let it get handled by the code below ?

Yes, it works that way. IMO, this check doesn't add much code and it's very easy to read. Thinking about how the code below would handle one client is much harder. Generally, we never use the multi-client with only one client except for tests (I think we need to use the previous implementation with only one client because it's been used for a long time without any issues)

I think it does handle the case of 1-client correctly (if it doesn't that would indicate there might be some other issue with that code, plus it's something that's easy to test)

I guess a related discussion would be - #1308 (comment) (where we also "fall back" to older code rather than using the new implementation)

I understand that "fall back to previous code" might be slightly safer to do short-term, but this approach keeps accruing tech-debt over time

In this case, I think it's not a tech debt because the ExecutionClient implementation remains. I don't like using multi-client just for one client because it also changes logging, and adds some overhead. Similarly, we didn't use the beacon client's multi-client implementation just for one client. The first implementation didn't have MultiClient, in that case, I would prefer using just one client for everything. But we decided to keep the original version the least modified possible, so in this case, keeping using the old implementation just for one client looks good to me.

Maybe we need to allow MultiClient to be used only if 2+ clients are provided though

eth/executionclient/multi_client.go

This reverts commit 768665d.

nkryuchkov · 2025-01-27T21:22:51Z

eth/executionclient/execution_client.go

@@ -426,9 +404,6 @@ func (ec *ExecutionClient) streamLogsToChan(ctx context.Context, logs chan<- Blo
 		case <-ec.closed:
 			return fromBlock, ErrClosed

-		case <-ec.healthyCh:
-			return fromBlock, ErrUnhealthy


This is not working as expected, I saw it returning ErrUnhealthy when the client was healthy. This channel looks tricky and dangerous to work with (see another comment about closing channel), so I'm removing it
FYI @moshe-blox

+1

Just briefly looking at this channel (and corresponding mutex) - it's not entirely clear how it is supposed to be used (what it's for, etc.)

so if we aren't gonna remove it entirely we need to at least re-think it's purpose/usage

nkryuchkov · 2025-01-27T21:24:22Z

eth/executionclient/execution_client.go

-	defer ec.healthyChMu.Unlock()
-
-	if err := ec.healthy(ctx); err != nil {
-		close(ec.healthyCh)


This triggers closing of closed channel which panics, I'm removing it as it's dangerous to use
FYI @moshe-blox

eth/executionclient/multi_client.go

iurii-ssv · 2025-01-28T12:09:41Z

eth/executionclient/multi_client.go

+// If forever is false, it tries all clients only once and if no client is available then it returns an error.
+// If forever is true, it iterates clients forever.


Well, it seems now this is not entirely true for len(mc.clients) == 1 (which is why I sort of called it a tech debt)

Fixed the comment

It kind of begs the question now - why behavior for 1 client is different (my guess would be because it's just "old behavior we want to preserve"), maybe worth leaving a comment about it

iurii-ssv · 2025-01-28T12:13:46Z

eth/executionclient/multi_client.go

+	if forever {
+		limit = math.MaxInt32
+	}
+	for i := 0; i < limit; i++ {


It seems when "spinning forever" here - under certain circumstances (when clients error immediately)

we can potentially exhaust math.MaxInt32 (so I'd use 64 bit instead)

we can generate a lot of spamming logs

so we'd probably want to introduce some kind of delay (when "spinning forever") after we've done 1 round over all the clients (after i has increased by len(mc.clients))

and this is also why I'd prefer "spinning forever" to be implemented in a separate method - #2000 (comment) - otherwise this code gets more complex than it should be

Yes, it can potentially exhaust but I thought it's hardly reachable and we need to review the logic if so, so I thought this would be a bit simpler to understand. I agree with spamming logs, but I think we need to switch clients ASAP without any delays.

I rewrote this using another approach

I agree with spamming logs, but I think we need to switch clients ASAP without any delays.

I was thinking of something like 100ms-1000ms delay, but maybe for time-constraint duty (like block production) that's too high of an additional burden ...

we can get back to it if it (log-spamming) proves to be an actual issue in prod

eth/executionclient/multi_client.go

iurii-ssv

LGTM (with minor comments)

y0sher added 4 commits January 22, 2025 17:41

do not crash if one client fails version check

4a7ed06

fix help note on multiple addresses

1784907

don't compare genensis values

6139748

remove old assertSameGenesis code

54f982f

nkryuchkov added 3 commits January 22, 2025 20:52

beacon/goclient: set up connection hooks, assert genesis

7a4248d

beacon/goclient: remove outdated comment

1025641

eth/executionclient: allow starting with unhealthy client

af5c455

y0sher commented Jan 23, 2025

View reviewed changes

eth/executionclient/multi_client.go Outdated Show resolved Hide resolved

y0sher commented Jan 23, 2025

View reviewed changes

eth/executionclient/multi_client.go Outdated Show resolved Hide resolved

nkryuchkov changed the title ~~fix: allow one client with multi clients.~~ hotfix(el/cl): allow multi clients to start if at least one node is up Jan 23, 2025

nkryuchkov added 4 commits January 23, 2025 13:35

eth/executionclient: simplify mutex usage

3b9efcf

eth/executionclient: fix tests for assertSameChainID

c6592ad

eth/executionclient: improve comment for assertSameGenesis

b21c110

handle nil genesis and chain ID responses

902c99a

nkryuchkov requested a review from moshe-blox January 23, 2025 17:44

nkryuchkov approved these changes Jan 23, 2025

View reviewed changes

Merge branch 'stage' into fix/multiclient-oneclient

13edb6a

y0sher commented Jan 26, 2025

View reviewed changes

eth/executionclient/multi_client.go Outdated Show resolved Hide resolved

refactor(multi_client): replace connectedCount atomic int with bool. …

e1cc2e3

…replace connected atomic with regular bool value

y0sher requested review from MatusKysel and iurii-ssv January 26, 2025 13:04

iurii-ssv reviewed Jan 26, 2025

View reviewed changes

y0sher requested a review from oleg-ssvlabs January 27, 2025 09:43

nkryuchkov added 3 commits January 27, 2025 10:37

clarify comment

2bdc1ac

fix double mutex lock

8db421f

fix potential nil pointer dereference

0dd1027

nkryuchkov added 7 commits January 27, 2025 16:04

add panic hook

3cc71cd

attempt to fix panic

7dd5aa0

fix logging

ab91668

iterate clients forever in StreamLogs

99405c9

set follow distance to 1

768665d

Revert "set follow distance to 1"

6af3463

This reverts commit 768665d.

delete healthy channel

e9f5c7b

nkryuchkov reviewed Jan 27, 2025

View reviewed changes

remove obsolete tests

f093561

iurii-ssv reviewed Jan 28, 2025

View reviewed changes

nkryuchkov added 5 commits January 28, 2025 09:51

add successful call log

7b3a1f5

fix potential nil ptr dereference

16dc397

go-eth2-client: use fork with sync distance tolerance

95af801

add a comment about using github.com/nkryuchkov/go-eth2-client

411f482

code review comments

cbd61f5

iurii-ssv reviewed Jan 29, 2025

View reviewed changes

eth/executionclient/multi_client.go Outdated Show resolved Hide resolved

iurii-ssv reviewed Jan 29, 2025

View reviewed changes

eth/executionclient/multi_client.go Outdated Show resolved Hide resolved

iurii-ssv approved these changes Jan 29, 2025

View reviewed changes

nkryuchkov added 5 commits January 29, 2025 09:35

named consensus client logger

f8188c6

remove trace logs

5b72d6a

fix a typo

77ca3cf

add a comment with execution client shutdown scenarios

0406ad3

improve the comment for the call method

23d393d

y0sher force-pushed the fix/multiclient-oneclient branch from 23d393d to 6b51a6a Compare January 30, 2025 16:00

Merge branch 'stage' into fix/multiclient-oneclient

85779ec

nkryuchkov force-pushed the fix/multiclient-oneclient branch from 6b51a6a to 85779ec Compare January 30, 2025 16:06

nkryuchkov added 2 commits January 30, 2025 13:55

log client address on submissions

c890973

improve logs

4751783

nkryuchkov merged commit 285cd60 into stage Jan 30, 2025
6 of 7 checks passed

nkryuchkov deleted the fix/multiclient-oneclient branch January 30, 2025 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hotfix(el/cl): allow multi clients to start if at least one node is up #2000

hotfix(el/cl): allow multi clients to start if at least one node is up #2000

y0sher commented Jan 22, 2025 •

edited by nkryuchkov

Loading

codecov bot commented Jan 22, 2025 •

edited

Loading

y0sher Jan 23, 2025

nkryuchkov Jan 23, 2025

nkryuchkov commented Jan 23, 2025

y0sher Jan 26, 2025

nkryuchkov Jan 27, 2025

iurii-ssv left a comment

iurii-ssv Jan 26, 2025 •

edited

Loading

y0sher Jan 27, 2025

nkryuchkov Jan 27, 2025 •

edited

Loading

iurii-ssv Jan 26, 2025

nkryuchkov Jan 27, 2025

iurii-ssv Jan 27, 2025

nkryuchkov Jan 27, 2025

nkryuchkov Jan 27, 2025

iurii-ssv Jan 28, 2025 •

edited

Loading

nkryuchkov Jan 27, 2025

iurii-ssv Jan 28, 2025

nkryuchkov Jan 28, 2025

iurii-ssv Jan 29, 2025 •

edited

Loading

iurii-ssv Jan 28, 2025 •

edited

Loading

nkryuchkov Jan 28, 2025

iurii-ssv Jan 29, 2025 •

edited

Loading

iurii-ssv left a comment

		// If forever is false, it tries all clients only once and if no client is available then it returns an error.
		// If forever is true, it iterates clients forever.

hotfix(el/cl): allow multi clients to start if at least one node is up #2000

hotfix(el/cl): allow multi clients to start if at least one node is up #2000

Conversation

y0sher commented Jan 22, 2025 • edited by nkryuchkov Loading

codecov bot commented Jan 22, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkryuchkov commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iurii-ssv left a comment

Choose a reason for hiding this comment

iurii-ssv Jan 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nkryuchkov Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iurii-ssv Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iurii-ssv Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

iurii-ssv Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iurii-ssv Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

iurii-ssv left a comment

Choose a reason for hiding this comment

y0sher commented Jan 22, 2025 •

edited by nkryuchkov

Loading

codecov bot commented Jan 22, 2025 •

edited

Loading

iurii-ssv Jan 26, 2025 •

edited

Loading

nkryuchkov Jan 27, 2025 •

edited

Loading

iurii-ssv Jan 28, 2025 •

edited

Loading

iurii-ssv Jan 29, 2025 •

edited

Loading

iurii-ssv Jan 28, 2025 •

edited

Loading

iurii-ssv Jan 29, 2025 •

edited

Loading