Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config+grpc: allow configuring gRPC keepalive params #7730

Merged
merged 3 commits into from
Jun 10, 2023

Conversation

guggero
Copy link
Collaborator

@guggero guggero commented May 25, 2023

Fixes #7727.

Adds the following new config options to lnd:

grpc:
      --grpc.server-ping-time=                                               How long the server waits on a gRPC stream with no inactivity before pinging the
                                                                             client. (default: 1m0s)
      --grpc.server-ping-timeout=                                            How long the server waits for the response from the client for the keepalive ping
                                                                             response. (default: 20s)
      --grpc.client-ping-min-wait=                                           The minimum amount of time the client should wait before sending a keepalive ping.
                                                                             (default: 5s)

Tested with a node.js client (GRPC_TRACE=all GRPC_VERBOSITY=DEBUG node test.js):

let lightning = new lnrpc.Lightning(
    'localhost:10019', credentials, {'grpc.keepalive_time_ms': 5500},
);

Which showed the client was allowed to ping every 5.5s without the connection being closed:

D 2023-05-25T12:06:45.375Z | resolving_load_balancer | dns:localhost:10019 READY -> READY
D 2023-05-25T12:06:45.375Z | connectivity_state | (1) dns:localhost:10019 READY -> READY
D 2023-05-25T12:06:49.900Z | keepalive | (3) 127.0.0.1:10019 Sending ping with timeout 20000ms
D 2023-05-25T12:06:49.900Z | keepalive | (3) 127.0.0.1:10019 Received ping response
D 2023-05-25T12:06:55.400Z | keepalive | (3) 127.0.0.1:10019 Sending ping with timeout 20000ms
D 2023-05-25T12:06:55.400Z | keepalive | (3) 127.0.0.1:10019 Received ping response
D 2023-05-25T12:07:00.903Z | keepalive | (3) 127.0.0.1:10019 Sending ping with timeout 20000ms
D 2023-05-25T12:07:00.904Z | keepalive | (3) 127.0.0.1:10019 Received ping response
D 2023-05-25T12:07:06.406Z | keepalive | (3) 127.0.0.1:10019 Sending ping with timeout 20000ms
D 2023-05-25T12:07:06.406Z | keepalive | (3) 127.0.0.1:10019 Received ping response
D 2023-05-25T12:07:11.907Z | keepalive | (3) 127.0.0.1:10019 Sending ping with timeout 20000ms
D 2023-05-25T12:07:11.908Z | keepalive | (3) 127.0.0.1:10019 Received ping response
D 2023-05-25T12:07:17.407Z | keepalive | (3) 127.0.0.1:10019 Sending ping with timeout 20000ms
D 2023-05-25T12:07:17.408Z | keepalive | (3) 127.0.0.1:10019 Received ping response
D 2023-05-25T12:07:22.911Z | keepalive | (3) 127.0.0.1:10019 Sending ping with timeout 20000ms
D 2023-05-25T12:07:22.912Z | keepalive | (3) 127.0.0.1:10019 Received ping response
D 2023-05-25T12:07:28.415Z | keepalive | (3) 127.0.0.1:10019 Sending ping with timeout 20000ms
D 2023-05-25T12:07:28.416Z | keepalive | (3) 127.0.0.1:10019 Received ping response
D 2023-05-25T12:07:33.918Z | keepalive | (3) 127.0.0.1:10019 Sending ping with timeout 20000ms
D 2023-05-25T12:07:33.918Z | keepalive | (3) 127.0.0.1:10019 Received ping response
D 2023-05-25T12:07:39.422Z | keepalive | (3) 127.0.0.1:10019 Sending ping with timeout 20000ms

Copy link
Collaborator

@positiveblue positiveblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dstadulis
Copy link
Collaborator

Prio'ing an additional reviewer to this PR as it'll be an improvement in multiple venues.

@AndySchroder
Copy link

I'm thinking more about this and I'm wondering if there should be any comments in the documentation or release notes that mentions that all clients should now consider defining grpc.keepalive_time_ms in order to take advantage of grpc.client-ping-min-wait in this PR. As I discussed in #7727 , the client really needs to define what frequency it expects pings to happen, otherwise it won't realize the connection is broken because the default is to expect no pings. If we just tell the server to send more frequent keepalives using grpc.server-ping-time, that will still help firewalls and such from forgetting about the TCP socket in their connection tracker, making the connection a bit more reliable, but it's not the complete solution. More frequent keepalives sent by the server will also force it to close out broken/stale sockets, but that won't be seen as a major improvement for most users unless they have an enourmous amount of grpc connections on their server.

@guggero
Copy link
Collaborator Author

guggero commented Jun 2, 2023

Good point! I added some additional information to the release notes. And my plan is to add the client side ping to https://github.com/lightninglabs/lndclient as well (with a version detection to not ping too often on older lnd nodes), so it will automatically be added to Loop/Pool/Faraday/Lightning Terminal/Taproot Assets.

sending pings. So clients are encouraged to set `grpc.keepalive_time_ms` when
creating long open streams over a network topology that might silently fail
connections. By default, clients now only need to wait 5 seconds between pings
(was 5 minutes before), so a value of `grpc.keepalive_time_ms=5100` is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean grpc.client-ping-min-wait=5100ms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, grpc.client-ping-min-wait dictates how often the server allows a ping from the client. So if the client pings too often, the server will disconnect. But to tell the client that the server allows it to ping more often, the grpc.keepalive_time_ms value has to be lowered on the client side so it actually checks the connection more often. Not all clients set this, so I think for normal lndclient/gRPC connections no ping happens at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was confused as it looked like a js specific setting to me, but I guess it's used everywhere.

But to tell the client that the server allows it to ping more often, the grpc.keepalive_time_ms value has to be lowered on the client side so it actually checks the connection more often.

Not sure I follow. So in order for the server to tell the client that the server allows more frequent pings, the client needs to lower the grpc.keepalive_time_ms?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, every gRPC library in every programming language has a different way to identify the client ping time.
For Golang it's:

			grpc.WithKeepaliveParams(keepalive.ClientParameters{
				Time: defaultClientPingTime,
			}),

(see lightninglabs/pool#457).

But for JavaScript it's actually:

let lightning = new lnrpc.Lightning(
    'localhost:10019', credentials, {'grpc.keepalive_time_ms': 5500},
);

and for C++ it's yet another way. So not sure how to best communicate the setting's name. But it looks like grpc.keepalive_time_ms gives the most search results on Google so it seems to be pretty well understood.

So in order for the server to tell the client that the server allows more frequent pings, the client needs to lower the grpc.keepalive_time_ms?

Not quite. By default, the client does not ping at all and the server only allows pings from the client every 5 minutes (which is a bit slow for our purposes). So with the change in this PR we configure the server to allow the client to ping way more often. BUT to actually initiate the pinging, the client also needs to enable the keepalive.

I'm not sure why, but it looks like in the setup I tested (lnd with a Node.js client), the server didn't actually ping the client (or it was not logged for some reason). Debugging this isn't easy, especially with an encrypted connection. But it looked like the client initiated pings were going through regularly, so that seems to be the most reliable option.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LND server pings ever 2 hours by default (at least before this PR), so you probably didn't wait long enough to see that happen?

Also, even if the LND server pings every 2 hours, the client doesn't know how frequent the server will be pinging, so the client will still not disconnect after 2 hours have gone by with no ping. Instead, the client will disconnect after the time has passed that it is configured to send pings to the server. However, I'm not sure if the server sends pings as quickly as the client plans to send them, will the client count the server's ping as it's own, or will they both be pinging?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For python, the option is grpc.keepalive_time_ms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks for the discussion guys!

ServerPingTimeout time.Duration `long:"server-ping-timeout" description:"How long the server waits for the response from the client for the keepalive ping response."`

// ClientPingMinWait is the minimum amount of time a client should wait
// before sending a keepalive ping.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting because there's no activity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, wait between pings to not run into the situation where the server disconnects because the client is too eager with the pings (error code: ENHANCE_YOUR_CALM, debug data: "too_many_pings").

Copy link
Collaborator

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an itest failed which is rare, in Bob's log,

2023-06-02 08:18:06.748 [ERR] LTND: Shutting down because error in main method: unable to listen on 127.0.0.1:8570: listen tcp4 127.0.0.1:8570: bind: address already in use
2023-06-02 08:18:06.748 [INF] LTND: Stopping pprof server...
2023-06-02 08:18:06.748 [INF] LTND: Shutdown complete

unable to listen on 127.0.0.1:8570: listen tcp4 127.0.0.1:8570: bind: address already in use

I think it's not related, just wanna bring it up to double confirm.

@guggero
Copy link
Collaborator Author

guggero commented Jun 6, 2023

There's an itest failed which is rare, in Bob's log,

2023-06-02 08:18:06.748 [ERR] LTND: Shutting down because error in main method: unable to listen on 127.0.0.1:8570: listen tcp4 127.0.0.1:8570: bind: address already in use
2023-06-02 08:18:06.748 [INF] LTND: Stopping pprof server...
2023-06-02 08:18:06.748 [INF] LTND: Shutdown complete

unable to listen on 127.0.0.1:8570: listen tcp4 127.0.0.1:8570: bind: address already in use

I think it's not related, just wanna bring it up to double confirm.

Hmm, interesting, haven't seen that often before either. But it looks unrelated indeed, probably just something not properly stopped or ports overlapping for another reason?

As long as the client doesn't change its pinging behavior, the changes in this PR shouldn't have any effect (other than perhaps the server starting to ping by itself, which the client will never object to).

Copy link
Collaborator

@yyforyongyu yyforyongyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM🙏

As long as the client doesn't change its pinging behavior, the changes in this PR shouldn't have any effect (other than perhaps the server starting to ping by itself, which the client will never object to).

Yeah I think so too.

@guggero
Copy link
Collaborator Author

guggero commented Jun 6, 2023

cc @saubyk

sample-lnd.conf Outdated
@@ -1501,3 +1501,21 @@ litecoin.node=ltcd
; are sent over a short period.
; htlcswitch.mailboxdeliverytimeout=60s

[grpc]

; How long the server waits on a gRPC stream with no inactivity before pinging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of grpc.server-ping-time (GRPC_ARG_KEEPALIVE_TIME_MS) is, the amount of time that the connection must be idle before a keepalive probe (ping) is sent.

Getting confused by the wording no inactivity here. Shouldn't it be no activity or just inactivity instead?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are right, that double negative is likely a mistake.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, my bad. Fixed.

@@ -49,6 +49,16 @@ package](https://github.com/lightningnetwork/lnd/pull/7356)
a helpful note-to-self containing arbitrary useful information about the
channel.

* [gRPC keepalive parameters can now be set in the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions for the release notes:

  • Include all the params which have been added in the config (currently it only refers to the client setting for the keepalive)
  • Mention that the grpc.keepalive_time_ms setting can be different depending on the programming language used for client development and developers should lookup the specific setting to utilize this feature

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my understanding that languages that link to the C based version of grpc are all going to be the same. The golang implementation of grpc seems to be the only other variant. Can anyone else confirm this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added both suggestions, how does the new text look to you, @saubyk?

The golang implementation of grpc seems to be the only other variant. Can anyone else confirm this?

Yes, the golang library unfortunately behaves differently from most others. I've only ever used JS or Python so can't say anything about the other ones.

This change allows users to customize the gRPC keepalive settings. The
server ping values configure when the server sends out a ping by itself
and how quickly the client is expected to respond.
The client ping min wait configures the minimum time a client has to
wait before sending another ping. A connection might be closed by the
server if a client pings more frequently than the allowed min wait time.

We choose lower default values than the gRPC library used before:
 - ServerPingTime: 1 minute instead of 2 hours
 - ClientPingMinWait: 5 seconds instead of 5 minutes

This should by itself already make long-standing gRPC streams more
stable but also allow clients to set a custom client ping time of 5
seconds or more (a value of slightly more than 5000 milliseconds should
be chosen to avoid the server disconnecting due to slight timing skews).
Copy link
Collaborator

@saubyk saubyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 📶

@guggero guggero merged commit 753af11 into lightningnetwork:master Jun 10, 2023
@guggero guggero deleted the grpc-keepalive branch June 10, 2023 11:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature]: allow configuring standard grpc server keepalive settings
6 participants