-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
config+grpc: allow configuring gRPC keepalive params #7730
Conversation
cd511d5
to
471f28d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Prio'ing an additional reviewer to this PR as it'll be an improvement in multiple venues. |
I'm thinking more about this and I'm wondering if there should be any comments in the documentation or release notes that mentions that all clients should now consider defining |
Good point! I added some additional information to the release notes. And my plan is to add the client side ping to https://github.com/lightninglabs/lndclient as well (with a version detection to not ping too often on older |
sending pings. So clients are encouraged to set `grpc.keepalive_time_ms` when | ||
creating long open streams over a network topology that might silently fail | ||
connections. By default, clients now only need to wait 5 seconds between pings | ||
(was 5 minutes before), so a value of `grpc.keepalive_time_ms=5100` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean grpc.client-ping-min-wait=5100ms
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, grpc.client-ping-min-wait
dictates how often the server allows a ping from the client. So if the client pings too often, the server will disconnect. But to tell the client that the server allows it to ping more often, the grpc.keepalive_time_ms
value has to be lowered on the client side so it actually checks the connection more often. Not all clients set this, so I think for normal lndclient
/gRPC connections no ping happens at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was confused as it looked like a js specific setting to me, but I guess it's used everywhere.
But to tell the client that the server allows it to ping more often, the grpc.keepalive_time_ms value has to be lowered on the client side so it actually checks the connection more often.
Not sure I follow. So in order for the server to tell the client that the server allows more frequent pings, the client needs to lower the grpc.keepalive_time_ms
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is, every gRPC library in every programming language has a different way to identify the client ping time.
For Golang it's:
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: defaultClientPingTime,
}),
(see lightninglabs/pool#457).
But for JavaScript it's actually:
let lightning = new lnrpc.Lightning(
'localhost:10019', credentials, {'grpc.keepalive_time_ms': 5500},
);
and for C++ it's yet another way. So not sure how to best communicate the setting's name. But it looks like grpc.keepalive_time_ms
gives the most search results on Google so it seems to be pretty well understood.
So in order for the server to tell the client that the server allows more frequent pings, the client needs to lower the grpc.keepalive_time_ms?
Not quite. By default, the client does not ping at all and the server only allows pings from the client every 5 minutes (which is a bit slow for our purposes). So with the change in this PR we configure the server to allow the client to ping way more often. BUT to actually initiate the pinging, the client also needs to enable the keepalive.
I'm not sure why, but it looks like in the setup I tested (lnd
with a Node.js client), the server didn't actually ping the client (or it was not logged for some reason). Debugging this isn't easy, especially with an encrypted connection. But it looked like the client initiated pings were going through regularly, so that seems to be the most reliable option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LND server pings ever 2 hours by default (at least before this PR), so you probably didn't wait long enough to see that happen?
Also, even if the LND server pings every 2 hours, the client doesn't know how frequent the server will be pinging, so the client will still not disconnect after 2 hours have gone by with no ping. Instead, the client will disconnect after the time has passed that it is configured to send pings to the server. However, I'm not sure if the server sends pings as quickly as the client plans to send them, will the client count the server's ping as it's own, or will they both be pinging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For python, the option is grpc.keepalive_time_ms
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool thanks for the discussion guys!
ServerPingTimeout time.Duration `long:"server-ping-timeout" description:"How long the server waits for the response from the client for the keepalive ping response."` | ||
|
||
// ClientPingMinWait is the minimum amount of time a client should wait | ||
// before sending a keepalive ping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting because there's no activity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, wait between pings to not run into the situation where the server disconnects because the client is too eager with the pings (error code: ENHANCE_YOUR_CALM, debug data: "too_many_pings"
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an itest failed which is rare, in Bob's log,
2023-06-02 08:18:06.748 [ERR] LTND: Shutting down because error in main method: unable to listen on 127.0.0.1:8570: listen tcp4 127.0.0.1:8570: bind: address already in use
2023-06-02 08:18:06.748 [INF] LTND: Stopping pprof server...
2023-06-02 08:18:06.748 [INF] LTND: Shutdown complete
unable to listen on 127.0.0.1:8570: listen tcp4 127.0.0.1:8570: bind: address already in use
I think it's not related, just wanna bring it up to double confirm.
Hmm, interesting, haven't seen that often before either. But it looks unrelated indeed, probably just something not properly stopped or ports overlapping for another reason? As long as the client doesn't change its pinging behavior, the changes in this PR shouldn't have any effect (other than perhaps the server starting to ping by itself, which the client will never object to). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM🙏
As long as the client doesn't change its pinging behavior, the changes in this PR shouldn't have any effect (other than perhaps the server starting to ping by itself, which the client will never object to).
Yeah I think so too.
cc @saubyk |
sample-lnd.conf
Outdated
@@ -1501,3 +1501,21 @@ litecoin.node=ltcd | |||
; are sent over a short period. | |||
; htlcswitch.mailboxdeliverytimeout=60s | |||
|
|||
[grpc] | |||
|
|||
; How long the server waits on a gRPC stream with no inactivity before pinging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of grpc.server-ping-time (GRPC_ARG_KEEPALIVE_TIME_MS) is, the amount of time that the connection must be idle before a keepalive probe (ping) is sent.
Getting confused by the wording no inactivity
here. Shouldn't it be no activity
or just inactivity
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right, that double negative is likely a mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, my bad. Fixed.
@@ -49,6 +49,16 @@ package](https://github.com/lightningnetwork/lnd/pull/7356) | |||
a helpful note-to-self containing arbitrary useful information about the | |||
channel. | |||
|
|||
* [gRPC keepalive parameters can now be set in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two suggestions for the release notes:
- Include all the params which have been added in the config (currently it only refers to the client setting for the keepalive)
- Mention that the grpc.keepalive_time_ms setting can be different depending on the programming language used for client development and developers should lookup the specific setting to utilize this feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my understanding that languages that link to the C based version of grpc are all going to be the same. The golang implementation of grpc seems to be the only other variant. Can anyone else confirm this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added both suggestions, how does the new text look to you, @saubyk?
The golang implementation of grpc seems to be the only other variant. Can anyone else confirm this?
Yes, the golang library unfortunately behaves differently from most others. I've only ever used JS or Python so can't say anything about the other ones.
This change allows users to customize the gRPC keepalive settings. The server ping values configure when the server sends out a ping by itself and how quickly the client is expected to respond. The client ping min wait configures the minimum time a client has to wait before sending another ping. A connection might be closed by the server if a client pings more frequently than the allowed min wait time. We choose lower default values than the gRPC library used before: - ServerPingTime: 1 minute instead of 2 hours - ClientPingMinWait: 5 seconds instead of 5 minutes This should by itself already make long-standing gRPC streams more stable but also allow clients to set a custom client ping time of 5 seconds or more (a value of slightly more than 5000 milliseconds should be chosen to avoid the server disconnecting due to slight timing skews).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 📶
Fixes #7727.
Adds the following new config options to
lnd
:Tested with a node.js client (
GRPC_TRACE=all GRPC_VERBOSITY=DEBUG node test.js
):Which showed the client was allowed to ping every 5.5s without the connection being closed: