Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature]: allow configuring standard grpc server keepalive settings #7727

Closed
AndySchroder opened this issue May 24, 2023 · 9 comments · Fixed by #7730
Closed

[feature]: allow configuring standard grpc server keepalive settings #7727

AndySchroder opened this issue May 24, 2023 · 9 comments · Fixed by #7730
Labels
enhancement Improvements to existing features / behaviour
Milestone

Comments

@AndySchroder
Copy link

Is your feature request related to a problem? Please describe.

SubscribeInvoices hangs randomly.

SubscribeInvoices (and I believe other streaming grpc calls) is perfectly quiet on the network when no new invoice activity is happening. This is great for connection efficiency, however, in the case where the TCP connection is quietly interrupted somewhere in between the client and the server, the client sits indefinitely (days on end) and does not realize anything went wrong. At this point no new information can be obtained from the server. This situation can be commonplace in machine to machine applications where we have the lnd server connected to from clients in remote parts of the world on unstable and potentially wireless internet connections.

Operating system level TCP keepalives exist, but they are not used by grpc (and I think they are way too slow of a default (and minimum?) timeout of over 2 hours).

grpc has a keepalive option that occurs over http, but the default is also every two hours. Additionally, the server is by default the only side configured to use this keepalive. By default, the client doesn't care if it never sees any keepalive messages, so it will still keep a broken connection in an apparently open but idle state for an indefinite amount of time.

Describe the solution you'd like

I'd like LND to allow the client to send it more frequent keepalive messages over the grpc, and allow the server to send more frequent keepalives as well. Currently, the lnd grpc server appears to use the defaults for grpc (https://grpc.github.io/grpc/cpp/md_doc_keepalive.html).

If all the standard options available in the grpc (https://grpc.github.io/grpc/cpp/md_doc_keepalive.html) can be exposed as a setting in LND, I think that will be a good solution because we can then allow LND to accept more frequent keepalives instead of just terminating the connection when the client tries to send more frequent keepalives.

For the client side, we can already do whatever we want, but as suggested above, if the server doesn't want to see that many keepalives, it doesn't work.

Describe alternatives you've considered

Tell the client to do regular and frequent keepalive pings and then when the lnd grpc server gets upset because they are coming in too quickly, lnd kills the connection and forces the client to reconnect. This works but it is totally not ideal because it causes the entire connection to be rebuilt very often rather than just when the connection actually gets broken and needs to be re-connected. This workaround uses a lot more bandwidth to solve a problem that happens in situations where bandwidth is usually already limited.

@AndySchroder AndySchroder added the enhancement Improvements to existing features / behaviour label May 24, 2023
@guggero
Copy link
Collaborator

guggero commented May 25, 2023

Good point, I think this makes a lot of sense.
Can you test if #7730 works for your purposes?

@AndySchroder
Copy link
Author

Wondering why you haven't just implemented all 6 of the options found in https://grpc.github.io/grpc/cpp/md_doc_keepalive.html (and the lower case versions of the variables can be found here: https://github.com/grpc/grpc/blob/master/include/grpc/impl/grpc_types.h) using the same variable names?

It looks like you've taken the options you've implemented from https://github.com/grpc/grpc-go/blob/master/keepalive/keepalive.go ? It's a bit unclear to me what the mapping is between the parameters defined in there and in https://grpc.github.io/grpc/cpp/md_doc_keepalive.html . Do you know of any table that exists?

It's likely that what is in keepalive.go is good enough, just trying to understand the motivation.

Also, the variable name ClientPingMinWait seems inconsistent with defaultGrpcClientPingWait.

I like the default values that you've picked for grpc.server-ping-time and grpc.server-ping-timeout. They make a lot of sense to me. However, I'm wondering if jumping from 7200 seconds to 60 seconds for grpc.server-ping-time is too big of a jump considering all the other infrastructure out in the field? Probably not an issue, but I just wanted to bring it up.

@guggero
Copy link
Collaborator

guggero commented May 25, 2023

Wondering why you haven't just implemented all 6 of the options found in https://grpc.github.io/grpc/cpp/md_doc_keepalive.html

The Golang library (AFAIK) only supports configuring the first 3 options in https://grpc.github.io/grpc/cpp/md_doc_keepalive.html (through https://github.com/grpc/grpc-go/blob/master/keepalive/keepalive.go).

I omitted the other values in https://github.com/grpc/grpc-go/blob/master/keepalive/keepalive.go because I don't see how it would ever by useful for lnd to disconnect because a client was connected for too long. Allowing anyone to configure that will just make things break in very unexpected ways.

Also, the variable name ClientPingMinWait seems inconsistent with defaultGrpcClientPingWait.

Thanks, fixed.

However, I'm wondering if jumping from 7200 seconds to 60 seconds for grpc.server-ping-time is too big of a jump considering all the other infrastructure out in the field?

I also think it should be fine. I honestly don't understand how the initial 7200 seconds are at all useful in any context...

@AndySchroder
Copy link
Author

Wondering why you haven't just implemented all 6 of the options found in https://grpc.github.io/grpc/cpp/md_doc_keepalive.html

The Golang library (AFAIK) only supports configuring the first 3 options in https://grpc.github.io/grpc/cpp/md_doc_keepalive.html (through https://github.com/grpc/grpc-go/blob/master/keepalive/keepalive.go).

So is the golang version a reimplementation of the C version or does it link to some C libraries? The documentation page for the C version suggest it is used by the golang version.

I omitted the other values in https://github.com/grpc/grpc-go/blob/master/keepalive/keepalive.go because I don't see how it would ever by useful for lnd to disconnect because a client was connected for too long. Allowing anyone to configure that will just make things break in very unexpected ways.

Agreed on those connection age parameters.

However, one thing people might want though is PermitWithoutStream? Possibly they could want to keep the connection active even when there are no streaming RPC's in use so that when they are ready to make a new call, they know the connection will be active already. This might make it quicker to make the call because you don't have to redo the grpc connection establishment, or re-establish your entire internet connection.

Digging deeper, here is what I think the mapping is:

  • GRPC_ARG_KEEPALIVE_TIME_MS -- seems to be your grpc.server-ping-time
  • GRPC_ARG_KEEPALIVE_TIMEOUT_MS -- seems to be your grpc.server-ping-timeout
  • GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA -- doesn't seem to be in keepalive.go, maybe they use the default value of 2, or it is ignored (set to 0) because they say they may depreciate it soon?
  • GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS -- seems to be PermitWithoutStream in keepalive.go with the default value of 0. However, does your example shown in the pull request actually open a call? It doesn't look like it unless you've omitted some things, so I don't know why your example worked.
  • GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS -- seems to be your grpc.client-ping-min-wait
  • GRPC_ARG_HTTP2_MAX_PING_STRIKES -- doesn't seem to be in keepalive.go, maybe they use the default value of 2?

However, I'm wondering if jumping from 7200 seconds to 60 seconds for grpc.server-ping-time is too big of a jump considering all the other infrastructure out in the field?

I also think it should be fine. I honestly don't understand how the initial 7200 seconds are at all useful in any context...

Agreed, surprised I'm the first person to report this related to LND.

@AndySchroder
Copy link
Author

Describe alternatives you've considered

Tell the client to do regular and frequent keepalive pings and then when the lnd grpc server gets upset because they are coming in too quickly, lnd kills the connection and forces the client to reconnect. This works but it is totally not ideal because it causes the entire connection to be rebuilt very often rather than just when the connection actually gets broken and needs to be re-connected. This workaround uses a lot more bandwidth to solve a problem that happens in situations where bandwidth is usually already limited.

After studying the options further and doing more testing, it appears as though the default value of GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS of 300000 seconds is what is getting the grpc server upset and killing the connection. I had thought that the GRPC_ARG_KEEPALIVE_TIME_MS setting of 2 hours was what it was getting upset over, but that is just the frequency that the server sends pings, not the frequency it is willing to accept them.

If on the client end, I set grpc.keepalive_time_ms=300001 then it doesn't kill the connection. It sends pings every 5 minutes then. This is "okay" and works in current releases of lnd. 5 minutes is a long time to have your application hang up, but I think I'm going to opt for this for now over re-connecting every 1 minute. What is in your fix (plus adding PermitWithoutStream as I've suggested above) is much better, I'll use to that once this fix makes it into a formal release.

@guggero
Copy link
Collaborator

guggero commented May 26, 2023

So is the golang version a reimplementation of the C version or does it link to some C libraries?

It's a re-implementation, because we can't link into C libraries in lnd to support cross compilation to all sorts of platforms (WASM and mobile to mention just two) and support full binary reproducibility.
So I think the Go implementation is just lagging behind with exposing all those configuration variables on the API level. AFAIK even the https://github.com/grpc/grpc-go/blob/master/keepalive/keepalive.go is fairly new (or we were just recently able to upgrade to that version due to other dependency constraints).

I do think that GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS translates to grpc.client-ping-min-wait. And I looked through the gRPC library code and did find the hard coded GRPC_ARG_HTTP2_MAX_PING_STRIKES here: https://github.com/grpc/grpc-go/blob/v1.41.0/internal/transport/http2_server.go#L788

So if I interpret your responses correctly, you're happy with the PR once I add grpc.client-allow-ping-without-stream (which I'll do in a moment)? If so, cool!

@saubyk saubyk added this to the v0.18.0 milestone May 29, 2023
@AndySchroder
Copy link
Author

I haven't tested #7730, but as it is now, it "looks" to meet my expectations and I think you've answered all of my questions. You can close this issue once the PR is merged.

@saubyk saubyk modified the milestones: v0.18.0, v0.17.0 Jun 5, 2023
@saubyk
Copy link
Collaborator

saubyk commented Jun 7, 2023

This situation can be commonplace in machine to machine applications where we have the lnd server connected to from clients in remote parts of the world on unstable and potentially wireless internet connections.

Are there any examples of such use cases? I understand the motivation to address the requirement stated in the issue, but failing to see specific use cases like this, where lnd is providing streaming updates to remote clients on unreliable wireless connections

@AndySchroder
Copy link
Author

Distributed Charge is one example of an application that might be on cellular. Other examples are retail fuel pumps, oil and gas pipelines, water meters, candy and soda machines, toll roads, parking meters. All of these things could be a cellular or wired connection, depending on their location, but in places like Africa, it's much more likely that it will be on cellular. Whenever an invoice is issued, you are going to want to subscribe to updates on invoices so that you can detect payment using the least amount of bandwidth (without needing to actively poll specific invoices). The machine out in the field will need to connect to the LND node and monitor the invoices in order to know when to deliver a physical product or service to the customer.

Recently I filed this issue because I was having a problem with reliability even on a fiber optic connection, so I brought up the cellular as an even lower quality connection that would suffer much more severely from this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements to existing features / behaviour
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants