-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to detect if a watch is active #559
Comments
Thanks, please see the discussion of alternatives to The main problem with switching libraries is that all of this code is generated by the OpenAPI Generator project: We could change generators, but that would also result in significant breaking API changes in the library. I think that the best option is to fork the request generator in OpenAPI Generator ( This is a pretty significant undertaking though, and for now no one has taken it on. If there is interest, I'm happy to advise on what needs to be done. I don't have the cycles for it personally, but I can act as a guide. Regarding hanging watches, I think the best answer is to set the socket timeout pretty low (e.g. 1 minute) which should cause the TCP connection to reset anyway. |
@brendandburns I would be willing to attempt migrating to a new generator. |
@maxweisel Thank you for discovering this! We have been plagued by this and I wasn't sure what the problem was @jhagestedt It seems 1b313ce might be the culprit. Why is the timeout set to 0? (Although I'm using 0.11.1 and I also seem to have the problem) |
I also wonder whether https://github.com/kubernetes-client/javascript/blame/master/src/cache.ts#L86 should maybe be before calling the error handlers so that if the error handlers immediately restart, it might end up in a stopped state because we first started (stopped = false) and then the error handler sets it back to "stopped" |
@brendanburns fwiw, I don't think setting the socket timeout to one minute is an acceptable solution here. Our control plane runs a very large number of pods and when the stream initially connects it will send the entire list of all running pods and their current states. Our application fields requests by spinning up pods to do work. If our watch stream goes down, our service grinds to a halt as it can't confirmation that work for the request has started. For our service to run properly, I need to detect that the stream has gone down and reconnect within 5-10 seconds. I could set the timeout to 5 seconds, but that means we'll be pulling the entire list of pods every 5 seconds which is also too much data to comb through. I believe the correct solution here is to support http2 ping and ping every X seconds on the stream which is what the kubernetes go client is doing. The pings are cheap and the stream won't reset if the connection is active. |
@flyboarder I'd welcome the help. The generator is here: I don't know if it is easier to clone the generator into a new directory and create a whole new generator, or to introduce a flag and just patch around the calls to Please take a look and let me know what you think. Let's continue the discussion on #414 since that is the relevant issue and I don't want to hijack this issue. |
@entropitor that code hasn't been released in any published versions yet, so I don't think it is causing this issue. (we should make it configurable though) |
@maxweisel if you want to re-implement the watch code to use http 2 I'd be glad to take a PR. fwiw, though, the Kubernetes API was not intended to be an event bus. I think you will find that using a true event bus like Kafka is a more resilient architecture, alternately, you could register an admission controller that gets called whenever a Pod is created, which would get you out of the watch business entirely. |
Thanks for the notes. My use case is more complex than I let on and an admission controller is not a reasonable solution for me unfortunately :( I'd be happy to throw an http2 ping PR together. A quick glance at I'm tempted to start with my own RequestInterface object that uses the raw http2 nodejs API, but I don't know if that's something you'd want upstream as it would mean I'm going to experiment a bit and will report back. |
@maxweisel thanks for being willing to take on a PR. I don't mind if we use a totally separate HTTP2 API for watch, however we should definitely make sure that we can reuse all of the authentication and configuration code. e.g. if you wanted to re-implement |
What is the best way to move forward? @brendanburns Do you see a way of fixing this bug this without having to go to HTTP2? |
I found this issue as we're seeing the same problem. Sorry this isn't a full PR fix, but as a slightly fudged workaround, we fixed the issue by wrapping the Watch and Informer in our own class and injecting an HTTP2 request (rather than using the default request implementation) which we can keep open and check the status of the underlying stream. Hope this example code helps someone: ` function KubernetesPodCache(namespace) {
} KubernetesPodCache.prototype.start = function (onError, onAdd, onUpdate, onDelete) { KubernetesPodCache.prototype.onError = function (error) { KubernetesPodCache.prototype.onAdd = function (obj) { KubernetesPodCache.prototype.onUpdate = function (obj) { KubernetesPodCache.prototype.onDelete = function (obj) { |
@geoffpatehutch Have you seen any issues with the code above? I am running into a similar problem where the watch connection closed but the library still thinks its open. @brendandburns Is there any thought to adjusting the implementation simply for watch? |
@brendandburns I took @geoffpatehutch idea here and made a more general purpose drop in replacement for the request implementation for watches, i am goign to test this out the next few days. I realize there has been work in master changing the signature of this so it no longer takes a callback. what i have below matches the latest release 0.12.1 const http2 = require('http2');
const queryString = require('query-string');
class WatchRequest {
webRequest(options, callback) {
const {
ca, headers, method, qs, uri
} = options;
const url = new URL(uri);
const session = http2.connect(url, { ca });
let ping = null;
let error = '';
session.on('error', err => error += err);
session.on('close', () => {
clearInterval(ping);
callback(error);
});
const stream = session.request({
...headers,
':method': method,
':path' : `${url.pathname}?${queryString.stringify(qs)}`,
'accept' : 'application/json'
}, { 'endStream': false });
stream.setEncoding('utf8');
ping = setInterval(() => {
session.ping(error => {
if (error || stream.closed) {
clearInterval(ping);
if (!session.destroyed) {
session.destroy(error || 'stream was closed');
} else {
console.error('session was already destroyed this is unexpected');
}
}
});
}, 2000);
stream.on('error', () => {/* no opt this will allow session 'error' to be emitted instead of throwing an exception */});
stream.on('close', () => {
clearInterval(ping);
session.close();
});
return stream;
}
} usage is just simply const watch = new k8s.Watch(config, new WatchRequest()); |
@brendandburns I was able to reproduce this fairly consistently in an AKS cluster, where the connection would seem to be open but in reality it was not Curious if you experience the same: |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
I got sucked onto another project, but I'm finally getting around to working on a PR for this. @arsnyder16 I'm a big fan of your approach of using vanilla the built-in http2 module. did you happen to have a version of this that works with the new callback-less |
I've started looking into this one, and it appears this bug should already be fixed now potentially. The fix comes from PR #630, which uses tcp keepalive to send an empty packet on a regular basis to check if the connection is alive. The net-keepalive dependency was dropped in a3e6c0a, which is awesome too. I'm a huge fan of this approach too if it means we don't need to use http2 ping frames, however, so far it hasn't been that easy. At the moment, both the net-keepalive version and the updated version do not work for me on mac or linux. Some quick napkin math suggests it can take up to 5 minutes and 30 seconds if there's a default TCP_KEEP_CNT value of 10 and an initial 30 second delay, but even after 10 minutes the connection does not close for me. I'm going to see if I can get tcp keepalive working on a vanilla socket over here, and once I've got that working, I'll backport it to this and we can finally close this issue. Max |
I've done some more testing and can confirm that 0.14.3 works correctly on GKE! My test connects to a watch stream on a machine that I control. I manually start blocking tcp packets via iptables and in versions before 0.14.2 it will stay connected forever, but in 0.14.2 and 0.14.3 it will disconnect after 30 seconds when using Node 12 LTS and Node 14 LTS (I didn't test any others). It's worth noting, it seems tcp keepalive mileage can vary wildly depending on your OS / kernel settings, so folks who are not using this with a linux image provided by a cloud provider may have issues with this (my vanilla Ubuntu Desktop VM does not work for example) I believe we can also close #596 too, but I'll let @jkryl / @brendanburns handle that. |
@maxweisel Given it doesn't work with e.g. Ubuntu, shouldn't we consider this an open issue? |
Ubuntu Server images do work for me, it's purely the Ubuntu Desktop VM that did not work, but it's possible that was also due to my VM's virtual network adapter. |
This will send sends keep-alive probes to the server every 30 seconds. These features were present prior to the 1.0 refactor but were inadvertently removed. Fixes kubernetes-client#2127 Previous relevant issues: - Initial issue: kubernetes-client#559 - PR: kubernetes-client#630 - Improvement: kubernetes-client#632 - PR: kubernetes-client#635
This will send sends keep-alive probes to the server every 30 seconds. These features were present prior to the 1.0 refactor but were inadvertently removed. Fixes kubernetes-client#2127 Previous relevant issues: - Initial issue: kubernetes-client#559 - PR: kubernetes-client#630 - Improvement: kubernetes-client#632 - PR: kubernetes-client#635
This is the same bug that occurred in the Go client library here: kubernetes/kubernetes#65012
Essentially if you create a watch stream and the connection is severed, the stream is unaware and believes it is open. It will stay open forever and just stop reporting events instead of trying to reconnect. This occurs anytime our k8s control plane is updated. Ideally it would detect the broken connection and reconnect to another replica or at least mark the connection closed.
It seems the solution is to occasionally http2 ping the connection to ensure that it's still open, and close if the ping is not received.
I've dug into the library a little and it seems watch.js uses request which is now deprecated. The watch class does allow specifying my own request implementation in the constructor, which I imagine I could use to add the ping feature, but I imagine this is a bug that will impact everyone who uses the watch event, so I figured I'd file an issue.
Happy to contribute a patch that includes this functionality, I'd love some guidance on what the preferred direction should be. Is there a suitable replacement for request that I could implement while I'm working on this patch?
Max
The text was updated successfully, but these errors were encountered: