Unable to drain endpoints before removal #7218

concaf · 2019-06-09T07:47:44Z

Title: Unable to drain endpoints before removal

Description:
I'm trying to implement endpoint connection draining (and not draining envoy itself) with sticky sessions configured.

At t=0, there are 5 endpoints configured in my envoy cluster with cookie based sticky sessions configured.
At t=10, I wish to drain 1 out of the 5 endpoints I have i.e. I do not want any new connections to be routed to that endpoint, but I want the current sessions to persist before I remove the endpoint from the list.

In order to drain an endpoint, I've tried 2 approaches -

Set 'health_status': 'DRAINING' in the endpoint (See https://www.envoyproxy.io/docs/envoy/latest/api-v2/api/v2/core/health_check.proto#envoy-api-enum-core-healthstatus)
This takes care of not sending any new connections to the endpoint, but does not allow older sticky sessions to persist and they're broken off.
I got leads for this from Feature request: a means to refuse subsequent TCP connections while allowing current connections enough time to drain #2920 (comment), but doesn't seem to work 🤷‍♂️
Set load_balancing_weight: 1 in the endpoint in the hope that a lower weight will not route any new requests to the endpoint but the sticky sessions will persist, but that didn't happen.

So, what am I missing here? Any pointers are appreciated! Thanks :)

Config:
This is how the endpoint configuration looks when I'm trying to drain it -

                                {
                                    "endpoint": {
                                        "address": {
                                            "socket_address": {
                                                "address": "172.17.0.9",
                                                "port_value": 8080
                                            }
                                        }
                                    }
                                },
                                {
                                    "endpoint": {
                                        "address": {
                                            "socket_address": {
                                                "address": "172.17.0.14",
                                                "port_value": 8080
                                            }
                                        }
                                    },
                                    "health_status": "DRAINING"
                                },

The text was updated successfully, but these errors were encountered:

mattklein123 · 2019-06-10T02:43:05Z

I think I understand what you are after, but I'm not sure how it can be implemented cleanly. How would Envoy know that a session is OK to keep sending to an endpoint? Do you expect this to be somehow encoded in the cookie?

concaf · 2019-06-10T08:32:27Z

@mattklein123 thanks for the prompt reply ;)

I think the cookie does not need to have any information on the upcoming drain, I'm expecting a way to signal envoy of this drain in the configuration somewhere.

I think the best I can explain this is the way is quoting how Nginx drains sticky sessions (more at https://www.nginx.com/blog/nginx-plus-backend-upgrades-individual-servers#server-api-persistence)

Session persistence is required for any application that keeps state information for users (such as a shopping cart), but it complicates upgrades because now it is not enough just to wait until there are no active connections to our server before taking it offline. There might be clients that aren’t sending requests right now but haven’t ended their session with the server. For the best user experience, we want to keep the active sessions open – the amount of time depends on the application – but don’t want any new sessions to start.

In this case, before taking the server offline we don’t only want the number of active connections to be zero, but also for all sessions to end. That translates to the server being idle for some amount of time, which depends on the application. We can periodically check the dashboard or use the status API to determine that the server is idle, but we can also automate the process of marking a server drain and verifying it is idle before marking it down.

So, from what I understand, Envoy could do this 2 ways -

Allow the application to specify the drain time - during which no new connections are sent but older ones persist - and after which the endpoint is removed from the configuration
Allow an endpoint to be marked as draining (which is more like 'unhealthy soon' than 'unhealthy now') - and when marked no new connections are sent but older ones persist.

More reading

haproxy-ingress implements this by setting 0 weight to the endpoint which translates to no new connections but continue older ones - the implementation is at Support a "drain" mode for terminating pods and servers failing readiness checks jcmoraisjr/haproxy-ingress#95 which is well explained
An outstanding feature request on Nginx's Kubernetes' ingress at Enter drain mode when a pod is terminating and sticky-sessions are enabled nginxinc/kubernetes-ingress#267 - some pointers in the discussion there
Feature request on Ambassador (on which I am working on) - Support for draining node during pod termination emissary-ingress/emissary#1473

concaf · 2019-06-10T14:06:53Z

I just tried separating the draining endpoints and healthy endpoints into 2 different clusters, and then pointed the route to both of them via weighted_clusters (draining cluster with weight 0, and healthy cluster with weight 100), but no luck. Sticky sessions do not persist -

                                                            "weighted_clusters": {
                                                                "clusters": [
                                                                    {
                                                                        "name": "cluster_qotm_er_ring_hash_cookie_sticky_cookie",
                                                                        "weight": 100
                                                                    },
                                                                    {
                                                                        "name": "draining_cluster",
                                                                        "weight": 0
                                                                    }
                                                                ]
                                                            }

mattklein123 · 2019-06-10T20:37:01Z

I'm sorry I don't fully understand what you are after. Session stickiness is supported effectively via hashing of an incoming request cookie. How can Envoy understand whether to continue to route existing sessions without some state in the cookie itself?

Do you mean that you want users with existing cookies to keep routing, but no new cookies to be issued that target that upstream?

concaf · 2019-06-11T07:08:26Z

Do you mean that you want users with existing cookies to keep routing, but no new cookies to be issued that target that upstream?

@mattklein123 Yep, exactly 👍

I want to drain an endpoint/upstream and I want to give it 1 hour to drain. In that one hour, I want already existing sticky sessions to keep routing to that very upstream, but I don't want any newer sticky sessions to be issued to that upstream.

mattklein123 · 2019-06-11T15:59:11Z

OK I think I understand what you are after. Right now the cookie hashing policies just effectively store a hash value in the cookie and are meant for use with a consistent hashing load balancer. When a host is failed, the rest of the hosts will rehash. There is no actual state stored in the cookie itself.

I think we could support something like this but it would take a bunch of thinking on how to do it. @alyssawilk any thoughts on this one given the existing cookie support?

alyssawilk · 2019-06-17T15:09:42Z

I'm not terribly familiar with Envoy upstream selection. I also kind of see how this would work for H2, where you can check connection lifetime vs drain time if we move a bunch of APIs around but I'm less convinced it works well for HTTP/1 given that you may have a preexisting pipeline connection and then prefetch more, and have inconsistent state?

I think rather than base this on connection lifetime it might make sense to do some logic around cookie max-age, where if the client connection were in continued use Envoy could continue extending the max-age and if the client went idle the cookie would expire. Then it's just a question of having a state where your backend selection always uses the hash when the cookie is present, and only selects from non-draining backend when the cookie is not present?

kflynn · 2019-06-20T16:33:07Z

After kicking this around at Datawire for a bit, we think that it would work to add a way to mark a given upstream as "blacklisted for new hashes only":

if a request comes in with a valid (non-expired) cookie directing it to the blacklisted upstream, fine, go ahead and route.
if a request comes in with no cookie, or with an invalid or expired cookie, never allow the hash to select any blacklisted upstream.

Does that sound like a reasonable approach?

mattklein123 · 2019-06-21T21:45:41Z

Does that sound like a reasonable approach?

Conceptually it would work, but I think the implementation will be complex, as then we need a per-host draining concept (which I guess we have today but I would need to refresh the details). Worse, and this is where I would really have to think it through, I'm not convinced this works well at all if the upstream hosts are stable and things rehash. Are we assuming upstream stability? Even if we assume upstream stability, once a host drains and then goes away, 1/N approximately are going to rehash again anyway so don't you still need to support migration?

stale · 2019-07-21T22:04:05Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale · 2019-07-28T22:21:56Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

concaf · 2019-08-20T08:00:58Z

@mattklein123 @alyssawilk can we mark this as 'help wanted'?

Pawan-Bishnoi · 2023-04-17T05:18:43Z

OK I think I understand what you are after. Right now the cookie hashing policies just effectively store a hash value in the cookie and are meant for use with a consistent hashing load balancer. When a host is failed, the rest of the hosts will rehash. There is no actual state stored in the cookie itself.

I think we could support something like this but it would take a bunch of thinking on how to do it. @alyssawilk any thoughts on this one given the existing cookie support?

@mattklein123 this isn't a concern now with sessionFilter
Since podIP is used directly.

concaf mentioned this issue Jun 9, 2019

Support for draining node during pod termination emissary-ingress/emissary#1473

Closed

mattklein123 added the question Questions that are neither investigations, bugs, nor enhancements label Jun 10, 2019

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 21, 2019

stale bot closed this as completed Jul 28, 2019

alyssawilk reopened this Aug 20, 2019

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Aug 20, 2019

alyssawilk added the help wanted Needs help! label Aug 20, 2019

mattklein123 added design proposal Needs design doc/proposal before implementation and removed question Questions that are neither investigations, bugs, nor enhancements labels Aug 20, 2019

mattklein123 mentioned this issue Nov 8, 2019

Sticky-session for path based routing with envoy as sidecar proxy #8888

Closed

bdecoste mentioned this issue Apr 17, 2020

support endpoint sticky session draining solo-io/gloo#2817

Closed

wbpcode mentioned this issue May 27, 2021

stateful session persistence #16698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to drain endpoints before removal #7218

Unable to drain endpoints before removal #7218

concaf commented Jun 9, 2019

mattklein123 commented Jun 10, 2019

concaf commented Jun 10, 2019 •

edited

Loading

concaf commented Jun 10, 2019

mattklein123 commented Jun 10, 2019

concaf commented Jun 11, 2019

mattklein123 commented Jun 11, 2019

alyssawilk commented Jun 17, 2019

kflynn commented Jun 20, 2019

mattklein123 commented Jun 21, 2019

stale bot commented Jul 21, 2019

stale bot commented Jul 28, 2019

concaf commented Aug 20, 2019

Pawan-Bishnoi commented Apr 17, 2023

Unable to drain endpoints before removal #7218

Unable to drain endpoints before removal #7218

Comments

concaf commented Jun 9, 2019

mattklein123 commented Jun 10, 2019

concaf commented Jun 10, 2019 • edited Loading

concaf commented Jun 10, 2019

mattklein123 commented Jun 10, 2019

concaf commented Jun 11, 2019

mattklein123 commented Jun 11, 2019

alyssawilk commented Jun 17, 2019

kflynn commented Jun 20, 2019

mattklein123 commented Jun 21, 2019

stale bot commented Jul 21, 2019

stale bot commented Jul 28, 2019

concaf commented Aug 20, 2019

Pawan-Bishnoi commented Apr 17, 2023

concaf commented Jun 10, 2019 •

edited

Loading