-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(kuma-cp) introduce Health Discovery Service (HDS) #1418
Conversation
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Wow, that is pretty awesome!! What is missing:
Configuration of HC is hardcoded. It should be a dynamic configuration. Either with separate policy |
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is one critical thing to consider here.
One of the points of introducing this was to be able to introduce new instance of the service but only use it when it's ready. Correct me if I'm wrong, but right now when inbound.health
is nil, then we are treating it as healthy. When you have a dataplane with inbound.probe
it will be treated as healthy UNTIL the first probe arrives from HDS, so we will take into account this instance too soon.
We could:
- consider inbound as unhealthy when health is nil, but probe is not nil (extra method in dataplane_helpers)
- add
health.ready=false
in DataplaneManager when you are applying inbound with a probe
|
||
message Tcp {} | ||
|
||
Tcp tcp = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we shipping TCP probes only in this release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I have enough time to do it, so yes, let's stick to TCP only in this release
}, "10s", "1s").Should(ContainSubstring("ready: true")) | ||
}) | ||
|
||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! But here we are checking if we are updating the Dataplane definition.
For me the real E2E test would be to:
- spawn web
- spawn backend-1
- start stream of requests from web to backend
- spawn backend-2
- no requests are dropped, eventually requests are loadbalanced between backend-1 and backend-2.
- when service of backend-2 is down
- then eventually backend-2 is ejected and all requests are being sent to backend-1
Not a blocker for this PR to be merged, but I'd strongly consider extending the test to check such scenario.
one more thing, multizone Universal. Not sure we are doing it right now, but I think we should exclude unhealthy from available services in Ingress |
# On Kubernetes this feature disabled for now regardless the flag value | ||
enabled: true # ENV: KUMA_DP_SERVER_HDS_ENABLED | ||
# Interval for Envoy to send statuses for HealthChecks | ||
interval: 1s # ENV: KUMA_DP_SERVER_HDS_INTERVAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how will this scale? 100 DPs sending their statuses every second?
I guess 15s might be a better default.
interval: 1s # ENV: KUMA_DP_SERVER_HDS_CHECK_INTERVAL | ||
# NoTrafficInterval is a special health check interval that is used when a cluster has | ||
# never had traffic routed to it | ||
noTrafficInterval: 1s # ENV: KUMA_DP_SERVER_HDS_CHECK_NO_TRAFFIC_INTERVAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I see lots of timeouts and intervals here. Since this is critical I hope we have thorough checks that these are not in conflict with each other. Also, as mentioned above, maybe relax these by a factor of 10.
if h.NoTrafficInterval <= 0 { | ||
return errors.New("NoTrafficInterval must be greater than 0s") | ||
} | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure these are not conflicting. For example, can the timeout be lower than the check interval?
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com> # Conflicts: # app/kumactl/cmd/install/testdata/install-control-plane.cni-enabled.golden.yaml # app/kumactl/cmd/install/testdata/install-control-plane.defaults.golden.yaml # app/kumactl/cmd/install/testdata/install-control-plane.global.golden.yaml # app/kumactl/cmd/install/testdata/install-control-plane.overrides.golden.yaml # app/kumactl/cmd/install/testdata/install-control-plane.remote.golden.yaml # app/kumactl/cmd/install/testdata/install-control-plane.with-ingress.golden.yaml # app/kumactl/pkg/install/k8s/control-plane/helmtemplates_vfsdata.go # deployments/charts/kuma/templates/cp-deployment.yaml
pkg/dp-server/components.go
Outdated
@@ -23,7 +24,7 @@ func SetupServer(rt runtime.Runtime) error { | |||
if err := bootstrap.RegisterBootstrap(rt, dpServer.httpMux); err != nil { | |||
return err | |||
} | |||
if rt.Config().Environment == core.UniversalEnvironment && rt.Config().DpServer.Hds.Enabled { | |||
if rt.Config().Mode != core.Global && rt.Config().DpServer.Hds.Enabled { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: DP server is not run on Global at all. No reason to do the check here
* feat(kuma-cp) initial hds support * feat(kuma-cp) get rid of old tests * feat(kuma-cp) authn for hds * feat(kuma-cp) metrics for hds * feat(kuma-cp) fix install metrics test Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com> (cherry picked from commit f6636c8) # Conflicts: # app/kumactl/pkg/install/k8s/control-plane/helmtemplates_vfsdata.go # pkg/xds/bootstrap/generator_test.go # pkg/xds/bootstrap/template_v3.go # pkg/xds/bootstrap/testdata/generator.default-config.golden.yaml
* feat(kuma-cp) initial hds support * feat(kuma-cp) get rid of old tests * feat(kuma-cp) authn for hds * feat(kuma-cp) metrics for hds * feat(kuma-cp) fix install metrics test Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
) * feat(kuma-cp) introduce Health Discovery Service (HDS) (#1418) * feat(kuma-cp) initial hds support * feat(kuma-cp) get rid of old tests * feat(kuma-cp) authn for hds * feat(kuma-cp) metrics for hds * feat(kuma-cp) fix install metrics test Co-authored-by: Ilya Lobkov <ilya.lobkov@konghq.com> Co-authored-by: Nikolay Nikolaev <nikolay.nikolaev@konghq.com>
Summary
Health Discovery Service is an Envoy protocol, which allows the receiving of information about the state of particular hosts.
In Kuma we want to utilize this protocol for health checking of application in Universal mode. Before that, we were able to check the status only for proxy, but that's not enough especially on Universal. If an application dies but the proxy is alive we still will be sending traffic to this proxy.
Current PR introduces the initial HDS support and provides a simple TCP health check for application ports.
Health Discovery Service protocol
HealthCheckRequest
withNode
andCapabilites
to management serverHealthCheckSpecifier
which tells Envoy what and how it should check. SoHealthCheckSpecifier
containscluster
,host
andHealthChecker
. AlsoHealthCheckSpecifier
containsInterval
which defines how often Envoy should report the status.Interval
Envoy sendsEndpointHealthResponse
with the health status of the hosts.Health Discovery Service implementation
Management server for HDS is implemented in the same fashion as xDS server. There are Snapshot, SnapshotCache, Reconciler, and etc.
Snapshot
consists of a single resource -HealthCheckSpecifier
, which is generated based on Dataplane resource, 1 health checker per dataplane's inbound.Documentation