feat(kuma-cp) introduce Health Discovery Service (HDS) #1418

lobkovilya · 2021-01-13T15:52:03Z

Summary

Health Discovery Service is an Envoy protocol, which allows the receiving of information about the state of particular hosts.

In Kuma we want to utilize this protocol for health checking of application in Universal mode. Before that, we were able to check the status only for proxy, but that's not enough especially on Universal. If an application dies but the proxy is alive we still will be sending traffic to this proxy.

Current PR introduces the initial HDS support and provides a simple TCP health check for application ports.

Health Discovery Service protocol

Right after start Envoy sends HealthCheckRequest with Node and Capabilites to management server
Management server replies with HealthCheckSpecifier which tells Envoy what and how it should check. So HealthCheckSpecifier contains cluster, host and HealthChecker. Also HealthCheckSpecifier contains Interval which defines how often Envoy should report the status.
Every Interval Envoy sends EndpointHealthResponse with the health status of the hosts.

Health Discovery Service implementation

Management server for HDS is implemented in the same fashion as xDS server. There are Snapshot, SnapshotCache, Reconciler, and etc. Snapshot consists of a single resource - HealthCheckSpecifier, which is generated based on Dataplane resource, 1 health checker per dataplane's inbound.

Documentation

Link to the website documentation PR

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

jakubdyszkiewicz · 2021-01-15T11:37:39Z

Wow, that is pretty awesome!!

What is missing:

Auth, you are sending DP token but there is no verification
Stats (including dashboards)

Configuration of HC is hardcoded. It should be a dynamic configuration. Either with separate policy AppHealthChecks or by extending Dataplane model.

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

jakubdyszkiewicz

I think there is one critical thing to consider here.

One of the points of introducing this was to be able to introduce new instance of the service but only use it when it's ready. Correct me if I'm wrong, but right now when inbound.health is nil, then we are treating it as healthy. When you have a dataplane with inbound.probe it will be treated as healthy UNTIL the first probe arrives from HDS, so we will take into account this instance too soon.

We could:

consider inbound as unhealthy when health is nil, but probe is not nil (extra method in dataplane_helpers)
add health.ready=false in DataplaneManager when you are applying inbound with a probe

jakubdyszkiewicz · 2021-01-21T09:06:49Z

api/mesh/v1alpha1/dataplane.proto

+
+        message Tcp {}
+
+        Tcp tcp = 5;


are we shipping TCP probes only in this release?

I'm not sure I have enough time to do it, so yes, let's stick to TCP only in this release

app/kumactl/data/install/k8s/metrics/grafana/kuma-cp.json

pkg/config/app/kuma-cp/kuma-cp.defaults.yaml

pkg/dp-server/components.go

pkg/hds/tracker/callbacks.go

pkg/hds/tracker/healthcheck_generator.go

pkg/xds/bootstrap/generator.go

pkg/xds/bootstrap/template_v2.go

jakubdyszkiewicz · 2021-01-21T10:21:57Z

test/e2e/healthcheck_universal_test.go

+		}, "10s", "1s").Should(ContainSubstring("ready: true"))
+	})
+
+})


Nice! But here we are checking if we are updating the Dataplane definition.

For me the real E2E test would be to:

spawn web

spawn backend-1

start stream of requests from web to backend

spawn backend-2

no requests are dropped, eventually requests are loadbalanced between backend-1 and backend-2.

when service of backend-2 is down

then eventually backend-2 is ejected and all requests are being sent to backend-1

Not a blocker for this PR to be merged, but I'd strongly consider extending the test to check such scenario.

jakubdyszkiewicz · 2021-01-21T10:37:50Z

one more thing, multizone Universal.

Not sure we are doing it right now, but I think we should exclude unhealthy from available services in Ingress

api/mesh/v1alpha1/dataplane.proto

nickolaev · 2021-01-21T09:03:23Z

pkg/config/app/kuma-cp/kuma-cp.defaults.yaml

+    # On Kubernetes this feature disabled for now regardless the flag value
+    enabled: true # ENV: KUMA_DP_SERVER_HDS_ENABLED
+    # Interval for Envoy to send statuses for HealthChecks
+    interval: 1s # ENV: KUMA_DP_SERVER_HDS_INTERVAL


how will this scale? 100 DPs sending their statuses every second?
I guess 15s might be a better default.

nickolaev · 2021-01-21T09:05:56Z

pkg/config/app/kuma-cp/kuma-cp.defaults.yaml

+      interval: 1s # ENV: KUMA_DP_SERVER_HDS_CHECK_INTERVAL
+      # NoTrafficInterval is a special health check interval that is used when a cluster has
+      #	never had traffic routed to it
+      noTrafficInterval: 1s # ENV: KUMA_DP_SERVER_HDS_CHECK_NO_TRAFFIC_INTERVAL


OK I see lots of timeouts and intervals here. Since this is critical I hope we have thorough checks that these are not in conflict with each other. Also, as mentioned above, maybe relax these by a factor of 10.

nickolaev · 2021-01-21T11:28:24Z

pkg/config/dp-server/config.go

+	if h.NoTrafficInterval <= 0 {
+		return errors.New("NoTrafficInterval must be greater than 0s")
+	}
+	return nil


Please make sure these are not conflicting. For example, can the timeout be lower than the check interval?

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com> # Conflicts: # app/kumactl/cmd/install/testdata/install-control-plane.cni-enabled.golden.yaml # app/kumactl/cmd/install/testdata/install-control-plane.defaults.golden.yaml # app/kumactl/cmd/install/testdata/install-control-plane.global.golden.yaml # app/kumactl/cmd/install/testdata/install-control-plane.overrides.golden.yaml # app/kumactl/cmd/install/testdata/install-control-plane.remote.golden.yaml # app/kumactl/cmd/install/testdata/install-control-plane.with-ingress.golden.yaml # app/kumactl/pkg/install/k8s/control-plane/helmtemplates_vfsdata.go # deployments/charts/kuma/templates/cp-deployment.yaml

jakubdyszkiewicz · 2021-01-22T13:38:40Z

pkg/dp-server/components.go

@@ -23,7 +24,7 @@ func SetupServer(rt runtime.Runtime) error {
 	if err := bootstrap.RegisterBootstrap(rt, dpServer.httpMux); err != nil {
 		return err
 	}
-	if rt.Config().Environment == core.UniversalEnvironment && rt.Config().DpServer.Hds.Enabled {
+	if rt.Config().Mode != core.Global && rt.Config().DpServer.Hds.Enabled {


nit: DP server is not run on Global at all. No reason to do the check here

* feat(kuma-cp) initial hds support * feat(kuma-cp) get rid of old tests * feat(kuma-cp) authn for hds * feat(kuma-cp) metrics for hds * feat(kuma-cp) fix install metrics test Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com> (cherry picked from commit f6636c8) # Conflicts: # app/kumactl/pkg/install/k8s/control-plane/helmtemplates_vfsdata.go # pkg/xds/bootstrap/generator_test.go # pkg/xds/bootstrap/template_v3.go # pkg/xds/bootstrap/testdata/generator.default-config.golden.yaml

* feat(kuma-cp) initial hds support * feat(kuma-cp) get rid of old tests * feat(kuma-cp) authn for hds * feat(kuma-cp) metrics for hds * feat(kuma-cp) fix install metrics test Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

) * feat(kuma-cp) introduce Health Discovery Service (HDS) (#1418) * feat(kuma-cp) initial hds support * feat(kuma-cp) get rid of old tests * feat(kuma-cp) authn for hds * feat(kuma-cp) metrics for hds * feat(kuma-cp) fix install metrics test Co-authored-by: Ilya Lobkov <ilya.lobkov@konghq.com> Co-authored-by: Nikolay Nikolaev <nikolay.nikolaev@konghq.com>

lobkovilya added 6 commits January 13, 2021 22:30

feat(kuma-cp) initial hds support

493410f

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) get rid of old tests

413336e

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) make check

2e17065

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) tests and comments

68c1904

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

Merge branch 'master' into feat/hds-intro

9836670

feat(kuma-cp) after-merge 'make check'

48e5978

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

lobkovilya marked this pull request as ready for review January 15, 2021 09:43

lobkovilya requested a review from a team as a code owner January 15, 2021 09:43

feat(kuma-cp) fix tests, more comments

14c2c74

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

lobkovilya added 7 commits January 20, 2021 19:08

feat(kuma-cp) serviceProbes + hds fixes

ce26671

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) make check

dd79588

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) fix test

9305540

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) authn for hds

7240e1a

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) metrics for hds

2042d98

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) make check

30f3f33

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) fix install metrics test

a522d3b

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

jakubdyszkiewicz reviewed Jan 21, 2021

View reviewed changes

nickolaev reviewed Jan 21, 2021

View reviewed changes

nickolaev changed the title ~~feat(kuma-cp) HDS intro~~ feat(kuma-cp) introduce Health Discovery Service (HDS) Jan 21, 2021

lobkovilya added 3 commits January 21, 2021 22:30

feat(kuma-cp) review

1ae23f4

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

feat(kuma-cp) review

0a61b36

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

nickolaev added the backport-to-stable label Jan 22, 2021

nickolaev approved these changes Jan 22, 2021

View reviewed changes

jakubdyszkiewicz reviewed Jan 22, 2021

View reviewed changes

jakubdyszkiewicz approved these changes Jan 22, 2021

View reviewed changes

nickolaev merged commit f6636c8 into master Jan 22, 2021

nickolaev deleted the feat/hds-intro branch January 22, 2021 13:45

mergify bot mentioned this pull request Jan 22, 2021

feat(kuma-cp) introduce Health Discovery Service (HDS) (bp #1418) #1467

Merged

nickolaev mentioned this pull request Jan 22, 2021

HDS optimizations #1468

Open

lahabana mentioned this pull request Nov 22, 2021

HDS, support HTTP protocol for serviceProbe #3232

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kuma-cp) introduce Health Discovery Service (HDS) #1418

feat(kuma-cp) introduce Health Discovery Service (HDS) #1418

lobkovilya commented Jan 13, 2021 •

edited

Loading

jakubdyszkiewicz commented Jan 15, 2021

jakubdyszkiewicz left a comment

jakubdyszkiewicz Jan 21, 2021

lobkovilya Jan 21, 2021

jakubdyszkiewicz Jan 21, 2021

jakubdyszkiewicz commented Jan 21, 2021

nickolaev Jan 21, 2021

nickolaev Jan 21, 2021

nickolaev Jan 21, 2021

jakubdyszkiewicz Jan 22, 2021

feat(kuma-cp) introduce Health Discovery Service (HDS) #1418

feat(kuma-cp) introduce Health Discovery Service (HDS) #1418

Conversation

lobkovilya commented Jan 13, 2021 • edited Loading

Summary

Health Discovery Service protocol

Health Discovery Service implementation

Documentation

jakubdyszkiewicz commented Jan 15, 2021

jakubdyszkiewicz left a comment

Choose a reason for hiding this comment

jakubdyszkiewicz Jan 21, 2021

Choose a reason for hiding this comment

lobkovilya Jan 21, 2021

Choose a reason for hiding this comment

jakubdyszkiewicz Jan 21, 2021

Choose a reason for hiding this comment

jakubdyszkiewicz commented Jan 21, 2021

nickolaev Jan 21, 2021

Choose a reason for hiding this comment

nickolaev Jan 21, 2021

Choose a reason for hiding this comment

nickolaev Jan 21, 2021

Choose a reason for hiding this comment

jakubdyszkiewicz Jan 22, 2021

Choose a reason for hiding this comment

lobkovilya commented Jan 13, 2021 •

edited

Loading