Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(kuma-cp) introduce Health Discovery Service (HDS) #1418

Merged
merged 17 commits into from
Jan 22, 2021
Merged

Conversation

lobkovilya
Copy link
Contributor

@lobkovilya lobkovilya commented Jan 13, 2021

Summary

Health Discovery Service is an Envoy protocol, which allows the receiving of information about the state of particular hosts.

In Kuma we want to utilize this protocol for health checking of application in Universal mode. Before that, we were able to check the status only for proxy, but that's not enough especially on Universal. If an application dies but the proxy is alive we still will be sending traffic to this proxy.

Current PR introduces the initial HDS support and provides a simple TCP health check for application ports.

Health Discovery Service protocol

  1. Right after start Envoy sends HealthCheckRequest with Node and Capabilites to management server
  2. Management server replies with HealthCheckSpecifier which tells Envoy what and how it should check. So HealthCheckSpecifier contains cluster, host and HealthChecker. Also HealthCheckSpecifier contains Interval which defines how often Envoy should report the status.
  3. Every Interval Envoy sends EndpointHealthResponse with the health status of the hosts.

Health Discovery Service implementation

Management server for HDS is implemented in the same fashion as xDS server. There are Snapshot, SnapshotCache, Reconciler, and etc. Snapshot consists of a single resource - HealthCheckSpecifier, which is generated based on Dataplane resource, 1 health checker per dataplane's inbound.

Documentation

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
@lobkovilya lobkovilya marked this pull request as ready for review January 15, 2021 09:43
@lobkovilya lobkovilya requested a review from a team as a code owner January 15, 2021 09:43
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
@jakubdyszkiewicz
Copy link
Contributor

Wow, that is pretty awesome!!

What is missing:

  • Auth, you are sending DP token but there is no verification
  • Stats (including dashboards)

Configuration of HC is hardcoded. It should be a dynamic configuration. Either with separate policy AppHealthChecks or by extending Dataplane model.

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Copy link
Contributor

@jakubdyszkiewicz jakubdyszkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is one critical thing to consider here.

One of the points of introducing this was to be able to introduce new instance of the service but only use it when it's ready. Correct me if I'm wrong, but right now when inbound.health is nil, then we are treating it as healthy. When you have a dataplane with inbound.probe it will be treated as healthy UNTIL the first probe arrives from HDS, so we will take into account this instance too soon.

We could:

  • consider inbound as unhealthy when health is nil, but probe is not nil (extra method in dataplane_helpers)
  • add health.ready=false in DataplaneManager when you are applying inbound with a probe


message Tcp {}

Tcp tcp = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we shipping TCP probes only in this release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I have enough time to do it, so yes, let's stick to TCP only in this release

}, "10s", "1s").Should(ContainSubstring("ready: true"))
})

})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! But here we are checking if we are updating the Dataplane definition.

For me the real E2E test would be to:

  • spawn web
  • spawn backend-1
  • start stream of requests from web to backend
  • spawn backend-2
  • no requests are dropped, eventually requests are loadbalanced between backend-1 and backend-2.
  • when service of backend-2 is down
  • then eventually backend-2 is ejected and all requests are being sent to backend-1

Not a blocker for this PR to be merged, but I'd strongly consider extending the test to check such scenario.

@jakubdyszkiewicz
Copy link
Contributor

one more thing, multizone Universal.

Not sure we are doing it right now, but I think we should exclude unhealthy from available services in Ingress

# On Kubernetes this feature disabled for now regardless the flag value
enabled: true # ENV: KUMA_DP_SERVER_HDS_ENABLED
# Interval for Envoy to send statuses for HealthChecks
interval: 1s # ENV: KUMA_DP_SERVER_HDS_INTERVAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will this scale? 100 DPs sending their statuses every second?
I guess 15s might be a better default.

interval: 1s # ENV: KUMA_DP_SERVER_HDS_CHECK_INTERVAL
# NoTrafficInterval is a special health check interval that is used when a cluster has
# never had traffic routed to it
noTrafficInterval: 1s # ENV: KUMA_DP_SERVER_HDS_CHECK_NO_TRAFFIC_INTERVAL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I see lots of timeouts and intervals here. Since this is critical I hope we have thorough checks that these are not in conflict with each other. Also, as mentioned above, maybe relax these by a factor of 10.

if h.NoTrafficInterval <= 0 {
return errors.New("NoTrafficInterval must be greater than 0s")
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure these are not conflicting. For example, can the timeout be lower than the check interval?

@nickolaev nickolaev changed the title feat(kuma-cp) HDS intro feat(kuma-cp) introduce Health Discovery Service (HDS) Jan 21, 2021
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>

# Conflicts:
#	app/kumactl/cmd/install/testdata/install-control-plane.cni-enabled.golden.yaml
#	app/kumactl/cmd/install/testdata/install-control-plane.defaults.golden.yaml
#	app/kumactl/cmd/install/testdata/install-control-plane.global.golden.yaml
#	app/kumactl/cmd/install/testdata/install-control-plane.overrides.golden.yaml
#	app/kumactl/cmd/install/testdata/install-control-plane.remote.golden.yaml
#	app/kumactl/cmd/install/testdata/install-control-plane.with-ingress.golden.yaml
#	app/kumactl/pkg/install/k8s/control-plane/helmtemplates_vfsdata.go
#	deployments/charts/kuma/templates/cp-deployment.yaml
@@ -23,7 +24,7 @@ func SetupServer(rt runtime.Runtime) error {
if err := bootstrap.RegisterBootstrap(rt, dpServer.httpMux); err != nil {
return err
}
if rt.Config().Environment == core.UniversalEnvironment && rt.Config().DpServer.Hds.Enabled {
if rt.Config().Mode != core.Global && rt.Config().DpServer.Hds.Enabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: DP server is not run on Global at all. No reason to do the check here

@nickolaev nickolaev merged commit f6636c8 into master Jan 22, 2021
@nickolaev nickolaev deleted the feat/hds-intro branch January 22, 2021 13:45
mergify bot pushed a commit that referenced this pull request Jan 22, 2021
* feat(kuma-cp) initial hds support

* feat(kuma-cp) get rid of old tests

* feat(kuma-cp) authn for hds

* feat(kuma-cp) metrics for hds

* feat(kuma-cp) fix install metrics test

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
(cherry picked from commit f6636c8)

# Conflicts:
#	app/kumactl/pkg/install/k8s/control-plane/helmtemplates_vfsdata.go
#	pkg/xds/bootstrap/generator_test.go
#	pkg/xds/bootstrap/template_v3.go
#	pkg/xds/bootstrap/testdata/generator.default-config.golden.yaml
@nickolaev nickolaev mentioned this pull request Jan 22, 2021
nickolaev pushed a commit that referenced this pull request Jan 22, 2021
* feat(kuma-cp) initial hds support

* feat(kuma-cp) get rid of old tests

* feat(kuma-cp) authn for hds

* feat(kuma-cp) metrics for hds

* feat(kuma-cp) fix install metrics test

Signed-off-by: Ilya Lobkov <ilya.lobkov@konghq.com>
nickolaev pushed a commit that referenced this pull request Jan 22, 2021
)

* feat(kuma-cp) introduce Health Discovery Service (HDS) (#1418)

* feat(kuma-cp) initial hds support

* feat(kuma-cp) get rid of old tests

* feat(kuma-cp) authn for hds

* feat(kuma-cp) metrics for hds

* feat(kuma-cp) fix install metrics test

Co-authored-by: Ilya Lobkov <ilya.lobkov@konghq.com>
Co-authored-by: Nikolay Nikolaev <nikolay.nikolaev@konghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants