When the data plane can't reach the apiserver, it silently fails to forward #1085

evankanderson · 2019-04-22T22:16:13Z

Describe the bug
In a non-default namespace, the default-broker-filter deployment fails to actually load any Trigger data, and instead reports:

sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1alpha1.Trigger: Get https://10.15.240.1:443/apis/eventing.knative.dev/v1alpha1/namespaces/demo/triggers?limit=500&resourceVersion=0: net/http: TLS handshake timeout

and

jsonPayload: {
  caller:  "filter/main.go:52"   
  error:  "Get https://10.15.240.1:443/api?timeout=32s: dial tcp 10.15.240.1:443: connect: connection refused"   
  level:  "fatal"   
  logger:  "provisioner"   
  msg:  "Error starting up."   
  stacktrace:  "main.main
	/go/src/github.com/knative/eventing/cmd/broker/filter/main.go:52
runtime.main
	/usr/local/go/src/runtime/proc.go:200"   
  ts:  "2019-04-18T22:36:24.188Z"   
 }

Expected behavior
Broker and Trigger work in a namespace other than the default namespace.

To Reproduce
Deploy Broker/Trigger to a non-default namespace and then attempt to deliver via Trigger.

Knative release version
Release 0.5

The text was updated successfully, but these errors were encountered:

evankanderson · 2019-04-22T22:17:59Z

/priority awaiting-more-evidence

We should repro with an e2e test or by hand?

mchmarny · 2019-04-22T22:20:23Z

I can replicate using this source/trigger/service combo in demo namespace. Will try to validate in default

n3wscott · 2019-04-22T22:38:54Z

/cc @nachocano
Nacho is currently rewriting the broker controller.

Harwayne · 2019-04-23T17:46:50Z

I don't think this is caused by using a Broker in a non-default namespace. The Broker e2e tests have been running in a non-default namespace since their creation.

TestDefaultBrokerWithManyTriggers runs in a non-default namespace (its name is generated here).

evankanderson · 2019-04-23T18:58:06Z

I think mark indicated that deleting and re-creating the pods got him unstuck.

However, the Broker and Triggers' status was Ready during this entire incident, despite the fact that the actual filter was unable to talk to the apiserver. I wonder if there is a way that we could detect this and report a failed status when the broker-filter pod(s) are unable to talk to the apiserver.

evankanderson · 2019-04-24T16:31:05Z

Renaming this bug; it appears to have been an intermittent network issue,

The larger issue is that all the objects reported Ready, but no Triggers were actually being routed. Traffic went as follows:

[source] --> [broker-filter] --> [broker-internal-channel] --> [subscription] --> [broker-filter] 🗑

grantr · 2019-04-29T18:23:46Z

@Harwayne has been working on making sure Ready status is only reported when the data plane is truly ready: #1064 #1071. Those PRs were merged before this bug was reported, but maybe the report was made with an older nightly? 🤞

Harwayne · 2019-04-29T21:43:45Z

Ready status is only reported when the data plane is truly ready: #1064 #1071.

That's not quite accurate. Those PRs don't mark the Broker/Trigger ready until all their pieces are ready, but that is all control plane activity. In particular, if the Pod is healthy and available, even when it can't talk to the API server, then we would still have this problem. An 'easy' fix would be to crash loop the container until it can talk to the API server. A better fix would be to add a readiness probe to the Deployments.

grantr · 2019-04-29T21:58:07Z

A better fix would be to add a readiness probe to the Deployments.

IIUC this used to be impossible or at least very difficult with mTLS enabled in Istio. According to istio/istio#4429 that's fixed, but unclear whether the fix is released.

@tcnghia can we start using liveness and readiness probes in the data path yet?

tcnghia · 2019-04-30T18:34:56Z

@grantr In serving we have been using readiness probe in data path for a while now. @ZhiminXiang is busy with AutoTLS work and hasn't gotten to confirming mTLS (probe rewrite) yet. May be we will put more investigation in 0.7 milestone after AutoTLS is done.

evankanderson · 2019-05-06T17:45:27Z

I believe we've also removed the Istio requirement at this point.

matzew · 2019-05-07T08:10:50Z

related: #1166 ?

vaikas · 2019-08-01T19:25:27Z

@Harwayne did this get fixed somewhere along the way?

Harwayne · 2019-08-01T20:13:23Z

@Harwayne did this get fixed somewhere along the way?

I don't think so. The removal of Istio has likely made it far less likely to occur, but I think it would still present the same problems. I think the readiness probe described in #1085 (comment) is still the proper, long term fix.

vaikas · 2019-08-01T20:44:37Z

Should we then have an issue to add a readiness probe to explicitly spell out what the problem is. Title seems to state a different issue?

Harwayne · 2019-08-07T17:17:27Z

/close

Closing in favor of #1656, which describes the desired solution.

knative-prow-robot · 2019-08-07T17:17:29Z

@Harwayne: Closing this issue.

In response to this:

/close

Closing in favor of #1656, which describes the desired solution.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

evankanderson added the kind/bug Categorizes issue or PR as related to a bug. label Apr 22, 2019

knative-prow-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Apr 22, 2019

evankanderson changed the title ~~Istio Injection with non-default namespace may crash Broker~~ When the data plane can't reach the apiserver, it fails to forward silently Apr 24, 2019

evankanderson changed the title ~~When the data plane can't reach the apiserver, it fails to forward silently~~ When the data plane can't reach the apiserver, it silently fails to forward Apr 24, 2019

vaikas assigned Harwayne Aug 1, 2019

Harwayne mentioned this issue Aug 7, 2019

Liveness and Readiness Checks for the Broker Ingress and Filter #1656

Closed

knative-prow-robot closed this as completed Aug 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the data plane can't reach the apiserver, it silently fails to forward #1085

When the data plane can't reach the apiserver, it silently fails to forward #1085

evankanderson commented Apr 22, 2019

evankanderson commented Apr 22, 2019

mchmarny commented Apr 22, 2019 •

edited

Loading

n3wscott commented Apr 22, 2019

Harwayne commented Apr 23, 2019

evankanderson commented Apr 23, 2019

evankanderson commented Apr 24, 2019

grantr commented Apr 29, 2019

Harwayne commented Apr 29, 2019

grantr commented Apr 29, 2019

tcnghia commented Apr 30, 2019

evankanderson commented May 6, 2019

matzew commented May 7, 2019

vaikas commented Aug 1, 2019

Harwayne commented Aug 1, 2019

vaikas commented Aug 1, 2019

Harwayne commented Aug 7, 2019

knative-prow-robot commented Aug 7, 2019

When the data plane can't reach the apiserver, it silently fails to forward #1085

When the data plane can't reach the apiserver, it silently fails to forward #1085

Comments

evankanderson commented Apr 22, 2019

evankanderson commented Apr 22, 2019

mchmarny commented Apr 22, 2019 • edited Loading

n3wscott commented Apr 22, 2019

Harwayne commented Apr 23, 2019

evankanderson commented Apr 23, 2019

evankanderson commented Apr 24, 2019

grantr commented Apr 29, 2019

Harwayne commented Apr 29, 2019

grantr commented Apr 29, 2019

tcnghia commented Apr 30, 2019

evankanderson commented May 6, 2019

matzew commented May 7, 2019

vaikas commented Aug 1, 2019

Harwayne commented Aug 1, 2019

vaikas commented Aug 1, 2019

Harwayne commented Aug 7, 2019

knative-prow-robot commented Aug 7, 2019

mchmarny commented Apr 22, 2019 •

edited

Loading