Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the data plane can't reach the apiserver, it silently fails to forward #1085

Closed
evankanderson opened this issue Apr 22, 2019 · 17 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@evankanderson
Copy link
Member

Describe the bug
In a non-default namespace, the default-broker-filter deployment fails to actually load any Trigger data, and instead reports:

sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1alpha1.Trigger: Get https://10.15.240.1:443/apis/eventing.knative.dev/v1alpha1/namespaces/demo/triggers?limit=500&resourceVersion=0: net/http: TLS handshake timeout

and

jsonPayload: {
  caller:  "filter/main.go:52"   
  error:  "Get https://10.15.240.1:443/api?timeout=32s: dial tcp 10.15.240.1:443: connect: connection refused"   
  level:  "fatal"   
  logger:  "provisioner"   
  msg:  "Error starting up."   
  stacktrace:  "main.main
	/go/src/github.com/knative/eventing/cmd/broker/filter/main.go:52
runtime.main
	/usr/local/go/src/runtime/proc.go:200"   
  ts:  "2019-04-18T22:36:24.188Z"   
 }

Expected behavior
Broker and Trigger work in a namespace other than the default namespace.

To Reproduce
Deploy Broker/Trigger to a non-default namespace and then attempt to deliver via Trigger.

Knative release version
Release 0.5

@evankanderson evankanderson added the kind/bug Categorizes issue or PR as related to a bug. label Apr 22, 2019
@evankanderson
Copy link
Member Author

/priority awaiting-more-evidence

We should repro with an e2e test or by hand?

@knative-prow-robot knative-prow-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Apr 22, 2019
@mchmarny
Copy link
Member

mchmarny commented Apr 22, 2019

I can replicate using this source/trigger/service combo in demo namespace. Will try to validate in default

@n3wscott
Copy link
Contributor

/cc @nachocano
Nacho is currently rewriting the broker controller.

@Harwayne
Copy link
Contributor

I don't think this is caused by using a Broker in a non-default namespace. The Broker e2e tests have been running in a non-default namespace since their creation.

TestDefaultBrokerWithManyTriggers runs in a non-default namespace (its name is generated here).

@evankanderson
Copy link
Member Author

I think mark indicated that deleting and re-creating the pods got him unstuck.

However, the Broker and Triggers' status was Ready during this entire incident, despite the fact that the actual filter was unable to talk to the apiserver. I wonder if there is a way that we could detect this and report a failed status when the broker-filter pod(s) are unable to talk to the apiserver.

@evankanderson evankanderson changed the title Istio Injection with non-default namespace may crash Broker When the data plane can't reach the apiserver, it fails to forward silently Apr 24, 2019
@evankanderson evankanderson changed the title When the data plane can't reach the apiserver, it fails to forward silently When the data plane can't reach the apiserver, it silently fails to forward Apr 24, 2019
@evankanderson
Copy link
Member Author

Renaming this bug; it appears to have been an intermittent network issue,

The larger issue is that all the objects reported Ready, but no Triggers were actually being routed. Traffic went as follows:

[source] --> [broker-filter] --> [broker-internal-channel] --> [subscription] --> [broker-filter] 🗑

@grantr
Copy link
Contributor

grantr commented Apr 29, 2019

@Harwayne has been working on making sure Ready status is only reported when the data plane is truly ready: #1064 #1071. Those PRs were merged before this bug was reported, but maybe the report was made with an older nightly? 🤞

@Harwayne
Copy link
Contributor

Ready status is only reported when the data plane is truly ready: #1064 #1071.

That's not quite accurate. Those PRs don't mark the Broker/Trigger ready until all their pieces are ready, but that is all control plane activity. In particular, if the Pod is healthy and available, even when it can't talk to the API server, then we would still have this problem. An 'easy' fix would be to crash loop the container until it can talk to the API server. A better fix would be to add a readiness probe to the Deployments.

@grantr
Copy link
Contributor

grantr commented Apr 29, 2019

A better fix would be to add a readiness probe to the Deployments.

IIUC this used to be impossible or at least very difficult with mTLS enabled in Istio. According to istio/istio#4429 that's fixed, but unclear whether the fix is released.

@tcnghia can we start using liveness and readiness probes in the data path yet?

@tcnghia
Copy link

tcnghia commented Apr 30, 2019

@grantr In serving we have been using readiness probe in data path for a while now. @ZhiminXiang is busy with AutoTLS work and hasn't gotten to confirming mTLS (probe rewrite) yet. May be we will put more investigation in 0.7 milestone after AutoTLS is done.

@evankanderson
Copy link
Member Author

I believe we've also removed the Istio requirement at this point.

@matzew
Copy link
Member

matzew commented May 7, 2019

related: #1166 ?

@vaikas
Copy link
Contributor

vaikas commented Aug 1, 2019

@Harwayne did this get fixed somewhere along the way?

@Harwayne
Copy link
Contributor

Harwayne commented Aug 1, 2019

@Harwayne did this get fixed somewhere along the way?

I don't think so. The removal of Istio has likely made it far less likely to occur, but I think it would still present the same problems. I think the readiness probe described in #1085 (comment) is still the proper, long term fix.

@vaikas
Copy link
Contributor

vaikas commented Aug 1, 2019

Should we then have an issue to add a readiness probe to explicitly spell out what the problem is. Title seems to state a different issue?

@Harwayne
Copy link
Contributor

Harwayne commented Aug 7, 2019

/close

Closing in favor of #1656, which describes the desired solution.

@knative-prow-robot
Copy link
Contributor

@Harwayne: Closing this issue.

In response to this:

/close

Closing in favor of #1656, which describes the desired solution.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

9 participants