Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gloo pod is failing when all the upstreams are static and configured with consul integration #8425

Open
prasanth-openet opened this issue Jun 28, 2023 · 4 comments
Labels
stale Issues that are stale. These will not be prioritized without further engagement on the issue. Type: Bug Something isn't working

Comments

@prasanth-openet
Copy link

prasanth-openet commented Jun 28, 2023

Gloo Edge Version

1.13.x (beta)

Kubernetes Version

v1.24.0

Describe the bug

I am trying to install gloo chart (version v1.13.0) on kubernetes in a namespace other than 'gloo-system'. However I could see the sds scontainer in the gloo pod is not getting ready. In my values.yaml file, I have disabled the service discovery and also i have provided a consul integration details. However, the above issue is not happening when I install version v1.12.56 or previous versions.
Seems like it is broken from gloo versions >=1.13.0 onwards.

appreciate if you can help thank you.

Steps to reproduce the bug

  1. Install the the chart in a namespace other than gloo-system.

helm upgrade --install gloo -n my-namespace --create-namespace --wait --debug --values values.yaml gloo-1.13.0.tgz
the following is my values.yaml file

discovery:
  enabled: false
settings:
  singleNamespace: true
  disableKubernetesDestinations: true
  integrations:
    consul:
      httpAddress: http://consul-consul-server.consul.svc:8500
      dnsAddress: kube-dns.kube-system.svc:53
      serviceDiscovery: {}
global:
  glooMtls:
    enabled: true
  image:
    registry: quay.io/solo-io
  1. kubectl get pods -n gloo-system
    you could observe only 2 out 3 containers in the gloo pod is started.

the kubectl output is as follows

Every 2.0s: kubectl get pods -n gloo-system                                   Wed Jun 28 14:57:13 2023

NAME                                            READY   STATUS    RESTARTS   AGE
gateway-proxy-c5b68c9df-n4mzd   2/2       Running   0          43s
gloo-bd6954685-8cg7z                  2/3       Running   0          43s
  1. glooctl check gives the following results
 Checking deployments... 1 Errors!
Checking pods... 2 Errors!
Checking upstreams... OK
Checking upstream groups... OK
Checking auth configs... OK
Checking rate limit configs... OK
Checking VirtualHostOptions... OK
Checking RouteOptions... OK
Checking secrets... OK
Checking virtual services... OK
Checking gateways... OK
Checking proxies... Skipping due to an error in checking deployments
Skipping due to an error in checking deployments
Error: 5 errors occurred:
        * Deployment gloo in namespace gloo-system is not available! Message: Deployment does not have minimum availability.
        * Pod gloo-bd6954685-8cg7z in namespace gloo-system is not ready! Message: containers with unready status: [sds]
        * Not all containers in pod gloo-bd6954685-8cg7z in namespace gloo-system are ready! Message: containers with unready status: [sds]
        * proxy check was skipped due to an error in checking deployments
        * xds metrics check was skipped due to an error in checking deployments

Not the issue is reproducible only if you use a namespace other than 'gloo-system'. In the test i am using 'my-namespace'. if we use the namespace gloo-system , the gloo pod is runs without any issues.

Expected Behavior

i should expect gloo pod should be started.

Additional Context

No response

@prasanth-openet prasanth-openet added the Type: Bug Something isn't working label Jun 28, 2023
@prasanth-openet prasanth-openet changed the title Gloo pod not up and running, when service discovery is disable and using static consul upstream. Gloo pod is failing when all the upstreams are static and configured with consul integration Jun 28, 2023
@ncouse
Copy link

ncouse commented Jun 29, 2023

Some additional detail on the issue observered here.

This requires a least the following conditions:

  • Gloo mTLS is enabled
  • Discovery mode is disabled
  • Consul Integration is defined (with ServiceDiscovery)
  • SingleNamespace is used and the install namespace is not gloo-system
  • There are no Upstreams defined yet

We use Consul integration as default discovery mechanism. Discovery is turned off, as we don't want those auto-discovered upstreams - the side effect is there are no upstreams, until we add our own later after Gloo chart installation.

From checking the OSS code, we observe that the SDS container seems to no be Ready as the Gloo container has not opened its GRPC port yet. This is why we only see the issue it if mTLS is enabled.

This seems to be due to the startup order where Gloo container only opens the GPRC port after it checks for healthy endpoints. If we set endpointsWarmingTimeout to 0s to disable feature, we do not have this problem.

Also, we set singleNamespace and we install in a custom namespace (not gloo-system). This seems to be an important part of the problem. We don't want Gloo looking in other namespaces in our case, as install is restricted to single installed namspace. We observed that the this does not occur if installed in gloo-system namespace, but does if installed in any other namespace.

In summary, installing with the settings in original description, in a namespace other than gloo-system will observer the problem, and chart installation will fail/timeout.

@ncouse
Copy link

ncouse commented Jun 29, 2023

Currently the only way to workaround this is to either:

  • Create Consul Upstreams before installing Gloo chart
    • If no Consul Upstreams are needed (at install time), then Consul Integration must be disabled.
    • This approach doesn't allow flexibility to define Consul upstreams later when needed, or to install charts in any order
  • Disable endpoints warming feature

This was not an issue in previous versions of Gloo.

@ncouse
Copy link

ncouse commented Jun 29, 2023

Further analysis, by process of elimination of versions, shows that the bug was introduced in 1.13.0-beta10.

This version introduces some new settings for Consul.

We also now using settings under consulUpstreamDiscovery, as we have noticed that occasionally Gloo is out of sync with the services in Consul catalog. Changing these settings solve this.

However, these settings do not seem to directly affect the reproducibility of the issue in this ticket. However they may be relate given the features added in that beta release.

settings:
  singleNamespace: true
  disableKubernetesDestinations: true
  integrations:
    consul:
      httpAddress: http://consul-consul-server.consul.svc:8500
      dnsAddress: kube-dns.kube-system.svc:53
      serviceDiscovery: {}
    consulUpstreamDiscovery:
      consistencyMode: ConsistentMode
      edsBlockingQueries: true
      queryOptions:
        useCache: false

Copy link

This issue has been marked as stale because of no activity in the last 180 days. It will be closed in the next 180 days unless it is tagged "no stalebot" or other activity occurs.

@github-actions github-actions bot added the stale Issues that are stale. These will not be prioritized without further engagement on the issue. label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that are stale. These will not be prioritized without further engagement on the issue. Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants