Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issue where success before passing and failures before critical were not being honoured #10256

Conversation

nicholasjackson
Copy link
Contributor

This PR fixes an issue where the health check features success_before_passing and failures_before_critical were not being honoured and status was changed immediately.

This is probably also the cause of the Nomad bug listed in this issue.
hashicorp/nomad#9189

Co-authored-by: Erik Veld eveld@hashicorp.com

Previously given the following service config where the destination service http://localhost:9090 always returns a 500 or no response for the /health and /ready endpoints. Consul would immediately set the check status to critical.

{
  "service": {
    "id": "api-v1",
    "name": "api",
    "tags": ["primary"],
    "address": "127.0.0.1",
    "meta": {
      "meta": "for my service"
    },
    "port": 9090,
    "checks": [
      {
        "name": "HTTP API Ready",
        "http": "http://localhost:9090/ready",
        "interval": "10s",
        "timeout": "5s",
        "status": "warning",
        "success_before_passing": 1,
        "failures_before_critical": 10
      },
      {
        "name": "HTTP API Health",
        "http": "http://localhost:9090/health",
        "interval": "10s",
        "timeout": "5s",
        "status": "warning",
        "success_before_passing": 3,
        "failures_before_critical": 10
      }
    ]
  }
}

This was tested by running Consul locally.

consul agent -dev

Registering the service.

consul services register ./service.json

And checking the logs and output from the /agent/checks endpoint

curl localhost:8500/v1/agent/checks | jq '. | map(. | {check_id: .CheckID, name: .Name, status: .Status})'

Before the first health check, the status correctly reported as warning, however, after the first check the status immediately moved to critical.

[
  {
    "check_id": "service:api-v1:1",
    "name": "HTTP API Ready",
    "status": "critical"
  },
  {
    "check_id": "service:api-v1:2",
    "name": "HTTP API Health",
    "status": "critical"
  }
]

After the code changes, the service correctly stays in warning until the number of consecutive failures defined in the config has been reached.

We have also tested the happy path using fake-service and the behaviour is as expected.

were not being honored

Co-authored-by: Erik Veld <eveld@hashicorp.com>
@nicholasjackson nicholasjackson added the type/bug Feature does not function as expected label May 19, 2021
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging May 19, 2021 08:33 Inactive
@vercel vercel bot temporarily deployed to Preview – consul May 19, 2021 08:33 Inactive
@dnephin
Copy link
Contributor

dnephin commented Jun 8, 2021

Thank you for opening this PR! We looked into this issue a bit more, and found that the current behaviour seems like it is expected by the contributor:

// NewStatusHandler set counters values to threshold in order to immediatly update status after first check.

It seems like there are two use cases here:

  1. reduce load on Consul when health checks are flapping
  2. delay the initial healthcheck

It seems like we may not be able to use these settings for the second use case without it being a breaking change. Maybe what we need is a new field to delay running the healthcheck when it is first created.

@dnephin dnephin added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jun 8, 2021
@dnephin
Copy link
Contributor

dnephin commented Aug 17, 2021

Thanks again for this PR! As mentioned above, unfortunately it looks like this behaviour is expected. If we want to delay initial checks we'll need a new setting. Since this PR would be a breaking change I am going to close it.

@dnephin
Copy link
Contributor

dnephin commented Aug 17, 2021

I created #10864 to track this work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected waiting-reply Waiting on response from Original Poster or another individual in the thread
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants