Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silences slow to update alerts in Alertmanager with Sharding Enabled #6248

Open
ShunjiTakano opened this issue Oct 2, 2024 · 0 comments
Open

Comments

@ShunjiTakano
Copy link
Contributor

ShunjiTakano commented Oct 2, 2024

Describe the bug
I am running alertmanager with sharding enabled with 3 pods running and I'm experiencing long delays between creating silences and it reflecting that in the alert UI/API (~15min) and vice versa when expiring the silence.

For example,

  1. I create a silence from the UI/API
  2. The effected alerts still appear in the UI and API response for quite a long time.
  3. Eventually the effected alerts are removed from the list of active alerts.

Where as I expected it would almost instantly reflect the changes.

To Reproduce
Steps to reproduce the behavior:

  1. Start Cortex v1.17.1 (With the config defined in 'Additional Context'
  2. Navigate to the alertmanager UI
  3. Send a test alert (can be anything)
  4. Create a silence, matching one of the labels in the test alert.
  5. Navigate back to the Alerts page
  6. Confirm that the alert is still showing, even after creating the silence.

I've attached a video of the test scenario.
https://github.com/user-attachments/assets/cba15fa8-a2fd-4ed5-ad71-01207a035727

Expected behavior
When creating a silence, I except the matched alerts to be silenced almost immediately. Instead it takes a few minutes for the alerts to be changed from active to silenced.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: Kustomize

Additional Context
Config Used:

api:
  alertmanager_http_prefix: /alertmanager
server:
  log_level: debug
memberlist:
  bind_port: 7946
  join_members:
    - 'cortex-alertmanager-memberlist' # A headless service, pointing to the cortex alertmanager pods.
alertmanager:
  data_dir: /data
  enable_api: true
  external_url: /alertmanager
  persist_interval: 1m
  sharding_enabled: true 
  sharding_ring:
    kvstore:
      store: memberlist     
    replication_factor: 3
  alertmanager_client:
    grpc_compression: gzip
alertmanager_storage:
  backend: gcs
  gcs:
    bucket_name: ${BUCKET_NAME}
    service_account: ${SERVICE_ACCOUNT}
runtime_config:
  file: /etc/cortex-rt/runtime.yml

Deployed as a statefulset on kubernetes, running on 3+ replicas.

I've noticed while looking at the logs, the silences tend to actually silence the alerts after the silences Maintenance is done on all replicas. Looking at the code, the maintenance period is hardcoded to 15min.

cortex-alertmanager-0 alertmanager ts=2024-10-02T02:59:16.711960932Z caller=silence.go:411 level=debug component=MultiTenantAlertmanager user=user1 component=silences msg="Running maintenance"
cortex-alertmanager-0 alertmanager ts=2024-10-02T02:59:16.812300961Z caller=silence.go:419 level=debug component=MultiTenantAlertmanager user=user1 component=silences msg="Maintenance done" duration=100.331208ms size=1545

Also, I have tried changing various configs such as poll_interval, push_pull_interval, persist_interval, grpc_compression, gc_interval without much luck. I have tried consul as the kvstore as well. Seemed to make no difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants