Silences slow to update alerts in Alertmanager with Sharding Enabled #6248

ShunjiTakano · 2024-10-02T03:12:49Z

Describe the bug
I am running alertmanager with sharding enabled with 3 pods running and I'm experiencing long delays between creating silences and it reflecting that in the alert UI/API (~15min) and vice versa when expiring the silence.

For example,

I create a silence from the UI/API
The effected alerts still appear in the UI and API response for quite a long time.
Eventually the effected alerts are removed from the list of active alerts.

Where as I expected it would almost instantly reflect the changes.

To Reproduce
Steps to reproduce the behavior:

Start Cortex v1.17.1 (With the config defined in 'Additional Context'
Navigate to the alertmanager UI
Send a test alert (can be anything)
Create a silence, matching one of the labels in the test alert.
Navigate back to the Alerts page
Confirm that the alert is still showing, even after creating the silence.

I've attached a video of the test scenario.
https://github.com/user-attachments/assets/cba15fa8-a2fd-4ed5-ad71-01207a035727

Expected behavior
When creating a silence, I except the matched alerts to be silenced almost immediately. Instead it takes a few minutes for the alerts to be changed from active to silenced.

Environment:

Infrastructure: Kubernetes
Deployment tool: Kustomize

Additional Context
Config Used:

api:
  alertmanager_http_prefix: /alertmanager
server:
  log_level: debug
memberlist:
  bind_port: 7946
  join_members:
    - 'cortex-alertmanager-memberlist' # A headless service, pointing to the cortex alertmanager pods.
alertmanager:
  data_dir: /data
  enable_api: true
  external_url: /alertmanager
  persist_interval: 1m
  sharding_enabled: true 
  sharding_ring:
    kvstore:
      store: memberlist     
    replication_factor: 3
  alertmanager_client:
    grpc_compression: gzip
alertmanager_storage:
  backend: gcs
  gcs:
    bucket_name: ${BUCKET_NAME}
    service_account: ${SERVICE_ACCOUNT}
runtime_config:
  file: /etc/cortex-rt/runtime.yml

Deployed as a statefulset on kubernetes, running on 3+ replicas.

I've noticed while looking at the logs, the silences tend to actually silence the alerts after the silences Maintenance is done on all replicas. Looking at the code, the maintenance period is hardcoded to 15min.

cortex-alertmanager-0 alertmanager ts=2024-10-02T02:59:16.711960932Z caller=silence.go:411 level=debug component=MultiTenantAlertmanager user=user1 component=silences msg="Running maintenance"
cortex-alertmanager-0 alertmanager ts=2024-10-02T02:59:16.812300961Z caller=silence.go:419 level=debug component=MultiTenantAlertmanager user=user1 component=silences msg="Maintenance done" duration=100.331208ms size=1545

Also, I have tried changing various configs such as poll_interval, push_pull_interval, persist_interval, grpc_compression, gc_interval without much luck. I have tried consul as the kvstore as well. Seemed to make no difference.

The text was updated successfully, but these errors were encountered:

yeya24 added the component/alertmanager label Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silences slow to update alerts in Alertmanager with Sharding Enabled #6248

Silences slow to update alerts in Alertmanager with Sharding Enabled #6248

ShunjiTakano commented Oct 2, 2024 •

edited

Loading

Silences slow to update alerts in Alertmanager with Sharding Enabled #6248

Silences slow to update alerts in Alertmanager with Sharding Enabled #6248

Comments

ShunjiTakano commented Oct 2, 2024 • edited Loading

ShunjiTakano commented Oct 2, 2024 •

edited

Loading