Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix prometheus histogram rate overflows #17753

Merged

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Apr 16, 2020

What does this PR do?

Fix some overflows on Prometheus histogram rate calculations. They could be caused by:

  • New buckets added to existing histograms on runtime, this happens at least with CockroachDB (see Use Elasticsearch types in Cockroachdb module #17736).
  • Buckets with bigger upper limits have lower counters. This is wrong and I have only reproduced this on tests, but handling it just in case to avoid losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate if a zero value is caused because it was the first call, or because it the rate is actually zero.

Why is it important?

These overflows cause big numbers, documents containing these values are rejected by Elasticsearch, causing the loss of these and other metrics in the same documents.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  • Configure Metricbeat to monitor a service that exposes Prometheus metrics and some of these metrics are histograms whose buckets distribution can change along the time. I have only seen this with CockroachDB, issue could be reproduced with a config like the following one:
metricbeat.modules:
  - module: prometheus
    metricsets: ['collector']
    hosts: ['172.21.0.2:8080']
    metrics_path: '/_status/vars'
    use_types: true

output.elasticsearch.hosts: ["http://localhost:9200"]
  • Check that metrics are ingested and no errors appear in Metricbeat logs.

Related issues

Logs

Example error:

2020-04-16T14:13:24.057+0200	WARN	[elasticsearch]	elasticsearch/client.go:385	Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xbf9e30d8bb1146e2, ext:18152263511, loc:(*time.Location)(0x7b0cda0)}, Meta:null, Fields:{"agent":{"ephemeral_id":"f625d601-5fe0-4fd4-834c-a093e836c8ef","hostname":"voyager","id":"9aed890c-78d3-4460-bd44-54c141fc352d","type":"metricbeat","version":"8.0.0"},"ecs":{"version":"1.5.0"},"event":{"dataset":"prometheus.collector","duration":9085986,"module":"prometheus"},"host":{"name":"voyager"},"metricset":{"name":"collector","period":10000},"prometheus":{"addsstable_applications":{"counter":0.000000},"addsstable_copies":{"counter":0.000000},"addsstable_proposals":{"counter":0.000000},"capacity":{"value":107374182400.000000},"capacity_available":{"value":26611531776.000000},"capacity_reserved":{"value":0.000000},"capacity_used":{"value":679969.000000},"compactor_compactingnanos":{"counter":0.000000},"compactor_compactions_failure":{"counter":0.000000},"compactor_compactions_success":{"counter":0.000000},"compactor_suggestionbytes_compacted":{"counter":0.000000},"compactor_suggestionbytes_queued":{"value":0.000000},"compactor_suggestionbytes_skipped":{"counter":0.000000},"follower_reads_success_count":{"counter":0.000000},"gcbytesage":{"value":16466.000000},"intentage":{"value":0.000000},"intentbytes":{"value":0.000000},"intentcount":{"value":0.000000},"intentresolver_async_throttled":{"counter":0.000000},"intents_abort_attempts":{"counter":0.000000},"intents_poison_attempts":{"counter":0.000000},"intents_resolve_attempts":{"counter":30.000000},"keybytes":{"value":33026.000000},"keycount":{"value":725.000000},"kv_closed_timestamp_max_behind_nanos":{"value":36006495619.000000},"kv_rangefeed_catchup_scan_nanos":{"counter":0.000000},"labels":{"instance":"172.22.0.2:8080","job":"prometheus","store":"1"},"lastupdatenanos":{"value":1587039199844044288.000000},"leases_epoch":{"value":18.000000},"leases_error":{"counter":0.000000},"leases_expiration":{"value":2.000000},"leases_success":{"counter":28.000000},"leases_transfers_error":{"counter":0.000000},"leases_transfers_success":{"counter":0.000000},"livebytes":{"value":129018.000000},"livecount":{"value":724.000000},"queue_consistency_pending":{"value":0.000000},"queue_consistency_process_failure":{"counter":0.000000},"queue_consistency_process_success":{"counter":20.000000},"queue_consistency_processingnanos":{"counter":3259882231.000000},"queue_gc_info_abortspanconsidered":{"counter":0.000000},"queue_gc_info_abortspangcnum":{"counter":0.000000},"queue_gc_info_abortspanscanned":{"counter":0.000000},"queue_gc_info_intentsconsidered":{"counter":0.000000},"queue_gc_info_intenttxns":{"counter":0.000000},"queue_gc_info_numkeysaffected":{"counter":0.000000},"queue_gc_info_pushtxn":{"counter":0.000000},"queue_gc_info_resolvesuccess":{"counter":0.000000},"queue_gc_info_resolvetotal":{"counter":0.000000},"queue_gc_info_transactionspangcaborted":{"counter":0.000000},"queue_gc_info_transactionspangccommitted":{"counter":0.000000},"queue_gc_info_transactionspangcpending":{"counter":0.000000},"queue_gc_info_transactionspanscanned":{"counter":0.000000},"queue_gc_pending":{"value":0.000000},"queue_gc_process_failure":{"counter":0.000000},"queue_gc_process_success":{"counter":0.000000},"queue_gc_processingnanos":{"counter":0.000000},"queue_merge_pending":{"value":0.000000},"queue_merge_process_failure":{"counter":0.000000},"queue_merge_process_success":{"counter":0.000000},"queue_merge_processingnanos":{"counter":0.000000},"queue_merge_purgatory":{"value":0.000000},"queue_raftlog_pending":{"value":0.000000},"queue_raftlog_process_failure":{"counter":0.000000},"queue_raftlog_process_success":{"counter":23.000000},"queue_raftlog_processingnanos":{"counter":360904963.000000},"queue_raftsnapshot_pending":{"value":0.000000},"queue_raftsnapshot_process_failure":{"counter":0.000000},"queue_raftsnapshot_process_success":{"counter":0.000000},"queue_raftsnapshot_processingnanos":{"counter":0.000000},"queue_replicagc_pending":{"value":0.000000},"queue_replicagc_process_failure":{"counter":0.000000},"queue_replicagc_process_success":{"counter":0.000000},"queue_replicagc_processingnanos":{"counter":0.000000},"queue_replicagc_removereplica":{"counter":0.000000},"queue_replicate_addreplica":{"counter":0.000000},"queue_replicate_pending":{"value":0.000000},"queue_replicate_process_failure":{"counter":21.000000},"queue_replicate_process_success":{"counter":0.000000},"queue_replicate_processingnanos":{"counter":923795838.000000},"queue_replicate_purgatory":{"value":20.000000},"queue_replicate_rebalancereplica":{"counter":0.000000},"queue_replicate_removedeadreplica":{"counter":0.000000},"queue_replicate_removereplica":{"counter":0.000000},"queue_replicate_transferlease":{"counter":0.000000},"queue_split_pending":{"value":0.000000},"queue_split_process_failure":{"counter":0.000000},"queue_split_process_success":{"counter":0.000000},"queue_split_processingnanos":{"counter":0.000000},"queue_split_purgatory":{"value":0.000000},"queue_tsmaintenance_pending":{"value":0.000000},"queue_tsmaintenance_process_failure":{"counter":0.000000},"queue_tsmaintenance_process_success":{"counter":3.000000},"queue_tsmaintenance_processingnanos":{"counter":619011410.000000},"raft_commandsapplied":{"counter":186.000000},"raft_enqueued_pending":{"value":0.000000},"raft_entrycache_accesses":{"counter":44.000000},"raft_entrycache_bytes":{"value":172159.000000},"raft_entrycache_hits":{"counter":4.000000},"raft_entrycache_size":{"value":173.000000},"raft_heartbeats_pending":{"value":0.000000},"raft_process_applycommitted_latency":{"histogram":{"counts":[0,0,0,0,0,0,1,0,3,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9223372036854775801,8,0,0],"values":[40959.5,86015,100351,112639,116735,124927,135167,143359,151551,159743,167935,176127,184319,192511,200703,208895,217087,225279,233471,241663,249855,258047,270335,286719,303103,319487,335871,352255,368639,385023,401407,417791,434175,450559,466943,483327,507903,540671,573439,606207,638975,688127,753663,819199,884735,1015807,1146879,1343487,1605631,2555903,3538943,3932159]}},"raft_process_commandcommit_latency":{"histogram":{"counts":[0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,9223372036854775806,3,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,9223372036854775801,8,0,0],"values":[3199.5,6527,6911,7423,7935,8447,8959,9727,10495,11263,12031,12543,13055,13823,14847,15615,16639,17919,18943,19967,20991,22015,23039,24063,25087,26111,27135,28159,29183,30207,31231,32255,33791,35839,37887,39935,41983,44031,46079,48127,50175,52223,54271,56319,58367,60415,62463,64511,67583,71679,75775,79871,83967,88063,92159,96255,100351,106495,129023,303103,688127,1277951,1769471,2162687]}},"raft_process_handleready_latency":{"histogram":{"counts":[0,2,0,0,9223372036854775806,0,4,9223372036854775804,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"values":[491519.5,999423,1032191,1179647,1409023,1572863,1835007,5996543,10747903,12058623,14155775,20971519,26738687,27787263,28835839,29884415,30932991,31981567,33030143,34603007,36700159,38797311,40894463,42991615,45088767,47185919,49283071,51380223,53477375,55574527,57671679,59768831,61865983,63963135,66060287,69206015,73400319,77594623,81788927,88080383,94371839,98566143,102760447,109051903,115343359,130023423,146800639,159383551,260046847,360710143,385875967]}},"raft_process_logcommit_latency":{"histogram":{"counts":[0,2,0,0,9223372036854775806,0,4,9223372036854775804,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],"values":[393215.5,819199,868351,933887,1114111,1343487,1540095,4882431,9830399,11796479,13631487,20185087,25690111,26738687,27787263,28835839,29884415,30932991,31981567,33030143,34603007,36700159,38797311,40894463,42991615,45088767,47185919,49283071,51380223,53477375,55574527,57671679,59768831,61865983,63963135,66060287,69206015,73400319,77594623,81788927,88080383,94371839,98566143,102760447,109051903,127926271,146800639,159383551,251658239,343932927,360710143,385875967]}},"raft_process_tickingnanos":{"counter":1858744.000000},"raft_process_workingnanos":{"counter":8547434611.000000},"raft_rcvd_app":{"counter":0.000000},"raft_rcvd_appresp":{"counter":0.000000},"raft_rcvd_dropped":{"counter":0.000000},"raft_rcvd_heartbeat":{"counter":0.000000},"raft_rcvd_heartbeatresp":{"counter":0.000000},"raft_rcvd_prevote":{"counter":0.000000},"raft_rcvd_prevoteresp":{"counter":0.000000},"raft_rcvd_prop":{"counter":0.000000},"raft_rcvd_snap":{"counter":0.000000},"raft_rcvd_timeoutnow":{"counter":0.000000},"raft_rcvd_transferleader":{"counter":0.000000},"raft_rcvd_vote":{"counter":0.000000},"raft_rcvd_voteresp":{"counter":0.000000},"raft_ticks":{"counter":150.000000},"raftlog_behind":{"value":0.000000},"raftlog_truncated":{"counter":13.000000},"range_adds":{"counter":0.000000},"range_merges":{"counter":0.000000},"range_raftleadertransfers":{"counter":0.000000},"range_removes":{"counter":0.000000},"range_snapshots_generated":{"counter":0.000000},"range_snapshots_normal_applied":{"counter":0.000000},"range_snapshots_preemptive_applied":{"counter":0.000000},"range_splits":{"counter":0.000000},"ranges":{"value":20.000000},"ranges_overreplicated":{"value":0.000000},"ranges_unavailable":{"value":0.000000},"ranges_underreplicated":{"value":20.000000},"rebalancing_lease_transfers":{"counter":0.000000},"rebalancing_queriespersecond":{"value":9.125939},"rebalancing_range_rebalances":{"counter":0.000000},"rebalancing_writespersecond":{"value":69.916395},"replicas":{"value":20.000000},"replicas_leaders":{"value":20.000000},"replicas_leaders_not_leaseholders":{"value":0.000000},"replicas_leaseholders":{"value":20.000000},"replicas_quiescent":{"value":20.000000},"replicas_reserved":{"value":0.000000},"requests_backpressure_split":{"value":0.000000},"requests_slow_latch":{"value":0.000000},"requests_slow_lease":{"value":0.000000},"requests_slow_raft":{"value":0.000000},"rocksdb_block_cache_hits":{"value":0.000000},"rocksdb_block_cache_misses":{"value":0.000000},"rocksdb_block_cache_pinned_usage":{"value":0.000000},"rocksdb_block_cache_usage":{"value":0.000000},"rocksdb_bloom_filter_prefix_checked":{"value":0.000000},"rocksdb_bloom_filter_prefix_useful":{"value":0.000000},"rocksdb_compactions":{"value":0.000000},"rocksdb_encryption_algorithm":{"value":0.000000},"rocksdb_flushes":{"value":0.000000},"rocksdb_memtable_total_size":{"value":724080.000000},"rocksdb_num_sstables":{"value":0.000000},"rocksdb_read_amplification":{"value":0.000000},"rocksdb_table_readers_mem_estimate":{"value":0.000000},"sysbytes":{"value":4630.000000},"syscount":{"value":103.000000},"totalbytes":{"value":129817.000000},"tscache_skl_read_pages":{"value":1.000000},"tscache_skl_read_rotations":{"counter":0.000000},"tscache_skl_write_pages":{"value":1.000000},"tscache_skl_write_rotations":{"counter":0.000000},"txnwaitqueue_deadlocks_total":{"counter":0.000000},"txnwaitqueue_pushee_waiting":{"value":0.000000},"txnwaitqueue_pusher_slow":{"value":0.000000},"txnwaitqueue_pusher_wait_time":{"histogram":{"counts":[0],"values":[0]}},"txnwaitqueue_pusher_waiting":{"value":0.000000},"txnwaitqueue_query_wait_time":{"histogram":{"counts":[0],"values":[0]}},"txnwaitqueue_query_waiting":{"value":0.000000},"valbytes":{"value":96791.000000},"valcount":{"value":736.000000}},"service":{"address":"172.22.0.2:8080","type":"prometheus"}}, Private:interface {}(nil), TimeSeries:true}, Flags:0x0, Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400): {"type":"mapper_parsing_exception","reason":"failed to parse field [prometheus.raft_process_applycommitted_latency.histogram] of type [histogram]","caused_by":{"type":"input_coercion_exception","reason":"Numeric value (9223372036854775801) out of range of int (-2147483648 - 2147483647)\n at [Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 1, column: 855]"}}

@jsoriano jsoriano added bug review Metricbeat Metricbeat needs_backport PR is waiting to be backported to other branches. [zube]: In Review Team:Platforms Label for the Integrations - Platforms team v7.8.0 v7.7.1 labels Apr 16, 2020
@jsoriano jsoriano self-assigned this Apr 16, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@jsoriano jsoriano merged commit 0afffa8 into elastic:master Apr 17, 2020
@jsoriano jsoriano deleted the fix-prometheus-histogram-rate-overflow branch April 17, 2020 10:03
@jsoriano jsoriano removed the needs_backport PR is waiting to be backported to other branches. label Apr 17, 2020
jsoriano added a commit to jsoriano/beats that referenced this pull request Apr 17, 2020
Fix some overflows on Prometheus histogram rate calculations.
They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see elastic#17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 0afffa8)
jsoriano added a commit to jsoriano/beats that referenced this pull request Apr 17, 2020
Fix some overflows on Prometheus histogram rate calculations.
They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see elastic#17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 0afffa8)
jsoriano added a commit that referenced this pull request Apr 20, 2020
…17783)

Fix some overflows on Prometheus histogram rate calculations.

They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see #17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 0afffa8)
jsoriano added a commit that referenced this pull request Apr 20, 2020
…17784)

Fix some overflows on Prometheus histogram rate calculations.

They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see #17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 0afffa8)
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…lows (elastic#17784)

Fix some overflows on Prometheus histogram rate calculations.

They could be caused by:
* New buckets added to existing histograms on runtime, this happens at
  least with CockroachDB (see elastic#17736).
* Buckets with bigger upper limits have lower counters. This is wrong and
  has been only reproduced this on tests, but handling it just in case to avoid
  losing other data if this happens with some service.

Rate calculation methods return now also a boolean to be able to differenciate
if a zero value is caused because it was the first call, or because it the rate is
actually zero.

(cherry picked from commit 2d6e0ca)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Metricbeat Metricbeat review Team:Platforms Label for the Integrations - Platforms team v7.7.0 v7.7.1 v7.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants