promtail: ratelimiting by label #7597

redbaron · 2022-11-04T11:29:47Z

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Limit stage can now takes optional by_label_name param, which tracks distinct values of that label and ratelimit entries independently.

Log lines without expected label are not considered for ratelimiting and passed to the next pipeline stage.

To avoid stalling whole stage by just few labels it requires 'drop', to be set to true.

Checklist

Reviewed the CONTRIBUTING.md guide
Documentation added
Tests updated
CHANGELOG.md updated
Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md

CLAassistant · 2022-11-04T11:29:56Z

All committers have signed the CLA.

grafanabot · 2022-11-04T12:09:04Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

liguozhong · 2022-11-04T15:32:15Z

hi, thanks for your contribution.
I have a question.
The configuration of this PR is probably like this.

- limit:
    rate: 10
    burst: 10
    drop: true
    by_label_name: "namespace"

Is it possible to achieve the same result with match+ limit stage in such a scenario.

match:
  selector: '{namespace="foo"}"'
  stages:
  - limit:
      rate: 5
      drop: true
match:
  selector: '{namespace!="foo"}"'
  stages:
  - limit:
      rate: 100
      drop: false

redbaron · 2022-11-04T15:39:05Z

No, because in your case "namespace=abc" and "namespace=xyz" will be ratelimited together, only "namespace=foo" will have it's own ratelimiter.

liguozhong · 2022-11-07T05:50:19Z

If user configure rate limit based on traceid label, will this pr make promtail oom.

- limit:
    rate: 10
    burst: 10
    drop: true
    by_label_name: "traceID"

redbaron · 2022-11-07T13:14:38Z

high cardinality labels is already on a naughty list of loki administrators

grafanabot · 2022-11-09T11:28:02Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

redbaron · 2022-11-09T11:42:12Z

@liguozhong , added max size cap. Implementation is completely free until size cap is reached at which point additional cost is just 1 lookup + 1 insert per each in-use label value once.

grafanabot · 2022-11-09T13:17:37Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
- querier/queryrange	-0.1%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0.4%
+               loki	0%

grafanabot · 2022-11-09T14:37:35Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

liguozhong · 2022-11-15T06:05:55Z

@liguozhong , added max size cap. Implementation is completely free until size cap is reached at which point additional cost is just 1 lookup + 1 insert per each in-use label value once.

👍，thanks for your contribution.

Reimplementing a map with limited expiration requires more consideration, such as whether the map is a thread-safe map ? Why not use an existing map?

import	lru "github.com/hashicorp/golang-lru"
*lru.Cache

redbaron · 2022-11-15T12:29:27Z

I considered using LRU, but decided against , because LRU is not what we need. We don't have requirements to take out entry on insert beyond capacity, our map values are small enough (pointer) that we can afford to keep 2x of them with neglible increase in memory use.

Core logic of capped map ended up being just 13 lines, therefore I couldn't justify pulling another much much more complicated dependency, especially given that we don't need LRU.

Keep in mind, that end goal is to prevent crash on misconfiguration, like with your trace id example, nothing else. Normal mode of operation of properly configured system will not reach capacity , therefore we should optimize for that. Proposed solution in a steady state is indistinguishable from a simple map (which it is) both for CPU and memory utilization.

as for multithreading, my undertandig is that Stage.Run never runs concurrently, therefore thread safety is a non requirement in this particular case.

liguozhong · 2022-11-15T12:57:23Z

[bugfix] promtail:Concurrent map read and map write in promtail pod #7206

#7206

No. promtail stage is not always thread safety. for example issue #7180

redbaron · 2022-11-15T15:01:48Z

These are pipelines, right? Within pipeline instance, every stage's Run is called once. Limit stage spawns a single goroutine in it's Run, so there is no concurrent access to the same stage Instance.

liguozhong · 2022-11-16T07:47:45Z

[bugfix] promtail:Concurrent map read and map write in promtail pod #7206

[bugfix] promtail:Concurrent map read and map write in promtail pod #7206

From this issue, in extreme cases, the pipeline may not be thread-safe. I'm still analyzing this issue. Why does map concurrent access occur?

liguozhong · 2022-11-16T08:43:46Z

Today I carefully checked the pipeline code, and the 2.7.0 code should be thread-safe.

#2996
After this PR introduces channel, it seems that it should be able to solve the problem of concurrent access.

redbaron · 2022-11-16T09:03:19Z

good to merge, then? :)

liguozhong · 2022-11-16T09:07:55Z

this new metrics may cause prometheus OOM due to high cardinality label values.
Wouldn't it be better to replace metrics with log?

liguozhong · 2022-11-16T09:10:28Z

If we want to introduce a new generationalMap, there are many things to consider,
such as introducing a callback interface and triggering close interfaces to recycle resources when a key expires.
this is not an easy thing

liguozhong · 2022-11-16T09:12:08Z

and add testcase for generationalMap

redbaron · 2022-11-16T10:34:26Z

such as introducing a callback interface and triggering close interfaces to recycle resources when a key expires.
this is not an easy thing

I don't understand what it is for. None of that is required for the current use of it.

redbaron · 2022-11-16T10:36:30Z

this new metrics may cause prometheus OOM due to high cardinality label values.

I am not familiar with the prometheus metrics instrumentation, how unbounded cardinality problem is resolved elsewhere?

grafanabot · 2022-11-16T11:51:49Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

grafanabot · 2022-11-18T09:50:41Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

liguozhong · 2022-11-21T09:02:48Z

clients/pkg/logentry/stages/match.go

 	dropCount := prometheus.NewCounterVec(prometheus.CounterOpts{
 		Namespace: "logentry",
-		Name:      "dropped_lines_total",


It would be better if this PR can not modify match.go. Other parts LGTM

removed any changes from match.go

grafanabot · 2022-11-21T12:25:48Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

liguozhong

LGTM, thanks for your contributor👍

redbaron · 2022-11-28T12:02:53Z

@kavirajk , would you kindly have a look?

kavirajk

thanks @redbaron for the contribution and @liguozhong for the review.

Added a few comments on my own. My main concern is what advantage does this by_label_name serve over using matcher stage with any labels..

kavirajk · 2022-12-06T09:03:39Z

pkg/util/map.go

+}
+
+// NewGenMap created which maintains at most maxSize recently used entries
+func NewGenMap[K comparable, V any](maxSize int, newV func() V, gcCb func()) GenerationalMap[K, V] {


This would our first generic code in Loki I believe. Will wait for few more eyes from contributors.

Can we make it bit more readable may be? failing to understand what are newV and gcCb means here. Can we have may be better naming and some descriptive comment on those?

If we going with this generic map then, a have few tests for this map would be good.

Just started to think why we need this Map in the first place? I see it's getting used only in one place and could be simple map[string]*rate.Limiter?. Trying to understand the rationale here.

newV - new value
gcCb - garbage collector callback

I am happy to use any names you suggest for these or any other identifiers.

Purpose of this map is to keep maximum number of last used entries and GC the rest. Difference to LRU is that it doesn't pessimize common case of NOT reaching the limit.

kavirajk · 2022-12-06T09:30:03Z

clients/pkg/logentry/stages/limit.go

+	if cfg.ByLabelName != "" && cfg.MaxDistinctLabels < MinReasonableMaxDistinctLabels {
+		level.Warn(logger).Log(
+			"msg",
+			fmt.Sprintf("max_distinct_labels was adjusted up to the minimal reasonable value of %d", MinReasonableMaxDistinctLabels),


Any rationale behind not to have < MinReasonableMaxDistrinctLabels?

max_distinct_labels is a safeguard to prevent OOM in case of misconfiguration. There is no preallocation happening, so setting it less doesn't give you any benefit. Current limit "costs" ~1MiB in my non-scientific calculations when limit is hit and user need to increase it should they need more.

and user need to increase it should they need more.

Just to clarify this bit, user need to increase limit in case if number actively hit ratelimiters exceed it or they'll have innacurate ratelimiting (more entrics will go through then should have). Should some labels values be transient and NOT being in active use (lets say pod_name) , then it is perfectly fine for the system to reach max limit and get then GCed, there is no need to increase limit to accomodate for them.

kavirajk · 2022-12-06T09:41:41Z

pkg/util/metrics_helper.go

+
+// RegisterCounterVec registers new CounterVec with given name,namespace and labels.
+// If metric was already registered it returns existing instance.
+func RegisterCounterVec(registerer prometheus.Registerer, namespace, name, help string, labels []string) *prometheus.CounterVec {


Q: what's the problem in using existing prometheus.MustRegister helper here? I would let it panic in case of duplicate registration rather than handling it via promtheus.AlreadyRegisteredError handling. Am I missing any specific use case here?

this function mimics getDropcountMetric from match.go this stage used to use. Now that we need 2 metrics, I created this helper function keepin old behaviour exactly as is

kavirajk · 2022-12-06T09:46:21Z

docs/sources/clients/promtail/stages/limit.md

@@ -56,3 +63,18 @@ Given the pipeline:
 ```

 Would throttle any log line and drop logs when rate limit.
+
+#### Ratelimit by a label


I think it's worth mentioning what is the difference between these two.

- limit: rate: 10 burst: 10 drop: true by_label_name: "namespace"

with this

match: selector: '{namespace="foo"}"' stages: - limit: rate: 5 drop: true match: selector: '{namespace!="foo"}"' stages: - limit: rate: 100 drop: false

Like @liguozhong already mentioned, even my first thought would be to use latter config instead of this new config.

I see you explained, with by_label_name we can get separate ratelimiter per label.

Wondering what advantage does it serve?

it allows you to have ratelimit per label without knowing labels value upfront. Lets say you can set ratelimit per namespace or app, but don't need to reconfigure loki every time new app comes into cluster.

vlad-diachenko

I definitely like this feature but I think we could improve the implementation before merging it ;)

PS: generics are one love ;)

clients/pkg/logentry/stages/limit.go

pkg/util/map.go

Supra89kren · 2022-12-07T09:13:06Z

pkg/util/map.go

+		m.newgen[key] = v
+
+		if len(m.newgen) == m.maxSize {
+			m.oldgen = m.newgen


can we add some mechanism to clean up oldgen after some time if it's untouched?
I think that it might be a case when we have a spike of values, so we populate oldgen, and after that we remove stabilise our values, but rateLimitters will live in oldgen forever...

Potentially I could trigger gc on timer, but it doesn't feel right to do, because:

it affects ratelimiting accuracy

because of the above GC period likely need to be exposed as configuration variable for users to tune

net win is 1-2MiB tops in memory and with no win in CPU utilization

overall it doesn't worth it IMHO

vlad-diachenko · 2022-12-07T09:40:42Z

hey @redbaron . thanks for rapid reply ;)
In general, I misunderstood the goal of this implementation. Now it's much clear. will double check the PR and clean up my irrelevant comments

vlad-diachenko · 2022-12-07T13:29:50Z

@redbaron could you please fix the CI https://drone.grafana.net/grafana/loki/17966/3/9 so, we could merge it when all the discussions are resolved

redbaron · 2022-12-07T14:41:43Z

hey @redbaron . thanks for rapid reply ;) In general, I misunderstood the goal of this implementation. Now it's much clear. will double check the PR and clean up my irrelevant comments

If I read output correctly, it reports files I didn't change in this PR.

vlad-diachenko · 2022-12-07T14:48:16Z

hey @redbaron . thanks for rapid reply ;) In general, I misunderstood the goal of this implementation. Now it's much clear. will double check the PR and clean up my irrelevant comments

If I read output correctly, it reports files I didn't change in this PR.

level=info msg="Fix issue &result.Issue{FromLinter:"goimports", Text:"File is not goimports-ed with -local github.com/grafana/loki/pkg,github.com/grafana/loki/tools", Severity:"", SourceLines:[]string{""}, Replacement:(*result.Replacement)(0xc098805850), Pkg:(*packages.Package)(0xc001b4f380), LineRange:(*result.Range)(nil), Pos:token.Position{Filename:"clients/pkg/logentry/stages/limit.go", Offset:0, Line:9, Column:0}, HunkPos:0, ExpectNoLint:false, ExpectedNoLintLinter:""} with range {9 9}"

clients/pkg/logentry/stages/limit.go

import must be sorted and split into 3 groups:

standard libraries
packages from this/grafana repo
external modules

try to move the imports and run make lint command.

I will be partially unavailable during the next few days... might be slow in replies... let me know if you need any help

Limit stage can now takes optional `by_label_name` param, which tracks distinct values of that label and ratelimit entries independently. Log lines without expected label are not considered for ratelimiting and passed to the next pipeline stage. To avoid stalling whole stage by just few labels it requires 'drop', to be set to true.

grafanabot · 2022-12-07T16:15:09Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

redbaron · 2022-12-07T16:16:39Z

@vlad-diachenko , there were unrelated errors reported, so I squashed and rebased on top of latest master.

CHANGELOG.md

Signed-off-by: Vladyslav Diachenko <vlad.diachenko@grafana.com>

vlad-diachenko

LGTM . thanks @redbaron 🚀

grafanabot · 2022-12-12T11:45:44Z

./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
+        distributor	0%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

redbaron requested a review from a team as a code owner November 4, 2022 11:29

pull-request-size bot added the size/L label Nov 4, 2022

github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Nov 4, 2022

redbaron force-pushed the ratelimit-by-label branch 3 times, most recently from 763c685 to 8851e03 Compare November 4, 2022 12:04

redbaron force-pushed the ratelimit-by-label branch from 6bbb55c to a060796 Compare November 9, 2022 13:11

liguozhong reviewed Nov 21, 2022

View reviewed changes

redbaron requested a review from liguozhong November 21, 2022 12:20

liguozhong approved these changes Nov 22, 2022

View reviewed changes

kavirajk reviewed Dec 6, 2022

View reviewed changes

vlad-diachenko requested changes Dec 6, 2022

View reviewed changes

Supra89kren reviewed Dec 7, 2022

View reviewed changes

pkg/util/map.go Show resolved Hide resolved

Supra89kren reviewed Dec 7, 2022

View reviewed changes

grafana deleted a comment from Supra89kren Dec 7, 2022

redbaron force-pushed the ratelimit-by-label branch 2 times, most recently from d16ab2f to f1190de Compare December 7, 2022 16:03

redbaron force-pushed the ratelimit-by-label branch from f1190de to 18cf4f9 Compare December 7, 2022 16:10

vlad-diachenko reviewed Dec 12, 2022

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

fixed changelog

7507f8b

Signed-off-by: Vladyslav Diachenko <vlad.diachenko@grafana.com>

vlad-diachenko approved these changes Dec 12, 2022

View reviewed changes

vlad-diachenko merged commit b773a7a into grafana:main Dec 12, 2022

redbaron deleted the ratelimit-by-label branch December 12, 2022 13:03

liguozhong mentioned this pull request Mar 2, 2023

Rate limiting or Quotas within Promtail. #2664

Closed

promtail: ratelimiting by label #7597

promtail: ratelimiting by label #7597

Conversation

redbaron commented Nov 4, 2022 • edited Loading

CLAassistant commented Nov 4, 2022 • edited Loading

grafanabot commented Nov 4, 2022

liguozhong commented Nov 4, 2022 • edited Loading

redbaron commented Nov 4, 2022

liguozhong commented Nov 7, 2022 • edited Loading

redbaron commented Nov 7, 2022

grafanabot commented Nov 9, 2022

redbaron commented Nov 9, 2022 • edited Loading

grafanabot commented Nov 9, 2022

grafanabot commented Nov 9, 2022

liguozhong commented Nov 15, 2022

redbaron commented Nov 15, 2022

liguozhong commented Nov 15, 2022

redbaron commented Nov 15, 2022

liguozhong commented Nov 16, 2022

liguozhong commented Nov 16, 2022 • edited Loading

redbaron commented Nov 16, 2022

liguozhong commented Nov 16, 2022

liguozhong commented Nov 16, 2022

liguozhong commented Nov 16, 2022

redbaron commented Nov 16, 2022

redbaron commented Nov 16, 2022

grafanabot commented Nov 16, 2022

grafanabot commented Nov 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

grafanabot commented Nov 21, 2022

liguozhong left a comment

Choose a reason for hiding this comment

redbaron commented Nov 28, 2022

kavirajk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

redbaron Dec 6, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vlad-diachenko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vlad-diachenko commented Dec 7, 2022

vlad-diachenko commented Dec 7, 2022

redbaron commented Dec 7, 2022

vlad-diachenko commented Dec 7, 2022

grafanabot commented Dec 7, 2022

redbaron commented Dec 7, 2022

vlad-diachenko left a comment

Choose a reason for hiding this comment

grafanabot commented Dec 12, 2022

redbaron commented Nov 4, 2022 •

edited

Loading

CLAassistant commented Nov 4, 2022 •

edited

Loading

liguozhong commented Nov 4, 2022 •

edited

Loading

liguozhong commented Nov 7, 2022 •

edited

Loading

redbaron commented Nov 9, 2022 •

edited

Loading

liguozhong commented Nov 16, 2022 •

edited

Loading

redbaron Dec 6, 2022 •

edited

Loading