-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
promtail: ratelimiting by label #7597
Conversation
763c685
to
8851e03
Compare
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
hi, thanks for your contribution. - limit:
rate: 10
burst: 10
drop: true
by_label_name: "namespace" Is it possible to achieve the same result with match+ limit stage in such a scenario. match:
selector: '{namespace="foo"}"'
stages:
- limit:
rate: 5
drop: true
match:
selector: '{namespace!="foo"}"'
stages:
- limit:
rate: 100
drop: false |
No, because in your case "namespace=abc" and "namespace=xyz" will be ratelimited together, only "namespace=foo" will have it's own ratelimiter. |
If user configure rate limit based on traceid label, will this pr make promtail oom. - limit:
rate: 10
burst: 10
drop: true
by_label_name: "traceID" |
high cardinality labels is already on a naughty list of loki administrators |
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
@liguozhong , added max size cap. Implementation is completely free until size cap is reached at which point additional cost is just 1 lookup + 1 insert per each in-use label value once. |
6bbb55c
to
a060796
Compare
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
- querier/queryrange -0.1%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0.4%
+ loki 0% |
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
👍,thanks for your contribution. Reimplementing a map with limited expiration requires more consideration, such as whether the map is a thread-safe map ? Why not use an existing map?
|
I considered using LRU, but decided against , because LRU is not what we need. We don't have requirements to take out entry on insert beyond capacity, our map values are small enough (pointer) that we can afford to keep 2x of them with neglible increase in memory use. Core logic of capped map ended up being just 13 lines, therefore I couldn't justify pulling another much much more complicated dependency, especially given that we don't need LRU. Keep in mind, that end goal is to prevent crash on misconfiguration, like with your trace id example, nothing else. Normal mode of operation of properly configured system will not reach capacity , therefore we should optimize for that. Proposed solution in a steady state is indistinguishable from a simple map (which it is) both for CPU and memory utilization. as for multithreading, my undertandig is that Stage.Run never runs concurrently, therefore thread safety is a non requirement in this particular case. |
These are pipelines, right? Within pipeline instance, every stage's |
Today I carefully checked the pipeline code, and the 2.7.0 code should be thread-safe. #2996 |
good to merge, then? :) |
this new metrics may cause prometheus OOM due to high cardinality label values. |
If we want to introduce a new generationalMap, there are many things to consider, |
and add testcase for generationalMap |
I don't understand what it is for. None of that is required for the current use of it. |
I am not familiar with the prometheus metrics instrumentation, how unbounded cardinality problem is resolved elsewhere? |
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
clients/pkg/logentry/stages/match.go
Outdated
dropCount := prometheus.NewCounterVec(prometheus.CounterOpts{ | ||
Namespace: "logentry", | ||
Name: "dropped_lines_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better if this PR can not modify match.go
. Other parts LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed any changes from match.go
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for your contributor👍
@kavirajk , would you kindly have a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @redbaron for the contribution and @liguozhong for the review.
Added a few comments on my own. My main concern is what advantage does this by_label_name
serve over using matcher
stage with any labels..
} | ||
|
||
// NewGenMap created which maintains at most maxSize recently used entries | ||
func NewGenMap[K comparable, V any](maxSize int, newV func() V, gcCb func()) GenerationalMap[K, V] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would our first generic code in Loki I believe. Will wait for few more eyes from contributors.
Can we make it bit more readable may be? failing to understand what are newV
and gcCb
means here. Can we have may be better naming and some descriptive comment on those?
If we going with this generic map then, a have few tests for this map would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just started to think why we need this Map in the first place? I see it's getting used only in one place and could be simple map[string]*rate.Limiter
?. Trying to understand the rationale here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newV - new value
gcCb - garbage collector callback
I am happy to use any names you suggest for these or any other identifiers.
Purpose of this map is to keep maximum number of last used entries and GC the rest. Difference to LRU is that it doesn't pessimize common case of NOT reaching the limit.
if cfg.ByLabelName != "" && cfg.MaxDistinctLabels < MinReasonableMaxDistinctLabels { | ||
level.Warn(logger).Log( | ||
"msg", | ||
fmt.Sprintf("max_distinct_labels was adjusted up to the minimal reasonable value of %d", MinReasonableMaxDistinctLabels), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any rationale behind not to have < MinReasonableMaxDistrinctLabels
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_distinct_labels
is a safeguard to prevent OOM in case of misconfiguration. There is no preallocation happening, so setting it less doesn't give you any benefit. Current limit "costs" ~1MiB in my non-scientific calculations when limit is hit and user need to increase it should they need more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and user need to increase it should they need more.
Just to clarify this bit, user need to increase limit in case if number actively hit ratelimiters exceed it or they'll have innacurate ratelimiting (more entrics will go through then should have). Should some labels values be transient and NOT being in active use (lets say pod_name) , then it is perfectly fine for the system to reach max limit and get then GCed, there is no need to increase limit to accomodate for them.
|
||
// RegisterCounterVec registers new CounterVec with given name,namespace and labels. | ||
// If metric was already registered it returns existing instance. | ||
func RegisterCounterVec(registerer prometheus.Registerer, namespace, name, help string, labels []string) *prometheus.CounterVec { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: what's the problem in using existing prometheus.MustRegister
helper here? I would let it panic in case of duplicate registration rather than handling it via promtheus.AlreadyRegisteredError
handling. Am I missing any specific use case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function mimics getDropcountMetric
from match.go
this stage used to use. Now that we need 2 metrics, I created this helper function keepin old behaviour exactly as is
@@ -56,3 +63,18 @@ Given the pipeline: | |||
``` | |||
|
|||
Would throttle any log line and drop logs when rate limit. | |||
|
|||
#### Ratelimit by a label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth mentioning what is the difference between these two.
- limit:
rate: 10
burst: 10
drop: true
by_label_name: "namespace"
with this
match:
selector: '{namespace="foo"}"'
stages:
- limit:
rate: 5
drop: true
match:
selector: '{namespace!="foo"}"'
stages:
- limit:
rate: 100
drop: false
Like @liguozhong already mentioned, even my first thought would be to use latter config instead of this new config.
I see you explained, with by_label_name
we can get separate ratelimiter per label.
Wondering what advantage does it serve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it allows you to have ratelimit per label without knowing labels value upfront. Lets say you can set ratelimit per namespace or app, but don't need to reconfigure loki every time new app comes into cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely like this feature but I think we could improve the implementation before merging it ;)
PS: generics are one love ;)
m.newgen[key] = v | ||
|
||
if len(m.newgen) == m.maxSize { | ||
m.oldgen = m.newgen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some mechanism to clean up oldgen after some time if it's untouched?
I think that it might be a case when we have a spike of values, so we populate oldgen
, and after that we remove stabilise our values, but rateLimitters will live in oldgen forever...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially I could trigger gc on timer, but it doesn't feel right to do, because:
- it affects ratelimiting accuracy
- because of the above GC period likely need to be exposed as configuration variable for users to tune
- net win is 1-2MiB tops in memory and with no win in CPU utilization
overall it doesn't worth it IMHO
hey @redbaron . thanks for rapid reply ;) |
@redbaron could you please fix the CI https://drone.grafana.net/grafana/loki/17966/3/9 so, we could merge it when all the discussions are resolved |
If I read output correctly, it reports files I didn't change in this PR. |
level=info msg="Fix issue &result.Issue{FromLinter:"goimports", Text:"File is not clients/pkg/logentry/stages/limit.go import must be sorted and split into 3 groups:
try to move the imports and run I will be partially unavailable during the next few days... might be slow in replies... let me know if you need any help |
d16ab2f
to
f1190de
Compare
Limit stage can now takes optional `by_label_name` param, which tracks distinct values of that label and ratelimit entries independently. Log lines without expected label are not considered for ratelimiting and passed to the next pipeline stage. To avoid stalling whole stage by just few labels it requires 'drop', to be set to true.
f1190de
to
18cf4f9
Compare
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
@vlad-diachenko , there were unrelated errors reported, so I squashed and rebased on top of latest master. |
Signed-off-by: Vladyslav Diachenko <vlad.diachenko@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM . thanks @redbaron 🚀
./tools/diff_coverage.sh ../loki-target-branch/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Limit stage can now takes optional
by_label_name
param, which tracks distinct values of that label and ratelimit entries independently.Log lines without expected label are not considered for ratelimiting and passed to the next pipeline stage.
To avoid stalling whole stage by just few labels it requires 'drop', to be set to true.
Checklist
CONTRIBUTING.md
guideCHANGELOG.md
updateddocs/sources/upgrading/_index.md