Stats cache can be configured independently #9535

salvacorts · 2023-05-26T12:36:35Z

What this PR does / why we need it:

Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since:

We would not be able to point to a different cache for storing the index stats if needed.
We would not be able to add specific settings for this cache, without adding it to the results cache.

In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings.

Which issue(s) this PR fixes:
This is needed for:

Special notes for your reviewer:

This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own.
Configuring a new cache in rountrip.go is a requirement for Add summary stats and metrics for stats cache #9536 so the stats summary can distinguish before the stats cache and the results cache.

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
CHANGELOG.md updated
Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

…erences between results cache and stats cache

JStickler

[docs team] LGTM, one question for clarification.

JStickler · 2023-06-05T14:36:27Z

docs/sources/configuration/_index.md

+  [cache: <cache_config>]
+
+  # Use compression in results cache. Supported values are: 'snappy' and ''
+  # (disable compression).


Supported values are: 'snappy' and ' '

Does ' ' disable the compression? Would it be clearer to say that "A null value will disable compression."?

This comes inherited from the docs in the results cache (we reuse the structure and therefore generate the same docs).

Does ' ' disable the compression?

Yes

Would it be clearer to say that "A null value will disable compression."?

I think the use of "null" may be confusing here. Someone may write "null" to disable it but that would yield an error. Instead, I think we may better use An empty (i.e. "") value will disable compression.

Having said that, we should probably apply that change on a separate PR so we can update both the docs for this and the results cache. Wdyt?

Separate PR is fine. I figure if I was confused by what value we meant there, then customers could be confused too. (And good point about someone entering "null" as the value.)

Created:

Improve docs for empty value in cache compression config #9649

…rics

**What this PR does / why we need it**: When a query finishes, we return (and log) the following stats: ```go Cache.Chunk.Requests 0 Cache.Chunk.EntriesRequested 0 Cache.Chunk.EntriesFound 0 Cache.Chunk.EntriesStored 0 Cache.Chunk.BytesSent 0 B Cache.Chunk.BytesReceived 0 B Cache.Chunk.DownloadTime 0s Cache.Index.Requests 0 Cache.Index.EntriesRequested 0 Cache.Index.EntriesFound 0 Cache.Index.EntriesStored 0 Cache.Index.BytesSent 0 B Cache.Index.BytesReceived 0 B Cache.Index.DownloadTime 0s Cache.Result.Requests 13 Cache.Result.EntriesRequested 13 Cache.Result.EntriesFound 13 Cache.Result.EntriesStored 0 Cache.Result.BytesSent 0 B Cache.Result.BytesReceived 2.5 kB Cache.Result.DownloadTime 4.600266ms ``` In addition to that, we log the following in metrics.go: ``` level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester |promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes _fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms ``` With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache. Here's an example of the new stats being returned and printed: ```go ... Cache.StatsResult.Requests 180 Cache.StatsResult.EntriesRequested 129 Cache.StatsResult.EntriesFound 129 Cache.StatsResult.EntriesStored 51 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 75 kB ... ``` And the new stats from metrics.go ``` ... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti me=156.864429ms ... ``` **Special notes for your reviewer**: - Blocked by #9535 - Note the new`stats.GetOrCreateContext` func. It's used inside the `query.Exec` method so we don't overwrite the stats added in the stats middleware. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: When we query the stats for recent data, we query both the ingesters and the index gateways for the stats. https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127 Then we merge all the responses, which means summing up all the stats https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26 Because we have a replication factor of 3, this means that we will get the stats from the ingesters repeated up to 3 times, hence inflating the stats. In the stats cache, we store the stats for a given matcher set for the whole day, then we extract the stats from the cache by the factor of time from the request that is stored in the cache: https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33 https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40 Inflated stats for recent data will be cached, so subsequent stats extracted from the cache will be inflated regardless of the time. This PR adds a new per-tenant limit `max_stats_cache_freshness` to not cache requests with an end time that falls within Now minus this duration. Here's a scenario illustrating this. The graphs below show the bytes stats queried in the sharding middleware. We are running a log filter query that won't match any log, every 5 seconds with a length of 3h. ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39) As can be seen, after enabling the stats cache and configuring`do_not_cache_request_within` to not cache stats for requests within 30m, the bytes stats used in the sharding middleware stopped increasing. In both cases the stats cache hit ration was 100%. ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23) **Special notes for your reviewer**: - Blocked by #9535 - Note that this PR doesn't fix the root issue of inflated stats form the ingesters, but rather buys us some time to work on that. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: Follow up PR for #9535 (comment) **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since: 1. We would not be able to point to a different cache for storing the index stats if needed. 2. We would not be able to add specific settings for this cache, without adding it to the results cache. In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings. **Which issue(s) this PR fixes**: This is needed for: - #9537 - #9536 **Special notes for your reviewer**: - This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own. - Configuring a new cache in rountrip.go is a requirement for #9536 so the stats summary can distinguish before the stats cache and the results cache. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: When a query finishes, we return (and log) the following stats: ```go Cache.Chunk.Requests 0 Cache.Chunk.EntriesRequested 0 Cache.Chunk.EntriesFound 0 Cache.Chunk.EntriesStored 0 Cache.Chunk.BytesSent 0 B Cache.Chunk.BytesReceived 0 B Cache.Chunk.DownloadTime 0s Cache.Index.Requests 0 Cache.Index.EntriesRequested 0 Cache.Index.EntriesFound 0 Cache.Index.EntriesStored 0 Cache.Index.BytesSent 0 B Cache.Index.BytesReceived 0 B Cache.Index.DownloadTime 0s Cache.Result.Requests 13 Cache.Result.EntriesRequested 13 Cache.Result.EntriesFound 13 Cache.Result.EntriesStored 0 Cache.Result.BytesSent 0 B Cache.Result.BytesReceived 2.5 kB Cache.Result.DownloadTime 4.600266ms ``` In addition to that, we log the following in metrics.go: ``` level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester |promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes _fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms ``` With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache. Here's an example of the new stats being returned and printed: ```go ... Cache.StatsResult.Requests 180 Cache.StatsResult.EntriesRequested 129 Cache.StatsResult.EntriesFound 129 Cache.StatsResult.EntriesStored 51 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 75 kB ... ``` And the new stats from metrics.go ``` ... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti me=156.864429ms ... ``` **Special notes for your reviewer**: - Blocked by #9535 - Note the new`stats.GetOrCreateContext` func. It's used inside the `query.Exec` method so we don't overwrite the stats added in the stats middleware. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: When we query the stats for recent data, we query both the ingesters and the index gateways for the stats. https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127 Then we merge all the responses, which means summing up all the stats https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26 Because we have a replication factor of 3, this means that we will get the stats from the ingesters repeated up to 3 times, hence inflating the stats. In the stats cache, we store the stats for a given matcher set for the whole day, then we extract the stats from the cache by the factor of time from the request that is stored in the cache: https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33 https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40 Inflated stats for recent data will be cached, so subsequent stats extracted from the cache will be inflated regardless of the time. This PR adds a new per-tenant limit `max_stats_cache_freshness` to not cache requests with an end time that falls within Now minus this duration. Here's a scenario illustrating this. The graphs below show the bytes stats queried in the sharding middleware. We are running a log filter query that won't match any log, every 5 seconds with a length of 3h. ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39) As can be seen, after enabling the stats cache and configuring`do_not_cache_request_within` to not cache stats for requests within 30m, the bytes stats used in the sharding middleware stopped increasing. In both cases the stats cache hit ration was 100%. ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23) **Special notes for your reviewer**: - Blocked by #9535 - Note that this PR doesn't fix the root issue of inflated stats form the ingesters, but rather buys us some time to work on that. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

commit 065bee7 Author: Travis Patterson <travis.patterson@grafana.com> Date: Mon Jun 12 10:21:58 2023 -0600 Label Volume Endpoint (#9588) For a given set of matchers, returns the top N associated label/value pairs by volume. A query for `{cluster=prod}` will return ``` cluster=prod: size (total logs matching this matcher) . . . nth-label=nth-value ``` This is to service use cases where users want to understand where their log volume has come from by label without making multiple requests to the stats endpoint. Note: This PR is a monster but it's mostly plumbing. I've pointed out the most interesting bits that actually get the volumes from ingesters/indexs commit 4d997a5 Author: Piotr <17101802+thampiotr@users.noreply.github.com> Date: Mon Jun 12 16:24:26 2023 +0100 Fix promtail cluster template not finding all clusters. (#9684) **What this PR does / why we need it**: In promtail-mixin, the dropdown template for clusters would only include clusters that run loki. So if a cluster only run promtail and not loki, it doesn't appear. commit 57f9452 Author: Kaviraj Kanagaraj <kavirajkanagaraj@gmail.com> Date: Mon Jun 12 15:21:08 2023 +0200 Revert 9217 chaudum/tsdb chunkrefs pool (#9685) Revert #9217 (potential bug in query result) --------- Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com> commit 73ac208 Author: Salva Corts <salva.corts@grafana.com> Date: Mon Jun 12 10:46:30 2023 +0200 Improve docs for empty value in cache compression config (#9649) **What this PR does / why we need it**: Follow up PR for #9535 (comment) **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e) commit f239435 Author: Christophe Collot <52134228+CCOLLOT@users.noreply.github.com> Date: Fri Jun 9 15:00:31 2023 +0200 feat(lambda-promtail): add cloudfront log file ingestion support (#9573) **What this PR does / why we need it**: This PR enables ingesting logs from Cloudfront log files stored in s3 (batch). The current setup only supports streaming Cloudfront logs through AWS Kinesis, whereas this PR implements the same flow as for VPC Flow logs, Load Balancer logs, and Cloudtrail logs (s3 --> SQS (optional) --> Lambda Promtail --> Loki) **Special notes for your reviewer**: + The Cloudfront log file format is different from the already implemented services, meaning we had to build yet another regex. AWS never bothered making all services follow the same log file naming convention but the "good" thing is that it's now very unlikely they will change it in the future. + The Cloudfront file name does not have any mention of the AWS account or the time of log it contains, it means we have to infer the log type from the filename format instead of finding the exact string "cloudfront" in the filename. This is why in `getLabels`, if no `type` parameter is found in the regex, we use the key corresponding to the name of the matching parser. + I introduced a new `parser` struct to group together several parameters specific to a type of log (and avoid relying too much on map key string matching and / or if statements for specific use cases) + I've been successfully running this code in several AWS environments for days. + I corrected a typo from my previous PR #9497 (wrong PR number in Changelog.md) **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [x] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [x] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e) --------- Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com> commit c6fbff2 Author: Salva Corts <salva.corts@grafana.com> Date: Fri Jun 9 14:40:36 2023 +0200 Add config to avoid caching stats for recent data (#9537) **What this PR does / why we need it**: When we query the stats for recent data, we query both the ingesters and the index gateways for the stats. https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127 Then we merge all the responses, which means summing up all the stats https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26 Because we have a replication factor of 3, this means that we will get the stats from the ingesters repeated up to 3 times, hence inflating the stats. In the stats cache, we store the stats for a given matcher set for the whole day, then we extract the stats from the cache by the factor of time from the request that is stored in the cache: https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33 https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40 Inflated stats for recent data will be cached, so subsequent stats extracted from the cache will be inflated regardless of the time. This PR adds a new per-tenant limit `max_stats_cache_freshness` to not cache requests with an end time that falls within Now minus this duration. Here's a scenario illustrating this. The graphs below show the bytes stats queried in the sharding middleware. We are running a log filter query that won't match any log, every 5 seconds with a length of 3h. ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39) As can be seen, after enabling the stats cache and configuring`do_not_cache_request_within` to not cache stats for requests within 30m, the bytes stats used in the sharding middleware stopped increasing. In both cases the stats cache hit ration was 100%. ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23) **Special notes for your reviewer**: - Blocked by #9535 - Note that this PR doesn't fix the root issue of inflated stats form the ingesters, but rather buys us some time to work on that. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e) commit 22779e1 Author: Michel Hollands <42814411+MichelHollands@users.noreply.github.com> Date: Fri Jun 9 13:33:15 2023 +0100 Fix date template function with epoch times (#8886) **What this PR does / why we need it**: Adds new toUnixEpoch... functions to convert from a string with a Unix/Epoch time to an integer that can be used in the existing `toDate` function. Note that these are the opposites of some of the functions introduced in #8774. **Which issue(s) this PR fixes**: Fixes #8624. **Special notes for your reviewer**: **Checklist** - [ ] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [X] Documentation added - [X] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` --------- Signed-off-by: Michel Hollands <michel.hollands@grafana.com> commit 1b410db Author: Bruno FERNANDO <bruno.fernando@jobteaser.com> Date: Fri Jun 9 13:48:42 2023 +0200 feat(promtail): add CF ClientRequestSource field (#9669) **What this PR does / why we need it**: Hey folks 👋 Little contribution here to add a useful log field for cloudflare users. Indeed I add the [ClientRequestSource field](https://developers.cloudflare.com/logs/reference/clientrequestsource/ ) which is pretty useful when debugging some specific traffic handled by cloudflare Extra: Since I was on the documentation I fixed an indentation issue that I spotted Don't hesitate to reach me if you have any questions Cheers 😉 **Which issue(s) this PR fixes**: Fixes #<issue number> **Special notes for your reviewer**: Loki rocks 🚀 **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [ ] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e) commit b1917a6 Author: Gregor Zeitlinger <gregor.zeitlinger@grafana.com> Date: Fri Jun 9 13:38:21 2023 +0200 add "alignLeft" and "alignRight" functions (#9672) Fixes #9667 commit 98d1307 Author: Ashwanth <iamashwanth@gmail.com> Date: Fri Jun 9 12:46:38 2023 +0530 config: ensure storage config defaults apply to named stores (#9650) **What this PR does / why we need it**: Since named store config does not register any flags, storage configs defined under it do not get the defaults. For example [aws_storage_config](https://grafana.com/docs/loki/latest/configuration/#aws_storage_config) sets the default `storage_class` to `STANDARD`, but the same doesn't get applied by default when using named stores. This PR ensures that named storage configs are always assigned default values when they are unmarshalled by implementing `yaml.Unmarshaler` interface commit 4cebc2d Author: Pepe Cano <825430+ppcano@users.noreply.github.com> Date: Thu Jun 8 21:44:00 2023 +0200 Docs: replace `k6 Cloud` mention (#9599) k6 is now available as a managed service on Grafana Cloud. This is a small doc changes to remove the mention of `k6 Cloud`. --------- Co-authored-by: J Stickler <julie.stickler@grafana.com> commit 1db560f Author: Danny Kopping <danny.kopping@grafana.com> Date: Thu Jun 8 14:19:58 2023 +0200 Adding background cache (en|de)queue counters (#9665) **What this PR does / why we need it**: The background writeback cache exposes gauge metric currently for the current queue size. Gauges can be useful, but they are susceptible to sample errors because they only represent the point in time as the time of the scrape. Exposing counters for the bytes (en|de)queued to/from the cache will be more useful because they can be aggregated. Signed-off-by: Danny Kopping <danny.kopping@grafana.com> commit 609bc22 Author: Dylan Guedes <djmgguedes@gmail.com> Date: Thu Jun 8 09:04:45 2023 -0300 Distributor: Make key configurable when logging failures (#9659) **What this PR does / why we need it**: Make appending `insight=true` key-value pair to log failures configurable. **Which issue(s) this PR fixes**: N/A commit d581258 Author: Nils Griebner <nils@nils-griebner.de> Date: Thu Jun 8 11:30:52 2023 +0200 Make table manager retention options configurable in helm chart values (#9647) **What this PR does / why we need it**: Configuration options for table manager retention are hard-coded in the helm chart values at the moment so that it's not possible to enable retention deletes. **Which issue(s) this PR fixes**: Fixes #8676 **Special notes for your reviewer**: - **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [x] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [x] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e) Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>

**What this PR does / why we need it**: Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since: 1. We would not be able to point to a different cache for storing the index stats if needed. 2. We would not be able to add specific settings for this cache, without adding it to the results cache. In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings. **Which issue(s) this PR fixes**: This is needed for: - #9537 - #9536 **Special notes for your reviewer**: - This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own. - Configuring a new cache in rountrip.go is a requirement for #9536 so the stats summary can distinguish before the stats cache and the results cache. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [x] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: When a query finishes, we return (and log) the following stats: ```go Cache.Chunk.Requests 0 Cache.Chunk.EntriesRequested 0 Cache.Chunk.EntriesFound 0 Cache.Chunk.EntriesStored 0 Cache.Chunk.BytesSent 0 B Cache.Chunk.BytesReceived 0 B Cache.Chunk.DownloadTime 0s Cache.Index.Requests 0 Cache.Index.EntriesRequested 0 Cache.Index.EntriesFound 0 Cache.Index.EntriesStored 0 Cache.Index.BytesSent 0 B Cache.Index.BytesReceived 0 B Cache.Index.DownloadTime 0s Cache.Result.Requests 13 Cache.Result.EntriesRequested 13 Cache.Result.EntriesFound 13 Cache.Result.EntriesStored 0 Cache.Result.BytesSent 0 B Cache.Result.BytesReceived 2.5 kB Cache.Result.DownloadTime 4.600266ms ``` In addition to that, we log the following in metrics.go: ``` level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester |promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes _fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms ``` With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache. Here's an example of the new stats being returned and printed: ```go ... Cache.StatsResult.Requests 180 Cache.StatsResult.EntriesRequested 129 Cache.StatsResult.EntriesFound 129 Cache.StatsResult.EntriesStored 51 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 75 kB ... ``` And the new stats from metrics.go ``` ... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti me=156.864429ms ... ``` **Special notes for your reviewer**: - Blocked by #9535 - Note the new`stats.GetOrCreateContext` func. It's used inside the `query.Exec` method so we don't overwrite the stats added in the stats middleware. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

**What this PR does / why we need it**: When we query the stats for recent data, we query both the ingesters and the index gateways for the stats. https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127 Then we merge all the responses, which means summing up all the stats https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158 https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26 Because we have a replication factor of 3, this means that we will get the stats from the ingesters repeated up to 3 times, hence inflating the stats. In the stats cache, we store the stats for a given matcher set for the whole day, then we extract the stats from the cache by the factor of time from the request that is stored in the cache: https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33 https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40 Inflated stats for recent data will be cached, so subsequent stats extracted from the cache will be inflated regardless of the time. This PR adds a new per-tenant limit `max_stats_cache_freshness` to not cache requests with an end time that falls within Now minus this duration. Here's a scenario illustrating this. The graphs below show the bytes stats queried in the sharding middleware. We are running a log filter query that won't match any log, every 5 seconds with a length of 3h. ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39) As can be seen, after enabling the stats cache and configuring`do_not_cache_request_within` to not cache stats for requests within 30m, the bytes stats used in the sharding middleware stopped increasing. In both cases the stats cache hit ration was 100%. ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23) **Special notes for your reviewer**: - Blocked by #9535 - Note that this PR doesn't fix the root issue of inflated stats form the ingesters, but rather buys us some time to work on that. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [x] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

Adds changes from: - #9535 - #9536 - #9537 - #9529 - #9552 So we can use k150 with newer config.

**What this PR does / why we need it**: When a query finishes, we return (and log) the following stats: ```go Cache.Chunk.Requests 0 Cache.Chunk.EntriesRequested 0 Cache.Chunk.EntriesFound 0 Cache.Chunk.EntriesStored 0 Cache.Chunk.BytesSent 0 B Cache.Chunk.BytesReceived 0 B Cache.Chunk.DownloadTime 0s Cache.Index.Requests 0 Cache.Index.EntriesRequested 0 Cache.Index.EntriesFound 0 Cache.Index.EntriesStored 0 Cache.Index.BytesSent 0 B Cache.Index.BytesReceived 0 B Cache.Index.DownloadTime 0s Cache.Result.Requests 13 Cache.Result.EntriesRequested 13 Cache.Result.EntriesFound 13 Cache.Result.EntriesStored 0 Cache.Result.BytesSent 0 B Cache.Result.BytesReceived 2.5 kB Cache.Result.DownloadTime 4.600266ms ``` In addition to that, we log the following in metrics.go: ``` level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester |promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes _fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms ``` With the goal of being able to better monitor how the stats cache is performing; this PR adds stats for the index stats cache, similarly to how it's done for the results cache. Here's an example of the new stats being returned and printed: ```go ... Cache.StatsResult.Requests 180 Cache.StatsResult.EntriesRequested 129 Cache.StatsResult.EntriesFound 129 Cache.StatsResult.EntriesStored 51 Cache.StatsResult.BytesSent 0 B Cache.StatsResult.BytesReceived 75 kB ... ``` And the new stats from metrics.go ``` ... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti me=156.864429ms ... ``` **Special notes for your reviewer**: - Blocked by #9535 - Note the new`stats.GetOrCreateContext` func. It's used inside the `query.Exec` method so we don't overwrite the stats added in the stats middleware. **Checklist** - [x] Reviewed the [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md) guide (**required**) - [ ] Documentation added - [x] Tests updated - [ ] `CHANGELOG.md` updated - [ ] Changes that require user attention or interaction to upgrade are documented in `docs/sources/upgrading/_index.md` - [ ] For Helm chart changes bump the Helm chart version in `production/helm/loki/Chart.yaml` and update `production/helm/loki/CHANGELOG.md` and `production/helm/loki/README.md`. [Example PR](d10549e)

Stats cache can be configured independently. Query summary stats diff…

3b58b61

…erences between results cache and stats cache

pull-request-size bot added size/XL size/L and removed size/XL labels May 26, 2023

Remove stats and metrics for index stats cache

ccf3fab

salvacorts force-pushed the salvacorts/stats-cache-summary-stats-and-metrics branch from e3475a7 to ccf3fab Compare May 26, 2023 14:04

salvacorts added 2 commits May 26, 2023 16:26

Update docs

8da8fc3

Fix fmt issues

545b63b

github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label May 26, 2023

This was referenced May 26, 2023

Add summary stats and metrics for stats cache #9536

Merged

Add config to avoid caching stats for recent data #9537

Merged

salvacorts changed the title ~~Stats cache can be configured independently and reports metrics/stats for the cache~~ Stats cache can be configured independently May 26, 2023

Update changelog

1215fe4

salvacorts marked this pull request as ready for review June 2, 2023 07:04

salvacorts requested review from JStickler and a team as code owners June 2, 2023 07:04

JStickler approved these changes Jun 5, 2023

View reviewed changes

owen-d approved these changes Jun 6, 2023

View reviewed changes

Merge branch 'main' into salvacorts/stats-cache-summary-stats-and-met…

e17d239

…rics

salvacorts merged commit 1694ad0 into main Jun 7, 2023

salvacorts deleted the salvacorts/stats-cache-summary-stats-and-metrics branch June 7, 2023 09:00

salvacorts mentioned this pull request Jun 7, 2023

Improve docs for empty value in cache compression config #9649

Merged

6 tasks

salvacorts mentioned this pull request Jun 13, 2023

k150 with stats cache changes #9680

Merged

salvacorts added a commit that referenced this pull request Jun 13, 2023

k150 with stats cache changes (#9680)

ff001a3

Adds changes from: - #9535 - #9536 - #9537 - #9529 - #9552 So we can use k150 with newer config.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stats cache can be configured independently #9535

Stats cache can be configured independently #9535

salvacorts commented May 26, 2023 •

edited

Loading

JStickler left a comment

JStickler Jun 5, 2023

salvacorts Jun 6, 2023

JStickler Jun 6, 2023

salvacorts Jun 7, 2023

Stats cache can be configured independently #9535

Stats cache can be configured independently #9535

Conversation

salvacorts commented May 26, 2023 • edited Loading

JStickler left a comment

Choose a reason for hiding this comment

JStickler Jun 5, 2023

Choose a reason for hiding this comment

salvacorts Jun 6, 2023

Choose a reason for hiding this comment

JStickler Jun 6, 2023

Choose a reason for hiding this comment

salvacorts Jun 7, 2023

Choose a reason for hiding this comment

salvacorts commented May 26, 2023 •

edited

Loading