Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stats cache can be configured independently #9535

Merged
merged 6 commits into from
Jun 7, 2023

Conversation

salvacorts
Copy link
Contributor

@salvacorts salvacorts commented May 26, 2023

What this PR does / why we need it:

Before this PR, the index stats cache would use the same config as the query results cache. This was a limitation since:

  1. We would not be able to point to a different cache for storing the index stats if needed.
  2. We would not be able to add specific settings for this cache, without adding it to the results cache.

In this PR, we refactor the index stats cache config to be independently configurable. Note that if it's not configured, it will try to use the results cache settings.

Which issue(s) this PR fixes:
This is needed for:

Special notes for your reviewer:

  • This PR also refactors all the tripperwares in rountrip.go to reuse the same stats tripperware instead of each one creating their own.
  • Configuring a new cache in rountrip.go is a requirement for Add summary stats and metrics for stats cache #9536 so the stats summary can distinguish before the stats cache and the results cache.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

…erences between results cache and stats cache
@salvacorts salvacorts force-pushed the salvacorts/stats-cache-summary-stats-and-metrics branch from e3475a7 to ccf3fab Compare May 26, 2023 14:04
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label May 26, 2023
@salvacorts salvacorts changed the title Stats cache can be configured independently and reports metrics/stats for the cache Stats cache can be configured independently May 26, 2023
@salvacorts salvacorts marked this pull request as ready for review June 2, 2023 07:04
@salvacorts salvacorts requested review from JStickler and a team as code owners June 2, 2023 07:04
Copy link
Contributor

@JStickler JStickler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[docs team] LGTM, one question for clarification.

[cache: <cache_config>]

# Use compression in results cache. Supported values are: 'snappy' and ''
# (disable compression).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supported values are: 'snappy' and ' '

Does ' ' disable the compression? Would it be clearer to say that "A null value will disable compression."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes inherited from the docs in the results cache (we reuse the structure and therefore generate the same docs).

Does ' ' disable the compression?

Yes

Would it be clearer to say that "A null value will disable compression."?

I think the use of "null" may be confusing here. Someone may write "null" to disable it but that would yield an error. Instead, I think we may better use An empty (i.e. "") value will disable compression.

Having said that, we should probably apply that change on a separate PR so we can update both the docs for this and the results cache. Wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate PR is fine. I figure if I was confused by what value we meant there, then customers could be confused too. (And good point about someone entering "null" as the value.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@salvacorts salvacorts merged commit 1694ad0 into main Jun 7, 2023
@salvacorts salvacorts deleted the salvacorts/stats-cache-summary-stats-and-metrics branch June 7, 2023 09:00
salvacorts added a commit that referenced this pull request Jun 7, 2023
**What this PR does / why we need it**:
When a query finishes, we return (and log) the following stats:
```go
Cache.Chunk.Requests             0
Cache.Chunk.EntriesRequested     0
Cache.Chunk.EntriesFound         0
Cache.Chunk.EntriesStored        0
Cache.Chunk.BytesSent            0 B
Cache.Chunk.BytesReceived        0 B
Cache.Chunk.DownloadTime         0s
Cache.Index.Requests             0
Cache.Index.EntriesRequested     0
Cache.Index.EntriesFound         0
Cache.Index.EntriesStored        0
Cache.Index.BytesSent            0 B
Cache.Index.BytesReceived        0 B
Cache.Index.DownloadTime         0s
Cache.Result.Requests            13
Cache.Result.EntriesRequested    13
Cache.Result.EntriesFound        13
Cache.Result.EntriesStored       0
Cache.Result.BytesSent   0 B
Cache.Result.BytesReceived       2.5 kB
Cache.Result.DownloadTime        4.600266ms
```

In addition to that, we log the following in metrics.go:
```
level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms
```

With the goal of being able to better monitor how the stats cache is
performing; this PR adds stats for the index stats cache, similarly to
how it's done for the results cache.

Here's an example of the new stats being returned and printed:
```go
...
Cache.StatsResult.Requests               180
Cache.StatsResult.EntriesRequested       129
Cache.StatsResult.EntriesFound   129
Cache.StatsResult.EntriesStored          51
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          75 kB
...
```

And the new stats from metrics.go
```
... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...
```

**Special notes for your reviewer**:
- Blocked by #9535
- Note the new`stats.GetOrCreateContext` func. It's used inside the
`query.Exec` method so we don't overwrite the stats added in the stats
middleware.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 9, 2023
**What this PR does / why we need it**:

When we query the stats for recent data, we query both the ingesters and
the index gateways for the stats.

https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114


https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127

Then we merge all the responses, which means summing up all the stats


https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158


https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26

Because we have a replication factor of 3, this means that we will get
the stats from the ingesters repeated up to 3 times, hence inflating the
stats.

In the stats cache, we store the stats for a given matcher set for the
whole day, then we extract the stats from the cache by the factor of
time from the request that is stored in the cache:

https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33

https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40

Inflated stats for recent data will be cached, so subsequent stats
extracted from the cache will be inflated regardless of the time.

This PR adds a new per-tenant limit `max_stats_cache_freshness` to not
cache requests with an end time that falls within Now minus this
duration.

Here's a scenario illustrating this. The graphs below show the bytes
stats queried in the sharding middleware. We are running a log filter
query that won't match any log, every 5 seconds with a length of 3h.


![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39)

As can be seen, after enabling the stats cache and
configuring`do_not_cache_request_within` to not cache stats for requests
within 30m, the bytes stats used in the sharding middleware stopped
increasing.

In both cases the stats cache hit ration was 100%.

![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23)

**Special notes for your reviewer**:
- Blocked by #9535
- Note that this PR doesn't fix the root issue of inflated stats form
the ingesters, but rather buys us some time to work on that.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 12, 2023
**What this PR does / why we need it**:
Follow up PR for
#9535 (comment)

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [ ] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 12, 2023
**What this PR does / why we need it**:

Before this PR, the index stats cache would use the same config as the
query results cache. This was a limitation since:

1. We would not be able to point to a different cache for storing the
index stats if needed.
2. We would not be able to add specific settings for this cache, without
adding it to the results cache.

In this PR, we refactor the index stats cache config to be independently
configurable. Note that if it's not configured, it will try to use the
results cache settings.

**Which issue(s) this PR fixes**:
This is needed for:
- #9537
- #9536

**Special notes for your reviewer**:

- This PR also refactors all the tripperwares in rountrip.go to reuse
the same stats tripperware instead of each one creating their own.
- Configuring a new cache in rountrip.go is a requirement for
#9536 so the stats summary can
distinguish before the stats cache and the results cache.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 12, 2023
**What this PR does / why we need it**:
When a query finishes, we return (and log) the following stats:
```go
Cache.Chunk.Requests             0
Cache.Chunk.EntriesRequested     0
Cache.Chunk.EntriesFound         0
Cache.Chunk.EntriesStored        0
Cache.Chunk.BytesSent            0 B
Cache.Chunk.BytesReceived        0 B
Cache.Chunk.DownloadTime         0s
Cache.Index.Requests             0
Cache.Index.EntriesRequested     0
Cache.Index.EntriesFound         0
Cache.Index.EntriesStored        0
Cache.Index.BytesSent            0 B
Cache.Index.BytesReceived        0 B
Cache.Index.DownloadTime         0s
Cache.Result.Requests            13
Cache.Result.EntriesRequested    13
Cache.Result.EntriesFound        13
Cache.Result.EntriesStored       0
Cache.Result.BytesSent   0 B
Cache.Result.BytesReceived       2.5 kB
Cache.Result.DownloadTime        4.600266ms
```

In addition to that, we log the following in metrics.go:
```
level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms
```

With the goal of being able to better monitor how the stats cache is
performing; this PR adds stats for the index stats cache, similarly to
how it's done for the results cache.

Here's an example of the new stats being returned and printed:
```go
...
Cache.StatsResult.Requests               180
Cache.StatsResult.EntriesRequested       129
Cache.StatsResult.EntriesFound   129
Cache.StatsResult.EntriesStored          51
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          75 kB
...
```

And the new stats from metrics.go
```
... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...
```

**Special notes for your reviewer**:
- Blocked by #9535
- Note the new`stats.GetOrCreateContext` func. It's used inside the
`query.Exec` method so we don't overwrite the stats added in the stats
middleware.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 12, 2023
**What this PR does / why we need it**:

When we query the stats for recent data, we query both the ingesters and
the index gateways for the stats.

https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114


https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127

Then we merge all the responses, which means summing up all the stats


https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158


https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26

Because we have a replication factor of 3, this means that we will get
the stats from the ingesters repeated up to 3 times, hence inflating the
stats.

In the stats cache, we store the stats for a given matcher set for the
whole day, then we extract the stats from the cache by the factor of
time from the request that is stored in the cache:

https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33

https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40

Inflated stats for recent data will be cached, so subsequent stats
extracted from the cache will be inflated regardless of the time.

This PR adds a new per-tenant limit `max_stats_cache_freshness` to not
cache requests with an end time that falls within Now minus this
duration.

Here's a scenario illustrating this. The graphs below show the bytes
stats queried in the sharding middleware. We are running a log filter
query that won't match any log, every 5 seconds with a length of 3h.


![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39)

As can be seen, after enabling the stats cache and
configuring`do_not_cache_request_within` to not cache stats for requests
within 30m, the bytes stats used in the sharding middleware stopped
increasing.

In both cases the stats cache hit ration was 100%.

![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23)

**Special notes for your reviewer**:
- Blocked by #9535
- Note that this PR doesn't fix the root issue of inflated stats form
the ingesters, but rather buys us some time to work on that.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
trevorwhitney added a commit that referenced this pull request Jun 12, 2023
commit 065bee7
Author: Travis Patterson <travis.patterson@grafana.com>
Date:   Mon Jun 12 10:21:58 2023 -0600

    Label Volume Endpoint (#9588)

    For a given set of matchers, returns the top N associated label/value
    pairs by volume. A query for `{cluster=prod}` will return

    ```
    cluster=prod: size (total logs matching this matcher)
     .
     .
     .
    nth-label=nth-value
    ```

    This is to service use cases where users want to understand where their
    log volume has come from by label without making multiple requests to
    the stats endpoint.

    Note: This PR is a monster but it's mostly plumbing. I've pointed out
    the most interesting bits that actually get the volumes from
    ingesters/indexs

commit 4d997a5
Author: Piotr <17101802+thampiotr@users.noreply.github.com>
Date:   Mon Jun 12 16:24:26 2023 +0100

    Fix promtail cluster template not finding all clusters. (#9684)

    **What this PR does / why we need it**:
    In promtail-mixin, the dropdown template for clusters would only include
    clusters that run loki. So if a cluster only run promtail and not loki,
    it doesn't appear.

commit 57f9452
Author: Kaviraj Kanagaraj <kavirajkanagaraj@gmail.com>
Date:   Mon Jun 12 15:21:08 2023 +0200

    Revert 9217 chaudum/tsdb chunkrefs pool (#9685)

    Revert #9217 (potential bug in query result)

    ---------

    Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

commit 73ac208
Author: Salva Corts <salva.corts@grafana.com>
Date:   Mon Jun 12 10:46:30 2023 +0200

    Improve docs for empty value in cache compression config (#9649)

    **What this PR does / why we need it**:
    Follow up PR for
    #9535 (comment)

    **Checklist**
    - [x] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [x] Documentation added
    - [ ] Tests updated
    - [ ] `CHANGELOG.md` updated
    - [ ] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`
    - [ ] For Helm chart changes bump the Helm chart version in
    `production/helm/loki/Chart.yaml` and update
    `production/helm/loki/CHANGELOG.md` and
    `production/helm/loki/README.md`. [Example
    PR](d10549e)

commit f239435
Author: Christophe Collot <52134228+CCOLLOT@users.noreply.github.com>
Date:   Fri Jun 9 15:00:31 2023 +0200

    feat(lambda-promtail): add cloudfront log file ingestion support (#9573)

    **What this PR does / why we need it**:

    This PR enables ingesting logs from Cloudfront log files stored in s3
    (batch).
    The current setup only supports streaming Cloudfront logs through AWS
    Kinesis, whereas this PR implements the same flow as for VPC Flow logs,
    Load Balancer logs, and Cloudtrail logs (s3 --> SQS (optional) -->
    Lambda Promtail --> Loki)

    **Special notes for your reviewer**:
    + The Cloudfront log file format is different from the already
    implemented services, meaning we had to build yet another regex. AWS
    never bothered making all services follow the same log file naming
    convention but the "good" thing is that it's now very unlikely they will
    change it in the future.
    + The Cloudfront file name does not have any mention of the AWS account
    or the time of log it contains, it means we have to infer the log type
    from the filename format instead of finding the exact string
    "cloudfront" in the filename. This is why in `getLabels`, if no `type`
    parameter is found in the regex, we use the key corresponding to the
    name of the matching parser.
    + I introduced a new `parser` struct to group together several
    parameters specific to a type of log (and avoid relying too much on map
    key string matching and / or if statements for specific use cases)
    + I've been successfully running this code in several AWS environments
    for days.
    + I corrected a typo from my previous PR #9497 (wrong PR number in
    Changelog.md)
    **Checklist**
    - [x] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [x] Documentation added
    - [x] Tests updated
    - [x] `CHANGELOG.md` updated
    - [x] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`
    - [x] For Helm chart changes bump the Helm chart version in
    `production/helm/loki/Chart.yaml` and update
    `production/helm/loki/CHANGELOG.md` and
    `production/helm/loki/README.md`. [Example
    PR](d10549e)

    ---------

    Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>

commit c6fbff2
Author: Salva Corts <salva.corts@grafana.com>
Date:   Fri Jun 9 14:40:36 2023 +0200

    Add config to avoid caching stats for recent data (#9537)

    **What this PR does / why we need it**:

    When we query the stats for recent data, we query both the ingesters and
    the index gateways for the stats.

    https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114

    https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127

    Then we merge all the responses, which means summing up all the stats

    https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158

    https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26

    Because we have a replication factor of 3, this means that we will get
    the stats from the ingesters repeated up to 3 times, hence inflating the
    stats.

    In the stats cache, we store the stats for a given matcher set for the
    whole day, then we extract the stats from the cache by the factor of
    time from the request that is stored in the cache:

    https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33

    https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40

    Inflated stats for recent data will be cached, so subsequent stats
    extracted from the cache will be inflated regardless of the time.

    This PR adds a new per-tenant limit `max_stats_cache_freshness` to not
    cache requests with an end time that falls within Now minus this
    duration.

    Here's a scenario illustrating this. The graphs below show the bytes
    stats queried in the sharding middleware. We are running a log filter
    query that won't match any log, every 5 seconds with a length of 3h.

    ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39)

    As can be seen, after enabling the stats cache and
    configuring`do_not_cache_request_within` to not cache stats for requests
    within 30m, the bytes stats used in the sharding middleware stopped
    increasing.

    In both cases the stats cache hit ration was 100%.

    ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23)

    **Special notes for your reviewer**:
    - Blocked by #9535
    - Note that this PR doesn't fix the root issue of inflated stats form
    the ingesters, but rather buys us some time to work on that.

    **Checklist**
    - [x] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [x] Documentation added
    - [x] Tests updated
    - [ ] `CHANGELOG.md` updated
    - [ ] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`
    - [ ] For Helm chart changes bump the Helm chart version in
    `production/helm/loki/Chart.yaml` and update
    `production/helm/loki/CHANGELOG.md` and
    `production/helm/loki/README.md`. [Example
    PR](d10549e)

commit 22779e1
Author: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>
Date:   Fri Jun 9 13:33:15 2023 +0100

    Fix date template function with epoch times (#8886)

    **What this PR does / why we need it**:

    Adds new toUnixEpoch... functions to convert from a string with a
    Unix/Epoch time to an integer that can be used in the existing `toDate`
    function.

    Note that these are the opposites of some of the functions introduced in
    #8774.

    **Which issue(s) this PR fixes**:
    Fixes #8624.

    **Special notes for your reviewer**:

    **Checklist**
    - [ ] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [X] Documentation added
    - [X] Tests updated
    - [ ] `CHANGELOG.md` updated
    - [ ] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`

    ---------

    Signed-off-by: Michel Hollands <michel.hollands@grafana.com>

commit 1b410db
Author: Bruno FERNANDO <bruno.fernando@jobteaser.com>
Date:   Fri Jun 9 13:48:42 2023 +0200

    feat(promtail): add CF ClientRequestSource field (#9669)

    **What this PR does / why we need it**:

    Hey folks 👋

    Little contribution here to add a useful log field for cloudflare users.
    Indeed I add the [ClientRequestSource
    field](https://developers.cloudflare.com/logs/reference/clientrequestsource/
    ) which is pretty useful when debugging some specific traffic handled by
    cloudflare

    Extra: Since I was on the documentation I fixed an indentation issue
    that I spotted

    Don't hesitate to reach me if you have any questions

    Cheers 😉

    **Which issue(s) this PR fixes**:
    Fixes #<issue number>

    **Special notes for your reviewer**:

    Loki rocks 🚀

    **Checklist**
    - [x] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [x] Documentation added
    - [ ] Tests updated
    - [ ] `CHANGELOG.md` updated
    - [ ] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`
    - [ ] For Helm chart changes bump the Helm chart version in
    `production/helm/loki/Chart.yaml` and update
    `production/helm/loki/CHANGELOG.md` and
    `production/helm/loki/README.md`. [Example
    PR](d10549e)

commit b1917a6
Author: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
Date:   Fri Jun 9 13:38:21 2023 +0200

    add "alignLeft" and "alignRight" functions (#9672)

    Fixes #9667

commit 98d1307
Author: Ashwanth <iamashwanth@gmail.com>
Date:   Fri Jun 9 12:46:38 2023 +0530

    config: ensure storage config defaults apply to named stores (#9650)

    **What this PR does / why we need it**:
    Since named store config does not register any flags, storage configs
    defined under it do not get the defaults.
    For example
    [aws_storage_config](https://grafana.com/docs/loki/latest/configuration/#aws_storage_config)
    sets the default `storage_class` to `STANDARD`, but the same doesn't get
    applied by default when using named stores.

    This PR ensures that named storage configs are always assigned default
    values when they are unmarshalled by implementing `yaml.Unmarshaler`
    interface

commit 4cebc2d
Author: Pepe Cano <825430+ppcano@users.noreply.github.com>
Date:   Thu Jun 8 21:44:00 2023 +0200

    Docs: replace `k6 Cloud` mention (#9599)

    k6 is now available as a managed service on Grafana Cloud.

    This is a small doc changes to remove the mention of `k6 Cloud`.

    ---------

    Co-authored-by: J Stickler <julie.stickler@grafana.com>

commit 1db560f
Author: Danny Kopping <danny.kopping@grafana.com>
Date:   Thu Jun 8 14:19:58 2023 +0200

    Adding background cache (en|de)queue counters (#9665)

    **What this PR does / why we need it**:
    The background writeback cache exposes gauge metric currently for the
    current queue size. Gauges can be useful, but they are susceptible to
    sample errors because they only represent the point in time as the time
    of the scrape.

    Exposing counters for the bytes (en|de)queued to/from the cache will be
    more useful because they can be aggregated.

    Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

commit 609bc22
Author: Dylan Guedes <djmgguedes@gmail.com>
Date:   Thu Jun 8 09:04:45 2023 -0300

    Distributor: Make  key configurable when logging failures (#9659)

    **What this PR does / why we need it**:
    Make appending `insight=true` key-value pair to log failures
    configurable.

    **Which issue(s) this PR fixes**:
    N/A

commit d581258
Author: Nils Griebner <nils@nils-griebner.de>
Date:   Thu Jun 8 11:30:52 2023 +0200

    Make table manager retention options configurable in helm chart values (#9647)

    **What this PR does / why we need it**:

    Configuration options for table manager retention are hard-coded in the
    helm chart values at the moment so that it's not possible to enable
    retention deletes.

    **Which issue(s) this PR fixes**:
    Fixes #8676

    **Special notes for your reviewer**:
    -

    **Checklist**
    - [x] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [x] Documentation added
    - [x] Tests updated
    - [x] `CHANGELOG.md` updated
    - [x] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`
    - [x] For Helm chart changes bump the Helm chart version in
    `production/helm/loki/Chart.yaml` and update
    `production/helm/loki/CHANGELOG.md` and
    `production/helm/loki/README.md`. [Example
    PR](d10549e)

    Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>
trevorwhitney added a commit that referenced this pull request Jun 12, 2023
commit 065bee7
Author: Travis Patterson <travis.patterson@grafana.com>
Date:   Mon Jun 12 10:21:58 2023 -0600

    Label Volume Endpoint (#9588)

    For a given set of matchers, returns the top N associated label/value
    pairs by volume. A query for `{cluster=prod}` will return

    ```
    cluster=prod: size (total logs matching this matcher)
     .
     .
     .
    nth-label=nth-value
    ```

    This is to service use cases where users want to understand where their
    log volume has come from by label without making multiple requests to
    the stats endpoint.

    Note: This PR is a monster but it's mostly plumbing. I've pointed out
    the most interesting bits that actually get the volumes from
    ingesters/indexs

commit 4d997a5
Author: Piotr <17101802+thampiotr@users.noreply.github.com>
Date:   Mon Jun 12 16:24:26 2023 +0100

    Fix promtail cluster template not finding all clusters. (#9684)

    **What this PR does / why we need it**:
    In promtail-mixin, the dropdown template for clusters would only include
    clusters that run loki. So if a cluster only run promtail and not loki,
    it doesn't appear.

commit 57f9452
Author: Kaviraj Kanagaraj <kavirajkanagaraj@gmail.com>
Date:   Mon Jun 12 15:21:08 2023 +0200

    Revert 9217 chaudum/tsdb chunkrefs pool (#9685)

    Revert #9217 (potential bug in query result)

    ---------

    Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

commit 73ac208
Author: Salva Corts <salva.corts@grafana.com>
Date:   Mon Jun 12 10:46:30 2023 +0200

    Improve docs for empty value in cache compression config (#9649)

    **What this PR does / why we need it**:
    Follow up PR for
    #9535 (comment)

    **Checklist**
    - [x] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [x] Documentation added
    - [ ] Tests updated
    - [ ] `CHANGELOG.md` updated
    - [ ] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`
    - [ ] For Helm chart changes bump the Helm chart version in
    `production/helm/loki/Chart.yaml` and update
    `production/helm/loki/CHANGELOG.md` and
    `production/helm/loki/README.md`. [Example
    PR](d10549e)

commit f239435
Author: Christophe Collot <52134228+CCOLLOT@users.noreply.github.com>
Date:   Fri Jun 9 15:00:31 2023 +0200

    feat(lambda-promtail): add cloudfront log file ingestion support (#9573)

    **What this PR does / why we need it**:

    This PR enables ingesting logs from Cloudfront log files stored in s3
    (batch).
    The current setup only supports streaming Cloudfront logs through AWS
    Kinesis, whereas this PR implements the same flow as for VPC Flow logs,
    Load Balancer logs, and Cloudtrail logs (s3 --> SQS (optional) -->
    Lambda Promtail --> Loki)

    **Special notes for your reviewer**:
    + The Cloudfront log file format is different from the already
    implemented services, meaning we had to build yet another regex. AWS
    never bothered making all services follow the same log file naming
    convention but the "good" thing is that it's now very unlikely they will
    change it in the future.
    + The Cloudfront file name does not have any mention of the AWS account
    or the time of log it contains, it means we have to infer the log type
    from the filename format instead of finding the exact string
    "cloudfront" in the filename. This is why in `getLabels`, if no `type`
    parameter is found in the regex, we use the key corresponding to the
    name of the matching parser.
    + I introduced a new `parser` struct to group together several
    parameters specific to a type of log (and avoid relying too much on map
    key string matching and / or if statements for specific use cases)
    + I've been successfully running this code in several AWS environments
    for days.
    + I corrected a typo from my previous PR #9497 (wrong PR number in
    Changelog.md)
    **Checklist**
    - [x] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [x] Documentation added
    - [x] Tests updated
    - [x] `CHANGELOG.md` updated
    - [x] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`
    - [x] For Helm chart changes bump the Helm chart version in
    `production/helm/loki/Chart.yaml` and update
    `production/helm/loki/CHANGELOG.md` and
    `production/helm/loki/README.md`. [Example
    PR](d10549e)

    ---------

    Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>

commit c6fbff2
Author: Salva Corts <salva.corts@grafana.com>
Date:   Fri Jun 9 14:40:36 2023 +0200

    Add config to avoid caching stats for recent data (#9537)

    **What this PR does / why we need it**:

    When we query the stats for recent data, we query both the ingesters and
    the index gateways for the stats.

    https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114

    https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127

    Then we merge all the responses, which means summing up all the stats

    https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158

    https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26

    Because we have a replication factor of 3, this means that we will get
    the stats from the ingesters repeated up to 3 times, hence inflating the
    stats.

    In the stats cache, we store the stats for a given matcher set for the
    whole day, then we extract the stats from the cache by the factor of
    time from the request that is stored in the cache:

    https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33

    https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40

    Inflated stats for recent data will be cached, so subsequent stats
    extracted from the cache will be inflated regardless of the time.

    This PR adds a new per-tenant limit `max_stats_cache_freshness` to not
    cache requests with an end time that falls within Now minus this
    duration.

    Here's a scenario illustrating this. The graphs below show the bytes
    stats queried in the sharding middleware. We are running a log filter
    query that won't match any log, every 5 seconds with a length of 3h.

    ![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39)

    As can be seen, after enabling the stats cache and
    configuring`do_not_cache_request_within` to not cache stats for requests
    within 30m, the bytes stats used in the sharding middleware stopped
    increasing.

    In both cases the stats cache hit ration was 100%.

    ![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23)

    **Special notes for your reviewer**:
    - Blocked by #9535
    - Note that this PR doesn't fix the root issue of inflated stats form
    the ingesters, but rather buys us some time to work on that.

    **Checklist**
    - [x] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [x] Documentation added
    - [x] Tests updated
    - [ ] `CHANGELOG.md` updated
    - [ ] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`
    - [ ] For Helm chart changes bump the Helm chart version in
    `production/helm/loki/Chart.yaml` and update
    `production/helm/loki/CHANGELOG.md` and
    `production/helm/loki/README.md`. [Example
    PR](d10549e)

commit 22779e1
Author: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>
Date:   Fri Jun 9 13:33:15 2023 +0100

    Fix date template function with epoch times (#8886)

    **What this PR does / why we need it**:

    Adds new toUnixEpoch... functions to convert from a string with a
    Unix/Epoch time to an integer that can be used in the existing `toDate`
    function.

    Note that these are the opposites of some of the functions introduced in
    #8774.

    **Which issue(s) this PR fixes**:
    Fixes #8624.

    **Special notes for your reviewer**:

    **Checklist**
    - [ ] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [X] Documentation added
    - [X] Tests updated
    - [ ] `CHANGELOG.md` updated
    - [ ] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`

    ---------

    Signed-off-by: Michel Hollands <michel.hollands@grafana.com>

commit 1b410db
Author: Bruno FERNANDO <bruno.fernando@jobteaser.com>
Date:   Fri Jun 9 13:48:42 2023 +0200

    feat(promtail): add CF ClientRequestSource field (#9669)

    **What this PR does / why we need it**:

    Hey folks 👋

    Little contribution here to add a useful log field for cloudflare users.
    Indeed I add the [ClientRequestSource
    field](https://developers.cloudflare.com/logs/reference/clientrequestsource/
    ) which is pretty useful when debugging some specific traffic handled by
    cloudflare

    Extra: Since I was on the documentation I fixed an indentation issue
    that I spotted

    Don't hesitate to reach me if you have any questions

    Cheers 😉

    **Which issue(s) this PR fixes**:
    Fixes #<issue number>

    **Special notes for your reviewer**:

    Loki rocks 🚀

    **Checklist**
    - [x] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [x] Documentation added
    - [ ] Tests updated
    - [ ] `CHANGELOG.md` updated
    - [ ] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`
    - [ ] For Helm chart changes bump the Helm chart version in
    `production/helm/loki/Chart.yaml` and update
    `production/helm/loki/CHANGELOG.md` and
    `production/helm/loki/README.md`. [Example
    PR](d10549e)

commit b1917a6
Author: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
Date:   Fri Jun 9 13:38:21 2023 +0200

    add "alignLeft" and "alignRight" functions (#9672)

    Fixes #9667

commit 98d1307
Author: Ashwanth <iamashwanth@gmail.com>
Date:   Fri Jun 9 12:46:38 2023 +0530

    config: ensure storage config defaults apply to named stores (#9650)

    **What this PR does / why we need it**:
    Since named store config does not register any flags, storage configs
    defined under it do not get the defaults.
    For example
    [aws_storage_config](https://grafana.com/docs/loki/latest/configuration/#aws_storage_config)
    sets the default `storage_class` to `STANDARD`, but the same doesn't get
    applied by default when using named stores.

    This PR ensures that named storage configs are always assigned default
    values when they are unmarshalled by implementing `yaml.Unmarshaler`
    interface

commit 4cebc2d
Author: Pepe Cano <825430+ppcano@users.noreply.github.com>
Date:   Thu Jun 8 21:44:00 2023 +0200

    Docs: replace `k6 Cloud` mention (#9599)

    k6 is now available as a managed service on Grafana Cloud.

    This is a small doc changes to remove the mention of `k6 Cloud`.

    ---------

    Co-authored-by: J Stickler <julie.stickler@grafana.com>

commit 1db560f
Author: Danny Kopping <danny.kopping@grafana.com>
Date:   Thu Jun 8 14:19:58 2023 +0200

    Adding background cache (en|de)queue counters (#9665)

    **What this PR does / why we need it**:
    The background writeback cache exposes gauge metric currently for the
    current queue size. Gauges can be useful, but they are susceptible to
    sample errors because they only represent the point in time as the time
    of the scrape.

    Exposing counters for the bytes (en|de)queued to/from the cache will be
    more useful because they can be aggregated.

    Signed-off-by: Danny Kopping <danny.kopping@grafana.com>

commit 609bc22
Author: Dylan Guedes <djmgguedes@gmail.com>
Date:   Thu Jun 8 09:04:45 2023 -0300

    Distributor: Make  key configurable when logging failures (#9659)

    **What this PR does / why we need it**:
    Make appending `insight=true` key-value pair to log failures
    configurable.

    **Which issue(s) this PR fixes**:
    N/A

commit d581258
Author: Nils Griebner <nils@nils-griebner.de>
Date:   Thu Jun 8 11:30:52 2023 +0200

    Make table manager retention options configurable in helm chart values (#9647)

    **What this PR does / why we need it**:

    Configuration options for table manager retention are hard-coded in the
    helm chart values at the moment so that it's not possible to enable
    retention deletes.

    **Which issue(s) this PR fixes**:
    Fixes #8676

    **Special notes for your reviewer**:
    -

    **Checklist**
    - [x] Reviewed the
    [`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
    guide (**required**)
    - [x] Documentation added
    - [x] Tests updated
    - [x] `CHANGELOG.md` updated
    - [x] Changes that require user attention or interaction to upgrade are
    documented in `docs/sources/upgrading/_index.md`
    - [x] For Helm chart changes bump the Helm chart version in
    `production/helm/loki/Chart.yaml` and update
    `production/helm/loki/CHANGELOG.md` and
    `production/helm/loki/README.md`. [Example
    PR](d10549e)

    Co-authored-by: Michel Hollands <42814411+MichelHollands@users.noreply.github.com>
salvacorts added a commit that referenced this pull request Jun 13, 2023
**What this PR does / why we need it**:

Before this PR, the index stats cache would use the same config as the
query results cache. This was a limitation since:

1. We would not be able to point to a different cache for storing the
index stats if needed.
2. We would not be able to add specific settings for this cache, without
adding it to the results cache.

In this PR, we refactor the index stats cache config to be independently
configurable. Note that if it's not configured, it will try to use the
results cache settings.

**Which issue(s) this PR fixes**:
This is needed for:
- #9537
- #9536

**Special notes for your reviewer**:

- This PR also refactors all the tripperwares in rountrip.go to reuse
the same stats tripperware instead of each one creating their own.
- Configuring a new cache in rountrip.go is a requirement for
#9536 so the stats summary can
distinguish before the stats cache and the results cache.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [x] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 13, 2023
**What this PR does / why we need it**:
When a query finishes, we return (and log) the following stats:
```go
Cache.Chunk.Requests             0
Cache.Chunk.EntriesRequested     0
Cache.Chunk.EntriesFound         0
Cache.Chunk.EntriesStored        0
Cache.Chunk.BytesSent            0 B
Cache.Chunk.BytesReceived        0 B
Cache.Chunk.DownloadTime         0s
Cache.Index.Requests             0
Cache.Index.EntriesRequested     0
Cache.Index.EntriesFound         0
Cache.Index.EntriesStored        0
Cache.Index.BytesSent            0 B
Cache.Index.BytesReceived        0 B
Cache.Index.DownloadTime         0s
Cache.Result.Requests            13
Cache.Result.EntriesRequested    13
Cache.Result.EntriesFound        13
Cache.Result.EntriesStored       0
Cache.Result.BytesSent   0 B
Cache.Result.BytesReceived       2.5 kB
Cache.Result.DownloadTime        4.600266ms
```

In addition to that, we log the following in metrics.go:
```
level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms
```

With the goal of being able to better monitor how the stats cache is
performing; this PR adds stats for the index stats cache, similarly to
how it's done for the results cache.

Here's an example of the new stats being returned and printed:
```go
...
Cache.StatsResult.Requests               180
Cache.StatsResult.EntriesRequested       129
Cache.StatsResult.EntriesFound   129
Cache.StatsResult.EntriesStored          51
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          75 kB
...
```

And the new stats from metrics.go
```
... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...
```

**Special notes for your reviewer**:
- Blocked by #9535
- Note the new`stats.GetOrCreateContext` func. It's used inside the
`query.Exec` method so we don't overwrite the stats added in the stats
middleware.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 13, 2023
**What this PR does / why we need it**:

When we query the stats for recent data, we query both the ingesters and
the index gateways for the stats.

https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L112-L114


https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L126-L127

Then we merge all the responses, which means summing up all the stats


https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/async_store.go#L157-L158


https://github.com/grafana/loki/blob/ebdb2b18007e024d56105afc5230383165ca1650/pkg/storage/stores/index/stats/stats.go#L23-L26

Because we have a replication factor of 3, this means that we will get
the stats from the ingesters repeated up to 3 times, hence inflating the
stats.

In the stats cache, we store the stats for a given matcher set for the
whole day, then we extract the stats from the cache by the factor of
time from the request that is stored in the cache:

https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L33

https://github.com/grafana/loki/blob/336283acadb34f5fda9abce4e6fcef1dca9965d8/pkg/querier/queryrange/index_stats_cache.go#L40

Inflated stats for recent data will be cached, so subsequent stats
extracted from the cache will be inflated regardless of the time.

This PR adds a new per-tenant limit `max_stats_cache_freshness` to not
cache requests with an end time that falls within Now minus this
duration.

Here's a scenario illustrating this. The graphs below show the bytes
stats queried in the sharding middleware. We are running a log filter
query that won't match any log, every 5 seconds with a length of 3h.


![image](https://github.com/grafana/loki/assets/8354290/45c2e6e9-185c-4a18-b290-47da27fc3e39)

As can be seen, after enabling the stats cache and
configuring`do_not_cache_request_within` to not cache stats for requests
within 30m, the bytes stats used in the sharding middleware stopped
increasing.

In both cases the stats cache hit ration was 100%.

![image](https://github.com/grafana/loki/assets/8354290/cd35bcb8-0c77-4693-a06b-502741fd6e23)

**Special notes for your reviewer**:
- Blocked by #9535
- Note that this PR doesn't fix the root issue of inflated stats form
the ingesters, but rather buys us some time to work on that.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [x] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
salvacorts added a commit that referenced this pull request Jun 13, 2023
Adds changes from:

- #9535
- #9536
- #9537
- #9529
- #9552

So we can use k150 with newer config.
salvacorts added a commit that referenced this pull request Jun 28, 2023
**What this PR does / why we need it**:
When a query finishes, we return (and log) the following stats:
```go
Cache.Chunk.Requests             0
Cache.Chunk.EntriesRequested     0
Cache.Chunk.EntriesFound         0
Cache.Chunk.EntriesStored        0
Cache.Chunk.BytesSent            0 B
Cache.Chunk.BytesReceived        0 B
Cache.Chunk.DownloadTime         0s
Cache.Index.Requests             0
Cache.Index.EntriesRequested     0
Cache.Index.EntriesFound         0
Cache.Index.EntriesStored        0
Cache.Index.BytesSent            0 B
Cache.Index.BytesReceived        0 B
Cache.Index.DownloadTime         0s
Cache.Result.Requests            13
Cache.Result.EntriesRequested    13
Cache.Result.EntriesFound        13
Cache.Result.EntriesStored       0
Cache.Result.BytesSent   0 B
Cache.Result.BytesReceived       2.5 kB
Cache.Result.DownloadTime        4.600266ms
```

In addition to that, we log the following in metrics.go:
```
level=info ts=2023-05-29T09:17:10.93029945Z caller=metrics.go:152 component=frontend org_id=145265 traceID=52d59b78fe6b9221 sampled=true latency=fast query="{cluster=\"dev-us-central-0\", namespace=~\"loki.*\", container=~\"distributor|ingester
|promtail|index-gateway|compactor\"} |= \"thislinewillnotexist\"" query_hash=1194136170 query_type=filter range_type=range length=3h0m0s start_delta=165h37m24.930289434s end_delta=162h37m24.930289612s step=43s duration=2.473055ms status=200 lim
it=30 returned_lines=0 throughput=0B total_bytes=0B lines_per_second=0 total_lines=0 total_entries=0 store_chunks_download_time=0s queue_time=0s splits=13 shards=0 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes
_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_result_req=13 cache_result_hit=13 cache_result_download_time=4.600266ms
```

With the goal of being able to better monitor how the stats cache is
performing; this PR adds stats for the index stats cache, similarly to
how it's done for the results cache.

Here's an example of the new stats being returned and printed:
```go
...
Cache.StatsResult.Requests               180
Cache.StatsResult.EntriesRequested       129
Cache.StatsResult.EntriesFound   129
Cache.StatsResult.EntriesStored          51
Cache.StatsResult.BytesSent              0 B
Cache.StatsResult.BytesReceived          75 kB
...
```

And the new stats from metrics.go
```
... caller=metrics.go:155 ... cache_stats_results_req=129 cache_stats_results_hit=129 cache_stats_results_download_ti
me=156.864429ms ...
```

**Special notes for your reviewer**:
- Blocked by #9535
- Note the new`stats.GetOrCreateContext` func. It's used inside the
`query.Exec` method so we don't overwrite the stats added in the stats
middleware.

**Checklist**
- [x] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [x] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/upgrading/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](d10549e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants