Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): GCS backend using thanos.io/objstore #11132

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

JoaoBraveCoding
Copy link
Collaborator

@JoaoBraveCoding JoaoBraveCoding commented Nov 3, 2023

What this PR does / why we need it:

This PR adds support to use the thanos.io/objstore backend for the GCS provider for all components including the Ruler.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

  • We tried to the best of our efforts to preserve feature parity so the new backend still uses the ObjectClient interface
  • Thanos.io/objstore doesn't support all the CLI flags Loki supports nowadays (see snippet below)
  • Hedging support was added in GCS: Adds HTTP Config similar to S3 thanos-io/objstore#86 and 32a9e4c
  • This PR was inspired by the way Mimir uses thanos.io/objstore intending to make this part of the codebase as similar as possible

CLI Table

gcs.bucketname
gcs.service-account

ADDED
gcs.http.idle_conn_timeout
gcs.http.response_header_timeout
gcs.http.insecure_skip_verify
gcs.http.tls_handshake_timeout
gcs.http.expect_continue_timeout
gcs.http.max_idle_connections
gcs.http.max_idle_connections_per_host
gcs.http.ca_file

NOT PRESENT AT THE MOMENT
gcs.chunk-buffer-size (Would need objstore patch)
gcs.enable-opencensus (Would need objstore patch)
gcs.enable-retries (Would need objstore patch)
gcs.request-timeout (Once HTTP Config its possible)
gcs.enable-http2 (Once HTTP Config its possible)

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
    • If the change is worth mentioning in the release notes, add add-to-release-notes label
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@JoaoBraveCoding JoaoBraveCoding changed the title WIP: GCP implementation using thanos.io/objstore GCP implementation using thanos.io/objstore Nov 14, 2023
Signed-off-by: Joao Marcal <jmarcal@redhat.com>
@JoaoBraveCoding JoaoBraveCoding marked this pull request as ready for review November 14, 2023 10:30
@JoaoBraveCoding JoaoBraveCoding requested a review from a team as a code owner November 14, 2023 10:30
@JoaoBraveCoding JoaoBraveCoding changed the title GCP implementation using thanos.io/objstore storage: GCS backend using thanos.io/objstore Nov 14, 2023
Copy link
Contributor

github-actions bot commented Nov 21, 2023

Trivy scan found the following vulnerabilities:

@chaudum
Copy link
Contributor

chaudum commented Nov 23, 2023

This PR was inspired by the way Mimir uses thanos.io/objstore with the goal of making this part of the codebase as similar as possible

Thank you! ❤️

@JoaoBraveCoding JoaoBraveCoding changed the title storage: GCS backend using thanos.io/objstore feat(storage): GCS backend using thanos.io/objstore Mar 14, 2024
@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label Mar 14, 2024
@kavirajk
Copy link
Contributor

kavirajk commented Jul 3, 2024

@JoaoBraveCoding thanks for the PR :)

Update:
I'm testing GCS changes on two things internally.

  1. On our internal cluster to make sure nothing breaks
  2. Our Cloud logs archiver project uses Loki's storage layer to copy chunks between buckets. I'm making sure this PR doesn't break anything there (particularly authentication side of things).

I will keep it posted here. Once those testing is done, we can approve and merge this PR

@kavirajk kavirajk self-assigned this Jul 3, 2024
@kavirajk
Copy link
Contributor

kavirajk commented Jul 3, 2024

@JoaoBraveCoding I was testing this PR and found it is breaking couple of our internal services.

not enough arguments in call to storage.NewObjectClient
	have (string, "github.com/grafana/loki/v3/pkg/storage".Config, "github.com/grafana/loki/v3/pkg/storage".ClientMetrics)
	want (string, string, "github.com/grafana/loki/v3/pkg/storage".Config, "github.com/grafana/loki/v3/pkg/storage".ClientMetrics, "github.com/prometheus/client_golang/prometheus".Registerer)

Basically, looks like we cannot afford to change the signature of storage.NewObjectClient atomically in single PR. These two internal services consume those heavily and I think we need to take multi step approach.

  1. Keep `storage.NewObjectClient as is
  2. expose something like storage.NewObjectClientV2 (or similar) with thanos object store support
  3. I will migrate our internal services to V2 separately.
    wdyt?

Signed-off-by: Kaviraj Kanagaraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj Kanagaraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj Kanagaraj <kavirajkanagaraj@gmail.com>
Signed-off-by: Kaviraj Kanagaraj <kavirajkanagaraj@gmail.com>
@kavirajk
Copy link
Contributor

kavirajk commented Jul 30, 2024

I'm planning to merge this PR. Let me know if anyone have any conflicts!

There is one edge case when we enable thanos_objstore: true in some components (compactor, ruler, idx-gw and ingester) thanos metrics are getting registered more than once, causing duplicate panic. But this never happens with thanos_objstore: false which is default. Basically it's no-op if not enabled, tested in on some internal dev cells

The rationale for that is mainly if loki config have multiple period_configs, we create one object client per period_config and each will try to register the metrics collector in thanos currently. But in Loki currently we create those metrics once and make it part of transport so that all the object storage calls are tracked in single set of metrics.

I have raised an upstream PR to decouple metrics creation and instrumenting the bucket.

With this, we can create thanos metrics once and use it multiple instances of object clients. I have a changes ready to fix it in compactor, I will raise as separate PR (we can do similar things on other components as well)

All these can be done iteratively, meaning we can merge this PR without keeping it open.

Another important rationale is, this PR often get's outdated and needing some manual work. With RF=1 feature, people also touch object client API's so keeping this PR open going to make only worse with more frequent rebase & conflicts.

Looping some people for visibility @JoaoBraveCoding @slim-bean

@kavirajk kavirajk removed their assignment Sep 3, 2024
@JoaoBraveCoding JoaoBraveCoding force-pushed the log-4550-gcp branch 2 times, most recently from ec00fec to d51bbd5 Compare September 30, 2024 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XXL type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants