Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement stateless mode for Thanos Ruler (continue #4250) #4731

Merged
merged 26 commits into from
Nov 8, 2021

Conversation

yeya24
Copy link
Contributor

@yeya24 yeya24 commented Oct 3, 2021

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

This pr is based on top of the awesome pr #4250 by @idoqo and fixes E2E tests.

Verification

E2E tests should pass.

@yeya24
Copy link
Contributor Author

yeya24 commented Oct 3, 2021

prometheus/prometheus#9109 requires some modification on the vendored grafana agent wal package.

/home/runner/work/thanos/thanos/pkg/rules/remotewrite/wal_test.go:75:29: cannot use &collector (value of type *walDataCollector) as wal.WriteTo value in struct literal: missing method UpdateSeriesSegment

Maybe let's wait for the upgrade from the agent side? @rfratto

@rfratto
Copy link

rfratto commented Oct 4, 2021

Maybe let's wait for the upgrade from the agent side? @rfratto

I'm going to be bumping the Prometheus dependency in grafana/agent ideally this week and expect to also run into this. I'll tag you @yeya24 once we have the PR to update the code to resolve the error.

@rfratto
Copy link

rfratto commented Oct 8, 2021

I'm going to be bumping the Prometheus dependency in grafana/agent ideally this week and expect to also run into this. I'll tag you @yeya24 once we have the PR to update the code to resolve the error.

Hey y'all, I wanted to give a quick update on this. I've put effort into updating the Grafana Agent's Prometheus dependency this week, but it's going to take some more time. The bump also requires a bump to github.com/prometheus/common, which is currently blocked by a few of the exporters we embed still using the (since removed) github.com/prometheus/common/log package. Such is the life of a project that has a lot of dependencies :)

We've definitely made progress here, but it might take another week before I can update the WAL code. I'll still tag y'all when we have something.

@yeya24 yeya24 force-pushed the remote-write-rule branch from 3a79087 to 2390e96 Compare October 28, 2021 20:06
@yeya24
Copy link
Contributor Author

yeya24 commented Oct 28, 2021

This pr is ready for review!

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits and questions. Good work! Do we still need pkg/rules/remotewrite/{wal,series}.go with prometheus/prometheus#8785 merged? ;P It says:

// TODO(idoqo): Migrate to prometheus package when https://github.com/prometheus/prometheus/pull/8785 is ready.

CHANGELOG.md Outdated Show resolved Hide resolved
pkg/rules/remotewrite/wal_test.go Outdated Show resolved Hide resolved
pkg/rules/remotewrite/wal_test.go Outdated Show resolved Hide resolved
// TestRule_CanPersistWALData checks that in stateless mode, Thanos Ruler can persist rule evaluations
// which couldn't be sent to the remote write endpoint (e.g because receiver isn't available).
func TestRule_CanPersistWALData(t *testing.T) {
//TODO: Implement test with unavailable remote-write endpoint(receiver)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to implement this before merging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is not a blocker as we are using the upstream prom agent code already so data should be persisted. Not sure how to implement this right now though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@yeya24 yeya24 force-pushed the remote-write-rule branch 2 times, most recently from 5979b10 to 23988c5 Compare November 2, 2021 05:40
@yeya24
Copy link
Contributor Author

yeya24 commented Nov 2, 2021

Just some nits and questions. Good work! Do we still need pkg/rules/remotewrite/{wal,series}.go with prometheus/prometheus#8785 merged? ;P It says:

It seems that we can totally remove the pkg/rules/remotewrite package after using the upstream Prometheus agent code 😄

@yeya24 yeya24 force-pushed the remote-write-rule branch 2 times, most recently from be3ab7e to ae51c39 Compare November 2, 2021 06:10
idoqo and others added 19 commits November 7, 2021 00:37
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
This removes the need to pass a separate `remote-write` flag to ruler
to enable stateless mode. Instead, we now check if a remote-write config is provided
and automatically enables stateless mode based off that.

Ruler test is also cleaned up to remove unnecessary
tests (i.e those that have been performed by other e2e suites).

Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Michael Okoko <okokomichaels@outlook.com>
Signed-off-by: Ben Ye <ben.ye@bytedance.com>
yeya24 and others added 6 commits November 7, 2021 00:37
Signed-off-by: yeya24 <yb532204897@gmail.com>
Signed-off-by: Ben Ye <ben.ye@bytedance.com>
Signed-off-by: Ben Ye <ben.ye@bytedance.com>
Signed-off-by: Ben Ye <ben.ye@bytedance.com>
Signed-off-by: Ben Ye <ben.ye@bytedance.com>
Signed-off-by: Ben Ye <ben.ye@bytedance.com>
@yeya24 yeya24 force-pushed the remote-write-rule branch from ae51c39 to fb0dfba Compare November 7, 2021 07:38
@yeya24
Copy link
Contributor Author

yeya24 commented Nov 8, 2021

@GiedriusS Let me know if there is anything left in this pr. If it is small nits then I guess we can clean up later. 😄

GiedriusS
GiedriusS previously approved these changes Nov 8, 2021
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few very small nits but after that, we can merge this IMO 👍

}, func(error) {
close(done)
})
db, err := agent.Open(logger, reg, remoteStore, walDir, agentOpts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please call this something other than db to avoid shadowing the outer variable? This is confusing and I had to double check this place 😂 I suggest agentDB or something like that, and to rename the outer db variable into tsdbDB.

grpcserver.WithTLSConfig(tlsCfg),
}
if db != nil {
tsdbStore := store.NewTSDBStore(logger, db, component.Rule, conf.lset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future maybe we could also answer StoreAPI queries by using the data in the WAL? Perhaps worth adding a TODO here 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. If the agent is able to query head data, then what's the difference between the agent and a prometheus with very low retention? I am not sure whether this is in their roadmap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point 👍

cmd/thanos/rule.go Outdated Show resolved Hide resolved
test/e2e/rule_test.go Outdated Show resolved Hide resolved
test/e2e/rule_test.go Show resolved Hide resolved
// TestRule_CanPersistWALData checks that in stateless mode, Thanos Ruler can persist rule evaluations
// which couldn't be sent to the remote write endpoint (e.g because receiver isn't available).
func TestRule_CanPersistWALData(t *testing.T) {
//TODO: Implement test with unavailable remote-write endpoint(receiver)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Signed-off-by: Ben Ye <ben.ye@bytedance.com>
@yeya24
Copy link
Contributor Author

yeya24 commented Nov 8, 2021

Hi @GiedriusS, I have addressed the comments in the latest commit.

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end, it became a very small pull request with all of the agent code in upstream Prometheus 💪 awesome!

@onprem
Copy link
Member

onprem commented Nov 9, 2021

Great work! 🎉

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @yeya24 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants