Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruler: optimised <prefix>/api/v1/rules and <prefix>/api/v1/alerts #3916

Merged

Conversation

pracucci
Copy link
Contributor

@pracucci pracucci commented Mar 5, 2021

What this PR does:
The ruler /api/prom/api/v1/rules is pretty slow in our clusters:
Screenshot 2021-03-05 at 17 03 15

The /api/prom/api/v1/rules endpoint is handled by the ruler's API.PrometheusRules(). When sharding is enabled, rules are fetches sequentially from rulers so in this PR I'm proposing to concurrently fetch them from all rulers. Along the way, I've also introduced the usage of a clients pool (like we already have in other places) and an integration test.

Which issue(s) this PR fixes:
N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Contributor

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I've also found a bug today that I believe it's worth fixing as part of this.

integration/e2ecortex/client.go Outdated Show resolved Hide resolved
integration/ruler_test.go Outdated Show resolved Hide resolved
pkg/ruler/client_pool.go Outdated Show resolved Hide resolved
pkg/ruler/client_pool.go Outdated Show resolved Hide resolved
}
conn, err := grpc.DialContext(ctx, rlr.Addr, dialOpts...)

newGrps, err := grpcClient.(RulerClient).Rules(ctx, &RulesRequest{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's one more bug we need to fix and I think we can test it by having the grpc method Rules fail via escape failure.

It seems like gRPC methods don't like when you return nil as a pointer to a struct they expect as response because they call new() on it e.g. *RulesResponse for Rules(ctx context.Context, in *RulesRequest) (*RulesResponse, error)

So, we need to do:

func (r *Ruler) Rules(ctx context.Context, in *RulesRequest) (*RulesResponse, error) {
	userID, err := tenant.TenantID(ctx)
	if err != nil {
		return &RulesResponse{}, fmt.Errorf("no user id found in context")
	}
	groupDescs, err := r.getLocalRules(userID)
	if err != nil {
		return &RulesResponse{}, err
	}
	return &RulesResponse{Groups: groupDescs}, nil
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example error is:

 msg="GET /api/prom/api/v1/rules (500) 4.012183ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"data\\\":null,\\\"errorType\\\":\\\"server_error\\\",\\\"error\\\":\\\"unable to retrieve rules from other rulers, rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil\\\"}\"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the problem is the return. We do return nil, err in any gRPC function. Why should be different here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error while marshaling: proto: Marshal called with nil

I've been able to reproduce it on the request side, calling grpcClient.(RulerClient).Rules(ctx, nil) instead of the actual grpcClient.(RulerClient).Rules(ctx, &RulesRequest{}). In master we're calling .Rules(ctx, nil) and it's something we have broken with the grpc upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See here:
#3917

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I originally thought the problem was intermittent so my hunch was on the return side.

CHANGELOG.md Outdated Show resolved Hide resolved
@pracucci pracucci force-pushed the optimise-ruler-getshardedrules branch from c00d1df to c8951cf Compare March 5, 2021 16:53
@pracucci pracucci mentioned this pull request Mar 5, 2021
3 tasks
Copy link
Contributor

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

CHANGELOG.md Outdated
@@ -91,6 +91,10 @@
* `cortex_bucket_store_chunk_pool_returned_bytes_total`
* [ENHANCEMENT] Alertmanager: load alertmanager configurations from object storage concurrently, and only load necessary configurations, speeding configuration synchronization process and executing fewer "GET object" operations to the storage when sharding is enabled. #3898
* [ENHANCEMENT] Blocks storage: Ingester can now stream entire chunks instead of individual samples to the querier. At the moment this feature must be explicitly enabled either by using `-ingester.stream-chunks-when-using-blocks` flag or `ingester_stream_chunks_when_using_blocks` (boolean) field in runtime config file, but these configuration options are temporary and will be removed when feature is stable. #3889
* [ENHANCEMENT] Ruler: optimized `<prefix>/api/v1/rules` and `<prefix>/api/v1/alerts` when ruler sharding is enabled. #3916
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cortex release 1.8.0 is now in progress. Could you please rebase master and move the CHANGELOG entry under the master / unreleased section?

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci force-pushed the optimise-ruler-getshardedrules branch from c8951cf to c1f634a Compare March 8, 2021 08:05
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci merged commit ebef1e1 into cortexproject:master Mar 8, 2021
@pracucci pracucci deleted the optimise-ruler-getshardedrules branch March 8, 2021 08:38
roystchiang pushed a commit to roystchiang/cortex that referenced this pull request Apr 6, 2022
…rtexproject#3916)

* Use a grpc clients pool in the ruler

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Concurrently fetch rules from all rulers

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added subservices manager

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed Rules() grpc call

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added integration test

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Added CHANGELOG entry

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Addressed review comments

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Fixed CHANGELOG

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants