Ruler: optimised <prefix>/api/v1/rules and <prefix>/api/v1/alerts #3916

pracucci · 2021-03-05T16:06:00Z

What this PR does:
The ruler /api/prom/api/v1/rules is pretty slow in our clusters:

The /api/prom/api/v1/rules endpoint is handled by the ruler's API.PrometheusRules(). When sharding is enabled, rules are fetches sequentially from rulers so in this PR I'm proposing to concurrently fetch them from all rulers. Along the way, I've also introduced the usage of a clients pool (like we already have in other places) and an integration test.

Which issue(s) this PR fixes:
N/A

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

gotjosh

LGTM! I've also found a bug today that I believe it's worth fixing as part of this.

integration/e2ecortex/client.go

integration/ruler_test.go

pkg/ruler/client_pool.go

gotjosh · 2021-03-05T16:37:56Z

pkg/ruler/ruler.go

 		}
-		conn, err := grpc.DialContext(ctx, rlr.Addr, dialOpts...)
+
+		newGrps, err := grpcClient.(RulerClient).Rules(ctx, &RulesRequest{})


I believe there's one more bug we need to fix and I think we can test it by having the grpc method Rules fail via escape failure.

It seems like gRPC methods don't like when you return nil as a pointer to a struct they expect as response because they call new() on it e.g. *RulesResponse for Rules(ctx context.Context, in *RulesRequest) (*RulesResponse, error)

So, we need to do:

func (r *Ruler) Rules(ctx context.Context, in *RulesRequest) (*RulesResponse, error) { userID, err := tenant.TenantID(ctx) if err != nil { return &RulesResponse{}, fmt.Errorf("no user id found in context") } groupDescs, err := r.getLocalRules(userID) if err != nil { return &RulesResponse{}, err } return &RulesResponse{Groups: groupDescs}, nil }

An example error is:

msg="GET /api/prom/api/v1/rules (500) 4.012183ms Response: \"{\\\"status\\\":\\\"error\\\",\\\"data\\\":null,\\\"errorType\\\":\\\"server_error\\\",\\\"error\\\":\\\"unable to retrieve rules from other rulers, rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil\\\"}\"

I don't think the problem is the return. We do return nil, err in any gRPC function. Why should be different here?

error while marshaling: proto: Marshal called with nil

I've been able to reproduce it on the request side, calling grpcClient.(RulerClient).Rules(ctx, nil) instead of the actual grpcClient.(RulerClient).Rules(ctx, &RulesRequest{}). In master we're calling .Rules(ctx, nil) and it's something we have broken with the grpc upgrade.

See here:
#3917

Good catch, I originally thought the problem was intermittent so my hunch was on the return side.

CHANGELOG.md

pstibrany

LGTM, thanks!

pstibrany · 2021-03-08T07:19:46Z

CHANGELOG.md

@@ -91,6 +91,10 @@
  * `cortex_bucket_store_chunk_pool_returned_bytes_total`
 * [ENHANCEMENT] Alertmanager: load alertmanager configurations from object storage concurrently, and only load necessary configurations, speeding configuration synchronization process and executing fewer "GET object" operations to the storage when sharding is enabled. #3898
 * [ENHANCEMENT] Blocks storage: Ingester can now stream entire chunks instead of individual samples to the querier. At the moment this feature must be explicitly enabled either by using `-ingester.stream-chunks-when-using-blocks` flag or `ingester_stream_chunks_when_using_blocks` (boolean) field in runtime config file, but these configuration options are temporary and will be removed when feature is stable. #3889
+* [ENHANCEMENT] Ruler: optimized `<prefix>/api/v1/rules` and `<prefix>/api/v1/alerts` when ruler sharding is enabled. #3916


Cortex release 1.8.0 is now in progress. Could you please rebase master and move the CHANGELOG entry under the master / unreleased section?

Signed-off-by: Marco Pracucci <marco@pracucci.com>

…rtexproject#3916) * Use a grpc clients pool in the ruler Signed-off-by: Marco Pracucci <marco@pracucci.com> * Concurrently fetch rules from all rulers Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added subservices manager Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed Rules() grpc call Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added integration test Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added CHANGELOG entry Signed-off-by: Marco Pracucci <marco@pracucci.com> * Addressed review comments Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fixed CHANGELOG Signed-off-by: Marco Pracucci <marco@pracucci.com>

pull-request-size bot added the size/L label Mar 5, 2021

gotjosh reviewed Mar 5, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

pracucci force-pushed the optimise-ruler-getshardedrules branch from c00d1df to c8951cf Compare March 5, 2021 16:53

pracucci mentioned this pull request Mar 5, 2021

Fix ruler API #3917

Merged

3 tasks

pstibrany approved these changes Mar 5, 2021

View reviewed changes

pstibrany reviewed Mar 8, 2021

View reviewed changes

pracucci added 7 commits March 8, 2021 09:03

Use a grpc clients pool in the ruler

8d70c9e

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Concurrently fetch rules from all rulers

a5aa0f3

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Added subservices manager

0f887df

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Fixed Rules() grpc call

bcb2bd0

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Added integration test

704aef0

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Added CHANGELOG entry

97d586b

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Addressed review comments

c1f634a

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci force-pushed the optimise-ruler-getshardedrules branch from c8951cf to c1f634a Compare March 8, 2021 08:05

Fixed CHANGELOG

df8ad9f

Signed-off-by: Marco Pracucci <marco@pracucci.com>

pracucci merged commit ebef1e1 into cortexproject:master Mar 8, 2021

pracucci deleted the optimise-ruler-getshardedrules branch March 8, 2021 08:38

pracucci mentioned this pull request Mar 29, 2021

Use ring client pool in ruler to reduce number of gRPC connections created #3315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruler: optimised <prefix>/api/v1/rules and <prefix>/api/v1/alerts #3916

Ruler: optimised <prefix>/api/v1/rules and <prefix>/api/v1/alerts #3916

pracucci commented Mar 5, 2021

gotjosh left a comment

gotjosh Mar 5, 2021

gotjosh Mar 5, 2021

pracucci Mar 5, 2021

pracucci Mar 5, 2021

pracucci Mar 5, 2021

gotjosh Mar 5, 2021

pstibrany left a comment

pstibrany Mar 8, 2021

Ruler: optimised <prefix>/api/v1/rules and <prefix>/api/v1/alerts #3916

Ruler: optimised <prefix>/api/v1/rules and <prefix>/api/v1/alerts #3916

Conversation

pracucci commented Mar 5, 2021

gotjosh left a comment

Choose a reason for hiding this comment

gotjosh Mar 5, 2021

Choose a reason for hiding this comment

gotjosh Mar 5, 2021

Choose a reason for hiding this comment

pracucci Mar 5, 2021

Choose a reason for hiding this comment

pracucci Mar 5, 2021

Choose a reason for hiding this comment

pracucci Mar 5, 2021

Choose a reason for hiding this comment

gotjosh Mar 5, 2021

Choose a reason for hiding this comment

pstibrany left a comment

Choose a reason for hiding this comment

pstibrany Mar 8, 2021

Choose a reason for hiding this comment