Indexing flow enhancements for admission controller #8911

ajaymovva · 2023-07-27T06:34:39Z

Overview

The parent RFC #8910 discusses the admission control framework, which limits and restricts the incoming requests early when a node begins to go under stress.

OpenSearch today provides few gating mechanisms to protect a node when under duress, via the concepts of queue rejections and circuit breakers. However, queue sizes are fixed and isolated and do not effectively represent the total work required to be done. Similarly, Circuit breakers act as a last line of defence, are mostly too late to act upon, and do not offer fairness. Some of these gaps create availability issues with the cluster when under duress due to hardware failures, node performance degradation, or traffic bursts.

Shard Indexing Pressure today tries to address this to some extent by rejecting indexing requests based on the memory accounting at the shard level along with other key performance factors like throughput and last successful requests.

Challenges in the Shard Indexing BackPressure

Shard Indexing Backpressure mechanisms only account for memory usage and throughput degradation of the shards / nodes to reject requests. They don’t consider other resource utilisation parameters such as CPU , I/O etc. Refer to similar [RFC] Shard Indexing backpressure mechanism should also protect from any CPU contention on nodes #7638.
The coordinator rejects requests based on the local view it constructs for downstream nodes and rejects if memory utilisation for the shard is breached. But it does not consider the collective resource contention on the target node based on requests from other coordinator nodes.

Proposed Solution

The admission controller framework we are suggesting will track the resource utilisation for all the downstream nodes and maintain these stats at the coordinator. Along with the shard indexing backpressure, which tracks the shard level view for downstream nodes, we are proposing a new solution that will give the node-level performance stats for all the downstream nodes.

In the admission controller framework, we will support multiple levels of rejection, such as rejection at the coordinator based on the coordinator's or downstream nodes resource utilisation and rejection at the target node based on the target node's resource utilisation. As part of it, we will enhance the indexing flow to proactively reject requests if the data nodes are stressed.

Reject the Incoming Indexing request based on the resource utilization of downstream nodes

Proposed Indexing Flow

For every bulk shard request, we will evaluate the performance stats of all nodes that are associated with primary or replica shards of the indexing operation and proceed with the request either by allowing it or rejecting it.

Primary Shard Node is in Stress

When the primary shard node is in stress, we will reject the whole bulk shard request at the coordinator only.

Replica Shard Node is in Stress

When all the replica nodes are in stress, we will reject the whole bulk shard reject at the coordinator only.
When few replica nodes are in stress we will allow the request to proceed from the coordinator as the primary and more than one replica nodes are not in stress.
While processing replica request at target node we will evaluate the resource utilization and proceed with replication action.

This will be a further extension to node-level rejections based on nodeLimitBreached at the coordinator which now consider rejections based on the CPU/JVM/IO usage of the downstream nodes.

Rejection of Incoming Indexing requests at target node

The coordinator will allow the requests to the stressed target nodes, when it doesn’t have the latest stats for the downstream nodes. In such cases, we will reject requests at target/data nodes.

Primary Shard Node is in Stress

We will extend existing shard indexing backpressure and add CPU/JVM limits to isPrimaryNodeLimitBreached method.

Replica Shard Node is in Stress

Similar to the shard indexing backpressure, we will have a higher threshold for replica requests compared to primary requests and reject the requests at the target node based on the CPU/JVM/IO usage of that node. This mechanism is to make sure to fail fast the indexing requests (rejecting primary) and to reduce shard failures as well (high threshold to replica). This will also be an extension to the node-level rejections based on the replicaLimitBreached at the target node based on resource usage.
Replica shard failure - If we fail the replica request on any one of the nodes, there will be shard failures, so it will be a configurable option to fail the replica or not. All the threshold limits for the requests are dynamically configurable, so as a default option, we will try to make sure there won’t be any data loss and nodes are not under duress. Even today, the shard indexing backpressure can reject the replica action at the replica node when the threshold limit breaches.

Co-authored by @bharath-techie

ajaymovva added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 27, 2023

bharath-techie mentioned this issue Jul 27, 2023

[RFC] Admission Controller framework for OpenSearch #8910

Open

Xtansia added the Indexing Indexing, Bulk Indexing and anything related to indexing label Aug 14, 2023

anasalkouz removed the untriaged label Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing flow enhancements for admission controller #8911

Indexing flow enhancements for admission controller #8911

ajaymovva commented Jul 27, 2023

Indexing flow enhancements for admission controller #8911

Indexing flow enhancements for admission controller #8911

Comments

ajaymovva commented Jul 27, 2023

Overview

Challenges in the Shard Indexing BackPressure

Proposed Solution

Reject the Incoming Indexing request based on the resource utilization of downstream nodes

Proposed Indexing Flow

Rejection of Incoming Indexing requests at target node