The "es-operator" should continue to attempt draining nodes if unsuccessful; however, it must allow new operations to override these retries to prevent blocking critical scaling actions within the same ElasticSearch Dataset (EDS).
Actual Behavior
Currently, the "es-operator" repeatedly attempts to drain nodes up to 999 times with intervals of 10-30 seconds. While this retry mechanism is generally beneficial, it does not consider other operations that the es-operator needs to perform on the same EDS. As a result, essential actions such as scaling out or in are delayed, potentially leading to downtime or financial inefficiencies. The process can lock the EDS for approximately 6 hours under the retry parameters, significantly hindering operational flexibility and responsiveness.
Create a cluster with the following characteristics:
minReplicas=1, maxReplicas=2
minIndexReplicas=0
One index with two shards, no replicas, and routing.allocation.total_shards_per_node: 1
Observe the es-operator attempting to drain the second node, which fails because ES rejects placing more than one shard of the same index onto a single node.
Initiate a scale-out event by modifying eds.
Note that the scale-out process is stalled because the es-operator does not handle context cancellation during node drain retries.
Expected Behavior
The "es-operator" should continue to attempt draining nodes if unsuccessful; however, it must allow new operations to override these retries to prevent blocking critical scaling actions within the same ElasticSearch Dataset (EDS).
Actual Behavior
Currently, the "es-operator" repeatedly attempts to drain nodes up to 999 times with intervals of 10-30 seconds. While this retry mechanism is generally beneficial, it does not consider other operations that the
es-operator
needs to perform on the same EDS. As a result, essential actions such as scaling out or in are delayed, potentially leading to downtime or financial inefficiencies. The process can lock the EDS for approximately 6 hours under the retry parameters, significantly hindering operational flexibility and responsiveness.Proposed Solution
Implement functionality in the
waitForEmptyEsNode
function to check for cancellation within the context that is already passed but not utilized. By enhancing the use of the context, the "es-operator" can be interrupted during a drain retry, allowing higher priority operations (like scaling out) to proceed immediately. This will ensure that essential modifications to the EDS are not delayed by lower priority drain retries.Steps to Reproduce the Problem
minReplicas=1
,maxReplicas=2
minIndexReplicas=0
routing.allocation.total_shards_per_node: 1
es-operator
attempting to drain the second node, which fails because ES rejects placing more than one shard of the same index onto a single node.es-operator
does not handle context cancellation during node drain retries.Specifications