zalando-incubator / es-operator

Kubernetes Operator for Elasticsearch
353 stars 44 forks source link

Enable Interruption of Node Drain Retries by New Operations in es-operator #404

Closed A-Kamaee closed 4 months ago

A-Kamaee commented 5 months ago

Expected Behavior

The "es-operator" should continue to attempt draining nodes if unsuccessful; however, it must allow new operations to override these retries to prevent blocking critical scaling actions within the same ElasticSearch Dataset (EDS).

Actual Behavior

Currently, the "es-operator" repeatedly attempts to drain nodes up to 999 times with intervals of 10-30 seconds. While this retry mechanism is generally beneficial, it does not consider other operations that the es-operator needs to perform on the same EDS. As a result, essential actions such as scaling out or in are delayed, potentially leading to downtime or financial inefficiencies. The process can lock the EDS for approximately 6 hours under the retry parameters, significantly hindering operational flexibility and responsiveness.

Proposed Solution

Implement functionality in the waitForEmptyEsNode function to check for cancellation within the context that is already passed but not utilized. By enhancing the use of the context, the "es-operator" can be interrupted during a drain retry, allowing higher priority operations (like scaling out) to proceed immediately. This will ensure that essential modifications to the EDS are not delayed by lower priority drain retries.

Steps to Reproduce the Problem

  1. Create a cluster with the following characteristics:
    • minReplicas=1, maxReplicas=2
    • minIndexReplicas=0
    • One index with two shards, no replicas, and routing.allocation.total_shards_per_node: 1
  2. Observe the es-operator attempting to drain the second node, which fails because ES rejects placing more than one shard of the same index onto a single node.
  3. Initiate a scale-out event by modifying eds.
  4. Note that the scale-out process is stalled because the es-operator does not handle context cancellation during node drain retries.

Specifications

otrosien commented 4 months ago

done.