Feature request: Deregister terminating nodes from ALB to avoid 5xx errors

adri commented 1 year ago

Is your feature request related to a problem? Please describe. We've noticed in our production environment that we have a need for something to deregister nodes from load balancers as part of the draining procedure, before the instance is terminated.

We were using cluster-autoscaler (based on auto-scaling groups) in combination with lifecycle-manager for this. Recently we switched to Karpenter, which doesn't require auto-scaling groups (ASG). Since there is no ASG lifecycle hook to wait for the ALB to drain connections to the node before terminating it, we see 5xx errors when nodes terminate before it's deregistered from an ALB.

Describe the solution you would like Since this PR Karpenter adds a node label node.kubernetes.io/exclude-from-external-load-balancers when a node is going to shut down.

The behavior we're looking for is that kube-ingress-aws-controller would react to this label being added, sends a deregistration request and then waits for the deregistration to finish before marking the instance as being ready to terminate (if possible).

Describe alternatives you've considered (optional) We're considering writing our own script to remove instance tags from the EC2 instance that kube-ingress-aws-controller uses to filter nodes that are added to the ALB.

mikkeloscar commented 1 year ago

Hi @adri thanks for writing up the feature request. We're currently also experimenting with karpenter in Zalando and therefore it could be interesting for us in the future. I can't set any expectation on when we would look into this, but PRs in this direction is ofc. welcome.

adri commented 1 year ago

@mikkeloscar Using Karpenter turned out be a big cost-saver, especially when using spost instances. No idea if it's equally good for your infra, but I'd highly recommend it.

adri commented 10 months ago

We switched to AWS CNI Mode and now use a Deployment instead of a DaemonSet. This way pods get deregistered from the target group when Karpenter decides to shut down a node. Great feature! Thanks for that 👏

461

szuecs commented 10 months ago

Maybe we should also switch :)

@universam1 ^ A happy user.

universam1 commented 10 months ago

We run that very setup exclusively and consider it stable. 👍

zalando-incubator / kube-ingress-aws-controller

Feature request: Deregister terminating nodes from ALB to avoid 5xx errors #604

461