zalando-incubator / kube-ingress-aws-controller

Configures AWS Load Balancers according to Kubernetes Ingress resources
MIT License
375 stars 83 forks source link

CrashLoopBackOff due to EC2MetadataError: failed to make EC2Metadata request, status code: 401 #455

Open SarumObjects opened 2 years ago

SarumObjects commented 2 years ago

I've followed the guidelines here: https://github.com/zalando-incubator/kube-ingress-aws-controller/blob/master/deploy/kops.md but kube-ingress-aws-controller restarts every 2-10 minutes. When I follow the log of the pod I get this error "EC2MetadataError: failed to make EC2Metadata request" I have rebuilt the cluster and deleted it several times and cannot create the load balancer or target groups - although I have in the past. One of our clusters is still running so I have compared it in detail - and have got no differences except in the name.

We are blocked. This is our development environment. The instances have public & private IPs and the VPCs & SGs have been generated correctly.

Where should I look now please? John

AlexanderYastrebov commented 2 years ago

Hello. What is the controller version you are using? Could you provide a more detailed error log message?

szuecs commented 2 years ago

@SarumObjects What kind of controller version and AWS integration do you use? Kube2iam and all others had issues with https://github.com/jtblin/kube2iam/pull/130. The error message looks like https://github.com/aws/aws-sdk-go/issues/870 and this is quite old and should be fixed by recent kubernetes AWS iam integrations.

SarumObjects commented 2 years ago

@szuecs v0.12 (I downloaded :latest) and created the cluster with Kops (1.22.22). I've built several similar clusters in the last 24 months (we're running one as prod) and I have burned and built a QA cluster (same script) some 4 times. The cluster validates successfully but when I install kube-ingress-aws-controller/skipper (same manifest as our prod cluster - different name) I get this error: "EC2MetadataError: failed to make EC2Metadata request" @AlexanderYastrebov : This is the total log! I don't know how to debug this controller. I've searched the documentation for 'debug and 'verbose' - and I am stuck for over a week.

szuecs commented 2 years ago

@SarumObjects I think just paste the logs here until the crash would be great!

Latest version meaning v0.12.12? We just merged updates to aws-sdk, maybe you want to try v0.12.14, when it's released in some minutes (automated process).

Also interesting would be if you would past the output of kubectl describe pods kube-ingress-aws-controller-....

We don't have really knowledge about Kops. Is the version you are referring to the same as the Kubernetes version?

SarumObjects commented 2 years ago

@szuecs the 'latest' still restarts. here's the output from kubectl describe pods kube-ingress-aws-controller-.. kiac-describe.txt

kops is version 1.22.2 kubectl version Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:41:42Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:42:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}

AlexanderYastrebov commented 2 years ago

@SarumObjects Could you get ingress controller logs as well (kubectl logs kube-ingress-aws-controller-...)?

SarumObjects commented 2 years ago

@AlexanderYastrebov this the command and the complete log: kubectl -n kube-system logs -f kube-ingress-aws-controller-5fbcd9fff8-vqrvg time="2021-12-03T11:52:45Z" level=info msg="starting /kube-ingress-aws-controller v0.12.14" time="2021-12-03T11:54:48Z" level=fatal msg="EC2MetadataError: failed to make EC2Metadata request\n\n\tstatus code: 401, request id:

AlexanderYastrebov commented 2 years ago

Could you try to run with --debug option (it would print more details in the logs)? 401 suggests some kind of problem with AWS credentials.

SarumObjects commented 2 years ago

there's no --debug at the command line.

AlexanderYastrebov commented 2 years ago

there's no --debug at the command line.

~$ docker run -it --rm registry.opensource.zalan.do/teapot/kube-ingress-aws-controller:latest --help
INFO[0000] starting /kube-ingress-aws-controller v0.12.14 
usage: kube-ingress-aws-controller [<flags>]

Flags:
  --help                         Show context-sensitive help (also try --help-long and --help-man).
  --version                      Print version and exit
  --debug                        Enables debug logging level
...

https://github.com/zalando-incubator/kube-ingress-aws-controller/blob/5c661370ed2aa4cf2a0f4164f5a9114c80fe2c84/controller.go#L191

SarumObjects commented 2 years ago

kubectl -n kube-system logs -f pod/kube-ingress-aws-controller-65775b947-dx9tl --ignore-errors=false time="2021-12-03T14:50:47Z" level=debug msg=aws.NewAdapter time="2021-12-03T14:50:47Z" level=debug msg=aws.ec2metadata.GetMetadata 2021/12/03 14:50:47 DEBUG: Request ec2metadata/GetToken Details: ---[ REQUEST POST-SIGN ]----------------------------- PUT /latest/api/token HTTP/1.1 Host: 169.254.169.254 User-Agent: aws-sdk-go/1.42.16 (go1.17.1; linux; amd64) Content-Length: 0 X-Aws-Ec2-Metadata-Token-Ttl-Seconds: 21600 Accept-Encoding: gzip


time="2021-12-03T14:50:47Z" level=info msg="starting /kube-ingress-aws-controller v0.12.14" 2021/12/03 14:52:50 DEBUG: Send Request ec2metadata/GetToken failed, attempt 0/3, error RequestError: send request failed caused by: Put "http://169.254.169.254/latest/api/token": read tcp 100.96.4.21:34662->169.254.169.254:80: read: connection reset by peer 2021/12/03 14:52:50 DEBUG: Request ec2metadata/GetMetadata Details: ---[ REQUEST POST-SIGN ]----------------------------- GET /latest/meta-data/instance-id HTTP/1.1 Host: 169.254.169.254 User-Agent: aws-sdk-go/1.42.16 (go1.17.1; linux; amd64) Accept-Encoding: gzip


2021/12/03 14:52:50 DEBUG: Response ec2metadata/GetMetadata Details: ---[ RESPONSE ]-------------------------------------- HTTP/1.1 401 Unauthorized Connection: close Content-Type: text/plain Date: Fri, 03 Dec 2021 14:52:50 GMT Server: EC2ws Content-Length: 0


2021/12/03 14:52:50 DEBUG: Validate Response ec2metadata/GetMetadata failed, attempt 0/3, error EC2MetadataError: failed to make EC2Metadata request

status code: 401, request id: 

time="2021-12-03T14:52:50Z" level=fatal msg="EC2MetadataError: failed to make EC2Metadata request\n\n\tstatus code: 401, request id: "

szuecs commented 2 years ago

This log here:

 caused by: Put "http://169.254.169.254/latest/api/token": read tcp 100.96.4.21:34662->169.254.169.254:80: read: connection reset by peer

169.254.169.254 is the metadata service by AWS. It sent a TCP RST packet, instead of sending us the data required to access AWS APIs.

What Kubernetes IAM integration do you use? For me this looks like not to be an issue by the controller, but either AWS or the IAM integration that should support in getting the IAM done. Maybe also your EC2 nodes have not the right permissions to access metadata service to access AWS APIs with sts::assumeRole, which is required for all Kubernetes AWS IAM integrations.

SarumObjects commented 2 years ago

Thats helpful. I'll look into the IAM permissions.

szuecs commented 2 years ago

@SarumObjects let us know what the error was to share with other folks that might find this issue. After that we can close it.

SarumObjects commented 2 years ago

still investigating: https://kops.sigs.k8s.io/releases/1.22-notes/

SarumObjects commented 2 years ago

In the end, I simply had to change the Nodes.instanceMetadata form httpPutResponseHopLimit: 1 to httpPutResponseHopLimit: 3 and then the metadata query can run - but I'm blocked again (failed to get ingress list). Closing this one with thanks.

jbilliau-rcd commented 1 year ago

Im having this exact same issue, out of nowhere, on ONE out of 80 clusters.....makes no sense. Where exactly did you change that setting @SarumObjects ? did you get it to work?

SarumObjects commented 1 year ago

@jbilliau-rcd I had to "kops edit cluster" the changes (httpPutResponseHopLimit: 3) rather than update them with a script (I have only 4 clusters of 3 nodes each). They continue to work but if I upgrade the clusters I now have terminate the nodes - which I do with a script giving time for the replacement nodes to start. It's a very odd behaviour - but I've not got enough time to explore it. (If it ain't broke, don't fix it)

szuecs commented 1 year ago

@SarumObjects @jbilliau-rcd can you create a docs PR for kops update to highlight the version k8s update can trigger this?

Our current cluster setup is Kubernetes 1.21 and not kops, so I can not test on our side if it's kops related or not. We are migrating from crd v1beta1 and ingress v1beta1 since >1/2 year and soon we will update to 1.22.

jbilliau-rcd commented 1 year ago

@szuecs apologies, I dont quite understand what you are asking. You want me to put in a PR to update docs for what exactly? That this can happen if you go to 1.22? Do we know that for sure? I have plenty of clusters running EKS 1.22 just fine with 0.14.0 of this controller, with the following argument set in the pod spec - --ingress-api-version=networking.k8s.io/v1.

So we are already on 1.22, already using the new v1 ingress API, and it works on all clusters except one. Mind you...that one isnt even on 1.22! Its on 1.21, so I don't think this has anything to do with 1.22, looks more oidc/iam related.

szuecs commented 1 year ago

@jbilliau-rcd oh interesting so we need to investigate more. Right now we have to rely on you contributors.

jbilliau-rcd commented 1 year ago

So I ended up running this command:

aws ec2 modify-instance-metadata-options \
    --instance-id i-1234567898abcdef0 \
    --http-put-response-hop-limit 3 \
    --http-endpoint enabled

With the instance-id being the ec2 node that the Zalando pod was running on, and that fixed it! How this (so far) has only happened on one node is still puzzling to me, but that is the issue. It seems like the fix would need to be that the pod should never contact (or at least have the configuration option to never contact) the ec2 instance metadata service, instead only ever using OIDC to use it's own IAM role and not the role of the worker node. We give our Zalando ingress it's own role so the fact that it broke due to not being able to call the worker nodes metadata URL itself (presumably to use it's own if it needed to) kinda sucked :(

mikkeloscar commented 1 year ago

@jbilliau-rcd With this it should be possible to run the controller without needing to contact the ec2 instance metadata service: https://github.com/zalando-incubator/kube-ingress-aws-controller/pull/376

jbilliau-rcd commented 1 year ago

Ah interesting....looks like that was merged 2 years ago!? Has this hidden argument always been available? Don't see it in any documentation anywhere.

mikkeloscar commented 1 year ago

Yeah, we should get this documented so it's more clear.