Intermittent Hangs at crane.Push() on Registry Push

ranimbal commented 1 year ago

Environment

Device and OS: Rocky 8 EC2 App version: 0.29.2 Kubernetes distro being used: RKE2 v1.26.9+rke2r1 Other: Bigbang v2.11.1

Steps to reproduce

zarf package deploy zarf-package-mvp-cluster-amd64-v5.0.0-alpha.7.tar.zst --confirm -l=debug
About 80% of the time or so, the above command gets stuck at crane.Push(). A retry usually works.

Expected result

That the zarf package deploy... command wouldn't get hung up, and continue along.

Actual Result

The zarf package deploy... command gets hung up

Visual Proof (screenshots, videos, text, etc)

[30;100m[30;100m  DEBUG  [0m[0m [90m[90m2023-10-23T18:37:19Z  -  Pushing ...1.dso.mil/ironbank/neuvector/neuvector/manager:5.1.3[0m[0m
[30;100m[30;100m  DEBUG  [0m[0m [90m[90m2023-10-23T18:37:19Z  -  crane.Push() /tmp/zarf-3272389118/images:registry1.dso.mil/ironbank/neuvector/neuvector/manager:5.1.3 -> 127.0.0.1:39357/ironbank/neuvector/neuvector/manager:5.1.3-zarf-487612511)[0m[0m
section_end:1698087620:step_script
[0K[31;1mERROR: Job failed: execution took longer than 35m0s seconds

Severity/Priority

There is a workaround, by keeping retrying until the process succeeds.

Additional Context

This looks exactly like https://github.com/defenseunicorns/zarf/issues/1568, which was closed.

We have a multi-node cluster on AWS EC2, our package size is about 2.9G. Here are a few things that we noticed after some extensive testing:

this issue is not seen on a single EC2 node RKE2 cluster, it seems to only occur on multi-node clusters.
our zarf docker registry is backed by S3. The issue is always seen in this case, but only if a multi-node cluster.
if we back the registry with the default PVC (instead of S3), the issue is not seen at all. Since data transfer to S3 is slower than to the EBS backed PVC, maybe this extra time causes the problem to appear?
disabling or enabling the zarf docker registry HPA doesn't seem to matter either ways.

AbrohamLincoln commented 12 months ago

I did some testing on this and here's what i found:

I cannot reproduce this with a single node RKE2 cluster
I cannot reproduce this with an EKS cluster
I can reproduce this fairly consistently with RKE2 on a 2+ node cluster (rough math says ~80% of the time)
I changed the CNI from Canal to Calico. While I did still encounter this issue, my rough math says the failure rate dropped down to less than 20%.

While I have not found a smoking gun for this, the testing I've done seems to indicate it might be related to the default RKE2 CNI.

Racer159 commented 12 months ago

Yeah that is what we are leaning to after some internal testing as well - a potentially interesting data point - do you ever see this issue with zarf package mirror-resources?

https://docs.zarf.dev/docs/the-zarf-cli/cli-commands/zarf_package_mirror-resources#examples

(for the internal registry you can take the first example and swap the passwords and the package - if you don't have git configured just omit that set of flags)

Racer159 commented 12 months ago

(a potential addition to the theory is that other things in the cluster may be stressing it as well)

Racer159 commented 12 months ago

Also what is the node role layout for your clusters - I have heard reports that if all nodes are control plane nodes that the issue is also not seen.

ranimbal commented 12 months ago

Also what is the node role layout for your clusters - I have heard reports that if all nodes are control plane nodes that the issue is also not seen.

We've always had agent nodes when we saw this issue, whether with 1 or 3 control plane nodes. We've never seen this issue on single node clusters. Haven't tried a cluster with only 3 control plane nodes and no agent nodes.

docandrew commented 12 months ago

Just to add another data point from what we've seen - we can deploy OK with multi-node clusters but only if the nodes are all RKE2 servers. As soon as we make one an agent, the Zarf registry runs there and we see this behavior as well.

docandrew commented 12 months ago

Additional agent nodes are OK but we've tainted those so the Zarf registry doesn't run there.

AbrohamLincoln commented 11 months ago

I can confirm that adding a nodeSelector and taint/toleration to schedule the zarf registry pod(s) on the RKE2 control plane node(s) does ~resolve~ work around this issue:

kubectl patch deployment -n zarf zarf-docker-registry --patch-file=/dev/stdin <<-EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: zarf-docker-registry
  namespace: zarf
spec:
  template:
    spec:
      nodeSelector:
        node-role.kubernetes.io/master: "true"
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
EOF

AbrohamLincoln commented 10 months ago

Just wanted to chime in and say that this problem is still reproducible with the changes in #2190.
It appears that there isn't an error so the retry does not occur.

mjnagel commented 9 months ago

Just noting still encountering this on RKE2 with EBS backed PVCs. Not really any additional details on how/why we encountered this but we were able to workaround by pushing the image that was hanging "manually"/via a small zarf package.

EDIT: To clarify this was a zarf package that we built with a single component containing the single image that commonly stalled on deploy. Then we create/deploy-ed it and once finished, we deployed our "real" zarf package and it sped past the image push. Not sure why this worked better, but it seemed to consistently help when we hit stalling images.

eddiezane commented 8 months ago

This is a super longstanding issue upstream that we've been band-aiding for a few years (in Kubernetes land). The root of the issue is that SPDY is long dead but used for all streaming functionality in Kubernetes. The current port forward logic depends on SPDY and a implementation that is overdue for a rewrite.

KEP 4006 should be an actual fix as we replace SPDY.

We are currently building mitigations into Zarf to try and address this.

What we really need is an environment where we can replicate the issue and test different fixes. If anyone has any ideas... Historically we've been unable to reproduce this.

Racer159 commented 8 months ago

This should be mitigated now in https://github.com/defenseunicorns/zarf/releases/tag/v0.32.4 - leaving this open until we get more community feedback though (and again this is a mitigation not a true fix, that will have to happen upstream).

Racer159 commented 8 months ago

(also thanks to @benmountjoy111 and @docandrew for the .pcap files!)

YrrepNoj commented 7 months ago

This should be mitigated now in https://github.com/defenseunicorns/zarf/releases/tag/v0.32.4 - leaving this open until we get more community feedback though (and again this is a mitigation not a true fix, that will have to happen upstream).

Sadly, I do not think this solves the issue. I am still experiencing timeouts when publishing images. I am noticing that Zarf is now explicitly timing out instead of just hanging forever though.

eddiezane commented 7 months ago

https://github.com/kubernetes/kubernetes/pull/117493 should fix this upstream. Hopefully we can get it merged and backported.

mjnagel commented 6 months ago

Following up here to see if there's any more clarity on the exact issue we're facing...based on the above comments it seems like the current suspicion is that the issue originates from the kubectl port-forward/tunneling? Is that accurate @eddiezane ?

In some testing on our environment we've consistently had failures with large image pushes. This is happening in the context of a UDS bundle, so not directly zarf but it's effectively just looping through each package to deploy. Our common error looks like the one above with timeouts currently.

We have however had success pushing the images two different ways:

A single component zarf package just containing the image (no manifests/charts), created/deployed on the cluster: This succeeds pushing the image pretty consistently (most recently did this with 2-12gb images and all pushed)
A kubectl port-forward to the zarf registry + docker push commands: This also seems to work consistently

I think where I'm confused in all this is that I'd assume either of these workarounds would hit the same limitations with port-forwarding/tunneling. Is there anything to glean from this experience that might help explain the issue better or why these methods seem to work far more consistently? As @YrrepNoj mentioned above, we're able to hit this pretty much 100% consistently with our bundle deploy containing the Leapfrog images and haven't found any success outside of these workarounds.

RyanTepera1 commented 6 months ago

A workaround that has seemed to work for me consistently to get past this particular issue is to use zarf package mirror-resources concurrently with a zarf connect git tunnel open and mirror the package’s internal resources to the specified image registry and git repository. I use the IP address of the node that the zarf-docker-registry is running on and NodePort service the zarf-docker-registry is using for the --registry-url. Authentication is also required with --registry-push-username/password and --git-push-username/password. Gotten from running a zarf tools get-creds. For example:

zarf package mirror-resources zarf-package-structsure-enterprise-amd64-v5.9.0.tar.zst --registry-url <IP address of node zarf-docker-registry is running on>:31999 --registry-push-username zarf-push --registry-push-password <zarf-push-password> --git-url http://127.0.0.1:<tunnel port from zarf connect git> --git-push-username zarf-git-user --git-push-password <git-user-password>

proof: running the zarf package deploy zarf-package-structsure-enterprise-amd64-v5.0.0.tar.zst --confirm has error'd everytime when it gets stuck on a specific blob unable to push the image: Screenshot 2024-05-02 at 4 16 35 PM

The zarf package mirror-resource command working to push the same image to the zarf-docker-registry with a zarf connect git tunnel open that previously always gets stuck during a zarf package deploy command: Screenshot 2024-05-02 at 4 27 45 PM

After the zarf package’s internal resources are mirrored to the specified registry and git repository a zarf package deploy zarf-package-structsure-enterprise-amd64-v5.0.0.tar.zst --confirm is successful.

philiversen commented 6 months ago

I am seeing the same behavior as @RyanTepera1 using RKE2 v1.28.3+rke2r2 on EC2 instances with an NFS-based storage class. I have not seen this on EKS clusters using an EBS-based storage class. I also haven't tried the zarf connect git trick yet, but I'll be trying that soon!

One additional thing I've noticed is that using zarf package mirror-resources --zarf-url <ip>:31999 ... doesn't seem to completely hang, but it slows to a crawl taking hours to make a small amount of progress. However, if I kill the zarf-docker-registry-* pod, progress seems to resume at normal speed. I was able to get through a large package with multiple 2+ GB images in a single run by monitoring and occasionally killing zarf-docker-registry pods to get things moving again.

For example, pushing this sonarqube image took nearly 5 minutes to get from 39% to 41%, but after killing the zarf-docker-registry pod, it pushes the image in less than a minute.

philiversen commented 6 months ago

Moving the zarf-docker-registry pods to one of the RKE2 master nodes as suggested here did not improve performance for my deployment. I tried this with zarf package deploy... and zarf package mirror-resources.... In both cases, when image pushes slowed way down, killing zarf-docker-registry pods would get things moving again. This was much easier using the zarf package mirror-resources... approach.

schristoff commented 3 months ago

@YrrepNoj provided the following workaround using Docker Push instead.

# Do the following two commands in its own terminal. The connect command is hanging.
# NOTE: You will need to reference the port that this connect command uses later
zarf connect registry # This command hangs, leave it running int he background

# In a new terminal session
zarf tools get-creds registry

docker login 127.0.0.1:{PORT_FROM_CONNECT_COMMAND} -u zarf-push -p {OUTPUT_OF_GETCREDS}

docker pull                     # pull all the images you are going to need down to your local docker
docker tag {original_image} 127.0.0.1{CONNECT_PORT}/{image_ref+tag}
docker push {new_image}

a1994sc commented 2 months ago

I'm running into the same issue... I'm trying to load in the rook/ceph images, e.i. the uds-capability-rook-ceph zarf init package, and can not get it to play nice...

I also tried with a vanilla images, as uds is baking in the repo1 images, and still getting the same error..

Images I am using:

REPOSITORY                                             TAG         IMAGE ID      CREATED       SIZE
docker.io/rook/ceph                                    v1.15.0     b53803133db7  2 weeks ago   1.37 GB
quay.io/cephcsi/cephcsi                                v3.12.0     72d897831441  3 weeks ago   1.54 GB
quay.io/ceph/ceph                                      v18.2.4     2bc0b0f4375d  6 weeks ago   1.25 GB
registry.k8s.io/sig-storage/csi-node-driver-registrar  v2.11.1     1201ed6e40fa  7 weeks ago   31.7 MB
registry.k8s.io/sig-storage/csi-snapshotter            v8.0.1      a011c41a1df0  3 months ago  66.9 MB
registry.k8s.io/sig-storage/csi-resizer                v1.11.1     95ba1a4c52f0  3 months ago  68 MB
registry.k8s.io/sig-storage/csi-provisioner            v5.0.1      427403f00b9e  3 months ago  71.2 MB
registry.k8s.io/sig-storage/csi-attacher               v4.6.1      11aa3c05dd35  3 months ago  67.8 MB

AustinAbro321 commented 1 month ago

Related to #2864

philiversen commented 1 month ago

Performance has significantly improved for me using zarf package mirror-resources... as of Zarf 0.40.1. I was seeing hang-ups every time and after upgrading to that Zarf version I haven't seen a hang up in a couple attempts. Very promising so far 🤞

a1994sc commented 1 month ago

@philiversen is correct, using zarf package mirror-resources has been working for me, but I do want to add it to the lifecycle of the zarf init package, when deploying rook-ceph, so below is part of my yaml to help things along:

variables:
- name: CONTROL_PLANE_ONE_ADDRESS
  prompt: true
  description: This is the IP/Hostname of a control-plane
components:
...
    actions:
      onDeploy:
        before:
          - cmd: |-
              ./zarf package mirror-resources zarf-init-amd64-*.tar.zst \
                --registry-url ${ZARF_VAR_CONTROL_PLANE_ONE_ADDRESS}:${ZARF_NODEPORT} \
                --registry-push-username zarf-push \
                --registry-push-password ${ZARF_REGISTRY_AUTH_PUSH}

zarf-dev / zarf