Open ranimbal opened 1 year ago
I did some testing on this and here's what i found:
While I have not found a smoking gun for this, the testing I've done seems to indicate it might be related to the default RKE2 CNI.
Yeah that is what we are leaning to after some internal testing as well - a potentially interesting data point - do you ever see this issue with zarf package mirror-resources
?
https://docs.zarf.dev/docs/the-zarf-cli/cli-commands/zarf_package_mirror-resources#examples
(for the internal registry you can take the first example and swap the passwords and the package - if you don't have git configured just omit that set of flags)
(a potential addition to the theory is that other things in the cluster may be stressing it as well)
Also what is the node role layout for your clusters - I have heard reports that if all nodes are control plane nodes that the issue is also not seen.
Also what is the node role layout for your clusters - I have heard reports that if all nodes are control plane nodes that the issue is also not seen.
We've always had agent nodes when we saw this issue, whether with 1 or 3 control plane nodes. We've never seen this issue on single node clusters. Haven't tried a cluster with only 3 control plane nodes and no agent nodes.
Just to add another data point from what we've seen - we can deploy OK with multi-node clusters but only if the nodes are all RKE2 servers. As soon as we make one an agent, the Zarf registry runs there and we see this behavior as well.
Additional agent nodes are OK but we've tainted those so the Zarf registry doesn't run there.
I can confirm that adding a nodeSelector and taint/toleration to schedule the zarf registry pod(s) on the RKE2 control plane node(s) does ~resolve~ work around this issue:
kubectl patch deployment -n zarf zarf-docker-registry --patch-file=/dev/stdin <<-EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: zarf-docker-registry
namespace: zarf
spec:
template:
spec:
nodeSelector:
node-role.kubernetes.io/master: "true"
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
EOF
Just wanted to chime in and say that this problem is still reproducible with the changes in #2190.
It appears that there isn't an error so the retry does not occur.
Just noting still encountering this on RKE2 with EBS backed PVCs. Not really any additional details on how/why we encountered this but we were able to workaround by pushing the image that was hanging "manually"/via a small zarf package.
EDIT: To clarify this was a zarf package that we built with a single component containing the single image that commonly stalled on deploy. Then we create/deploy-ed it and once finished, we deployed our "real" zarf package and it sped past the image push. Not sure why this worked better, but it seemed to consistently help when we hit stalling images.
This is a super longstanding issue upstream that we've been band-aiding for a few years (in Kubernetes land). The root of the issue is that SPDY is long dead but used for all streaming functionality in Kubernetes. The current port forward logic depends on SPDY and a implementation that is overdue for a rewrite.
KEP 4006 should be an actual fix as we replace SPDY.
We are currently building mitigations into Zarf to try and address this.
What we really need is an environment where we can replicate the issue and test different fixes. If anyone has any ideas... Historically we've been unable to reproduce this.
This should be mitigated now in https://github.com/defenseunicorns/zarf/releases/tag/v0.32.4 - leaving this open until we get more community feedback though (and again this is a mitigation not a true fix, that will have to happen upstream).
(also thanks to @benmountjoy111 and @docandrew for the .pcap files!)
This should be mitigated now in https://github.com/defenseunicorns/zarf/releases/tag/v0.32.4 - leaving this open until we get more community feedback though (and again this is a mitigation not a true fix, that will have to happen upstream).
Sadly, I do not think this solves the issue. I am still experiencing timeouts when publishing images. I am noticing that Zarf is now explicitly timing out instead of just hanging forever though.
https://github.com/kubernetes/kubernetes/pull/117493 should fix this upstream. Hopefully we can get it merged and backported.
Following up here to see if there's any more clarity on the exact issue we're facing...based on the above comments it seems like the current suspicion is that the issue originates from the kubectl port-forward/tunneling? Is that accurate @eddiezane ?
In some testing on our environment we've consistently had failures with large image pushes. This is happening in the context of a UDS bundle, so not directly zarf but it's effectively just looping through each package to deploy. Our common error looks like the one above with timeouts currently.
We have however had success pushing the images two different ways:
I think where I'm confused in all this is that I'd assume either of these workarounds would hit the same limitations with port-forwarding/tunneling. Is there anything to glean from this experience that might help explain the issue better or why these methods seem to work far more consistently? As @YrrepNoj mentioned above, we're able to hit this pretty much 100% consistently with our bundle deploy containing the Leapfrog images and haven't found any success outside of these workarounds.
A workaround that has seemed to work for me consistently to get past this particular issue is to use zarf package mirror-resources
concurrently with a zarf connect git
tunnel open and mirror the package’s internal resources to the specified image registry and git repository. I use the IP address of the node that the zarf-docker-registry is running on and NodePort service the zarf-docker-registry is using for the --registry-url
. Authentication is also required with --registry-push-username/password
and --git-push-username/password
. Gotten from running a zarf tools get-creds
. For example:
zarf package mirror-resources zarf-package-structsure-enterprise-amd64-v5.9.0.tar.zst --registry-url <IP address of node zarf-docker-registry is running on>:31999 --registry-push-username zarf-push --registry-push-password <zarf-push-password> --git-url http://127.0.0.1:<tunnel port from zarf connect git> --git-push-username zarf-git-user --git-push-password <git-user-password>
proof:
running the zarf package deploy zarf-package-structsure-enterprise-amd64-v5.0.0.tar.zst --confirm
has error'd everytime when it gets stuck on a specific blob unable to push the image:
The zarf package mirror-resource
command working to push the same image to the zarf-docker-registry with a zarf connect git
tunnel open that previously always gets stuck during a zarf package deploy
command:
After the zarf package’s internal resources are mirrored to the specified registry and git repository a zarf package deploy zarf-package-structsure-enterprise-amd64-v5.0.0.tar.zst --confirm
is successful.
I am seeing the same behavior as @RyanTepera1 using RKE2 v1.28.3+rke2r2
on EC2 instances with an NFS-based storage class. I have not seen this on EKS clusters using an EBS-based storage class. I also haven't tried the zarf connect git
trick yet, but I'll be trying that soon!
One additional thing I've noticed is that using zarf package mirror-resources --zarf-url <ip>:31999 ...
doesn't seem to completely hang, but it slows to a crawl taking hours to make a small amount of progress. However, if I kill the zarf-docker-registry-*
pod, progress seems to resume at normal speed. I was able to get through a large package with multiple 2+ GB images in a single run by monitoring and occasionally killing zarf-docker-registry pods to get things moving again.
For example, pushing this sonarqube image took nearly 5 minutes to get from 39% to 41%, but after killing the zarf-docker-registry pod, it pushes the image in less than a minute.
Moving the zarf-docker-registry pods to one of the RKE2 master nodes as suggested here did not improve performance for my deployment. I tried this with zarf package deploy...
and zarf package mirror-resources...
. In both cases, when image pushes slowed way down, killing zarf-docker-registry pods would get things moving again. This was much easier using the zarf package mirror-resources...
approach.
@YrrepNoj provided the following workaround using Docker Push instead.
# Do the following two commands in its own terminal. The connect command is hanging.
# NOTE: You will need to reference the port that this connect command uses later
zarf connect registry # This command hangs, leave it running int he background
# In a new terminal session
zarf tools get-creds registry
docker login 127.0.0.1:{PORT_FROM_CONNECT_COMMAND} -u zarf-push -p {OUTPUT_OF_GETCREDS}
docker pull # pull all the images you are going to need down to your local docker
docker tag {original_image} 127.0.0.1{CONNECT_PORT}/{image_ref+tag}
docker push {new_image}
I'm running into the same issue... I'm trying to load in the rook/ceph images, e.i. the uds-capability-rook-ceph
zarf init package, and can not get it to play nice...
I also tried with a vanilla images, as uds is baking in the repo1 images, and still getting the same error..
Images I am using:
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/rook/ceph v1.15.0 b53803133db7 2 weeks ago 1.37 GB
quay.io/cephcsi/cephcsi v3.12.0 72d897831441 3 weeks ago 1.54 GB
quay.io/ceph/ceph v18.2.4 2bc0b0f4375d 6 weeks ago 1.25 GB
registry.k8s.io/sig-storage/csi-node-driver-registrar v2.11.1 1201ed6e40fa 7 weeks ago 31.7 MB
registry.k8s.io/sig-storage/csi-snapshotter v8.0.1 a011c41a1df0 3 months ago 66.9 MB
registry.k8s.io/sig-storage/csi-resizer v1.11.1 95ba1a4c52f0 3 months ago 68 MB
registry.k8s.io/sig-storage/csi-provisioner v5.0.1 427403f00b9e 3 months ago 71.2 MB
registry.k8s.io/sig-storage/csi-attacher v4.6.1 11aa3c05dd35 3 months ago 67.8 MB
Related to #2864
Performance has significantly improved for me using zarf package mirror-resources...
as of Zarf 0.40.1
. I was seeing hang-ups every time and after upgrading to that Zarf version I haven't seen a hang up in a couple attempts. Very promising so far 🤞
@philiversen is correct, using zarf package mirror-resources
has been working for me, but I do want to add it to the lifecycle of the zarf init package, when deploying rook-ceph, so below is part of my yaml to help things along:
variables:
- name: CONTROL_PLANE_ONE_ADDRESS
prompt: true
description: This is the IP/Hostname of a control-plane
components:
...
actions:
onDeploy:
before:
- cmd: |-
./zarf package mirror-resources zarf-init-amd64-*.tar.zst \
--registry-url ${ZARF_VAR_CONTROL_PLANE_ONE_ADDRESS}:${ZARF_NODEPORT} \
--registry-push-username zarf-push \
--registry-push-password ${ZARF_REGISTRY_AUTH_PUSH}
Environment
Device and OS: Rocky 8 EC2 App version: 0.29.2 Kubernetes distro being used: RKE2 v1.26.9+rke2r1 Other: Bigbang v2.11.1
Steps to reproduce
zarf package deploy zarf-package-mvp-cluster-amd64-v5.0.0-alpha.7.tar.zst --confirm -l=debug
crane.Push()
. A retry usually works.Expected result
That the
zarf package deploy...
command wouldn't get hung up, and continue along.Actual Result
The
zarf package deploy...
command gets hung upVisual Proof (screenshots, videos, text, etc)
Severity/Priority
There is a workaround, by keeping retrying until the process succeeds.
Additional Context
This looks exactly like https://github.com/defenseunicorns/zarf/issues/1568, which was closed.
We have a multi-node cluster on AWS EC2, our package size is about 2.9G. Here are a few things that we noticed after some extensive testing: