Open anthonywendt opened 1 month ago
Thank you for the issue @anthonywendt. This can be easily solved by updating the component to use health checks after #2718 is introduced. You are correct that the root of the issue is that the permanent registry is immediately succeeded on the wait check. Health checks will use kstatus rather than kubectl wait
under the hood. Kstatus waits for all of the pods to be in the updated state before evaluating to ready.
Environment
Device and OS: Nutanix VM RHEL8 App version: 0.36.1 Kubernetes distro being used: RKE2
Steps to reproduce
Expected result
Zarf to successfully initialize in the cluster.
Actual Result
Image pull backoff on some of the permanent registry pods because of what is explained in the below additional context.
Visual Proof (screenshots, videos, text, etc)
Need to produce... standby
Severity/Priority
Additional Context
We don't have a reproducible test case outside a Nutanix deployment of ours right now. We believe that when using multiple replicas of the registry and persistent volume provisioning is slow, this can happen.
We have the below custom zarf init package for our Nutanix CSI driver that imports the seed registry and permanent registry from the zarf project.
What we believe happens is the wait on the permanent registry immediately succeeds and moves on because the seed registry is using the same name as the permanent registry. This can cause the permanent registry to not actually be ready with all the images pushed to it in its persistent volume.
When the permanent registry replicas start to come up, some of the initial pods succeed because they pulled images from the temporary seed registry. Once that seed registry is gone, because the wait didn't actually wait for the permanent registry, and if the permanent registries persistent volumes were not quite ready or fully filled with all the images the seed registry had, any remaining registry pods can't pull images and remain in an image pull backoff. We can scale down to 1 registry pod and that allows the zarf init to continue and finish successfully.
Custom zarf init: