zarf-dev / zarf

DevSecOps for Air Gap & Limited-Connection Systems. https://zarf.dev/
Apache License 2.0
1.34k stars 162 forks source link

Image Race Condition Between Zarf Seed Registry and Zarf Permanent Registry With Multiple Registry Replicas #2855

Open anthonywendt opened 1 month ago

anthonywendt commented 1 month ago

Environment

Device and OS: Nutanix VM RHEL8 App version: 0.36.1 Kubernetes distro being used: RKE2

Steps to reproduce

  1. Run a zarf init with its registry deployment replica set to many, or have HPA enabled.

Expected result

Zarf to successfully initialize in the cluster.

Actual Result

Image pull backoff on some of the permanent registry pods because of what is explained in the below additional context.

Visual Proof (screenshots, videos, text, etc)

Need to produce... standby

Severity/Priority

Additional Context

We don't have a reproducible test case outside a Nutanix deployment of ours right now. We believe that when using multiple replicas of the registry and persistent volume provisioning is slow, this can happen.

We have the below custom zarf init package for our Nutanix CSI driver that imports the seed registry and permanent registry from the zarf project.

What we believe happens is the wait on the permanent registry immediately succeeds and moves on because the seed registry is using the same name as the permanent registry. This can cause the permanent registry to not actually be ready with all the images pushed to it in its persistent volume.

When the permanent registry replicas start to come up, some of the initial pods succeed because they pulled images from the temporary seed registry. Once that seed registry is gone, because the wait didn't actually wait for the permanent registry, and if the permanent registries persistent volumes were not quite ready or fully filled with all the images the seed registry had, any remaining registry pods can't pull images and remain in an image pull backoff. We can scale down to 1 registry pod and that allows the zarf init to continue and finish successfully.

Custom zarf init:

kind: ZarfInitConfig
metadata:
  name: init
  description: "Nutanix CSI Driver Custom Zarf Init Package"
  architecture: amd64
  version: "0.0.1" # This version is not used by zarf, but is used for tracking with the published versions

variables:
  - name: DYNAMIC_FILE_STORE_NAME
    description: "Name of Nutanix File Server to use for Dynamic File storageclass. Should match the name value for the file server in Prism."
  - name: PRISM_ENDPOINT
    description: "IP or hostname of Prism Element."
  - name: PRISM_USERNAME
    description: "Username of prism user to use for Nutanix CSI driver."
  - name: PRISM_PASSWORD
    description: "Password for prism user to use for Nutanix CSI driver."
  - name: STORAGE_CONTAINER
    description: "Name of Nutanix Storage Container for CSI driver to create volumes in."

components:
  # (Optional) Deploys a k3s cluster
  - name: k3s
    import:
      url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1

  # This package moves the injector & registries binaries
  - name: zarf-injector
    required: true
    import:
      url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1

  # Creates the temporary seed-registry
  - name: zarf-seed-registry
    required: true
    import:
      url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1
    charts:
      - name: docker-registry
        valuesFiles:
          - values/registry-values.yaml
    # On upgrades ensure we retain the existing PV
    actions:
      onDeploy:
        before:
          - description: Set persistence for upgrade seed registry
            cmd: ./zarf tools kubectl get pvc zarf-docker-registry -n zarf >/dev/null 2>&1 && echo true || echo false
            mute: true
            setVariables:
              - name: UPGRADE_PERSISTENCE
          - description: Set env vars for upgrade seed registry
            mute: true
            cmd: |
              ./zarf tools kubectl get pvc zarf-docker-registry -n zarf >/dev/null 2>&1 && \
              echo "" || \
              echo "- name: REGISTRY_STORAGE_FILESYSTEM_ROOTDIRECTORY
                value: \"/var/lib/registry\""
            setVariables:
              - name: UPGRADE_ENV_VARS
                autoIndent: true

  # Push nutanix csi images to seed-registry
  - name: nutanix-csi-images-initial
    required: true
    description: Push nutanix images to the zarf registry
    images:
      - registry.k8s.io/sig-storage/snapshot-controller:v8.0.1
      - registry.k8s.io/sig-storage/snapshot-validation-webhook:v8.0.1
      - quay.io/karbon/ntnx-csi:v2.6.10
      - registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.11.1
      - registry.k8s.io/sig-storage/csi-provisioner:v5.0.1
      - registry.k8s.io/sig-storage/csi-snapshotter:v8.0.1
      - registry.k8s.io/sig-storage/csi-resizer:v1.11.2
      - registry.k8s.io/sig-storage/livenessprobe:v2.13.1
      - registry1.dso.mil/ironbank/opensource/velero/velero-plugin-for-csi:v0.7.1
      - registry1.dso.mil/ironbank/opensource/velero/velero-plugin-for-aws:v1.10.0

  - name: nutanix-csi-storage
    required: true
    charts:
      # renovate: datasource=helm
      - name: nutanix-csi-storage
        url: https://github.com/defenseunicorns/nutanix-helm.git # fork containing fix for imagepullsecrets needed for pods to pull images from zarf registry
        version: v2.6.10-modified
        gitPath: charts/nutanix-csi-storage
        namespace: ntnx-system
        valuesFiles:
          - values/nutanix-storage-values.yaml
    actions:
      onDeploy:
        before:
          - description: Delete Storage Classes
            cmd: ./zarf tools kubectl delete sc nutanix-volume --ignore-not-found=true

  - name: nutanix-dynamicfile-manifests
    required: true
    manifests:
      - name: nutanix-dynamicfile-manifests
        namespace: ntnx-system
        files:
          - nutanix-dynamicfile.yaml
    actions:
      onDeploy:
        before:
          - description: Delete Storage Classes
            cmd: ./zarf tools kubectl delete sc nutanix-dynamicfile --ignore-not-found=true

  # Creates the permanent registry
  - name: zarf-registry
    required: true
    import:
      url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1

  # Push nutanix csi (and registry) images to permanent registry
  - name: nutanix-csi-images
    required: true
    description: Push nutanix csi images to the zarf registry
    images:
      - registry.k8s.io/sig-storage/snapshot-controller:v8.0.1
      - registry.k8s.io/sig-storage/snapshot-validation-webhook:v8.0.1
      - quay.io/karbon/ntnx-csi:v2.6.10
      - registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.11.1
      - registry.k8s.io/sig-storage/csi-provisioner:v5.0.1
      - registry.k8s.io/sig-storage/csi-snapshotter:v8.0.1
      - registry.k8s.io/sig-storage/csi-resizer:v1.11.2
      - registry.k8s.io/sig-storage/livenessprobe:v2.13.1
      - registry1.dso.mil/ironbank/opensource/velero/velero-plugin-for-csi:v0.7.1
      - registry1.dso.mil/ironbank/opensource/velero/velero-plugin-for-aws:v1.10.0
      - "###ZARF_PKG_TMPL_REGISTRY_IMAGE_DOMAIN######ZARF_PKG_TMPL_REGISTRY_IMAGE###:###ZARF_PKG_TMPL_REGISTRY_IMAGE_TAG###"

  # Creates the pod+git mutating webhook
  - name: zarf-agent
    required: true
    import:
      url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1

  # (Optional) Adds a git server to the cluster
  - name: git-server
    import:
      url: oci://ghcr.io/defenseunicorns/packages/init:v0.36.1
AustinAbro321 commented 2 weeks ago

Thank you for the issue @anthonywendt. This can be easily solved by updating the component to use health checks after #2718 is introduced. You are correct that the root of the issue is that the permanent registry is immediately succeeded on the wait check. Health checks will use kstatus rather than kubectl wait under the hood. Kstatus waits for all of the pods to be in the updated state before evaluating to ready.