xlab-uiuc / acto

Push-Button End-to-End Testing of Kubernetes Operators and Controllers
Apache License 2.0
115 stars 43 forks source link

Starter Project #4 Differentiate between misconfiguration and bugs #221

Open Spedoske opened 1 year ago

Spedoske commented 1 year ago

What we met

We found that some test cases generated by Acto may contain misconfiguration. Here is an example of a mutation from state 0 to state 1. In the following example (See CRD Definition), Acto add an override of livenessProbe to the custom resource, which is invalid because rabbitmq will not use the port 8500. Therefore, Kubernetes will constantly kill the pod because the pod cannot pass the liveness check.

There are also many similar cases in the alarm report, such as an invalid image name and a missing field. The issue is intended to solve this problem, or at least mitigate the problem.

What we could do

Improve the test cases generated by Acto

TBD

Collect events (and logs) from kubernetes, and classify the alarms.

The event indicates that the pod has a invalid config and could not be created, which is different from a crash event. We think such kind of event may indicate a misconfiguration.

  Warning  FailedCreate      50s (x19 over 5m40s)  statefulset-controller  create Pod test-cluster-server-2 in StatefulSet test-cluster-server failed error: Pod "test-cluster-server-2" is invalid: spec.containers[0].image: Required value

CRD Definition

Mutation:

$ diff mutated-0.yaml mutated-1.yaml 
>   override:
>     statefulSet:
>       spec:
>         template:
>           spec:
>             containers:
>             - livenessProbe:
>                 httpGet:
>                   port: 8500
>                 initialDelaySeconds: 10
>               name: b

Use the following custom resource to demonstrate. State 0:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: test-cluster
  namespace: rabbitmq-system
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution: null
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - test-cluster
        topologyKey: kubernetes.io/hostname
  image: null
  imagePullSecrets: null
  persistence:
    storage: 50Gi
  rabbitmq:
    additionalConfig: 'cluster_partition_handling = pause_minority

      vm_memory_high_watermark_paging_ratio = 0.99

      disk_free_limit.relative = 1.0

      collect_statistics_interval = 10000

      '
  replicas: 3
  resources:
    limits:
      cpu: 1
      memory: 4Gi
    requests:
      cpu: 1
      memory: 4Gi
  secretBackend: null
  service:
    type: ClusterIP
  skipPostDeploySteps: false
  terminationGracePeriodSeconds: 1024
  tls:
    caSecretName: null
    disableNonTLSListeners: false
    secretName: null
  tolerations: null

State 1:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: test-cluster
  namespace: rabbitmq-system
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution: null
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app.kubernetes.io/name
            operator: In
            values:
            - test-cluster
        topologyKey: kubernetes.io/hostname
  image: null
  imagePullSecrets: null
  override:
    statefulSet:
      spec:
        template:
          spec:
            containers:
            - livenessProbe:
                httpGet:
                  port: 8500
                initialDelaySeconds: 10
              name: b
  persistence:
    storage: 50Gi
  rabbitmq:
    additionalConfig: 'cluster_partition_handling = pause_minority

      vm_memory_high_watermark_paging_ratio = 0.99

      disk_free_limit.relative = 1.0

      collect_statistics_interval = 10000

      '
  replicas: 3
  resources:
    limits:
      cpu: 1
      memory: 4Gi
    requests:
      cpu: 1
      memory: 4Gi
  secretBackend: null
  service:
    type: ClusterIP
  skipPostDeploySteps: false
  terminationGracePeriodSeconds: 1024
  tls:
    caSecretName: null
    disableNonTLSListeners: false
    secretName: null
  tolerations: null
Spedoske commented 1 year ago

To do list as for 6/9/2023