ContainersNotReady delay

deimosfr commented 6 years ago

Hi,

I often have this kind of message: ContainersNotReady. Even if I add really long delay on initialDelaySeconds. What is the recommended stuff to avoid this? In addition,I do not see any strange behavhior while the pods are booting.

Thanks

whs commented 6 years ago

Are your pod showing as 0/1 in kubectl get pods? (or something like 2/3 if you use multi containers pod)

Usually this is caused by having a readiness check on a pod that is slow to boot. We're having this issue as well (our monolithic can take like 5 min to boot) It's on my radar to search for a solution to this but it could be around the second half of the year as we're busy with upcoming product launches.

If your container is showing as 1/1 and you still get ContainersNotReady then it's probably a bug. Please comment with details if this is the case.

whs commented 6 years ago

I think I know the problem. Should be a quick fix. I'll try testing it.

deimosfr commented 6 years ago

Hi,

I confirm I get 1/1 or 2/2 or 3/3. It happens while booting. The readiness check is always working, I tried to delay it as max as possible but this get the message in the early boot. It's like if the Container Not Ready was catch before the readiness check happened

whs commented 6 years ago

I added ContainersNotReady to the blacklist (plus Longnotready bugfix). Could you please test that the patch is working? The image is tagged latest on Docker Hub.

deimosfr commented 6 years ago

good for me ! thanks

whs commented 6 years ago

Released as v3.2.3

deimosfr commented 6 years ago

Sorry but I still got the issue

whs commented 6 years ago

https://github.com/wongnai/kube-slack/blob/master/src/monitors/waitingpods.js#L11 ContainersNotReady is blacklisted. Could you please give

Docker tag you are using
Slack message (feel free to censor the pod/container name) as I need to know whether it is trigged by waitingpods or longnotready.

deimosfr commented 6 years ago

I'm going to test with v3.2.3, I tested with latest when you asked for test. I'll keep you updated

deimosfr commented 6 years ago

Hi,

I confirm I still got the issue:

$ kubectl describe pod/kube-slack-78b78b89cf-jh9dx
Name:           kube-slack-78b78b89cf-jh9dx
Namespace:      kube-system
Node:           node6/1.1.1.1
Start Time:     Sun, 11 Mar 2018 18:57:01 +0100
Labels:         app=kube-slack
                pod-template-hash=3463464579
Annotations:    kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"kube-system","name":"kube-slack-78b78b89cf","uid":"5c5727a2-2555-11e8-9b5d-0007cb...
                scheduler.alpha.kubernetes.io/critical-pod=
Status:         Running
IP:             10.233.68.25
Controlled By:  ReplicaSet/kube-slack-78b78b89cf
Containers:
  kube-slack:
    Container ID:   docker://ab1f8a71bc1e303327e4c8edfd38a0bde66fbd7ce18e7e175c70d36547706ce1
    Image:          willwill/kube-slack:v3.2.3
    Image ID:       docker-pullable://willwill/kube-slack@sha256:dbfc705ba68b7079ada1e913250d76c50754ef3c211339e900b3a2dacb2c2a0b
    Port:           <none>
    State:          Running
      Started:      Sun, 11 Mar 2018 18:57:11 +0100
    Ready:          True
    Restart Count:  0
    Environment:
      SLACK_URL:  https://hooks.slack.com/services/xxx/yyy
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-kg9rw (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  default-token-kg9rw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-kg9rw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/node=true
Tolerations:     <none>
Events:          <none>

My message when deleting a pod from a daemonset for example:

containers with unready status: [traefik]
kube-system/traefik-2qzq7: ContainersNotReady
containers with unready status: [traefik]

If you want to know more about what the daemonset looks like: https://github.com/MySocialApp/kubernetes-helm-chart-traefik/blob/master/kubernetes/templates/daemonset.yaml

Thanks

whs commented 6 years ago

So it is indeed triggered by LongNotReady: https://github.com/wongnai/kube-slack/blob/abfb14a39a677fd0c1195d806df511eb9048e470/src/monitors/longnotready.js#L74 and not where I added the status ignore. I'll revert afbd9699d7e9e94cb495cc0f6cde5cb544f1c9d4.

I'm taking vacation next week, so I might be able to work on this around end of the month. Sorry for the wait.

whs commented 6 years ago

Could you please try the following?

Set initialDelaySeconds back to default value. Setting this value cause pods to be not ready longer.
Set the NOT_READY_MIN_TIME environment variable to kube-slack. The default is 60000 which is 60s. In our production system we use 300000 as we have a Java application that is very slow to boot.

deimosfr commented 6 years ago

Hi, that doesn't change any thing. Even very small software that runs in a few seconds have the same issue.

whs commented 6 years ago

I just tested this in our cluster and I still can't reproduce:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  containers:
  - image: kitematic/hello-world-nginx
    name: hello-world-nginx
    ports:
    - containerPort: 80
    readinessProbe:
        httpGet:
          path: /
          port: 80

NOT_READY_MIN_TIME=300000
kube-slack 3.2.2 (image ID docker-pullable://willwill/kube-slack@sha256:6213f0e4bf3258b4265bea1a5706f87c2477dc8ba29905ed865efa18dbdbfa9f)
Kubernetes v1.7.8 (I haven't test, but we haven't get false positive on our v1.9.4-gke.1 cluster for some times as well)

deimosfr commented 6 years ago

Hi,

Sorry for the late answer, I'm currently testing with the latest version and your suggestion on NOT_READY_MIN_TIME.

Thanks

deimosfr commented 6 years ago

Looks good ! Thanks

wongnai / kube-slack

ContainersNotReady delay #27