weaviate / weaviate-helm

Helm charts to deploy Weaviate to k8s
https://weaviate.io/developers/weaviate/current/
BSD 3-Clause "New" or "Revised" License
49 stars 64 forks source link

Weaviate Replica not restarting #164

Open jaspersorrio opened 1 year ago

jaspersorrio commented 1 year ago

Hi Team,

Not sure if you guys are also observing this behaviour & if this is normal.

One of the pods randomly went into the completed status. Kubernetes is not attempting to restart it.

kubectl version --output=yaml

clientVersion:
  buildDate: "2023-07-19T12:20:54Z"
  compiler: gc
  gitCommit: fa3d7990104d7c1f16943a67f11b154b71f6a132
  gitTreeState: clean
  gitVersion: v1.27.4
  goVersion: go1.20.6
  major: "1"
  minor: "27"
  platform: linux/amd64
kustomizeVersion: v5.0.1
serverVersion:
  buildDate: "2023-06-01T19:54:16Z"
  compiler: gc
  gitCommit: 5319597f0ffe6e93e83a51e280d81fb2028bf4a0
  gitTreeState: clean
  gitVersion: v1.27.2-gke.1200
  goVersion: go1.20.4 X:boringcrypto
  major: "1"
  minor: "27"
  platform: linux/amd64

kubectl get all -n weaviate-prod

NAME             READY   STATUS      RESTARTS       AGE
pod/weaviate-0   1/1     Running     2 (4d5h ago)   6d11h
pod/weaviate-1   1/1     Running     2 (4d5h ago)   6d11h
pod/weaviate-2   0/1     Completed   0              4d5h
pod/weaviate-3   1/1     Running     3 (4d5h ago)   6d11h
pod/weaviate-4   1/1     Running     2 (4d5h ago)   6d11h

NAME                        TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)        AGE
service/weaviate            LoadBalancer   10.76.28.71   10.148.0.88   80:31446/TCP   6d11h
service/weaviate-headless   ClusterIP      None          <none>        80/TCP         6d11h

NAME                        READY   AGE
statefulset.apps/weaviate   4/5     6d11h

kubectl describe pod weaviate-2 -n weaviate-prod

Name:             weaviate-2
Namespace:        weaviate-prod
Priority:         0
Service Account:  default
Node:             gke-weaviate-cluster-5-n-default-pool-9b8a9b6b-4sp9/10.148.15.219
Start Time:       Fri, 18 Aug 2023 17:41:41 +0800
Labels:           app=weaviate
                  app.kubernetes.io/managed-by=helm
                  app.kubernetes.io/name=weaviate
                  controller-revision-hash=weaviate-9b6bddc44
                  statefulset.kubernetes.io/pod-name=weaviate-2
Annotations:      checksum/config: 7dbc8971b0afaca83ce4d8fa3ad9eec00e89bfea8a90eb2e4fc5d72a1f4f19a5
Status:           Succeeded
IP:               10.68.2.22
IPs:
  IP:           10.68.2.22
Controlled By:  StatefulSet/weaviate
Init Containers:
  configure-sysctl:
    Container ID:  containerd://9c318ad3fcbe23f8592960031da760da7e0cfee6e9450b022f2be7ef528293ec
    Image:         docker.io/alpine:latest
    Image ID:      docker.io/library/alpine@sha256:82d1e9d7ed48a7523bdebc18cf6290bdb97b82302a8a9c27d4fe885949ea94d1
    Port:          <none>
    Host Port:     <none>
    Command:
      sysctl
      -w
      vm.max_map_count=524288
      vm.overcommit_memory=1
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 18 Aug 2023 17:41:46 +0800
      Finished:     Fri, 18 Aug 2023 17:41:46 +0800
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qcvz2 (ro)
Containers:
  weaviate:
    Container ID:  containerd://7cd39f31ebcde54c39ba3e956a617273bde26a45af86eec538199788ac158b77
    Image:         docker.io/semitechnologies/weaviate:1.20.0
    Image ID:      docker.io/semitechnologies/weaviate@sha256:473d094b99f4f045831cc6fa227e5b838aeddb8c89df1355db4d4a4526a43e4e
    Ports:         8080/TCP, 2112/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /bin/weaviate
    Args:
      --host
      0.0.0.0
      --port
      8080
      --scheme
      http
      --config-file
      /weaviate-config/conf.yaml
      --read-timeout=60s
      --write-timeout=60s
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 18 Aug 2023 17:41:47 +0800
      Finished:     Sun, 20 Aug 2023 18:04:29 +0800
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://:8080/v1/.well-known/live delay=900s timeout=3s period=10s #success=1 #failure=30
    Readiness:      http-get http://:8080/v1/.well-known/ready delay=3s timeout=3s period=10s #success=1 #failure=3
    Environment:
      AUTHENTICATION_APIKEY_ALLOWED_KEYS:    xxx
      AUTHENTICATION_APIKEY_ENABLED:         true
      AUTHENTICATION_APIKEY_USERS:           eezee-system
      CLUSTER_DATA_BIND_PORT:                7001
      CLUSTER_GOSSIP_BIND_PORT:              7000
      GOGC:                                  100
      PROMETHEUS_MONITORING_ENABLED:         true
      QUERY_MAXIMUM_RESULTS:                 100000
      REINDEX_VECTOR_DIMENSIONS_AT_STARTUP:  false
      TRACK_VECTOR_DIMENSIONS:               false
      STANDALONE_MODE:                       true
      PERSISTENCE_DATA_PATH:                 /var/lib/weaviate
      DEFAULT_VECTORIZER_MODULE:             none
      CLUSTER_JOIN:                          weaviate-headless.weaviate-prod.svc.cluster.local
    Mounts:
      /var/lib/weaviate from weaviate-data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qcvz2 (ro)
      /weaviate-config from weaviate-config (rw)
Conditions:
  Type               Status
  DisruptionTarget   True
  Initialized        True
  Ready              False
  ContainersReady    False
  PodScheduled       True
Volumes:
  weaviate-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  weaviate-data-weaviate-2
    ReadOnly:   false
  weaviate-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      weaviate-config
    Optional:  false
  kube-api-access-qcvz2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
StefanBogdan commented 1 year ago

Hi @jaspersorrio , this is the first time I see this. Do you have the logs from the Completed Pod?

mohit-sarvam commented 11 months ago

What is the solution to this problem? I am also seeing the same.

jaspersorrio commented 11 months ago

Hi Mohit,

I managed to solve this temporarily by increasing the CPU & Ram

What does your workload look like?

StefanBogdan commented 11 months ago

Hi @mohit-sarvam , we have noticed this as well and as @jaspersorrio mentioned increasing the resources helps. It seems to be an issue when running out of Memory, in this case either the pod crashes or the kube-system decided to kill it by sending signal to the pod to gracefully shut down. This causes the Pod to be in a Completed state. We have not figured out how to overcome this yet in a nice manner. You can try and set requests and limits for the Weaviate Pods so the Pod crashes properly on OOM and not be Completed.

mohit-sarvam commented 11 months ago

Thanks @StefanBogdan @jaspersorrio, I am not seeing the issue after increasing the memory and number of nodes.