wireapp / wire-server-deploy

Code to install/deploy wire-server (on kubernetes)
https://docs.wire.com
GNU Affero General Public License v3.0
94 stars 45 forks source link

brig error on deploy wire-server #261

Open maaaaaaav opened 4 years ago

maaaaaaav commented 4 years ago

Hi there,

thanks again for all the help and assistance.

Currently trying to deploy wire-server using helm; everything is working fine except for when the brig kubes are stopping on CrashLoopBackOff

When i pull the logs, this is all I get:

wireadmin@wire-controller:~/wire-server-deploy/ansible$ kubectl logs brig-8674744bc7-ccbtf
{"logger":"cassandra.brig","msgs":["I","Known hosts: [datacenter1:rack1:172.16.32.31:9042,datacenter1:rack1:172.16.32.32:9042,datacenter1:rack1:172.16.32.33:9042]"]}
{"logger":"cassandra.brig","msgs":["I","New control connection: datacenter1:rack1:172.16.32.33:9042#<socket: 11>"]}
NAME                                  READY   STATUS             RESTARTS   AGE
brig-8674744bc7-ccbtf                 0/1     CrashLoopBackOff   6          7m58s
brig-8674744bc7-jlpgn                 0/1     CrashLoopBackOff   7          7m58s
brig-8674744bc7-mbh5m                 0/1     CrashLoopBackOff   7          7m58s
cannon-0                              1/1     Running            0          7m58s
cannon-1                              1/1     Running            0          7m58s
cannon-2                              1/1     Running            0          7m58s
cargohold-d474c7847-mpj7w             1/1     Running            0          7m58s
cargohold-d474c7847-phms7             1/1     Running            0          7m58s
cargohold-d474c7847-r4j8b             1/1     Running            0          7m58s
cassandra-migrations-g667z            0/1     Completed          0          8m7s
demo-smtp-84b7b85ff6-k2djh            1/1     Running            0          9h
elasticsearch-index-create-xnzwm      0/1     Completed          0          8m1s
fake-aws-dynamodb-84f87cd86b-dsz2v    2/2     Running            0          9h
fake-aws-s3-5468cdf989-fccm9          1/1     Running            0          9h
fake-aws-s3-reaper-7c6d9cddd6-ff8fn   1/1     Running            0          9h
fake-aws-sns-5c56774d95-dwcsw         2/2     Running            0          9h
fake-aws-sqs-554bbc684d-cqxzl         2/2     Running            0          9h
galley-87df7b65f-kp588                1/1     Running            0          7m58s
galley-87df7b65f-t7wtd                1/1     Running            0          7m58s
galley-87df7b65f-vhzpg                1/1     Running            0          7m58s
gundeck-f9bf469f9-b9rxt               1/1     Running            0          7m58s
gundeck-f9bf469f9-clff6               1/1     Running            0          7m58s
gundeck-f9bf469f9-gx8d4               1/1     Running            0          7m57s
nginz-77f7ff6f5d-5m94p                2/2     Running            1          7m58s
nginz-77f7ff6f5d-h7w5n                2/2     Running            1          7m58s
nginz-77f7ff6f5d-pwbzl                2/2     Running            1          7m58s
redis-ephemeral-69bb4885bb-qbmdw      1/1     Running            0          8h
spar-59fd5db594-gbsbz                 1/1     Running            0          7m58s
spar-59fd5db594-jclmh                 1/1     Running            0          7m58s
spar-59fd5db594-zvbl6                 1/1     Running            0          7m58s
webapp-6cb84759d9-wfhc9               1/1     Running            0          7m58s
wireadmin@wire-controller:~/wire-server-deploy/ansible$

those are the correct IPs for my three cassandra nodes and they seem to be up fine. I'm using cassandra-external to point them there.

any guidance as to what I should upload to help with this would be much appreciated too.

Thanks!

akshaymankar commented 4 years ago

Hello @maaaaaaav, sometimes while pods are in crash loop, the logs can be from just before it crashed. The logs that you've added don't look like failure. Can you please check the logs of a couple of times again see if there is anything new?

ramesh8830 commented 4 years ago

I had the same issue especially when trying to configure own smtp server using https://github.com/wireapp/wire-server-deploy/issues/266. Below are the warnings and failure messages inside kubectl describe pod.

Normal Scheduled 4m22s default-scheduler Successfully assigned production/brig-69969b5bdc-ndn8b to kubenode02 Warning Unhealthy 3m29s (x5 over 4m9s) kubelet, kubenode02 Readiness probe failed: Get http://10.233.65.172:8080/i/status: dial tcp 10.233.65.172:8080: connect: connection refused Normal Pulling 3m23s (x3 over 4m20s) kubelet, kubenode02 Pulling image "quay.io/wire/brig:latest" Warning Unhealthy 3m23s (x6 over 4m13s) kubelet, kubenode02 Liveness probe failed: Get http://10.233.65.172:8080/i/status: dial tcp 10.233.65.172:8080: connect: connection refused Normal Killing 3m23s (x2 over 3m53s) kubelet, kubenode02 Container brig failed liveness probe, will be restarted Normal Pulled 3m22s (x3 over 4m16s) kubelet, kubenode02 Successfully pulled image "quay.io/wire/brig:latest" Normal Created 3m22s (x3 over 4m16s) kubelet, kubenode02 Created container brig Normal Started 3m22s (x3 over 4m15s) kubelet, kubenode02 Started container brig

akshaymankar commented 4 years ago

@ramesh8830 Do you also see nothing interesting in kubectl logs for the brig pods?

ramesh8830 commented 4 years ago

Hi @akshaymankar

Nothing there in the kubectl logs for brig pods. Brig pods keep waiting on the Ready Status and eventually fall into CrashLoopBackOff.

I am also getting same log @maaaaaaav reported in his post showing all cassandra nodes.

wireadmin@wire-controller:~/wire-server-deploy/ansible$ kubectl logs brig-8674744bc7-ccbtf
{"logger":"cassandra.brig","msgs":["I","Known hosts: [datacenter1:rack1:172.16.32.31:9042,datacenter1:rack1:172.16.32.32:9042,datacenter1:rack1:172.16.32.33:9042]"]}
{"logger":"cassandra.brig","msgs":["I","New control connection: datacenter1:rack1:172.16.32.33:9042#<socket: 11>"]}
akshaymankar commented 4 years ago
 Warning Unhealthy 3m23s (x6 over 4m13s) kubelet, kubenode02 Liveness probe failed: Get http://10.233.65.172:8080/i/status: dial tcp 10.233.65.172:8080: connect: connection refused

This indicates that brig is taking some time to come up and K8s is not patient enough for that. Usually brig prints a line like this when it starts listening on the port:

I, Listening on 0.0.0.0:8080

I would make sure that the pod is getting enough CPU/RAM. And if that is the case, I would bump up the logging in brig to Debug or even Trace and see if I find anything in the logs. Hope this helps!

ramesh8830 commented 4 years ago

Its has enough CPU/RAM. It happens only when we use user name password for SMTP configuration other than demo credentials. If I use demo credentials for smtp then brig pods running successfully.