zulip / docker-zulip

Container configurations, images, and examples for Zulip.
https://zulip.com/
Apache License 2.0
584 stars 242 forks source link

Add configuration to prevent docker-swarm from killing TCP connections to database #426

Open stratosgear opened 4 years ago

stratosgear commented 4 years ago

On the latest 2.1.3 (but also on 2.1.1, and 2.1.2) I am frequently (multiple times per day, 10-20?) getting this error emailed to me:

---------- Forwarded message ---------
From: myemail
Date: Fri, Apr 3, 2020 at 9:04 PM
Subject: [Django] a111cc8a0340: server closed the connection unexpectedly\n This probably means the server terminated abnormally\n  before or while processing the request.\n
To: myemail

Logger root, from module zerver.worker.queue_processors line 151:
Error generated by Anonymous user (not logged in) on a111cc8a0340 deployment

Traceback (most recent call last):
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/lib/db.py", line 31, in execute
    return wrapper_execute(self, super().execute, query, vars)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/lib/db.py", line 18, in wrapper_execute
    return action(sql, params)
psycopg2.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/worker/queue_processors.py", line 134, in consume_wrapper
    self.consume(data)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/worker/queue_processors.py", line 310, in consume
    user_profile = get_user_profile_by_id(event["user_profile_id"])
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/lib/cache.py", line 186, in func_with_caching
    val = func(*args, **kwargs)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/models.py", line 2072, in get_user_profile_by_id
    return UserProfile.objects.select_related().get(id=uid)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/query.py", line 374, in get
    num = len(clone)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/query.py", line 232, in __len__
    self._fetch_all()
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/query.py", line 1121, in _fetch_all
    self._result_cache = list(self._iterable_class(self))
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/query.py", line 53, in __iter__
    results = compiler.execute_sql(chunked_fetch=self.chunked_fetch)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 899, in execute_sql
    raise original_exception
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/models/sql/compiler.py", line 889, in execute_sql
    cursor.execute(sql, params)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/utils.py", line 94, in __exit__
    six.reraise(dj_exc_type, dj_exc_value, traceback)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/utils/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zulip-py3-venv/lib/python3.6/site-packages/django/db/backends/utils.py", line 64, in execute
    return self.cursor.execute(sql, params)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/lib/db.py", line 31, in execute
    return wrapper_execute(self, super().execute, query, vars)
  File "/home/zulip/deployments/2020-04-01-21-55-56/zerver/lib/db.py", line 18, in wrapper_execute
    return action(sql, params)
django.db.utils.OperationalError: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

Deployed code:
- ZULIP_VERSION: 2.1.3
- version: docker

Request info: none

The installed version appears to work correctly though.

Elsewhere in other issues I have posted here, I have received the advice that this is caused by restarted services, but this is not something that I do as the Docker Swarm stack is not restarted and I see the services having an uptime of larger than one day (while I have received these error messages)

I am running in a single node Docker Swarm with a docker-compose identical to the one described in docker-zulip repo.

For validation, here is it, with the secrets redacted:

version: '3.3'
services:
  database:
    image: zulip/zulip-postgresql:10
    environment:
      POSTGRES_DB: zulip
      POSTGRES_PASSWORD: xxx
      POSTGRES_USER: zulip
    volumes:
     - zulip_psql_data:/var/lib/postgresql/data
    networks:
     - default
    logging:
      driver: json-file
  memcached:
    image: memcached:alpine
    networks:
     - default
    logging:
      driver: json-file
    command:
      - 'sh'
      - '-euc'
      - |
        echo 'mech_list: plain' > "$$SASL_CONF_PATH"
        echo "zulip@$$HOSTNAME:$$MEMCACHED_PASSWORD" > "$$MEMCACHED_SASL_PWDB"
        exec memcached -S
    environment:
      SASL_CONF_PATH: '/home/memcache/memcached.conf'
      MEMCACHED_SASL_PWDB: '/home/memcache/memcached-sasl-db'
      MEMCACHED_PASSWORD: 'xxx'
  rabbitmq:
    image: rabbitmq:3.7.7
    environment:
      RABBITMQ_DEFAULT_PASS: xxx
      RABBITMQ_DEFAULT_USER: zulip
    volumes:
     - zulip_rabbitmq_data:/var/lib/rabbitmq
    networks:
     - default
    logging:
      driver: json-file
  redis:
    image: redis:alpine
    volumes:
     - zulip_redis_data:/data:rw
    networks:
     - default
    logging:
      driver: json-file
    command:
      - 'sh'
      - '-euc'
      - |
        echo "requirepass '$$REDIS_PASSWORD'" > /etc/redis.conf
        exec redis-server /etc/redis.conf
    environment:
      REDIS_PASSWORD: 'xxx'
  zulip:
    image: zulip/docker-zulip:2.1.3-0
    ports:
      - 80
    environment:
      DB_HOST: database
      DB_HOST_PORT: '5432'
      DB_USER: zulip
      DISABLE_HTTPS: 'True'
      SECRETS_email_password: xxx
      SECRETS_google_oauth2_client_secret: xxx
      SECRETS_postgres_password: xxx
      SECRETS_rabbitmq_password: xxx
      SECRETS_memcached_password: 'xxx'
      SECRETS_redis_password: 'xxx'
      SECRETS_secret_key: xxx
      SECRETS_social_auth_github_secret: xxx
      SETTING_EMAIL_HOST: smtp.gmail.com
      SETTING_EMAIL_HOST_USER: xxx
      SETTING_EMAIL_PORT: '587'
      SETTING_EMAIL_USE_SSL: 'False'
      SETTING_EMAIL_USE_TLS: 'True'
      SETTING_EXTERNAL_HOST: xxx.xxx.xxx
      SETTING_GOOGLE_OAUTH2_CLIENT_ID: xxxxx
      SETTING_MEMCACHED_LOCATION: memcached:11211
      SETTING_PUSH_NOTIFICATION_BOUNCER_URL: https://push.zulipchat.com
      SETTING_RABBITMQ_HOST: rabbitmq
      SETTING_REDIS_HOST: redis
      SETTING_SOCIAL_AUTH_GITHUB_KEY: xxx
      SETTING_ZULIP_ADMINISTRATOR: xxx
      SSL_CERTIFICATE_GENERATION: self-signed
      ZULIP_AUTH_BACKENDS: EmailAuthBackend,GoogleMobileOauth2Backend,GitHubAuthBackend
    volumes:
     - zulip_app_data:/data
    networks:
     - traefik-public
     - default
    logging:
      driver: json-file
    deploy:
      labels:
        traefik.docker.network: traefik-public
        traefik.enable: 'true'
        traefik.http.routers.zulip.entrypoints: websecure
        traefik.http.routers.zulip.rule: Host(`xxx.xxx.xxx`)
        traefik.http.routers.zulip.tls.certresolver: letsencryptresolver
        traefik.http.services.zulip.loadbalancer.server.port: '80'
networks:
  default:
    driver: overlay
  traefik-public:
    external: true
volumes:
  zulip_app_data:
    external: true
  zulip_psql_data:
    external: true
  zulip_rabbitmq_data:
    external: true
  zulip_redis_data:
    external: true

Some hopefully useful info:

docker version
Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b7f0
 Built:             Wed Mar 11 01:25:46 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b7f0
  Built:            Wed Mar 11 01:24:19 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

uname -a
Linux xxx.com 4.15.0-91-generic zulip/zulip#92-Ubuntu SMP Fri Feb 28 11:09:48 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

This is a "sister" issue as zulip/zulip#14456 that I also opened (but with another error message).

Anything else I can provide the help solve this...?

timabbott commented 4 years ago

OK, my theory is that the Docker Swarm networking stack is killing the open TCP connections between the Zulip server and the postgres/memcached servers. We had a much more fatal similar with RabbitMQ fixed last year (b312001fd92dc36233e5a9f57cd9fada890880c4). The symptom is the same as the service being restarted -- the connections are killed, which Zulip will re-establish in each process when it discovers this (and send an error email), resulting in this random distribution of error emails.

Googling suggests that other products have indeed had that sort of problem with Docker Swarm's aggressive killing of TCP connections. https://success.docker.com/article/ipvs-connection-timeout-issue seems to be their knowledge base article on the topic.

@stratosgear can you try playing with the diagnostic steps described on that article to see if they suggest this is what's happening? Based on that doc, it looks like Docker Swarm itself doesn't support configuring its networking behavior of killing idle TCP connections :(.

For memcached, https://github.com/lericson/pylibmc/issues/199, https://sendapatch.se/projects/pylibmc/behaviors.html, and https://pypi.org/project/pylibmc/1.3.0/ suggest they have an undocumented option to set the keepalive settings.

stratosgear commented 4 years ago

Wow, this is getting way too low level for me.

As I understand it, the connection drops and then re-establishes. I have never seen any issues on the actual running instance of Zulip, so this is not catastrophic.

So I would call this a self-healing auto-recovering issue.

Any way I can turn off these messages though, from cluttering my emailbox.

stratosgear commented 4 years ago

Just another noteworthy "trivia" mentioning...

Overnight, when our private Zulip instance is inactive by human users, there are no error messages emitted. But as soon as people start using it in the morning, and throughout the day, the messages resume. Does this make sense on how you think this error exhibits itself...?

timabbott commented 4 years ago

Yes, because what's happening is the network connections between Zulip processes and the database/cache are dropped by Docker Swarm when idle, and when the first activity happens, the Zulip processes try to use the connection, get an error, and then retry (successfully).

@andersk FYI; I wonder if it's worth trying to restructure the way we handle errors accessing the database to avoid sending an exception email when it only happens once and we succeed on retry?

stratosgear commented 4 years ago

I am not sure these are only timeouts...

Look at the frequency of these messages:

emails

From 11:17 to 11:20 there is a pair of identical error messages, and then they repeat immediately in 11:20 again.

Are you keeping the connections open, or you expect them to remain open for more than three minutes? I am not sure I follow on why a timeout issue will cause this, or what kind of connections you expect to always be open/connected. Never seen any similar connection problems with any other stack I have worked with.

Hopefully this timestamped error log might help a bit!

stratosgear commented 4 years ago

@timabbott Is there a way to turn off these emails please, please, please....?

I have been deleting 150-200 emails every day for the last 2-3 months.

Everything seems to be working fine in zulip, but these insistent messages are driving me crazy!

Thanks!!!

timabbott commented 4 years ago

@stratosgear yeah, with that frequency it's clear that your memcached cache is just nonfunctional. Zulip is designed to work even if the cache is down, but it's definitely going to be a performance issue (And the Python memcached module has this terrible setup where the exception is different every time, so our rate-limiting of a given error email doesn't work with it).

The issue is almost certainly that the memcached authentication that Zulip sets up is broken, and your server cannot connect to memcached. I think https://github.com/zulip/zulip/pull/14925 may help.

stratosgear commented 4 years ago

Yes, I look forward to upgrade to something better.

But my request was to find a way to stop these messages until that time. Is there no way to temporarily stop the messages, until the next upgrade...?

timabbott commented 4 years ago

Well, you can always change ADMINS in /etc/zulip/settings.py to make Zulip not send error emails at all. https://docs.djangoproject.com/en/3.0/ref/settings/#admins

stratosgear commented 4 years ago

That seems it did the trick, thanks. Looking forward for the next release!

Cybernisk commented 4 years ago

Just what to raise up the ticket. I have that message psycopg2.OperationalError: server closed the connection unexpectedly from time to time. My installation uses a simple docker-compose file, no docker swarm at all. Using version docker-zulip:2.1.7-0 Sometimes I can get 10 messages a day, then weeks without even one. Any ideas what could be the reason?

p.s. That zulip instance is not much in use, I guess, that is the reason why I don't have too many messages, but I'm afraid that when it will get a normal load I will have tons of the same errors.

Cybernisk commented 4 years ago

I'm continuing with the investigation of what could be the reason for the errors. I restarted Zulip on 2020-10-15 and it was working ok till yesterday. The backend error log is quite the same, with psycopg2 errors. Down below is a Postgresql error log and it explains some errors in the backend log as the DB trying to recover and closing connections and etc, but what is the initial reason? It was working ok for about 2 weeks. =/ Other services (rmq, redis) have no errors.

@timabbott maybe you have any new ideas?

**2020-10-15 09:07:26.021** UTC [1] LOG:  database system is ready to accept connections
**2020-11-02 04:58:15.594** UTC [1] LOG:  checkpointer process (PID 23) was terminated by signal 9
2020-11-02 04:58:15.594 UTC [1] LOG:  terminating any other active server processes
2020-11-02 04:58:15.594 UTC [38985] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [38985] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [38985] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [38983] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [38983] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [38983] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [38401] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [38401] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [38401] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [36889] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [36889] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [36889] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [36894] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [36894] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [36894] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [38991] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [38991] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [38991] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [38978] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [38978] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [38978] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [26] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [26] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [26] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [36893] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [36893] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [36893] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [38990] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [38990] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [38990] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [38961] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [38961] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [38961] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [38980] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [38980] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [38980] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [36891] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [36891] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [36891] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [36892] WARNING:  terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [36892] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [36892] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.644 UTC [1] LOG:  all server processes terminated; reinitializing
2020-11-02 04:58:15.683 UTC [38992] LOG:  database system was interrupted; last known up at 2020-11-02 04:54:32 UTC
2020-11-02 04:58:18.190 UTC [38992] LOG:  database system was not properly shut down; automatic recovery in progress
2020-11-02 04:58:18.315 UTC [38992] LOG:  redo starts at 1/8587C6A8
2020-11-02 04:58:18.406 UTC [38992] LOG:  invalid record length at 1/8588B1F0: wanted 24, got 0
2020-11-02 04:58:18.406 UTC [38992] LOG:  redo done at 1/8588B1C8
2020-11-02 04:58:18.406 UTC [38992] LOG:  last completed transaction was at log time 2020-11-02 04:58:13.697827+00
2020-11-02 04:58:18.527 UTC [1] LOG:  database system is ready to accept connections
2020-11-02 20:56:56.596 UTC [1] LOG:  server process (PID 39003) was terminated by signal 9
2020-11-02 20:56:56.596 UTC [1] DETAIL:  Failed process was running: COMMIT
2020-11-02 20:56:56.596 UTC [1] LOG:  terminating any other active server processes
2020-11-02 20:56:56.597 UTC [40473] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40473] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40473] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [39000] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [39000] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [39000] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [40484] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40484] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40484] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [40471] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40471] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40471] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [40480] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40480] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40480] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [38999] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [38999] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [38999] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [40476] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40476] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40476] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [40477] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40477] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40477] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [39004] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [39004] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [39004] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [39270] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [39270] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [39270] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [38996] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [38996] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [38996] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [39001] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [39001] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [39001] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.619 UTC [1] LOG:  all server processes terminated; reinitializing
2020-11-02 20:56:56.647 UTC [40486] LOG:  database system was interrupted; last known up at 2020-11-02 20:53:29 UTC
2020-11-02 20:56:59.077 UTC [40486] LOG:  database system was not properly shut down; automatic recovery in progress
2020-11-02 20:56:59.090 UTC [40486] LOG:  redo starts at 1/86266C58
2020-11-02 20:56:59.107 UTC [40486] LOG:  invalid record length at 1/86277E58: wanted 24, got 0
2020-11-02 20:56:59.107 UTC [40486] LOG:  redo done at 1/86277E20
2020-11-02 20:56:59.107 UTC [40486] LOG:  last completed transaction was at log time 2020-11-02 20:56:40.840177+00
2020-11-02 20:56:59.140 UTC [1] LOG:  database system is ready to accept connections
2020-11-02 20:59:04.927 UTC [1] LOG:  server process (PID 40497) was terminated by signal 9
2020-11-02 20:59:04.927 UTC [1] DETAIL:  Failed process was running: SELECT /* get_messages */ anon_1.message_id, anon_1.flags 
    FROM (SELECT message_id, flags 
    FROM zerver_usermessage 
    WHERE user_profile_id = 38 AND message_id <= 28463 ORDER BY message_id DESC 
     LIMIT 1001) AS anon_1 ORDER BY message_id ASC
2020-11-02 20:59:04.927 UTC [1] LOG:  terminating any other active server processes
2020-11-02 20:59:04.928 UTC [40493] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40493] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40493] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40496] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40496] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40496] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40502] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40502] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40502] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40494] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40494] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40494] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40499] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40499] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40499] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40501] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40501] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40501] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40490] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40490] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40490] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40495] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40495] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40495] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40504] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40504] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40504] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.929 UTC [40500] WARNING:  terminating connection because of crash of another server process
2020-11-02 20:59:04.929 UTC [40500] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.929 UTC [40500] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.959 UTC [1] LOG:  all server processes terminated; reinitializing
2020-11-02 20:59:04.988 UTC [40505] LOG:  database system was interrupted; last known up at 2020-11-02 20:56:59 UTC
2020-11-02 20:59:07.418 UTC [40505] LOG:  database system was not properly shut down; automatic recovery in progress
2020-11-02 20:59:07.431 UTC [40505] LOG:  redo starts at 1/86277EC8
2020-11-02 20:59:07.451 UTC [40505] LOG:  invalid record length at 1/8628C9B8: wanted 24, got 0
2020-11-02 20:59:07.451 UTC [40505] LOG:  redo done at 1/8628C980
2020-11-02 20:59:07.451 UTC [40505] LOG:  last completed transaction was at log time 2020-11-02 20:58:54.39854+00
2020-11-02 20:59:07.477 UTC [40506] FATAL:  the database system is in recovery mode
2020-11-02 20:59:07.514 UTC [1] LOG:  database system is ready to accept connections
2020-11-03 07:55:16.140 UTC [1] LOG:  server process (PID 41482) was terminated by signal 9
2020-11-03 07:55:16.140 UTC [1] DETAIL:  Failed process was running: 
        SELECT
            subquery.recipient_id, MAX(subquery.message_id)
        FROM (
            (SELECT
                um.message_id AS message_id,
                m.recipient_id AS recipient_id
            FROM
                zerver_usermessage um
            JOIN
                zerver_message m
            ON
                um.message_id = m.id
            WHERE
                um.user_profile_id=20 AND
                um.flags & 2048 <> 0 AND
                m.recipient_id <> 20
            ORDER BY message_id DESC
            LIMIT 1000)
            UNION ALL
            (SELECT
                m.id AS message_id,
                sender_profile.recipient_id AS recipient_id
            FROM
                zerver_message m
            JOIN
                zerver_userprofile sender_profile
            ON
                m.sender_id = sender_profile.id
            WHERE
                m.recipient_id=20
            ORDER BY message_id DESC
            LIMIT 1000)
        ) AS subquery
        GROUP BY subquery.recipient_id

2020-11-03 07:55:16.140 UTC [1] LOG:  terminating any other active server processes
2020-11-03 07:55:16.140 UTC [40513] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40513] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40513] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [40517] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40517] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40517] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [41485] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [41485] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [41485] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [40510] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40510] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40510] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [40514] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40514] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40514] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [40521] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40521] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40521] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [41477] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [41477] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [41477] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [40515] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40515] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40515] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [41491] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [41491] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [41491] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [41468] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [41468] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [41468] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [41466] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [41466] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [41466] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.161 UTC [1] LOG:  all server processes terminated; reinitializing
2020-11-03 07:55:16.192 UTC [41492] LOG:  database system was interrupted; last known up at 2020-11-03 07:54:14 UTC
2020-11-03 07:55:16.314 UTC [41493] FATAL:  the database system is in recovery mode
2020-11-03 07:55:16.335 UTC [41494] FATAL:  the database system is in recovery mode
2020-11-03 07:55:18.835 UTC [41492] LOG:  database system was not properly shut down; automatic recovery in progress
2020-11-03 07:55:18.848 UTC [41492] LOG:  redo starts at 1/867C5AA8
2020-11-03 07:55:18.856 UTC [41492] LOG:  invalid record length at 1/867C9620: wanted 24, got 0
2020-11-03 07:55:18.856 UTC [41492] LOG:  redo done at 1/867C95F8
2020-11-03 07:55:18.856 UTC [41492] LOG:  last completed transaction was at log time 2020-11-03 07:55:15.52811+00
2020-11-03 07:55:18.888 UTC [1] LOG:  database system is ready to accept connections
2020-11-03 07:57:19.063 UTC [1] LOG:  server process (PID 41506) was terminated by signal 9
2020-11-03 07:57:19.063 UTC [1] DETAIL:  Failed process was running: SELECT "zerver_userprofile"."id", "zerver_userprofile"."full_name", "zerver_userprofile"."short_name", "zerver_userprofile"."delivery_email", "zerver_userprofile"."email", "zerver_realm"."string_id", "zerver_userprofile"."avatar_source", "zerver_userprofile"."avatar_version", "zerver_userprofile"."is_mirror_dummy" FROM "zerver_userprofile" INNER JOIN "zerver_realm" ON ("zerver_userprofile"."realm_id" = "zerver_realm"."id") WHERE (zerver_userprofile.id in (34, 11, 13, 16, 17, 18, 20, 23))
2020-11-03 07:57:19.063 UTC [1] LOG:  terminating any other active server processes
2020-11-03 07:57:19.064 UTC [41503] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41503] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41503] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41505] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41505] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41505] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41511] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41511] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41511] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41504] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41504] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41504] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41510] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41510] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41510] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41498] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41498] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41498] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41501] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41501] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41501] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41502] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41502] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41502] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41508] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41508] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41508] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41507] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41507] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41507] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41513] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41513] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41513] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41512] WARNING:  terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41512] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41512] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.101 UTC [1] LOG:  all server processes terminated; reinitializing
2020-11-03 07:57:19.126 UTC [41514] LOG:  database system was interrupted; last known up at 2020-11-03 07:55:18 UTC
2020-11-03 07:57:21.444 UTC [41514] LOG:  database system was not properly shut down; automatic recovery in progress
2020-11-03 07:57:21.459 UTC [41514] LOG:  redo starts at 1/867C9690
2020-11-03 07:57:21.481 UTC [41514] LOG:  invalid record length at 1/867DE9E0: wanted 24, got 0
2020-11-03 07:57:21.481 UTC [41514] LOG:  redo done at 1/867DE9B8
2020-11-03 07:57:21.481 UTC [41514] LOG:  last completed transaction was at log time 2020-11-03 07:57:19.022299+00
2020-11-03 07:57:21.538 UTC [1] LOG:  database system is ready to accept connections
timabbott commented 4 years ago

@Cybernisk I think your issue is unrelated to what @stratosgear originally reported. If I had to guess, you have some operational issue with your server (memory corruption, OOM kills due to insufficient memory, etc.) that resulted in postgres processes crashing. dmesg is the place to look for details that can answer what actually is happening.

I'll also add that you should upgrade to current Zulip; we've made a lot of improvements that you're currently not benefitting from.

timabbott commented 3 years ago

I think the original issue with memcached authentication reported here was fixed, and we've also added https://zulip.readthedocs.io/en/latest/production/troubleshooting.html#restrict-unattended-upgrades to document why one might get a bunch of error emails after a postgres upgrade (one per process).

However, I think we may still have some follow-up to do related to the possibility that Docker Swarm's networking stack kills postgres connections in the original report: https://github.com/zulip/docker-zulip/issues/426.

@stratosgear can you confirm whether you're still seeing that symptom of postgres restarting daily in Docker Swarm?

stratosgear commented 3 years ago

Unfortunately, I will not be able to test this for you :(

Due to company policies, we were forced to switch to Teams, and I hate every minute of using it!!!

I look forward to using Zulip in the future again though, as I believe it is definitely a superior product. Keep up the good job!

timabbott commented 9 months ago

It looks like the remaining problem was a Docker Swarm default configuration problem; a potential solution is suggested here:

https://github.com/vapor/postgres-kit/issues/164#issuecomment-738450518

I'll transfer this to the docker-zulip repository.