Open stratosgear opened 4 years ago
OK, my theory is that the Docker Swarm networking stack is killing the open TCP connections between the Zulip server and the postgres/memcached servers. We had a much more fatal similar with RabbitMQ fixed last year (b312001fd92dc36233e5a9f57cd9fada890880c4). The symptom is the same as the service being restarted -- the connections are killed, which Zulip will re-establish in each process when it discovers this (and send an error email), resulting in this random distribution of error emails.
Googling suggests that other products have indeed had that sort of problem with Docker Swarm's aggressive killing of TCP connections. https://success.docker.com/article/ipvs-connection-timeout-issue seems to be their knowledge base article on the topic.
@stratosgear can you try playing with the diagnostic steps described on that article to see if they suggest this is what's happening? Based on that doc, it looks like Docker Swarm itself doesn't support configuring its networking behavior of killing idle TCP connections :(.
For memcached, https://github.com/lericson/pylibmc/issues/199, https://sendapatch.se/projects/pylibmc/behaviors.html, and https://pypi.org/project/pylibmc/1.3.0/ suggest they have an undocumented option to set the keepalive settings.
Wow, this is getting way too low level for me.
As I understand it, the connection drops and then re-establishes. I have never seen any issues on the actual running instance of Zulip, so this is not catastrophic.
So I would call this a self-healing auto-recovering issue.
Any way I can turn off these messages though, from cluttering my emailbox.
Just another noteworthy "trivia" mentioning...
Overnight, when our private Zulip instance is inactive by human users, there are no error messages emitted. But as soon as people start using it in the morning, and throughout the day, the messages resume. Does this make sense on how you think this error exhibits itself...?
Yes, because what's happening is the network connections between Zulip processes and the database/cache are dropped by Docker Swarm when idle, and when the first activity happens, the Zulip processes try to use the connection, get an error, and then retry (successfully).
@andersk FYI; I wonder if it's worth trying to restructure the way we handle errors accessing the database to avoid sending an exception email when it only happens once and we succeed on retry?
I am not sure these are only timeouts...
Look at the frequency of these messages:
From 11:17 to 11:20 there is a pair of identical error messages, and then they repeat immediately in 11:20 again.
Are you keeping the connections open, or you expect them to remain open for more than three minutes? I am not sure I follow on why a timeout issue will cause this, or what kind of connections you expect to always be open/connected. Never seen any similar connection problems with any other stack I have worked with.
Hopefully this timestamped error log might help a bit!
@timabbott Is there a way to turn off these emails please, please, please....?
I have been deleting 150-200 emails every day for the last 2-3 months.
Everything seems to be working fine in zulip, but these insistent messages are driving me crazy!
Thanks!!!
@stratosgear yeah, with that frequency it's clear that your memcached cache is just nonfunctional. Zulip is designed to work even if the cache is down, but it's definitely going to be a performance issue (And the Python memcached module has this terrible setup where the exception is different every time, so our rate-limiting of a given error email doesn't work with it).
The issue is almost certainly that the memcached authentication that Zulip sets up is broken, and your server cannot connect to memcached. I think https://github.com/zulip/zulip/pull/14925 may help.
Yes, I look forward to upgrade to something better.
But my request was to find a way to stop these messages until that time. Is there no way to temporarily stop the messages, until the next upgrade...?
Well, you can always change ADMINS
in /etc/zulip/settings.py
to make Zulip not send error emails at all. https://docs.djangoproject.com/en/3.0/ref/settings/#admins
That seems it did the trick, thanks. Looking forward for the next release!
Just what to raise up the ticket.
I have that message psycopg2.OperationalError: server closed the connection unexpectedly
from time to time. My installation uses a simple docker-compose file, no docker swarm at all. Using version docker-zulip:2.1.7-0
Sometimes I can get 10 messages a day, then weeks without even one. Any ideas what could be the reason?
p.s. That zulip instance is not much in use, I guess, that is the reason why I don't have too many messages, but I'm afraid that when it will get a normal load I will have tons of the same errors.
I'm continuing with the investigation of what could be the reason for the errors. I restarted Zulip on 2020-10-15 and it was working ok till yesterday. The backend error log is quite the same, with psycopg2
errors.
Down below is a Postgresql error log and it explains some errors in the backend log as the DB trying to recover and closing connections and etc, but what is the initial reason? It was working ok for about 2 weeks. =/ Other services (rmq, redis) have no errors.
@timabbott maybe you have any new ideas?
**2020-10-15 09:07:26.021** UTC [1] LOG: database system is ready to accept connections
**2020-11-02 04:58:15.594** UTC [1] LOG: checkpointer process (PID 23) was terminated by signal 9
2020-11-02 04:58:15.594 UTC [1] LOG: terminating any other active server processes
2020-11-02 04:58:15.594 UTC [38985] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [38985] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [38985] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [38983] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [38983] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [38983] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [38401] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [38401] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [38401] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [36889] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [36889] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [36889] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [36894] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [36894] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [36894] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [38991] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [38991] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [38991] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [38978] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [38978] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [38978] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [26] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [26] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [26] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [36893] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [36893] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [36893] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [38990] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [38990] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [38990] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [38961] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [38961] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [38961] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [38980] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [38980] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [38980] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.595 UTC [36891] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.595 UTC [36891] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.595 UTC [36891] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.594 UTC [36892] WARNING: terminating connection because of crash of another server process
2020-11-02 04:58:15.594 UTC [36892] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 04:58:15.594 UTC [36892] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 04:58:15.644 UTC [1] LOG: all server processes terminated; reinitializing
2020-11-02 04:58:15.683 UTC [38992] LOG: database system was interrupted; last known up at 2020-11-02 04:54:32 UTC
2020-11-02 04:58:18.190 UTC [38992] LOG: database system was not properly shut down; automatic recovery in progress
2020-11-02 04:58:18.315 UTC [38992] LOG: redo starts at 1/8587C6A8
2020-11-02 04:58:18.406 UTC [38992] LOG: invalid record length at 1/8588B1F0: wanted 24, got 0
2020-11-02 04:58:18.406 UTC [38992] LOG: redo done at 1/8588B1C8
2020-11-02 04:58:18.406 UTC [38992] LOG: last completed transaction was at log time 2020-11-02 04:58:13.697827+00
2020-11-02 04:58:18.527 UTC [1] LOG: database system is ready to accept connections
2020-11-02 20:56:56.596 UTC [1] LOG: server process (PID 39003) was terminated by signal 9
2020-11-02 20:56:56.596 UTC [1] DETAIL: Failed process was running: COMMIT
2020-11-02 20:56:56.596 UTC [1] LOG: terminating any other active server processes
2020-11-02 20:56:56.597 UTC [40473] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40473] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40473] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [39000] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [39000] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [39000] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [40484] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40484] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40484] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [40471] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40471] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40471] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [40480] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40480] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40480] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [38999] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [38999] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [38999] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [40476] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40476] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40476] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [40477] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [40477] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [40477] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [39004] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [39004] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [39004] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [39270] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [39270] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [39270] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [38996] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [38996] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [38996] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.597 UTC [39001] WARNING: terminating connection because of crash of another server process
2020-11-02 20:56:56.597 UTC [39001] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:56:56.597 UTC [39001] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:56:56.619 UTC [1] LOG: all server processes terminated; reinitializing
2020-11-02 20:56:56.647 UTC [40486] LOG: database system was interrupted; last known up at 2020-11-02 20:53:29 UTC
2020-11-02 20:56:59.077 UTC [40486] LOG: database system was not properly shut down; automatic recovery in progress
2020-11-02 20:56:59.090 UTC [40486] LOG: redo starts at 1/86266C58
2020-11-02 20:56:59.107 UTC [40486] LOG: invalid record length at 1/86277E58: wanted 24, got 0
2020-11-02 20:56:59.107 UTC [40486] LOG: redo done at 1/86277E20
2020-11-02 20:56:59.107 UTC [40486] LOG: last completed transaction was at log time 2020-11-02 20:56:40.840177+00
2020-11-02 20:56:59.140 UTC [1] LOG: database system is ready to accept connections
2020-11-02 20:59:04.927 UTC [1] LOG: server process (PID 40497) was terminated by signal 9
2020-11-02 20:59:04.927 UTC [1] DETAIL: Failed process was running: SELECT /* get_messages */ anon_1.message_id, anon_1.flags
FROM (SELECT message_id, flags
FROM zerver_usermessage
WHERE user_profile_id = 38 AND message_id <= 28463 ORDER BY message_id DESC
LIMIT 1001) AS anon_1 ORDER BY message_id ASC
2020-11-02 20:59:04.927 UTC [1] LOG: terminating any other active server processes
2020-11-02 20:59:04.928 UTC [40493] WARNING: terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40493] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40493] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40496] WARNING: terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40496] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40496] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40502] WARNING: terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40502] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40502] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40494] WARNING: terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40494] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40494] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40499] WARNING: terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40499] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40499] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40501] WARNING: terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40501] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40501] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40490] WARNING: terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40490] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40490] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40495] WARNING: terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40495] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40495] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.928 UTC [40504] WARNING: terminating connection because of crash of another server process
2020-11-02 20:59:04.928 UTC [40504] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.928 UTC [40504] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.929 UTC [40500] WARNING: terminating connection because of crash of another server process
2020-11-02 20:59:04.929 UTC [40500] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-02 20:59:04.929 UTC [40500] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-02 20:59:04.959 UTC [1] LOG: all server processes terminated; reinitializing
2020-11-02 20:59:04.988 UTC [40505] LOG: database system was interrupted; last known up at 2020-11-02 20:56:59 UTC
2020-11-02 20:59:07.418 UTC [40505] LOG: database system was not properly shut down; automatic recovery in progress
2020-11-02 20:59:07.431 UTC [40505] LOG: redo starts at 1/86277EC8
2020-11-02 20:59:07.451 UTC [40505] LOG: invalid record length at 1/8628C9B8: wanted 24, got 0
2020-11-02 20:59:07.451 UTC [40505] LOG: redo done at 1/8628C980
2020-11-02 20:59:07.451 UTC [40505] LOG: last completed transaction was at log time 2020-11-02 20:58:54.39854+00
2020-11-02 20:59:07.477 UTC [40506] FATAL: the database system is in recovery mode
2020-11-02 20:59:07.514 UTC [1] LOG: database system is ready to accept connections
2020-11-03 07:55:16.140 UTC [1] LOG: server process (PID 41482) was terminated by signal 9
2020-11-03 07:55:16.140 UTC [1] DETAIL: Failed process was running:
SELECT
subquery.recipient_id, MAX(subquery.message_id)
FROM (
(SELECT
um.message_id AS message_id,
m.recipient_id AS recipient_id
FROM
zerver_usermessage um
JOIN
zerver_message m
ON
um.message_id = m.id
WHERE
um.user_profile_id=20 AND
um.flags & 2048 <> 0 AND
m.recipient_id <> 20
ORDER BY message_id DESC
LIMIT 1000)
UNION ALL
(SELECT
m.id AS message_id,
sender_profile.recipient_id AS recipient_id
FROM
zerver_message m
JOIN
zerver_userprofile sender_profile
ON
m.sender_id = sender_profile.id
WHERE
m.recipient_id=20
ORDER BY message_id DESC
LIMIT 1000)
) AS subquery
GROUP BY subquery.recipient_id
2020-11-03 07:55:16.140 UTC [1] LOG: terminating any other active server processes
2020-11-03 07:55:16.140 UTC [40513] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40513] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40513] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [40517] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40517] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40517] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [41485] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [41485] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [41485] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [40510] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40510] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40510] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [40514] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40514] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40514] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [40521] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40521] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40521] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [41477] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [41477] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [41477] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [40515] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [40515] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [40515] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [41491] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [41491] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [41491] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [41468] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [41468] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [41468] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.140 UTC [41466] WARNING: terminating connection because of crash of another server process
2020-11-03 07:55:16.140 UTC [41466] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:55:16.140 UTC [41466] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:55:16.161 UTC [1] LOG: all server processes terminated; reinitializing
2020-11-03 07:55:16.192 UTC [41492] LOG: database system was interrupted; last known up at 2020-11-03 07:54:14 UTC
2020-11-03 07:55:16.314 UTC [41493] FATAL: the database system is in recovery mode
2020-11-03 07:55:16.335 UTC [41494] FATAL: the database system is in recovery mode
2020-11-03 07:55:18.835 UTC [41492] LOG: database system was not properly shut down; automatic recovery in progress
2020-11-03 07:55:18.848 UTC [41492] LOG: redo starts at 1/867C5AA8
2020-11-03 07:55:18.856 UTC [41492] LOG: invalid record length at 1/867C9620: wanted 24, got 0
2020-11-03 07:55:18.856 UTC [41492] LOG: redo done at 1/867C95F8
2020-11-03 07:55:18.856 UTC [41492] LOG: last completed transaction was at log time 2020-11-03 07:55:15.52811+00
2020-11-03 07:55:18.888 UTC [1] LOG: database system is ready to accept connections
2020-11-03 07:57:19.063 UTC [1] LOG: server process (PID 41506) was terminated by signal 9
2020-11-03 07:57:19.063 UTC [1] DETAIL: Failed process was running: SELECT "zerver_userprofile"."id", "zerver_userprofile"."full_name", "zerver_userprofile"."short_name", "zerver_userprofile"."delivery_email", "zerver_userprofile"."email", "zerver_realm"."string_id", "zerver_userprofile"."avatar_source", "zerver_userprofile"."avatar_version", "zerver_userprofile"."is_mirror_dummy" FROM "zerver_userprofile" INNER JOIN "zerver_realm" ON ("zerver_userprofile"."realm_id" = "zerver_realm"."id") WHERE (zerver_userprofile.id in (34, 11, 13, 16, 17, 18, 20, 23))
2020-11-03 07:57:19.063 UTC [1] LOG: terminating any other active server processes
2020-11-03 07:57:19.064 UTC [41503] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41503] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41503] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41505] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41505] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41505] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41511] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41511] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41511] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41504] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41504] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41504] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41510] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41510] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41510] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41498] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41498] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41498] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41501] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41501] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41501] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41502] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41502] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41502] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41508] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41508] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41508] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41507] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41507] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41507] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41513] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41513] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41513] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.064 UTC [41512] WARNING: terminating connection because of crash of another server process
2020-11-03 07:57:19.064 UTC [41512] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2020-11-03 07:57:19.064 UTC [41512] HINT: In a moment you should be able to reconnect to the database and repeat your command.
2020-11-03 07:57:19.101 UTC [1] LOG: all server processes terminated; reinitializing
2020-11-03 07:57:19.126 UTC [41514] LOG: database system was interrupted; last known up at 2020-11-03 07:55:18 UTC
2020-11-03 07:57:21.444 UTC [41514] LOG: database system was not properly shut down; automatic recovery in progress
2020-11-03 07:57:21.459 UTC [41514] LOG: redo starts at 1/867C9690
2020-11-03 07:57:21.481 UTC [41514] LOG: invalid record length at 1/867DE9E0: wanted 24, got 0
2020-11-03 07:57:21.481 UTC [41514] LOG: redo done at 1/867DE9B8
2020-11-03 07:57:21.481 UTC [41514] LOG: last completed transaction was at log time 2020-11-03 07:57:19.022299+00
2020-11-03 07:57:21.538 UTC [1] LOG: database system is ready to accept connections
@Cybernisk I think your issue is unrelated to what @stratosgear originally reported. If I had to guess, you have some operational issue with your server (memory corruption, OOM kills due to insufficient memory, etc.) that resulted in postgres processes crashing. dmesg
is the place to look for details that can answer what actually is happening.
I'll also add that you should upgrade to current Zulip; we've made a lot of improvements that you're currently not benefitting from.
I think the original issue with memcached authentication reported here was fixed, and we've also added https://zulip.readthedocs.io/en/latest/production/troubleshooting.html#restrict-unattended-upgrades to document why one might get a bunch of error emails after a postgres upgrade (one per process).
However, I think we may still have some follow-up to do related to the possibility that Docker Swarm's networking stack kills postgres connections in the original report: https://github.com/zulip/docker-zulip/issues/426.
@stratosgear can you confirm whether you're still seeing that symptom of postgres restarting daily in Docker Swarm?
Unfortunately, I will not be able to test this for you :(
Due to company policies, we were forced to switch to Teams, and I hate every minute of using it!!!
I look forward to using Zulip in the future again though, as I believe it is definitely a superior product. Keep up the good job!
It looks like the remaining problem was a Docker Swarm default configuration problem; a potential solution is suggested here:
https://github.com/vapor/postgres-kit/issues/164#issuecomment-738450518
I'll transfer this to the docker-zulip repository.
On the latest 2.1.3 (but also on 2.1.1, and 2.1.2) I am frequently (multiple times per day, 10-20?) getting this error emailed to me:
The installed version appears to work correctly though.
Elsewhere in other issues I have posted here, I have received the advice that this is caused by restarted services, but this is not something that I do as the Docker Swarm stack is not restarted and I see the services having an uptime of larger than one day (while I have received these error messages)
I am running in a single node Docker Swarm with a docker-compose identical to the one described in docker-zulip repo.
For validation, here is it, with the secrets redacted:
Some hopefully useful info:
This is a "sister" issue as zulip/zulip#14456 that I also opened (but with another error message).
Anything else I can provide the help solve this...?