yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.92k stars 1.06k forks source link

[DocDB] Postgress Backends are terminated even when total memory usage was half of total available memory on the node. #13014

Closed shantanugupta-yb closed 9 months ago

shantanugupta-yb commented 2 years ago

Jira Link: DB-2731

Description

As a part of testing changes [DB-2687 Make OOM killer prioritize PG backends with more memory], I observed that all the postgres backends got terminated even when the total memory usage was half of total available memory.

Test and instance details:

The total available memory on the node is 3.5G and the actual memory usage was only 1.6G(RSS). The issue here is when we have enough available free memory then why the postgres backend was terminated by signal 9 .

Postgres logs:

2022-06-22 18:20:59.883 UTC [18282] LOG: server process (PID 7916) was terminated by signal 9: Killed 2022-06-22 18:20:59.885 UTC [18282] LOG: terminating any other active server processes 2022-06-22 18:20:59.926 UTC [7913] WARNING: terminating connection because of crash of another server process 2022-06-22 18:20:59.926 UTC [7913] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2022-06-22 18:21:00.687 UTC [18282] LOG: all server processes terminated; reinitializing 2022-06-22 18:21:01.208 UTC [8982] LOG: database system was interrupted; last known up at 2022-06-22 17:18:34 UTC 2022-06-22 18:21:01.325 UTC [8982] LOG: database system was not properly shut down; automatic recovery in progress 2022-06-22 18:21:01.337 UTC [8982] LOG: redo starts at 0/1000108 2022-06-22 18:21:01.337 UTC [8982] LOG: invalid record length at 0/10001B0: wanted 24, got 0 2022-06-22 18:21:01.337 UTC [8982] LOG: redo done at 0/1000140

Dmesg log: [Wed Jun 22 18:21:00 2022] Killed process 7916 (postgres), UID 997, total-vm:650712kB, anon-rss:299340kB, file-rss:0kB, shmem-rss:52kB

image

Below is the excerpt from the dmesg log about the OOM killer message dump

image

OOM score of postgres/tserver/yb-master processes

image
sushantrmishra commented 2 years ago

Based on the OOM killer prints:

Does top command output captured before the OOM happened ? @shantanugupta-yb