[YSQL] OOM killer triggers in test with node restarts

qvad commented 2 years ago

Description

SUT: AWS c5.xlarge, 3 nodes Slightly modified SqlDataLoad workload from sample-apps

        "master_gflags": {
            "tablet_split_low_phase_size_threshold_bytes": "2097152",
            "tablet_split_limit_per_table": "8172",
            "enable_automatic_tablet_splitting": "false",
            "tablet_split_low_phase_shard_count_per_node": "134217728",
            "ysql_num_shards_per_tserver": "1",
        },
        "tserver_gflags": {
            "ysql_num_shards_per_tserver": "1",
            "memstore_size_mb": "1",
        }

Scenario is focused on setting intensive tablet splitting flags and run simple workload. In this case we also do restart nodes in parallel.

Start cluster
Start workload
Randomly restart nodes for some time (10 minutes)
Stop workload, check logs.

On check logs stage one of the node may become unavailable due to OOM

Apr 26 18:36:46 localhost kernel: Out of memory: Kill process 6415 (postgres) score 45 or sacrifice child
Apr 26 18:36:46 localhost kernel: Killed process 6415 (postgres) total-vm:690868kB, anon-rss:347480kB, file-rss:0kB, shmem-rss:120kB
Apr 26 18:36:47 localhost kernel: postgres[6711]: segfault at 28 ip 00007fbcb11fa664 sp 00007ffec65ce230 error 4
Apr 26 18:36:47 localhost kernel: postgres[6734]: segfault at 28 ip 00007fbcb11fa664 sp 00007ffec65ce230 error 4
Apr 26 18:36:47 localhost kernel: in libpthread-2.23.so[7fbcb11f0000+17000]
Apr 26 18:36:47 localhost kernel:
Apr 26 18:36:47 localhost kernel: postgres[6635]: segfault at 28 ip 00007fbcb11fa664 sp 00007ffec65ce230 error 4
Apr 26 18:36:47 localhost kernel: in libpthread-2.23.so[7fbcb11f0000+17000]

qvad commented 2 years ago

Got same behaviour with disabled tablet splitting, changes description and fixed text.

mbautin commented 2 years ago

Are segfaults a direct effect of the OOM killer, or do they indicate an additional bug (e.g. an incorrect memory access)?

yugabyte / yugabyte-db

[YSQL] OOM killer triggers in test with node restarts #12304

Description