yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.94k stars 1.06k forks source link

[YSQL] high memory usage observed for YSQL webserver process #19327

Closed sushantrmishra closed 10 months ago

sushantrmishra commented 1 year ago

Jira Link: DB-8133

Description

OOM logs: PID = 14355 is consuming 39791*4k ~160MB memory. @cdavid looked at this and mentioned this was the YSQL webserver process . It is not clear why the memory usage is high in webserver. I tried loading pg_stat_statments using the tserver/statements page but that did not change the memory usage.

Sep 19 17:34:09 ip-10-8-11-43 kernel: Memory cgroup stats for /ysql: cache:24160KB rss:180640KB rss_huge:14336KB mapped_file:9212KB swap:0KB inactive_anon:23712KB active_anon:180944KB inactive_file:0KB active_file:4KB unevictable:0KB
Sep 19 17:34:09 ip-10-8-11-43 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Sep 19 17:34:09 ip-10-8-11-43 kernel: [14337]   995 14337    69586     6531      59        0             0 postgres
Sep 19 17:34:10 ip-10-8-11-43 kernel: [14352]   995 14352    33434     2806      47        0             0 postgres
Sep 19 17:34:10 ip-10-8-11-43 kernel: [14355]   995 14355   110292    39791     122        0             0 postgres
Sep 19 17:34:10 ip-10-8-11-43 kernel: [14358]   995 14358    69586     3213      52        0             0 postgres
Sep 19 17:34:10 ip-10-8-11-43 kernel: [14359]   995 14359    33964     2940      48        0             0 postgres
Sep 19 17:34:10 ip-10-8-11-43 kernel: [ 9498]   995  9498    87970     5920      65        0           900 postgres
Sep 19 17:34:10 ip-10-8-11-43 kernel: Memory cgroup out of memory: Kill process 9506 (rpc_tp_pggate_y) score 1013 or sacrifice child
Sep 19 17:34:10 ip-10-8-11-43 kernel: Killed process 9498 (postgres), UID 995, total-vm:351880kB, anon-rss:16376kB, file-rss:7256kB, shmem-rss:48kB

Warning: Please confirm that this issue does not contain any sensitive information

sushantrmishra commented 11 months ago

How to dump the memory usage of webserver:

Connect to process using gdb and follow these commands: Below is just an example, not from node where problem happened.

(gdb) call (void*)(malloc(sizeof(8192)))
$1 = (void *) 0x6a37fd05040
(gdb) call YBCGetHeapConsumption(0x6a37fd05040)
$2 = (struct YBCStatusStruct *) 0x0
(gdb) p (YbTcmallocStats*)(0x6a37fd05040)
$3 = (struct YbTcmallocStats *) 0x6a37fd05040
(gdb) p *(YbTcmallocStats*)(0x6a37fd05040)
$4 = {total_physical_bytes = 18106982, heap_size_bytes = 4194304, current_allocated_bytes = 1067904,
  pageheap_free_bytes = 2187264, pageheap_unmapped_bytes = 0}
  (gdb) p PgMemTracker.pg_cur_mem_bytes
$5 = 255552
sushantrmishra commented 11 months ago

Webserver endpoint is accessed by following endpoints"

qvad commented 11 months ago

@sushantrmishra Looks like we are caching last result, which is looks like memory leak:

This is SQLancer test (lots of random queries)

Start test

[yugabyte@ip-172-151-19-221 ~]$ ps -eo pid,pcpu,pmem,vsz,rss,command | grep YSQL
  45448  0.0  0.4 2404900 31760 postgres: YSQL webserver
  1. After accessing 172.151.19.221:9000/rpcz and 172.151.19.221:13000/statements
    [yugabyte@ip-172-151-19-221 ~]$ ps -eo pid,pcpu,pmem,vsz,rss,command | grep YSQL
    45448  0.0  0.7 2404900 56712 postgres: YSQL webserver
  2. After some time from test (not sure why there is not much taken)
    [yugabyte@ip-172-151-19-221 ~]$ ps -eo pid,pcpu,pmem,vsz,rss,command | grep YSQL
    45448  0.0  0.7 2404900 56808 postgres: YSQL webserver

Memory grew few times and not cleared.

qvad commented 11 months ago

SQLancer reproducer (since there are a lot of unique queries)

  1. Get sqlancer from here https://github.com/yugabyte/sqlancer/releases/tag/sqlancer_2.0.0-yb
  2. Start it with java -Xmx12G -jar /PATH/TO/sqlancer-2.0.0-yb.jar --host 127.0.0.1 --port 5433 --username yugabyte --password yugabyte --num-threads 24 --timeout-seconds 28800 ysql --oracle HAVING
  3. Wait for 10-15 minutes so there will be few databases and queries and it’s access /statements endpoint. After that memory usage of the process should increase.
  4. Access/statements-reset endpoint.
  5. Check memory usage again
sushantrmishra commented 10 months ago

This issue is fixed as part of the fix for the following issues.