yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.91k stars 1.06k forks source link

[YSQL] Crash in 'postgres' with Unhandled Exception in 'libglog' Logging" (Caught in one of the run in CDC) #19298

Open shamanthchandra-yb opened 1 year ago

shamanthchandra-yb commented 1 year ago

Jira Link: DB-8108

Description

This is stress case which has nemesis, so, this could be found in one of the random combination.

PFA for stress report.

(lldb) target create "/home/yugabyte/yb-software/yugabyte-2.19.4.0-b4-centos-x86_64/postgres/bin/postgres" --core "/home/yugabyte/cores/core_7025_1695718136_!home!yugabyte!yb-software!yugabyte-2.19.4.0-b4-centos-x86_64!postgres!bin!postgres"
Core file '/home/yugabyte/cores/core_7025_1695718136_!home!yugabyte!yb-software!yugabyte-2.19.4.0-b4-centos-x86_64!postgres!bin!postgres' (x86_64) was loaded.
(lldb) bt all
warning: This version of LLDB has no plugin for the language "assembler". Inspection of frame variables will be limited.
* thread #1, name = 'postgres', stop reason = signal SIGABRT
  * frame #0: 0x00007fddf4e0c0a7 libc.so.6`__GI_raise(sig=6) at raise.c:54
    frame #1: 0x00007fddf4e0d4aa libc.so.6`__GI_abort at abort.c:89
    frame #2: 0x00007fddf05f8e3a libglog.so.0`google::LogMessage::Flush() + 346
    frame #3: 0x00007fddf05f8be3 libglog.so.0`google::LogMessage::~LogMessage() + 19
    frame #4: 0x00007fddf25999e9 libserver_process.so`yb::Webserver::Impl::LogMessageCallbackStatic(connection=<unavailable>, message="Failed to enter worker thread") at webserver.cc:523:5
    frame #5: 0x00007fddf25a2e88 libserver_process.so`cry + 200
    frame #6: 0x00007fddf25a854e libserver_process.so`worker_thread + 94
    frame #7: 0x00007fddf5782694 libpthread.so.0`start_thread(arg=0x00007fdde518b700) at pthread_create.c:333
    frame #8: 0x00007fddf4ebf41d libc.so.6`__clone at clone.S:109
  thread #2, stop reason = signal 0
    frame #0: 0x00007fddf6cf20dd ld.so`_dl_fixup(l=0x00007fddf6db7000, reloc_arg=<unavailable>) at dl-runtime.c:73
    frame #1: 0x00007fddf6cf8887 ld.so`_dl_runtime_resolve_avx512 at dl-trampoline.h:112
  thread #3, stop reason = signal 0
    frame #0: 0x00007fddf4eb65cd libc.so.6`poll at syscall-template.S:84
    frame #1: 0x00007fddf25a754d libserver_process.so`master_thread + 573
    frame #2: 0x00007fddf5782694 libpthread.so.0`start_thread(arg=0x00007fdde598c700) at pthread_create.c:333
    frame #3: 0x00007fddf4ebf41d libc.so.6`__clone at clone.S:109

Last sample apps I was running was:

java -jar /tmp/tests/artifacts/stress-sample-app-tool/yb-stress-sample-apps-1.1.18.jar --workload SqlDataLoad --default_postgres_database cdc_6b7aa5 --num_writes 334000 --num_threads_write 26 --num_threads_read 0 --num_reads 0 --num_unique_keys 100000000000 --batch_size 195 --num_value_columns 17 --create_table_name test_cdc_3c1703 --skip_ddl --uuid_column --uuid 3eecdbb6-317c-4dbe-9b32-fae687fdfcd3 --large_key_multiplier 3 --large_value_multiplier 3 --uuid_marker 41a4db05-c5c0-4838-a0d6-e2730f1283ec --nodes 172.151.26.173:5433,172.151.21.231:5433,172.151.17.255:5433

Warning: Please confirm that this issue does not contain any sensitive information

zlareb1-yb commented 10 months ago

Observed similar error in Packed YCQL Tests (which doesn't have any YSQL workload) as well:

(lldb) target create "/home/yugabyte/yb-software/yugabyte-2.18.5.0-b70-centos-x86_64/postgres/bin/postgres" --core "/home/yugabyte/cores/core_248834_1700740904_!home!yugabyte!yb-software!yugabyte-2.18.5.0-b70-centos-x86_64!postgres!bin!postgres"
Core file '/home/yugabyte/cores/core_248834_1700740904_!home!yugabyte!yb-software!yugabyte-2.18.5.0-b70-centos-x86_64!postgres!bin!postgres' (x86_64) was loaded.
(lldb) bt all
warning: This version of LLDB has no plugin for the language "assembler". Inspection of frame variables will be limited.
* thread #1, name = 'postgres', stop reason = signal SIGABRT
  * frame #0: 0x00007f7647d380a7 libc.so.6`__GI_raise(sig=6) at raise.c:54
    frame #1: 0x00007f7647d394aa libc.so.6`__GI_abort at abort.c:89
    frame #2: 0x00007f7643cab88a libglog.so.0`google::LogMessage::Flush() + 346
    frame #3: 0x00007f7643cab632 libglog.so.0`google::LogMessage::~LogMessage() + 18
    frame #4: 0x00007f7645764e89 libserver_process.so`yb::Webserver::Impl::LogMessageCallbackStatic(connection=<unavailable>, message="Failed to enter worker thread") at webserver.cc:523:5
    frame #5: 0x00007f764576c5b8 libserver_process.so`cry + 200
    frame #6: 0x00007f764577197e libserver_process.so`worker_thread + 94
    frame #7: 0x00007f76486ae694 libpthread.so.0`start_thread(arg=0x00007f7638856700) at pthread_create.c:333
    frame #8: 0x00007f7647deb41d libc.so.6`__clone at clone.S:109
  thread #2, stop reason = signal 0
    frame #0: 0x00007f7643237e10 pg_hint_plan.so`__do_fini
    frame #1: 0x00007f7649c4332a ld.so`_dl_fini at dl-fini.c:235
    frame #2: 0x00007f7647d3a969 libc.so.6`__run_exit_handlers(status=1, listp=0x00007f764809b5c0, run_list_atexit=true) at exit.c:82
    frame #3: 0x00007f7647d3a9b5 libc.so.6`__GI_exit(status=<unavailable>) at exit.c:104
    frame #4: 0x0000558674adba62 postgres`proc_exit(code=1) at ipc.c:157:2
    frame #5: 0x0000558674cbe8af postgres`errfinish(dummy=<unavailable>) at elog.c:801:3
    frame #6: 0x0000558674a3b91d postgres`bgworker_die(postgres_signal_arg=<unavailable>) at bgworker.c:672:2
    frame #7: 0x00007f76486b6ba0 libpthread.so.0`__restore_rt
    frame #8: 0x00007f7647deb9f3 libc.so.6`epoll_wait at syscall-template.S:84
    frame #9: 0x0000558674adda85 postgres`WaitEventSetWait [inlined] WaitEventSetWaitBlock(set=0x000055867870de88, cur_timeout=-1, occurred_events=0x00007ffe3605a0a8, nevents=1) at latch.c:1062:7
    frame #10: 0x0000558674adda77 postgres`WaitEventSetWait(set=0x000055867870de88, timeout=-1, occurred_events=<unavailable>, nevents=1, wait_event_info=<unavailable>) at latch.c:1014:8
    frame #11: 0x0000558674add503 postgres`WaitLatchOrSocket(latch=0x00007f76423e788c, wakeEvents=<unavailable>, sock=-1, timeout=<unavailable>, wait_event_info=117440512) at latch.c:399:7
    frame #12: 0x0000558674add400 postgres`WaitLatch(latch=<unavailable>, wakeEvents=<unavailable>, timeout=<unavailable>, wait_event_info=<unavailable>) at latch.c:353:9 [artificial]
    frame #13: 0x00007f76432595d0 yb_pg_metrics.so`webserver_worker_main(unused=<unavailable>) at yb_pg_metrics.c:376:3
    frame #14: 0x0000558674a3b7ac postgres`StartBackgroundWorker at bgworker.c:841:2
    frame #15: 0x0000558674a5411c postgres`maybe_start_bgworkers [inlined] do_start_bgworker(rw=0x000055867873e000) at postmaster.c:6033:4
    frame #16: 0x0000558674a540bf postgres`maybe_start_bgworkers at postmaster.c:6259:9
    frame #17: 0x0000558674a50980 postgres`PostmasterMain(argc=25, argv=0x00005586786d6000) at postmaster.c:1429:2
    frame #18: 0x000055867495623f postgres`PostgresServerProcessMain(argc=25, argv=0x00005586786d6000) at main.c:234:3
    frame #19: 0x000055867461f062 postgres`main + 34
    frame #20: 0x00007f7647d25825 libc.so.6`__libc_start_main(main=(postgres`main), argc=25, argv=0x00007ffe3605a9f8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007ffe3605a9e8) at libc-start.c:289
    frame #21: 0x000055867461ef79 postgres`_start at start.S:108
  thread #3, stop reason = signal 0
    frame #0: 0x00007f7647de25cd libc.so.6`poll at syscall-template.S:84
    frame #1: 0x00007f76457715ac libserver_process.so`master_thread + 572
    frame #2: 0x00007f76486ae694 libpthread.so.0`start_thread(arg=0x00007f7639057700) at pthread_create.c:333
    frame #3: 0x00007f7647deb41d libc.so.6`__clone at clone.S:109

Stress test link - http://stress.dev.yugabyte.com/stress_test/09501ea0-5786-4e0c-b996-7e3dbae0d5b6 Version - 2.18.5.0-b70

cc: @SergeyPotachev @rthallamko3 @renjith-yb @kripasreenivasan @shamanthchandra-yb

shamanthchandra-yb commented 8 months ago

Observed again, run was on 2.20.2.0-b109 test_cdc_with_consistency_bank_tx_before_image

Run link in JIRA comments. https://yugabyte.atlassian.net/browse/DB-8108?focusedCommentId=95970

shamanthchandra-yb commented 7 months ago

@m-iancu last week we discussed, that there were few fixes went in around webserver recently, and you had asked for runs where it hits. Here is the latest occurrence of this issue: https://yugabyte.atlassian.net/browse/DB-8108?focusedCommentId=97607