yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.98k stars 1.07k forks source link

[YSQL] Segmentation Fault in PrometheusWriter::WriteSingleEntry within PostgreSQL Process #19426

Open agsh-yb opened 1 year ago

agsh-yb commented 1 year ago

Jira Link: DB-8220

Description

Core dump is observed in test_ysql_snapshots_daily_test. Observed this failure in on 2.21.0.0-b42

Steps

  1. Create 20 YSQL tables
  2. Delete existing schedules if any
  3. Load data - using a large key multiplier-500 and Create a snapshot schedule
  4. Execute Nemesis - Failure moment- After Schedule-RESTART_VM, ADD_DROP_COLUMN
  5. Capture time for restore point
  6. Execute 20 restores in a row
  7. Evaluate nemesis after restore- RESTART_TSERVER, RESTART_MASTER, PARTITION_NETWORK, SLOW_NETWORK,
  8. RESTART_VM
  9. Assert no. of rows of data
  10. Check for FATALs and CORRUPT files

Environment setup

instance_type="c5.2xlarge",
master_gflags={
"ysql_num_shards_per_tserver": "10",
"yb_num_shards_per_tserver": "10",
},
tserver_gflags={
"ysql_num_shards_per_tserver": "10",
"yb_num_shards_per_tserver": "10",
'client_read_write_timeout_ms': str(30 * 60 * 1000),
'yb_client_admin_operation_timeout_sec': str(30 * 60),
}

Observations:

  1. Error:Found schema version mismatch for table 000033f3000030008000000000004000: expected 2, got 1
  2. Core dumps are found
  3. SST File size spiked echarts
2023-10-06 00:02:13,040:DEBUG: > 172.151.20.168 > /home/yugabyte/tserver/bin/ysqlsh --c "select count(k) as count from postgresqlkeyvalue0 where yb_hash_code(k) >= 26212 and yb_hash_code(k) <= 32765 " -h 172.151.17.9,172.151.17.20,172.151.20.168 -d postgres -A 2>&1
2023-10-06 00:02:13,040:DEBUG: < 172.151.20.168 < 
2023-10-06 00:02:13,041:DEBUG: > 172.151.20.168 > /home/yugabyte/tserver/bin/ysqlsh --c "select count(k) as count from postgresqlkeyvalue0 where yb_hash_code(k) >= 39318 and yb_hash_code(k) <= 45871 " -h 172.151.17.20,172.151.20.168,172.151.17.9 -d postgres -A 2>&1
2023-10-06 00:02:13,041:DEBUG: > 172.151.17.9 > /home/yugabyte/tserver/bin/ysqlsh --c "select count(k) as count from postgresqlkeyvalue0 where yb_hash_code(k) >= 45871 and yb_hash_code(k) <= 52424 " -h 172.151.17.20,172.151.20.168,172.151.17.9 -d postgres -A 2>&1
2023-10-06 00:02:13,041:DEBUG: > 172.151.17.9 > /home/yugabyte/tserver/bin/ysqlsh --c "select count(k) as count from postgresqlkeyvalue0 where yb_hash_code(k) >= 32765 and yb_hash_code(k) <= 39318 " -h 172.151.17.9,172.151.20.168,172.151.17.20 -d postgres -A 2>&1
2023-10-06 00:02:35,822:DEBUG: < 172.151.20.168 < ERROR:  schema version mismatch for table 000033f3000030008000000000004000: expected 2, got 1

2023-10-06 00:02:35,823:DEBUG: > 172.151.20.168 > /home/yugabyte/tserver/bin/ysqlsh --c "select count(k) as count from postgresqlkeyvalue0 where yb_hash_code(k) >= 39318 and yb_hash_code(k) <= 45871 " -h 172.151.17.20,172.151.20.168,172.151.17.9 -d postgres -A 2>&1
2023-10-06 00:02:36,057:DEBUG: < 172.151.20.168 < ERROR:  schema version mismatch for table 000033f3000030008000000000004000: expected 2, got 1

2023-10-06 00:02:36,058:DEBUG: > 172.151.17.9 > /home/yugabyte/tserver/bin/ysqlsh --c "select count(k) as count from postgresqlkeyvalue0 where yb_hash_code(k) >= 39318 and yb_hash_code(k) <= 45871 " -h 172.151.17.20,172.151.20.168,172.151.17.9 -d postgres -A 2>&1
2023-10-06 00:04:24,111:DEBUG: < 172.151.17.9 < 
2023-10-06 00:04:24,111:DEBUG: > 172.151.17.20 > /home/yugabyte/tserver/bin/ysqlsh --c "select count(k) as count from postgresqlkeyvalue0 where yb_hash_code(k) >= 32765 and yb_hash_code(k) <= 39318 " -h 172.151.17.20,172.151.17.9,172.151.20.168 -d postgres -A 2>&1
2023-10-06 00:04:24,111:DEBUG: < 172.151.17.9 < 
2023-10-06 00:04:24,112:DEBUG: > 172.151.20.168 > /home/yugabyte/tserver/bin/ysqlsh --c "select count(k) as count from postgresqlkeyvalue0 where yb_hash_code(k) >= 45871 and yb_hash_code(k) <= 52424 " -h 172.151.17.20,172.151.20.168,172.151.17.9 -d postgres -A 2>&1
2023-10-06 00:04:27,296:DEBUG: < 172.151.20.168 < count
25535
(1 row)

Logs Stack traceback:

(lldb) target create "/home/yugabyte/yb-software/yugabyte-2.21.0.0-b42-centos-x86_64/postgres/bin/postgres" --core "/home/yugabyte/cores/core_12771_1696550474_!home!yugabyte!yb-software!yugabyte-2.21.0.0-b42-centos-x86_64!postgres!bin!postgres"
Core file '/home/yugabyte/cores/core_12771_1696550474_!home!yugabyte!yb-software!yugabyte-2.21.0.0-b42-centos-x86_64!postgres!bin!postgres' (x86_64) was loaded.
(lldb) bt all
* thread #1, name = 'postgres', stop reason = signal SIGSEGV
  * frame #0: 0x00007fca39a97b67 libyb_util.so`std::__1::__hash_const_iterator<std::__1::__hash_node<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, void*>*> std::__1::__hash_table<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::__unordered_map_hasher<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, true>, std::__1::__unordered_map_equal<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, true>, std::__1::allocator<std::__1::__hash_value_type<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>::find<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>(this=<unavailable>, __k=<unavailable>) const at __hash_table:2343:31
    frame #1: 0x00007fca39a977f6 libyb_util.so`yb::PrometheusWriter::WriteSingleEntry(std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, long, yb::AggregationFunction, char const*, char const*) [inlined] std::__1::unordered_map<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::pair<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>>::find[abi:v160006](this=0x00007fca38ed6f68, __k="table_id") const at unordered_map:1445:69
    frame #2: 0x00007fca39a977e0 libyb_util.so`yb::PrometheusWriter::WriteSingleEntry(this=0x00007fca2e4a17a8, attr=0x00007fca38ed6f68, name="handler_latency_yb_ysqlserver_SQLProcessor_SelectStmt_count", value=32326, aggregation_function=kSum, type="", description="") at metrics_writer.cc:125:18
    frame #3: 0x00007fca38ecf2d2 libyb_pggate_webserver.so`yb::pggate::PgPrometheusMetricsHandler(req=<unavailable>, resp=<unavailable>) at pgsql_webserver_wrapper.cc:452:5
    frame #4: 0x00007fca3af67430 libserver_process.so`yb::Webserver::Impl::RunPathHandler(yb::Webserver::Impl::PathHandler const&, sq_connection*, sq_request_info*) [inlined] std::__1::__function::__value_func<void (yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*)>::operator(this=0x000007d77fdc2c80, __args=0x00007fca2e4a3b58, __args=0x00007fca2e4a3a30)[abi:v160006](yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*&&) const at function.h:510:16
    frame #5: 0x00007fca3af67417 libserver_process.so`yb::Webserver::Impl::RunPathHandler(yb::Webserver::Impl::PathHandler const&, sq_connection*, sq_request_info*) [inlined] std::__1::function<void (yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*)>::operator(this= Function = yb::pggate::PgPrometheusMetricsHandler(yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*) , __arg=0x00007fca2e4a3b58, __arg=0x00007fca2e4a1a30)(yb::WebCallbackRegistry::WebRequest const&, yb::WebCallbackRegistry::WebResponse*) const at function.h:1156:12
    frame #6: 0x00007fca3af67417 libserver_process.so`yb::Webserver::Impl::RunPathHandler(this=0x000007d77fd7ac80, handler=0x000007d77fddc730, connection=0x000007d77f4c2000, request_info=<unavailable>) at webserver.cc:625:5
    frame #7: 0x00007fca3af66f9b libserver_process.so`yb::Webserver::Impl::BeginRequestCallback(this=0x000007d77fd7ac80, connection=0x000007d77f4c2000, request_info=0x000007d77f4c2000) at webserver.cc:557:10
    frame #8: 0x00007fca3af73c88 libserver_process.so`worker_thread + 5400
    frame #9: 0x00007fca3d7a01cf libpthread.so.0`start_thread + 239
    frame #10: 0x00007fca3d40be73 libc.so.6`__clone + 67
  thread #2, stop reason = signal 0
    frame #0: 0x00007fca3d7a645c libpthread.so.0`pthread_cond_wait@@GLIBC_2.3.2 + 508
    frame #1: 0x00007fca3af726e5 libserver_process.so`master_thread + 1301
    frame #2: 0x00007fca3d7a01cf libpthread.so.0`start_thread + 239
    frame #3: 0x00007fca3d40be73 libc.so.6`__clone + 67
  thread #3, stop reason = signal 0
    frame #0: 0x00007fca3ed76d08 ld-linux-x86-64.so.2`.annobin_dl_fini.c at dl-fini.c:84:12
    frame #1: 0x00007fca3d42329c libc.so.6`__run_exit_handlers + 316
    frame #2: 0x00007fca3d4233d0 libc.so.6`exit + 32
    frame #3: 0x00005592c2aada82 postgres`proc_exit(code=1) at ipc.c:157:2
    frame #4: 0x00005592c2c96271 postgres`errfinish(dummy=<unavailable>) at elog.c:801:3
    frame #5: 0x00005592c2a0d31d postgres`bgworker_die(postgres_signal_arg=<unavailable>) at bgworker.c:672:2
    frame #6: 0x00007fca3d7aacf0 libpthread.so.0`__restore_rt
    frame #7: 0x00007fca3d502027 libc.so.6`epoll_wait + 87
    frame #8: 0x00005592c2aafa66 postgres`WaitEventSetWait [inlined] WaitEventSetWaitBlock(set=0x000007d77fd95e88, cur_timeout=-1, occurred_events=0x00007ffdd14a05a8, nevents=1) at latch.c:1062:7
    frame #9: 0x00005592c2aafa56 postgres`WaitEventSetWait(set=0x000007d77fd95e88, timeout=-1, occurred_events=<unavailable>, nevents=1, wait_event_info=<unavailable>) at latch.c:1014:8
    frame #10: 0x00005592c2aaf503 postgres`WaitLatchOrSocket(latch=0x00007fca3804988c, wakeEvents=<unavailable>, sock=-1, timeout=<unavailable>, wait_event_info=117440512) at latch.c:399:7
    frame #11: 0x00005592c2aaf400 postgres`WaitLatch(latch=<unavailable>, wakeEvents=<unavailable>, timeout=<unavailable>, wait_event_info=<unavailable>) at latch.c:353:9 [artificial]
    frame #12: 0x00007fca38eb71e0 yb_pg_metrics.so`webserver_worker_main(unused=<unavailable>) at yb_pg_metrics.c:375:3
    frame #13: 0x00005592c2a0d1b2 postgres`StartBackgroundWorker at bgworker.c:841:2
    frame #14: 0x00005592c2a260eb postgres`maybe_start_bgworkers [inlined] do_start_bgworker(rw=0x000007d77fd7a500) at postmaster.c:6005:4
    frame #15: 0x00005592c2a2608e postgres`maybe_start_bgworkers at postmaster.c:6231:9
    frame #16: 0x00005592c2a228ad postgres`PostmasterMain(argc=25, argv=0x000007d77fd061a0) at postmaster.c:1429:2
    frame #17: 0x00005592c2925fb6 postgres`PostgresServerProcessMain(argc=25, argv=0x000007d77fd061a0) at main.c:234:3
    frame #18: 0x00005592c25e7952 postgres`main + 34
    frame #19: 0x00007fca3d40cd85 libc.so.6`__libc_start_main + 229
    frame #20: 0x00005592c25e7869 postgres`_start + 41

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

sushantrmishra commented 1 year ago

Possible duplicate of https://github.com/yugabyte/yugabyte-db/issues/18948 . We should reverify once fix for https://github.com/yugabyte/yugabyte-db/issues/18948 has landed.