yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.93k stars 1.06k forks source link

[YSQL] [Upgrade] Segmentation Fault in PostgreSQL Process during epoll_wait #18779

Open rjalan-yb opened 1 year ago

rjalan-yb commented 1 year ago

Jira Link: DB-7660

Description

In upgrade automation for Geo Partition setup while upgrading from 2.19.0.0-b142 to 2.19.2.0-b92, we can see a core file was generated.

Part of Core trace:

Core was generated by `postgres: yugabyte mv_upgrade_db2_19_0_0 10.150.0.189(39570) REFRESH MATERIALIZ'.
Program terminated with signal 6, Aborted.
#0  0x00007fd8aa1100a7 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.

Thread 4 (Thread 0x7fd89ad13700 (LWP 14423)):
#0  0x00007fd8aa1c39f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1  0x00007fd8a664ae5e in boost::asio::detail::epoll_reactor::run(long, boost::asio::detail::op_queue<boost::asio::detail::scheduler_operation>&) () from /home/yugabyte/yb-software/yugabyte-2.19.0.0-b142-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#2  0x00007fd8a664821d in boost::asio::detail::scheduler::run(boost::system::error_code&) () from /home/yugabyte/yb-software/yugabyte-2.19.0.0-b142-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#3  0x00007fd8a66478b7 in yb::rpc::IoThreadPool::Impl::Execute() () from /home/yugabyte/yb-software/yugabyte-2.19.0.0-b142-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#4  0x00007fd8a64e1d7c in yb::Thread::SuperviseThread(void*) () from /home/yugabyte/yb-software/yugabyte-2.19.0.0-b142-centos-x86_64/lib/yb/libyb_util.so
No symbol table info available.
#5  0x00007fd8aaa86694 in start_thread (arg=0x7fd89ad13700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7fd89ad13700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140568287065856, 4969676843575277779, 0, 140737229806335, 94299084827072, 140568287065856, -4952020724462520109, -4952126073107908397}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#6  0x00007fd8aa1c341d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Steps:

testupgrade-aws-rf3-upgrade-geo-partition-2.19.0.0-b142: Start
    (     0.532s) User Login : Success
    (     0.164s) Refresh YB Version : Success
    (   120.468s) Setup Provider : Success
    (     0.075s) Updating Health Check Interval to 60000 sec : Success
    (  1667.727s) Create universe rjal-isd8154-165b2c9ac9-20230820-051349 : Success
    (    83.355s) Start sample workloads : Success
    (  4131.797s) Create geo-partitioned schema : Success
    (   707.368s) Bulk copy data : Success
    (    95.716s) Validation YSQL objects created : Success
    (    25.144s) Bulk copy verification : Success
    (   245.651s) Creating YBC backups in S3 : Success
    (    60.058s) Universe Upgrade failure : https://yugabyte.atlassian.net/browse/PLAT-8945
    (    76.710s) Validation YSQL objects created : Success
    (   108.046s) Upgrade Software to 2.19.2.0-b92 : Success
    (    90.597s) Validation YSQL objects created : Success
    (    89.154s) Validation YSQL objects created : Success
    (   100.236s) Validation YSQL objects created : Success
    (    89.154s) Validation YSQL objects created : Success
    (    93.848s) Validation YSQL objects created : Success
    (   100.622s) Validation YSQL objects created : Success
    (    59.357s) Validation YSQL objects created : Success
    (   217.296s) Validation YSQL objects created : Success
    (   123.334s) Validation YSQL objects created : Success
    (    64.533s) Validation YSQL objects created : Success
    (   204.892s) Validation YSQL objects created : Success
    (    82.181s) Check for fatals : >>> Integration Test Failed <<< 
Found at least one core file in 172.151.24.133.

    (     1.115s) Saved server log files and keys  : Success
    (    43.387s) Validation YSQL objects created : Success
    (    94.656s) Validation YSQL objects created : Success
    (    64.809s) Validation YSQL objects created : Success
    (    94.451s) Validation YSQL objects created : Success
    (    16.010s) Validation YSQL objects createdat /share/jenkins/workspace/itest-system-developer/logs/2.19.2.0_testupgrade-aws-rf3-upgrade-geo-partition-2.19.0.0-b142_20230820_091424 : Success
    (    59.284s) Destroy universe : Success
    (     0.282s) Check and stop workloads : Success
testupgrade-aws-rf3-upgrade-geo-partition-2.19.0.0-b142: End

Complete logs: https://drive.google.com/drive/folders/1Q6OhuCMPfHFUlSgWSIVI1dC3eOcW2S52?usp=sharing

Warning: Please confirm that this issue does not contain any sensitive information

rajapriya371 commented 10 months ago

observing this issue on version 2.21.0.0-b270

agsh-yb commented 4 months ago

Observed this issue in 2.20.4.0-b48 after upgrading from version base version (2.20.3.0-b67 and 2.18.7.0-b30)

Core was generated by `postgres: yugabyte mv_upgrade_db2_18_7_0 10.150.1.103(44344) REFRESH MATERIALIZ'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007feb300330a7 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7feb2b6eb0c0 (LWP 90187))]

Thread 8 (Thread 0x7feb20af1700 (LWP 90189)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
No locals.
#1  0x00007feb2c4164ac in boost::asio::detail::scheduler::run(boost::system::error_code&) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#2  0x00007feb2c415bb5 in yb::rpc::IoThreadPool::Impl::Execute() () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#3  0x00007feb2c29d339 in yb::Thread::SuperviseThread(void*) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyb_util.so
No symbol table info available.
#4  0x00007feb309a9694 in start_thread (arg=0x7feb20af1700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7feb20af1700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140647842387712, -8729785608899740838, 0, 140721831396991, 94269506872736, 140647842387712, 8723423414984293210, 8723458221417479002}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#5  0x00007feb300e641d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 7 (Thread 0x7feb212f2700 (LWP 90188)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
No locals.
#1  0x00007feb2c4164ac in boost::asio::detail::scheduler::run(boost::system::error_code&) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#2  0x00007feb2c415bb5 in yb::rpc::IoThreadPool::Impl::Execute() () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#3  0x00007feb2c29d339 in yb::Thread::SuperviseThread(void*) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyb_util.so
No symbol table info available.
#4  0x00007feb309a9694 in start_thread (arg=0x7feb212f2700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7feb212f2700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140647850780416, -8729785608899740838, 0, 140721831396991, 94269506871296, 140647850780416, 8723420118060022618, 8723458221417479002}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#5  0x00007feb300e641d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 6 (Thread 0x7feb1eaed700 (LWP 90193)):
#0  0x00007feb300e69f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1  0x00007feb2bf0b187 in epoll_poll () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/postgres/../lib/yb-thirdparty/libev.so.4
No symbol table info available.
#2  0x00007feb2bf0e489 in ev_run () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/postgres/../lib/yb-thirdparty/libev.so.4
No symbol table info available.
#3  0x00007feb2c43bff9 in yb::rpc::Reactor::RunThread() () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#4  0x00007feb2c29d339 in yb::Thread::SuperviseThread(void*) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyb_util.so
No symbol table info available.
#5  0x00007feb309a9694 in start_thread (arg=0x7feb1eaed700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7feb1eaed700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140647808816896, -8729785608899740838, 0, 140721831397103, 94269506874176, 140647808816896, 8723383834713176922, 8723458221417479002}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#6  0x00007feb300e641d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 5 (Thread 0x7feb1dad7700 (LWP 90842)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
No locals.
#1  0x00007feb30c8f2bb in std::__1::condition_variable::__do_timed_wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/postgres/../lib/yb-thirdparty/libc++.so.1
No symbol table info available.
#2  0x00007feb2c180d2d in yb::(anonymous namespace)::LongOperationTrackerHelper::Execute() () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyb_util.so
No symbol table info available.
#3  0x00007feb2c29d339 in yb::Thread::SuperviseThread(void*) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyb_util.so
No symbol table info available.
#4  0x00007feb309a9694 in start_thread (arg=0x7feb1dad7700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7feb1dad7700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140647791949568, -8729785608899740838, 0, 140647800306239, 25, 140647791949568, 8723377251602053978, 8723458221417479002}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#5  0x00007feb300e641d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 4 (Thread 0x7feb1faef700 (LWP 90191)):
#0  0x00007feb300e69f3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
No locals.
#1  0x00007feb2c41901e in boost::asio::detail::epoll_reactor::run(long, boost::asio::detail::op_queue<boost::asio::detail::scheduler_operation>&) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#2  0x00007feb2c41651d in boost::asio::detail::scheduler::run(boost::system::error_code&) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#3  0x00007feb2c415bb5 in yb::rpc::IoThreadPool::Impl::Execute() () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#4  0x00007feb2c29d339 in yb::Thread::SuperviseThread(void*) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyb_util.so
No symbol table info available.
#5  0x00007feb309a9694 in start_thread (arg=0x7feb1faef700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7feb1faef700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140647825602304, -8729785608899740838, 0, 140721831396991, 94269506872448, 140647825602304, 8723381636763663194, 8723458221417479002}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#6  0x00007feb300e641d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 3 (Thread 0x7feb202f0700 (LWP 90190)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
No locals.
#1  0x00007feb2c4164ac in boost::asio::detail::scheduler::run(boost::system::error_code&) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#2  0x00007feb2c415bb5 in yb::rpc::IoThreadPool::Impl::Execute() () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#3  0x00007feb2c29d339 in yb::Thread::SuperviseThread(void*) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyb_util.so
No symbol table info available.
#4  0x00007feb309a9694 in start_thread (arg=0x7feb202f0700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7feb202f0700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140647833995008, -8729785608899740838, 0, 140721831396991, 94269506873024, 140647833995008, 8723422316009536346, 8723458221417479002}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#5  0x00007feb300e641d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 2 (Thread 0x7feb1e2d8700 (LWP 90194)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
No locals.
#1  0x00007feb2c29a943 in yb::ThreadJoiner::Join() () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyb_util.so
No symbol table info available.
#2  0x00007feb2c4419de in yb::rpc::Reactor::Join() () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#3  0x00007feb2c41fcd8 in yb::rpc::Messenger::Shutdown() () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyrpc.so
No symbol table info available.
#4  0x00007feb30d91cb2 in void ev::base<ev_async, ev::async>::method_thunk<yb::pggate::PgApiImpl::Interrupter, &yb::pggate::PgApiImpl::Interrupter::AsyncHandler>(ev_loop*, ev_async*, int) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyb_pggate.so
No symbol table info available.
#5  0x00007feb2bf0af5b in ev_invoke_pending () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/postgres/../lib/yb-thirdparty/libev.so.4
No symbol table info available.
#6  0x00007feb2bf0eaee in ev_run () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/postgres/../lib/yb-thirdparty/libev.so.4
No symbol table info available.
#7  0x00007feb2c29d339 in yb::Thread::SuperviseThread(void*) () from /home/yugabyte/yb-software/yugabyte-2.18.7.0-b30-centos-x86_64/lib/yb/libyb_util.so
No symbol table info available.
#8  0x00007feb309a9694 in start_thread (arg=0x7feb1e2d8700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7feb1e2d8700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140647800342272, -8729785608899740838, 0, 140721831397775, 94269506874464, 140647800342272, 8723382763655707482, 8723458221417479002}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#9  0x00007feb300e641d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.

Thread 1 (Thread 0x7feb2b6eb0c0 (LWP 90187)):
#0  0x00007feb300330a7 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
        resultvar = 0
        pd = <optimized out>
        pid = 90187
        selftid = 90187
#1  0x00007feb300344aa in __GI_abort () at abort.c:89
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0, 0, 94269507428352, 1, 94269453683760, 0, 140648099557177, 94269453683760, 94269507435192, 0, 0, 0, 140648103118688, 140648103112832, 140648022716608, 0}}, sa_flags = 0, sa_restorer = 0x0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x000055bcd130b2da in errfinish ()
No symbol table info available.
#3  0x000055bcd1313749 in elog_start ()
No symbol table info available.
#4  0x000055bcd0d17c21 in AbortTransaction ()
No symbol table info available.
#5  0x000055bcd0d1b0aa in AbortCurrentTransaction ()
No symbol table info available.
#6  0x000055bcd11607fe in PostgresMain ()
No symbol table info available.
#7  0x000055bcd10a30de in BackendRun ()
No symbol table info available.
#8  0x000055bcd10a219c in ServerLoop ()
No symbol table info available.
#9  0x000055bcd109d3d5 in PostmasterMain ()
No symbol table info available.
#10 0x000055bcd0fa1cef in PostgresServerProcessMain ()
No symbol table info available.
#11 0x000055bcd0c66c12 in main ()
No symbol table info available.

cc: @kripasreenivasan

agsh-yb commented 2 months ago

We still see this kind of error, looks like similar issues: