yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.99k stars 1.07k forks source link

[CDCSDK] Postgres core dump in 'YBCDestroyVirtualWalForCDC' while running CDC master case with nemesis, with pg connector #21651

Closed shamanthchandra-yb closed 6 months ago

shamanthchandra-yb commented 7 months ago

Jira Link: DB-10545

Description

Please find stress report in JIRA description:

* thread #1, name = 'postgres', stop reason = signal SIGABRT
  * frame #0: 0x00007f05067daacf libc.so.6`raise + 271
    frame #1: 0x00007f05067adea5 libc.so.6`abort + 295
    frame #2: 0x000055af582eeb8c postgres`errfinish(dummy=<unavailable>) at elog.c:817:3
    frame #3: 0x000055af582ee4d5 postgres`errstart(elevel=20, filename="", lineno=121, funcname="", domain=0x0000000000000000) at elog.c:605:3
    frame #4: 0x000055af57ed6d07 postgres`YBCDestroyVirtualWalForCDC at ybccmds.c:1943:2
    frame #5: 0x000055af580b9acf postgres`WalSndErrorCleanup [inlined] YBCDestroyVirtualWal at yb_virtual_wal_client.c:122:2
    frame #6: 0x000055af580b9aca postgres`WalSndErrorCleanup at walsender.c:313:3
    frame #7: 0x000055af5813e205 postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, dbname=<unavailable>, username=<unavailable>) at postgres.c:5171:4
    frame #8: 0x000055af5807ea40 postgres`BackendRun(port=0x000016feffde4780) at postmaster.c:4736:2
    frame #9: 0x000055af5807dc1d postgres`ServerLoop [inlined] BackendStartup(port=0x000016feffde4780) at postmaster.c:4400:3
    frame #10: 0x000055af5807db7e postgres`ServerLoop at postmaster.c:1778:7
    frame #11: 0x000055af58078d26 postgres`PostmasterMain(argc=25, argv=0x000016feffd0c000) at postmaster.c:1434:11
    frame #12: 0x000055af57f78d0a postgres`PostgresServerProcessMain(argc=25, argv=0x000016feffd0c000) at main.c:234:3
    frame #13: 0x000055af57c2bd12 postgres`main + 34
    frame #14: 0x00007f05067c6d85 libc.so.6`__libc_start_main + 229
    frame #15: 0x000055af57c2bc2e postgres`_start + 46

Source connector version

Postgres connector

Connector configuration

add connector connector_name='ybconnector_cdc_f1ea24_test_cdc_262b7c' stream_id='rs_cdc_f1ea24_b921' db_name='cdc_f1ea24' connector_host='172.151.18.201' table_list=['test_cdc_262b7c'] {'name': 'ybconnector_cdc_f1ea24_test_cdc_262b7c', 'config': {'database.master.addresses': '172.151.28.172:7100,172.151.27.243:7100,172.151.31.13:7100', 'database.port': 5433, 'database.masterhost': '172.151.31.13', 'database.masterport': '7100', 'database.user': 'yugabyte', 'database.password': 'yugabyte', 'database.dbname': 'cdc_f1ea24', 'snapshot.mode': 'initial', 'admin.operation.timeout.ms': 600000, 'socket.read.timeout.ms': 300000, 'max.connector.retries': '10', 'operation.timeout.ms': 600000, 'topic.creation.default.compression.type': 'lz4', 'topic.creation.default.cleanup.policy': 'delete', 'topic.creation.default.partitions': 2, 'topic.creation.default.replication.factor': '1', 'tasks.max': '5', 'connector.class': 'io.debezium.connector.postgresql.PostgresConnector', 'topic.prefix': 'ybconnector_cdc_f1ea24_test_cdc_262b7c', 'database.hostname': '172.151.31.13', 'plugin.name': 'pgoutput', 'slot.name': 'rs_cdc_f1ea24_b921_from_con', 'publication.autocreate.mode': 'filtered', 'publication.name': 'pn_ybconnector_cdc_f1ea24_test_cdc_262b7c', 'table.include.list': 'public.test_cdc_262b7c'}}

YugabyteDB version

No response

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

shamanthchandra-yb commented 7 months ago

cc: @m-iancu @tverona1 @sushantrmishra

dr0pdb commented 7 months ago

So this stack trace is just PG exiting with error due to it receiving the error from the CDC service.

Two things to do:

  1. Figure out why cdc service returned an error. Difficult to do, will need an active universe.
  2. Update Walsender to throw a warning instead of crashing this case as this logic is on the error handling path. I'll send a fix for it shortly.
dr0pdb commented 7 months ago

For 1, it happened because of nemesis in the stress test. Specifically there were node restarts and tserver crashes due to which the walsender wasn't able to reach the CDC service.