yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.9k stars 1.06k forks source link

[YSQL] SIGSEGV in GetMemoryChunkContext (YBRefreshCache) similar to #6317 #10846

Closed ionthegeek closed 1 year ago

ionthegeek commented 2 years ago

Jira Link: DB-925

Description

This crash is similar to Issue #6317 but found in version 2.4.4. This version includes the fix for this issue, so there is likely some other code path where we're not setting attrmiss = NULL;.

The postgres process on the affected system crashed after a series of DDLs.

version: Application fingerprint: version 2.4.4.0 build 7 revision 5089f374d19429108c505d1b0fb89ec9e068f965 build_type RELEASE built at 20 May 2021 00:40:26 UTC

Core backtrace:

Core was generated by `postgres: ecperson person_core 10.35.202.8(53242) idle                        '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  FreeTupleDesc (tupdesc=0x2a2fc38) at ../../../../../../../src/postgres/src/backend/access/common/tupdesc.c:340
340                                     if (attrmiss[i].am_present
[Current thread is 1 (LWP 31509)]
(gdb) bt
#0  FreeTupleDesc (tupdesc=0x2a2fc38) at ../../../../../../../src/postgres/src/backend/access/common/tupdesc.c:340
#1  0x00000000009d8c39 in RelationDestroyRelation (relation=0x2a2ff98, remember_tupdesc=<optimized out>) at ../../../../../../../src/postgres/src/backend/utils/cache/relcache.c:2999
#2  0x00000000009da90e in YBPreloadRelCache () at ../../../../../../../src/postgres/src/backend/utils/cache/relcache.c:1453
#3  0x0000000000868ac8 in YBRefreshCache () at ../../../../../../src/postgres/src/backend/tcop/postgres.c:3713
#4  0x000000000086e2ff in YBCheckSharedCatalogCacheVersion () at ../../../../../../src/postgres/src/backend/tcop/postgres.c:3954
#5  PostgresMain (argc=<optimized out>, argv=argv@entry=0x1e40c88, dbname=0x1e40c68 "person_core", username=0x1e40c48 "ecperson") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4905
#6  0x000000000049a82f in BackendRun (port=0x1e982a0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4428
#7  BackendStartup (port=0x1e982a0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:4094
#8  ServerLoop () at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1737
#9  0x00000000007d6921 in PostmasterMain (argc=argc@entry=25, argv=argv@entry=0x1e3f2d0) at ../../../../../../src/postgres/src/backend/postmaster/postmaster.c:1400
#10 0x000000000072288a in PostgresServerProcessMain (argc=25, argv=0x1e3f2d0) at ../../../../../../src/postgres/src/backend/main/main.c:234
#11 0x0000000000722a89 in main ()

Postgres log from this session:

grep "31509" 10.35.222.22_postgresql-2021-12-09_000000.log
I1209 15:25:51.105139 31509 pggate.cc:149] Reset YSQL bind address to 10.35.222.22:5432
I1209 15:25:51.105393 31509 server_base_options.cc:137] Updating master addrs to {10.35.154.13:7100},{10.35.222.19:7100},{10.35.222.21:7100}
I1209 15:25:51.105712 31509 mem_tracker.cc:249] MemTracker: hard memory limit is 53.480705 GB
I1209 15:25:51.105731 31509 mem_tracker.cc:251] MemTracker: soft memory limit is 45.458599 GB
I1209 15:25:51.105782 31509 secure.cc:118] Certs directory: /home/yugabyte/yugabyte-tls-config, name:
I1209 15:25:51.107362 31509 thread_pool.cc:171] Starting thread pool { name: pggate_ybclient queue_limit: 10000 max_workers: 1024 }
2021-12-09 15:26:03.292 UTC [31509] LOG:  duration: 548.327 ms  parse <unnamed>: SELECT id, person_id FROM emails WHERE email_address IN ($1)
I1209 15:26:04.211752 31509 pg_txn_manager.cc:193] Using TServer endpoint: 10.35.222.22:9100
I1209 15:26:04.212568 31509 thread_pool.cc:171] Starting thread pool { name: TransactionManager queue_limit: 500 max_workers: 50 }
2021-12-09 15:42:52.692 UTC [31509] ERROR:  Timed out: Read RPC (request call id 2510) to 10.35.222.19:7100 timed out after 3.000s
2021-12-09 15:42:57.508 UTC [7495] LOG:  server process (PID 31509) was terminated by signal 11: Segmentation fault

The complete Postgres log has sensitive data in it but I can make it available directly. There were a number of statement timeouts and RPC timeouts happening at the time and the system was also under heavy I/O load. It's not clear if these were contributing factors.

The following DDLs were run just prior to the crash (some sensitive info redacted):

CREATE ROLE readaccess;
GRANT CONNECT ON DATABASE <db1> TO readaccess;
GRANT CONNECT ON DATABASE <db2> TO readaccess;
GRANT CONNECT ON DATABASE <db3> TO readaccess;
GRANT USAGE ON SCHEMA public TO readaccess;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO readaccess;
CREATE USER <user> WITH PASSWORD ‘pass’;
GRANT readaccess TO <user>;
ALTER ROLE <user> SET statement_timeout = 0;

Additional core info - the tupdesc->constr structure is itself valid but has invalid check and missing values:

(gdb) pgprint tupdesc->constr
not a node type
running experimental dump...
TupleConstr [defval=0x7f7d4835efc8 check=0x6c6c6f637261763a missing=0x61763a2030206469 num_defval=2 num_check=30309 has_not_null=true]
(gdb) pgprint tupdesc->constr->check
not a node type
running experimental dump...
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x6c6c6f637261763a:
Error occurred in Python: Cannot access memory at address 0x6c6c6f637261763a
(gdb) pgprint tupdesc->constr->missing
not a node type
running experimental dump...
Python Exception <class 'gdb.MemoryError'> Cannot access memory at address 0x61763a2030206469:
Error occurred in Python: Cannot access memory at address 0x61763a2030206469
sushantrmishra commented 1 year ago

Not able to reproduce this issue..

localhost:5433 yugabyte@yugabyte=# CREATE ROLE readaccess;
CREATE ROLE
Time: 109.123 ms
localhost:5433 yugabyte@yugabyte=# GRANT CONNECT ON DATABASE db1 TO readaccess;
GRANT
Time: 67.200 ms

localhost:5433 yugabyte@yugabyte=# GRANT CONNECT ON DATABASE db2 TO readaccess;
GRANT
Time: 72.415 ms
localhost:5433 yugabyte@yugabyte=# GRANT CONNECT ON DATABASE db3 TO readaccess;
GRANT
Time: 108.266 ms
localhost:5433 yugabyte@yugabyte=# GRANT USAGE ON SCHEMA public TO readaccess;
GRANT
Time: 66.294 ms

localhost:5433 yugabyte@yugabyte=# GRANT SELECT ON ALL TABLES IN SCHEMA public TO readaccess;
GRANT
Time: 388.310 ms
localhost:5433 yugabyte@yugabyte=# CREATE USER test WITH PASSWORD 'pass';
CREATE ROLE
Time: 83.435 ms
localhost:5433 yugabyte@yugabyte=# GRANT readaccess TO test;
GRANT ROLE
Time: 83.039 ms
localhost:5433 yugabyte@yugabyte=# ALTER ROLE test set statement_timeout = 0;
ALTER ROLE
Time: 94.247 ms