yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.7k stars 1.04k forks source link

[YSQL] flaky test: org.yb.pgsql.TestPgRegressIndex.testPgRegressIndex #16408

Closed bmatican closed 2 weeks ago

bmatican commented 1 year ago

Jira Link: DB-5819

Description

Keeps popping up on per-diff Detective for 2-3 build types.

report

Warning: Please confirm that this issue does not contain any sensitive information

m-iancu commented 1 year ago

Re-assigning to @amitanandaiyer as based on some of the error messages this looks to be caused by #17342. Will review test status after the fix for that lands to confirm (and close or re0assign as needed).

jasonyb commented 9 months ago

As of 2023-10-30, frequently failing tests are yb_pg_indexing (sometimes) and yb_reindex. At the time of the original report, this test was flaky for other reasons (I believe it was transactions related).

Stack trace of thread 3071708:                                                                                                                                                                                                                                                                                                                               
#0  0x00007f80a9770acf raise (libc.so.6)                                                                                                                                                                                                                                                                                                                     
#1  0x00007f80a9743ea5 abort (libc.so.6)                                                                                                                                                                                                                                                                                                                     
#2  0x0000000000a4344f ExceptionalCondition (postgres)                                                                                                                                                                                                                                                                                                       
#3  0x0000000000a4311a YbPgInheritsCacheDelete (postgres)                                                                                                                                                                                                                                                                                                    
#4  0x0000000000a4330a YbPgInheritsCacheInvalidate (postgres)                                                                                                                                                                                                                                                                                                
#5  0x0000000000a433ab YbPgInheritsCacheRelCallback (postgres)                                                                                                                                                                                                                                                                                               
#6  0x0000000000a22e5d LocalExecuteInvalidationMessage (postgres)                                                                                                                                                                                                                                                                                            
#7  0x0000000000a21c59 ProcessInvalidationMessages (postgres)                                                                                                                                                                                                                                                                                                
#8  0x0000000000a22537 CommandEndInvalidationMessages (postgres)                                                                                                                                                                                                                                                                                             
#9  0x000000000055c05e AtCCI_LocalCache (postgres)                                                                                                                                                                                                                                                                                                           
#10 0x00000000005a0024 deleteOneObject (postgres)                                                                                                                                                                                                                                                                                                            
#11 0x00000000005a00dc deleteObjectsInList (postgres)                                                                                                                                                                                                                                                                                                        
#12 0x00000000005a02a4 performMultipleDeletions (postgres)                                                                                                                                                                                                                                                                                                   
#13 0x00000000006c5604 RemoveRelations (postgres)                                                                                                                                                                                                                                                                                                            
#14 0x00000000008f95ef ExecDropStmt (postgres)                                                                                                                                                                                                                                                                                                               
#15 0x00000000008fcb98 ProcessUtilitySlow (postgres)                                                                                                                                                                                                                                                                                                         
#16 0x00000000008fb41d standard_ProcessUtility (postgres)                                                                                                                                                                                                                                                                                                    
#17 0x00000000008fb6db YBProcessUtilityDefaultHook (postgres)                                                                                                                                                                                                                                                                                                
#18 0x00007f80981fb681 pgss_ProcessUtility (pg_stat_statements.so)                                                                                                                                                                                                                                                                                           
#19 0x00007f80ab900540 ybpgm_ProcessUtility (/PATH/TO/REPO/build/fastdebug-gcc11-dynamic-ninja/postgres/lib/yb_pg_metrics.so)                                                                                                                                                                                               
#20 0x00007f80981d4755 pgaudit_NextProcessUtility_hook (pgaudit.so)                                                                                                                                                                                                                                                                                          
#21 0x00007f80981d5e7a pgaudit_ProcessUtility_hook (pgaudit.so)                                                                                                                                                                                                                                                                                              
#22 0x00007f80981c0b07 pg_hint_plan_ProcessUtility (pg_hint_plan.so)                                                                                                                                                                                                                                                                                         
#23 0x0000000000a7e7a6 YBTxnDdlProcessUtility (postgres)                                                                                                                                                                                                                                                                                                     
#24 0x00000000008fb71a ProcessUtility (postgres)                                                                                                                                                                                                                                                                                                             
#25 0x00000000008f72ec PortalRunUtility (postgres)                                                                                                                                                                                                                                                                                                           
#26 0x00000000008f7d6e PortalRunMulti (postgres)                                                                                                                                                                                                                                                                                                             
#27 0x00000000008f8bde PortalRun (postgres)                                                                                                                                                                                                                                                                                                                  
#28 0x00000000008f34eb exec_simple_query (postgres)                                                                                                                                                                                                                                                                                                          
#29 0x00000000008f075a yb_exec_query_wrapper_one_attempt (postgres)                                                                                                                                                                                                                                                                                          
#30 0x00000000008f1f74 yb_exec_query_wrapper (postgres)                                                                                                                                                                                                                                                                                                      
#31 0x00000000008f1fc8 yb_exec_simple_query (postgres)
#32 0x00000000008f49a7 PostgresMain (postgres)
#33 0x0000000000854a2b BackendRun (postgres)
#34 0x0000000000856eb3 PostmasterMain (postgres)
#35 0x000000000079fb34 PostgresServerProcessMain (postgres)
#36 0x000000000079fb54 main (postgres)
#37 0x00007f80a975cd85 __libc_start_main (libc.so.6)
#38 0x00000000004a88be _start (postgres)
jasonyb commented 8 months ago

The reindex test failure might be an actual regression.

Here is a test case: derived from yb_reindex.sql

CREATE TEMP TABLE tmp (i int PRIMARY KEY, j int);
CREATE INDEX ON tmp (j);
INSERT INTO tmp SELECT g, -g FROM generate_series(1, 10) g;
-- Disable reads/writes to the index.
UPDATE pg_index SET indislive = false, indisready = false, indisvalid = false
    WHERE indexrelid = 'tmp_j_idx'::regclass;
--- Force cache refresh.
SELECT * from pg_yb_catalog_version;
SET yb_non_ddl_txn_for_sys_tables_allowed TO on;
UPDATE pg_yb_catalog_version SET current_version = current_version + 1;
UPDATE pg_yb_catalog_version SET last_breaking_version = current_version;
RESET yb_non_ddl_txn_for_sys_tables_allowed;
SELECT * from pg_yb_catalog_version;

Run with debugger breakpoint b nodeModifyTable.c:1731: UPDATE tmp SET i = 11 WHERE j = -5;

For recent master cd69a84ac0a5bf4293fd26a014aef78bff61e28d,

Thread 1 "postgres" hit Breakpoint 1, ExecUpdate (mtstate=mtstate@entry=0x4a77e4d14a0, tupleid=tupleid@entry=0x7ffc63ccf7ca, oldtuple=oldtuple@entry=0x0, slot=0x4a77d968dd8, 
    planSlot=planSlot@entry=0x4a77e4d1df0, epqstate=epqstate@entry=0x4a77e4d1568, estate=0x4a77e4d0120, canSetTag=true)
    at ../../../../../../src/postgres/src/backend/executor/nodeModifyTable.c:1731
1731                    if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
(gdb) p *tuple
$5 = {t_len = 32, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 15}, t_tableOid = 16460, t_ybctid = 0, t_data = 0x4a77f1fd078}
(gdb) p *tuple->t_data
$6 = {t_choice = {t_heap = {t_xmin = 15, t_xmax = 0, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 15, datum_typmod = 0, datum_typeid = 0}}, t_ctid = {ip_blkid = {
      bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, t_infomask2 = 32770, t_infomask = 10240, t_hoff = 24 '\030', t_bits = 0x4a77f1fd08f ""}

For recent 2.18 7c9798fa546311ed8910928ed3c2f0fb6a0341f0,

Thread 1 "postgres" hit Breakpoint 1, ExecUpdate (mtstate=mtstate@entry=0x24c83f9c86a0, tupleid=tupleid@entry=0x7ffe232ce9ba, oldtuple=oldtuple@entry=0x0, slot=0x24c83f482040, 
    planSlot=planSlot@entry=0x24c83f9c8ff0, epqstate=epqstate@entry=0x24c83f9c8768, estate=0x24c83f9c8120, canSetTag=true)
    at ../../../../../../src/postgres/src/backend/executor/nodeModifyTable.c:1731
1731            if (resultRelInfo->ri_NumIndices > 0 && !HeapTupleIsHeapOnly(tuple))
(gdb) p *tuple
$2 = {t_len = 32, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 11}, t_tableOid = 16386, t_ybctid = 0, t_data = 0x24c83f482968}
(gdb) p *tuple->t_data
$3 = {t_choice = {t_heap = {t_xmin = 4, t_xmax = 0, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 4, datum_typmod = 0, datum_typeid = 0}}, t_ctid = {ip_blkid = {
      bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, t_infomask2 = 2, t_infomask = 10240, t_hoff = 24 '\030', t_bits = 0x24c83f48297f ""}

Notice t_infomask2 differs

HeapTupleIsHeapOnly call diverges because of that difference. In master, it no longer goes inside the if. I don't see much difference between the two paths here, but I suspect some other area depending on HeapTupleIsHeapOnly does make a difference.

In upstream PG 15.2, the line moved somewhere else: b heapam_handler.c:339

Breakpoint 1, heapam_tuple_update (relation=0x7f3ddff5ed28, otid=0x7ffce8253e92, slot=0x1d47a60, cid=0, snapshot=<optimized out>, crosscheck=0x0, wait=true, tmfd=0x7ffce8253ef0, lockmode=0x7ffce8253dec, update_indexes=0x7ffce8253de9) at heapam_handler.c:339
warning: Source file is more recent than executable.
339             *update_indexes = result == TM_Ok && !HeapTupleIsHeapOnly(tuple);
(gdb) p *tuple
$3 = {t_len = 32, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 0}, ip_posid = 11}, t_tableOid = 16850, t_data = 0x1d370c0}
(gdb) p *tuple->t_data
$4 = {t_choice = {t_heap = {t_xmin = 902, t_xmax = 0, t_field3 = {t_cid = 0, t_xvac = 0}}, t_datum = {datum_len_ = 902, datum_typmod = 0, datum_typeid = 0}}, t_ctid = {ip_blkid = {bi_hi = 65535, bi_lo = 65535}, ip_posid = 0}, t_infomask2 = 2, t_infomask = 10240, t_hoff = 24 '\030', t_bits = 0x1d370d7 ""}

Notice t_infomask2 is 2, matching 2.18. So something happened in master that likely messed up this field.

jasonyb commented 8 months ago

yb_reindex failure is a catalog_version and cache issue.

Originating from commit 6fec2ecda4240c633d0a3820495cd2f803a3033b, the condition for doing YBCPgSetCatalogCacheVersion has always had ybc_fdw set catalog version in request but other scans like index scan, index only scan (, and later ybc_remote_scan) not set catalog version for system rel requests.

ybcBeginScan:

/*
 * Set the current syscatalog version (will check that we are up to date).
 * Avoid it for syscatalog tables so that we can still use this for
 * refreshing the caches when we are behind.
 * Note: This works because we do not allow modifying schemas (alter/drop)
 * for system catalog tables.
 */
if (!IsSystemRelation(rel))

ybcBeginForeignScan:

/* Set the current syscatalog version (will check that we are up to date) */

Since Andrei's commit removes foreign scan, direct system table reads no longer use foreign scan and instead use yb seq scan. So they don't send catalog version and don't notice catalog version mismatch. Here is the key snippet:

SET yb_non_ddl_txn_for_sys_tables_allowed TO on;
UPDATE pg_yb_catalog_version SET current_version = current_version + 1;
UPDATE pg_yb_catalog_version SET last_breaking_version = current_version;
RESET yb_non_ddl_txn_for_sys_tables_allowed;
SELECT distinct(current_version = last_breaking_version) from pg_yb_catalog_version;
-- Show the corruption.
/*+SeqScan(tmp) */
SELECT i FROM tmp WHERE j = -5;
/*+IndexScan(tmp_j_idx) */
SELECT i FROM tmp WHERE j = -5;

Before, SELECT from pg_yb_catalog_version gets catalog version mismatch and causes remaining scans to operate on up-to-date cache. After, it doesn't notice mismatch (except for the rare timing where catalog version propogates through heartbeat fast enough), and temp table scans also don't notice since they don't reach out to master/tserver, which is where catalog version mismatch checks happen. Putting a sleep before the cache-dependent select (the last select) causes the issue to go away. Putting an EXPLAIN instead almost always shows sequential scan being chosen instead of index scan because, operating off an old cache, it thinks indislive, indisvalid, indisready are still false.

One fix in this case is to send catalog version in the system relation requests (it seems to me the comment justification is outdated). But if the SELECT to pg_yb_catalog_version never existed, then this would be a problem even with that fix. If we accept that the catalog changes can propagate slowly over heartbeat, then having a command to explicitly clear caches would be nice (though we have to be careful about tserver response cache, so this command should either clear both caches or clear just pg cache but also get latest catalog version from master -- though, now that I think about it, a more direct approach may be a command that force rechecks with master's catalog version which can hook up with the existing refresh logic). If we do not accept cases where queries are completely local and can get away with not checking catalog version, then a different more proper solution should be done (there are other similar cases besides select from temp table cc: @deeps1991).