[YSQL] Intermittent occurance of "Used read time is not set" and "Used read time already set to" errors

gauravk-in commented 9 months ago

Description

A mix of 2 different issues were observed intermittently when using TAQO framework for performance testing using the PG Parity configuration. These issues were reproduced on different queries in the test workload, but the issue could not be reproduced manually.

> /*+  Leading ( ( ts3 ts2 ) ) NestLoop(ts3 ts2) */ SELECT ts2.k1, ts2.k2, ts3.v1, ts3.v2 FROM ts2 JOIN ts3 on ts2.k1 = ts3.k1 WHERE ts2.k1 >= 300 AND ts2.k1 < 3100 AND ts3.k1 >= 300 AND ts3.k1 < 3100 GROUP BY ts2.k1, ts2.k2, ts3.v1, ts3.v2

ERROR: INTERNAL ERROR Used read time is not set
CONTEXT:  parallel worker

> /*+  Leading ( ( ts2 ts3 ) ) NestLoop(ts2 ts3) */ SELECT ts2.k1, ts2.k2, ts3.v1, ts3.v2 FROM ts2 JOIN ts3 on ts2.k1 = ts3.k1 WHERE ts2.k1 >= 300 AND ts2.k1 < 3100 AND ts3.k1 >= 300 AND ts3.k1 < 3100 GROUP BY ts2.k1, ts2.k2, ts3.v1, ts3.v2

Used read time already set to { read: { physical: 1707727630782199 } local_limit: { physical: 1707727630782199 } global_limit: { physical: 1707727631282201 } in_txn_limit: <max> serial_no: 0 }. Received new used read time is { read: { physical: 1707727630782250 } local_limit: { physical: 1707727630782250 } global_limit: { physical: 1707727631282252 } in_txn_limit: <max> serial_no: 0 }

The above queries are from the basic workload and the DDL can be found here. https://github.com/yugabyte/taqo/blob/main/sql/basic/create.sql

The flags used are as follows,

# Enable required preview flags
allowed_preview_flags_csv=ysql_ddl_rollback_enabled
# Enable RC + WoC
yb_enable_read_committed_isolation=true

# Enable DDL Atomicity
ysql_ddl_rollback_enabled=true
report_ysql_ddl_txn_status_to_master=true

# Enable pg to tserver shared memory
pg_client_use_shared_memory=true

# Enable tserver catalog request caching
ysql_enable_read_request_caching=true

# GUCs: Enable cost model, BNL, parallel query plans, fetch based limit
ysql_pg_conf_csv="yb_enable_base_scans_cost_model=true,yb_enable_optimizer_statistics=true,yb_bnl_batch_size=1024,yb_parallel_range_rows=10000,yb_fetch_row_limit=0,yb_fetch_size_limit='1MB',yb_use_hash_splitting_by_default=false"

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

[X] I confirm this issue does not contain any sensitive information.

mtakahar commented 9 months ago

This error also frequently shows up on RQG runs. cc: @sushantrmishra

andrei-mart commented 9 months ago

The problem is likely #20126, the symptoms are matching, one prerequisite - plan has Gather in the middle is met.

co=# /*+  Leading ( ( ts3 ts2 ) ) NestLoop(ts3 ts2) */ explain SELECT ts2.k1, ts2.k2, ts3.v1, ts3.v2 FROM ts2 JOIN ts3 on ts2.k1 = ts3.k1 WHERE ts2.k1 >= 300 A
ND ts2.k1 < 3100 AND ts3.k1 >= 300 AND ts3.k1 < 3100 GROUP BY ts2.k1, ts2.k2, ts3.v1, ts3.v2;
                                                QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=1238.11..1248.11 rows=1000 width=72)
   Group Key: ts2.k1, ts2.k2, ts3.v1, ts3.v2
   ->  Nested Loop  (cost=9.41..1228.11 rows=1000 width=72)
         Join Filter: (ts2.k1 = ts3.k1)
         ->  Seq Scan on ts3  (cost=4.71..987.47 rows=5 width=40)
               Storage Filter: ((k1 >= 300) AND (k1 < 3100))
         ->  Materialize  (cost=4.71..168.15 rows=1000 width=36)
               ->  Gather  (cost=4.71..163.15 rows=1000 width=36)
                     Workers Planned: 2
                     ->  Parallel Index Scan using ts2_pkey on ts2  (cost=4.71..163.15 rows=417 width=36)
                           Index Cond: ((k1 >= 300) AND (k1 < 3100))
(11 rows)

It is not clear though if there is a dependency on transaction isolation level. The issue #20126 is fixed, let's see if problem occurs again.

mtakahar commented 9 months ago

Thanks @andrei-mart. ~~Stopped seeing this error in the RQG runs with the latest master.~~ ~~Update: The problem still occurs in RQG runs. it's much less frequently though. I'll try to see if the RQG test cases can reproduce it reasonably frequently.~~

Update2: created #21320

yugabyte / yugabyte-db