Open def- opened 2 years ago
[dfelsing@dev-server-dfelsing ~]$ ps aux|grep postgres
dfelsing 3418204 0.0 0.1 267180 43264 ? S Aug24 0:02 /nfusr/dev-server/dfelsing/code/yugabyte-db/build/release-clang13-linuxbrew-dynamic-ninja/postgres/bin/postgres -D /nfusr/dev-server/dfelsing/yugabyte-data/node-3/disk-1/pg_data -p 5433 -h 127.0.0.3 -k /tmp/.yb.896641872422930295 -c unix_socket_permissions=0700 -c logging_collector=on -c log_directory=/nfusr/dev-server/dfelsing/yugabyte-data/node-3/disk-1/yb-data/tserver/logs -c yb_pg_metrics.node_name=dev-server-dfelsing:9000 -c yb_pg_metrics.port=13000 -c config_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-3/disk-1/pg_data/ysql_pg.conf -c hba_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-3/disk-1/pg_data/ysql_hba.conf
dfelsing 3418234 0.0 0.1 267180 42632 ? S Aug24 0:02 /nfusr/dev-server/dfelsing/code/yugabyte-db/build/release-clang13-linuxbrew-dynamic-ninja/postgres/bin/postgres -D /nfusr/dev-server/dfelsing/yugabyte-data/node-2/disk-1/pg_data -p 5433 -h 127.0.0.2 -k /tmp/.yb.1895829667009633833 -c unix_socket_permissions=0700 -c logging_collector=on -c log_directory=/nfusr/dev-server/dfelsing/yugabyte-data/node-2/disk-1/yb-data/tserver/logs -c yb_pg_metrics.node_name=dev-server-dfelsing:9000 -c yb_pg_metrics.port=13000 -c config_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-2/disk-1/pg_data/ysql_pg.conf -c hba_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-2/disk-1/pg_data/ysql_hba.conf
dfelsing 3418264 0.0 0.1 267180 43992 ? S Aug24 0:02 /nfusr/dev-server/dfelsing/code/yugabyte-db/build/release-clang13-linuxbrew-dynamic-ninja/postgres/bin/postgres -D /nfusr/dev-server/dfelsing/yugabyte-data/node-1/disk-1/pg_data -p 5433 -h 127.0.0.1 -k /tmp/.yb.9427142463175692966 -c unix_socket_permissions=0700 -c logging_collector=on -c log_directory=/nfusr/dev-server/dfelsing/yugabyte-data/node-1/disk-1/yb-data/tserver/logs -c yb_pg_metrics.node_name=dev-server-dfelsing:9000 -c yb_pg_metrics.port=13000 -c config_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-1/disk-1/pg_data/ysql_pg.conf -c hba_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-1/disk-1/pg_data/ysql_hba.conf
dfelsing 3418294 0.3 0.0 122584 13756 ? Ss Aug24 3:24 postgres: logger
dfelsing 3418296 0.0 0.0 275376 20972 ? Ssl Aug24 0:04 postgres: YSQL webserver
dfelsing 3418298 0.0 0.0 267180 14952 ? Ss Aug24 0:00 postgres: checkpointer
dfelsing 3418299 0.0 0.0 124700 13864 ? Ss Aug24 0:04 postgres: stats collector
dfelsing 3418300 0.4 0.0 122584 13764 ? Ss Aug24 4:20 postgres: logger
dfelsing 3418302 0.0 0.0 275376 20912 ? Ssl Aug24 0:05 postgres: YSQL webserver
dfelsing 3418304 0.3 0.0 122584 14128 ? Ss Aug24 3:29 postgres: logger
dfelsing 3418305 0.0 0.0 267180 14952 ? Ss Aug24 0:00 postgres: checkpointer
dfelsing 3418306 0.0 0.0 124700 13728 ? Ss Aug24 0:05 postgres: stats collector
dfelsing 3418308 0.0 0.0 275376 21660 ? Ssl Aug24 0:04 postgres: YSQL webserver
dfelsing 3418310 0.0 0.0 267180 15764 ? Ss Aug24 0:00 postgres: checkpointer
dfelsing 3418311 0.0 0.0 124700 14092 ? Ss Aug24 0:05 postgres: stats collector
dfelsing 3590709 0.9 0.3 390712 95552 ? Ssl 07:37 0:45 postgres: yugabyte db_lst_719764 127.0.0.1(33030) idle
dfelsing 3590713 6.5 0.8 552504 249820 ? Ssl 07:37 5:27 postgres: yugabyte db_lst_719764 127.0.0.1(33036) idle
dfelsing 3590716 1.1 0.3 401976 99892 ? Ssl 07:37 0:57 postgres: yugabyte db_lst_719764 127.0.0.1(50036) CREATE INDEX
dfelsing 3590721 1.5 0.3 394808 100472 ? Ssl 07:37 1:15 postgres: yugabyte db_lst_719764 127.0.0.1(42778) idle
dfelsing 3590738 1.0 0.3 390712 94476 ? Ssl 07:37 0:52 postgres: yugabyte db_lst_719764 127.0.0.1(33060) idle
dfelsing 3590739 6.7 0.7 532664 225832 ? Ssl 07:37 5:34 postgres: yugabyte db_lst_719764 127.0.0.1(50060) idle in transaction
dfelsing 3590742 5.9 0.7 533328 231976 ? Ssl 07:37 4:55 postgres: yugabyte db_lst_719764 127.0.0.1(33066) idle
dfelsing 3613572 0.3 1.0 655992 334204 ? Ssl 08:17 0:07 postgres: postgres db_lst_719764 [local] BACKFILL INDEX
dfelsing 3613666 0.1 0.3 431056 120532 ? Ssl 08:18 0:04 postgres: postgres db_lst_719764 [local] BACKFILL INDEX
dfelsing 3613791 0.1 0.3 429008 116640 ? Ssl 08:19 0:04 postgres: postgres db_lst_719764 [local] BACKFILL INDEX
dfelsing 3613875 0.1 0.3 426960 115028 ? Ssl 08:20 0:04 postgres: postgres db_lst_719764 [local] BACKFILL INDEX
dfelsing 3616910 0.0 0.0 12140 1164 pts/2 S+ 09:00 0:00 grep --color=auto postgres
In the postgres logs there are some occurences of catalog cache misses:
2022-08-25 08:17:03.335 UTC [3590716] LOG: Catalog cache miss on cache with id 50:
Target rel: pg_class (oid : 1259), index oid 2663
Search keys: idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_num, 2200
@ 0x9b5748 errmsg
@ 0x98beb2 SearchCatCacheMiss
@ 0x98ae85 SearchCatCacheInternal
@ 0x9adf69 GetSysCacheOid
@ 0x582e53 index_create
@ 0x648ef7 DefineIndex
@ 0x85a938 ProcessUtilitySlow
@ 0x858ec0 standard_ProcessUtility
@ 0x8584ae YBProcessUtilityDefaultHook
@ 0x7f953e09fe70 pgss_ProcessUtility
@ 0x7f953e096ad7 ybpgm_ProcessUtility
@ 0x7f953e07a72e pg_hint_plan_ProcessUtility
@ 0x9e4b07 YBTxnDdlProcessUtility
@ 0x8580fa PortalRunUtility
@ 0x8577e7 PortalRunMulti
@ 0x8570a6 PortalRun
@ 0x854826 yb_exec_simple_query_impl
@ 0x854dc6 yb_exec_query_wrapper_one_attempt
@ 0x85193b PostgresMain
@ 0x7bda3c BackendRun
@ 0x7bce90 ServerLoop
@ 0x7b944b PostmasterMain
@ 0x719c1d PostgresServerProcessMain
@ 0x71a0c2 main
@ 0x7f954293a825 __libc_start_main
@ 0x4c6ed9 _start
2022-08-25 08:17:03.335 UTC [3590716] STATEMENT: CREATE INDEX CONCURRENTLY idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_numeric ON tg0_0 USING lsm (c0_varchar DESC NULLS LAST, c22_int ASC NULLS LAST, c15_date ASC NULLS FIRST, c1_smallint ASC NULLS LAST, c2_int ASC NULLS LAST, c3_numeric DESC NULLS LAST);
Tserver logs around this time:
I0825 08:16:59.135116 3445811 libpq_utils.cc:207] Connected to PG (host=/tmp/.yb.9427142463175692966 port=5433 dbname='db_lst_719764' connect_timeout=58 user='postgres' password=<REDACTED>), time taken: 2.400s
[...]
I0825 08:17:03.362573 3418004 pg_create_table.cc:90] PgCreateTable: creating a transactional index: db_lst_719764.idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_num/{ database_oid: 20034 object_oid: 2
0522 }
I0825 08:17:03.478680 3424869 tablet.cc:1965] T 27ca2aad50c945f5be8bdf65cdc63f54 P b544631ca71c4083b396dd88b3535098: Alter schema from Schema [
0:ybrowid[binary NOT NULL NOT A PARTITION KEY],
1:c0_varchar[string NULLABLE NOT A PARTITION KEY],
2:c1_smallint[int16 NULLABLE NOT A PARTITION KEY],
3:c2_int[int32 NULLABLE NOT A PARTITION KEY],
4:c3_numeric[decimal NULLABLE NOT A PARTITION KEY],
5:c4_daterange[binary NULLABLE NOT A PARTITION KEY],
6:c5_date[int32 NULLABLE NOT A PARTITION KEY],
7:c6_varchar[string NULLABLE NOT A PARTITION KEY],
8:c7_int8range[binary NULLABLE NOT A PARTITION KEY],
9:c8_daterange[binary NULLABLE NOT A PARTITION KEY],
10:c9_int4range[binary NULLABLE NOT A PARTITION KEY],
11:c10_int[int32 NULLABLE NOT A PARTITION KEY],
12:c11_json[binary NULLABLE NOT A PARTITION KEY],
13:c12_float4[float NULLABLE NOT A PARTITION KEY],
14:c13_jsonb[binary NULLABLE NOT A PARTITION KEY],
15:c14_smallint[int16 NULLABLE NOT A PARTITION KEY],
16:c15_date[int32 NULLABLE NOT A PARTITION KEY],
17:c16_date[int32 NULLABLE NOT A PARTITION KEY],
18:c17_int4range[binary NULLABLE NOT A PARTITION KEY],
19:c18_int4range[binary NULLABLE NOT A PARTITION KEY],
20:c19_float4[float NULLABLE NOT A PARTITION KEY],
21:c20_numeric[decimal NULLABLE NOT A PARTITION KEY],
22:c21_int8range[binary NULLABLE NOT A PARTITION KEY],
23:c22_int[int32 NULLABLE NOT A PARTITION KEY],
24:c23_numrange[binary NULLABLE NOT A PARTITION KEY],
25:c24_smallint[int16 NULLABLE NOT A PARTITION KEY],
26:c25_varchar[string NULLABLE NOT A PARTITION KEY],
27:c26_date[int32 NULLABLE NOT A PARTITION KEY],
28:c27_date[int32 NULLABLE NOT A PARTITION KEY]
]
properties: contain_counters: false is_transactional: true consistency_level: STRONG use_mangled_column_name: false is_ysql_catalog_table: false retain_delete_markers: false partitioning_version: 1
colocation_id: 814432576 version 122 to Schema [
0:ybrowid[binary NOT NULL NOT A PARTITION KEY],
1:c0_varchar[string NULLABLE NOT A PARTITION KEY],
2:c1_smallint[int16 NULLABLE NOT A PARTITION KEY],
3:c2_int[int32 NULLABLE NOT A PARTITION KEY],
4:c3_numeric[decimal NULLABLE NOT A PARTITION KEY],
5:c4_daterange[binary NULLABLE NOT A PARTITION KEY],
6:c5_date[int32 NULLABLE NOT A PARTITION KEY],
7:c6_varchar[string NULLABLE NOT A PARTITION KEY],
8:c7_int8range[binary NULLABLE NOT A PARTITION KEY],
9:c8_daterange[binary NULLABLE NOT A PARTITION KEY],
10:c9_int4range[binary NULLABLE NOT A PARTITION KEY],
11:c10_int[int32 NULLABLE NOT A PARTITION KEY],
12:c11_json[binary NULLABLE NOT A PARTITION KEY],
13:c12_float4[float NULLABLE NOT A PARTITION KEY],
14:c13_jsonb[binary NULLABLE NOT A PARTITION KEY],
15:c14_smallint[int16 NULLABLE NOT A PARTITION KEY],
16:c15_date[int32 NULLABLE NOT A PARTITION KEY],
17:c16_date[int32 NULLABLE NOT A PARTITION KEY],
18:c17_int4range[binary NULLABLE NOT A PARTITION KEY],
19:c18_int4range[binary NULLABLE NOT A PARTITION KEY],
20:c19_float4[float NULLABLE NOT A PARTITION KEY],
21:c20_numeric[decimal NULLABLE NOT A PARTITION KEY],
22:c21_int8range[binary NULLABLE NOT A PARTITION KEY],
23:c22_int[int32 NULLABLE NOT A PARTITION KEY],
24:c23_numrange[binary NULLABLE NOT A PARTITION KEY],
25:c24_smallint[int16 NULLABLE NOT A PARTITION KEY],
26:c25_varchar[string NULLABLE NOT A PARTITION KEY],
27:c26_date[int32 NULLABLE NOT A PARTITION KEY],
28:c27_date[int32 NULLABLE NOT A PARTITION KEY]
]
properties: contain_counters: false is_transactional: true consistency_level: STRONG use_mangled_column_name: false is_ysql_catalog_table: false retain_delete_markers: false partitioning_version: 1
colocation_id: 814432576 version 123
I0825 08:17:03.479979 3424869 meta_data_cache.cc:77] Creating a metadata cache without a permissions cache
I0825 08:17:03.636530 3418004 table_creator.cc:363] Created index db_lst_719764.idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_num of type PGSQL_TABLE_TYPE
I0825 08:17:06.251971 3420376 tablet.cc:2175] Begin BackfillIndexesForYsql at { physical: 1661415426219233 } from <start-of-the-tablet> for [table_id: "00004e4200003000800000000000502a" version: 0 is_local: false hash_column_count: 0 range_column_count: 0 is_unique: false indexed_table_id: "00004e42000030008000000000004e82" use_mangled_column_name: false index_permissions: INDEX_PERM_WRITE_AND_DELETE backfill_error_message: ""]
I0825 08:17:08.091812 3420376 libpq_utils.cc:207] Connected to PG (host=/tmp/.yb.9427142463175692966 port=5433 dbname='db_lst_719764' connect_timeout=59 user='postgres' password=<REDACTED>), time taken: 1.840s
According to Dennis, the hanging happened more than once in the past week. This means without reduced max_connections, the hanging can still happen.
I think the CREATE INDEX CONCURRENTLY idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_numeric
was followed by BACKFILL INDEX 20522 WITH
. Meaning backfill had been running.
cc @jason-yb
Dennis:
It would help if you can detect the hanging LST (by looking at query_start column of pg_stat_activity) and capture /threadz?group=all
from the tservers.
Got it again. After 30 mins of hanging:
yugabyte=# SELECT * FROM pg_stat_activity;
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type
-------+---------------+---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+------------------------------------------------------------------------------------------+----------------
13288 | yugabyte | 1762894 | 13287 | yugabyte | ysqlsh | 127.0.0.1 | | 37658 | 2022-09-02 16:10:10.266513+00 | 2022-09-02 16:10:54.14204+00 | 2022-09-02 16:10:54.14204+00 | 2022-09-02 16:10:54.142131+00 | | | active | | 3 | SELECT * FROM pg_stat_activity; | client backend
20281 | db_lst_850715 | 1739026 | 13287 | yugabyte | lst_850715_i | 127.0.0.1 | | 37548 | 2022-09-02 14:38:51.910724+00 | | 2022-09-02 14:58:10.409321+00 | 2022-09-02 14:58:10.464552+00 | Client | ClientRead | idle | | | ROLLBACK | client backend
20281 | db_lst_850715 | 1739051 | 13287 | yugabyte | lst_850715_i | 127.0.0.1 | | 37572 | 2022-09-02 14:38:51.928679+00 | | 2022-09-02 14:58:18.211339+00 | 2022-09-02 14:58:18.229574+00 | Client | ClientRead | idle | | | ROLLBACK | client backend
20281 | db_lst_850715 | 1739056 | 13287 | yugabyte | lst_850715_i | 127.0.0.1 | | 37578 | 2022-09-02 14:38:51.924037+00 | | 2022-09-02 15:23:55.498421+00 | 2022-09-02 15:23:55.525019+00 | Client | ClientRead | idle | | | ROLLBACK | client backend
20281 | db_lst_850715 | 1757651 | 10 | postgres | | | | -1 | 2022-09-02 14:58:33.464523+00 | 2022-09-02 14:58:34.800631+00 | 2022-09-02 14:58:34.800879+00 | 2022-09-02 16:10:53.240109+00 | | | active | | 3 | BACKFILL INDEX 19700 WITH x'0880011a00' READ TIME 6808087401954246656 PARTITION x'aaa8'; | client backend
20281 | db_lst_850715 | 1757788 | 10 | postgres | | | | -1 | 2022-09-02 14:59:33.49601+00 | 2022-09-02 14:59:33.650278+00 | 2022-09-02 14:59:33.673502+00 | 2022-09-02 16:10:53.174117+00 | | | active | | | BACKFILL INDEX 19700 WITH x'0880011a00' READ TIME 6808087401954246656 PARTITION x'aaa8'; | client backend
20281 | db_lst_850715 | 1757898 | 10 | postgres | | | | -1 | 2022-09-02 15:00:33.55396+00 | 2022-09-02 15:00:33.626707+00 | 2022-09-02 15:00:33.642816+00 | 2022-09-02 16:10:53.639117+00 | | | active | | | BACKFILL INDEX 19700 WITH x'0880011a00' READ TIME 6808087401954246656 PARTITION x'aaa8'; | client backend
20281 | db_lst_850715 | 1758006 | 10 | postgres | | | | -1 | 2022-09-02 15:01:33.630768+00 | 2022-09-02 15:01:33.701976+00 | 2022-09-02 15:01:33.720206+00 | 2022-09-02 16:10:54.043188+00 | | | active | | | BACKFILL INDEX 19700 WITH x'0880011a00' READ TIME 6808087401954246656 PARTITION x'aaa8'; | client backend
| | 1617073 | | | | | | | 2022-09-02 10:42:06.998491+00 | | | | Activity | CheckpointerMain | | | | | checkpointer
(9 rows)
This time BACKFILL INDEX and ROLLBACK. So the common part appears to be backfill index. @tedyu Requested captures are attached: threadz.zip
In threadz.2.html , I notice the following:
<tr><td>rpc_tp_TabletServer_29-1617254</td><td>21.120s</td><td>0.000s</td><td>0.000s</td><td rowspan="1"><pre> @ 0x7fb396ea711f (unknown)
@ 0x7fb39722046c __GI___nanosleep
@ 0x7fb3977e9dda std::__1::this_thread::sleep_for()
@ 0x7fb39bbe49d8 yb::GenericBackoffWaiter<>::Wait()
@ 0x7fb39a85ffc1 yb::client::RetryFunc()
@ 0x7fb39a865c50 yb::client::YBClient::Data::WaitForBackfillIndexToFinish()
@ 0x7fb39a865976 yb::client::YBClient::Data::BackfillIndex()
@ 0x7fb39a8435b0 yb::client::YBClient::BackfillIndex()
@ 0x7fb39c17e7f4 yb::tserver::PgClientSession::BackfillIndex()
@ 0x7fb39c178850 yb::tserver::PgClientServiceImpl::Impl::BackfillIndex()
@ 0x7fb39c176748 yb::tserver::PgClientServiceImpl::BackfillIndex()
@ 0x7fb39b8ce5b9 std::__1::__function::__func<>::operator()()
@ 0x7fb39b8ccf03 yb::tserver::PgClientServiceIf::Handle()
@ 0x7fb398fc8829 yb::rpc::ServicePoolImpl::Handle()
@ 0x7fb398f726e2 yb::rpc::InboundCall::InboundCallTask::Run()
@ 0x7fb398fd47df yb::rpc::(anonymous namespace)::Worker::Execute()
Total number of threads: 1</pre></td></tr>
<tr><td>rpc_tp_TabletServer_55-1617995</td><td>19.450s</td><td>0.000s</td><td>0.000s</td><td rowspan="4"><pre> @ 0x7fb396ea711f (unknown)
@ 0x7fb396f515cc __GI___poll
@ 0x7fb39bc3ecf1 pqSocketCheck
@ 0x7fb39bc3eaeb pqWait
@ 0x7fb39bc3ad80 PQgetResult
@ 0x7fb39bc3b395 PQexecFinish
@ 0x7fb39c4ba390 yb::pgwrapper::PGConn::Fetch()
@ 0x7fb39be89263 yb::tablet::Tablet::BackfillIndexesForYsql()
@ 0x7fb39c1dd8f7 yb::tserver::TabletServiceAdminImpl::BackfillIndex()
In the other two files, I saw BackfillIndex()
.
This means index backfilling got stuck across the tservers.
It hangs in PQexecFinish()
:
while ((result = PQgetResult(conn)) != NULL)
Can we get tserver logs corresponding to threadz.2.html ?
Thanks
I don't have them anymore, but happened again ;) New try:
[deen@devp yugabyte-db]$ bin/ysqlsh -h 127.0.0.1
yugabyte=# SELECT * FROM pg_stat_activity;
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type
-------+----------+---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+---------------------------------+----------------
13288 | yugabyte | 2838317 | 13287 | yugabyte | ysqlsh | 127.0.0.1 | | 60720 | 2022-09-02 16:49:04.682505+00 | 2022-09-02 16:49:12.483181+00 | 2022-09-02 16:49:12.483181+00 | 2022-09-02 16:49:12.483248+00 | | | active | | 3 | SELECT * FROM pg_stat_activity; | client backend
| | 2602430 | | | | | | | 2022-09-02 07:55:42.438895+00 | | | | Activity | CheckpointerMain | | | | | checkpointer
(2 rows)
[deen@devp yugabyte-db]$ bin/ysqlsh -h 127.0.0.2
ysqlsh (11.2-YB-2.15.3.0-b0)
Type "help" for help.
yugabyte=# SELECT * FROM pg_stat_activity;
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type
-------+----------+---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+------------------------------+------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+---------------------------------+----------------
13288 | yugabyte | 2838356 | 13287 | yugabyte | ysqlsh | 127.0.0.1 | | 49566 | 2022-09-02 16:49:46.047175+00 | 2022-09-02 16:49:47.76328+00 | 2022-09-02 16:49:47.76328+00 | 2022-09-02 16:49:47.763399+00 | | | active | | 3 | SELECT * FROM pg_stat_activity; | client backend
| | 2602444 | | | | | | | 2022-09-02 07:55:42.457823+00 | | | | Activity | CheckpointerMain | | | | | checkpointer
(2 rows)
[deen@devp yugabyte-db]$ bin/ysqlsh -h 127.0.0.3
ysqlsh (11.2-YB-2.15.3.0-b0)
Type "help" for help.
yugabyte=# SELECT * FROM pg_stat_activity;
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query
| backend_type
-------+----------+---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------
22661 | | 2792520 | 13287 | yugabyte | lst_626892_i | 127.0.0.1 | | 44708 | 2022-09-02 15:34:09.109658+00 | 2022-09-02 16:34:11.269827+00 | 2022-09-02 16:34:11.269827+00 | 2022-09-02 16:34:11.269911+00 | | | active | | | CREATE INDEX CONCURRENTLY idx_t0_c6_int_c4_date_c15_bigint_c20_float4_c34_boolean_c32_float8_c7_numeric_c27_int_c31_varchar_c19_date_c22_float4 ON t0 USING btree (c6_int ASC NULLS FIRST, c4_date DESC NULLS FIRST, c15_bigint DESC NULLS LAST, c20_float4 DESC NULLS LAST, c34_boolean A
SC NULLS FIRST, c32_float8 ASC NULLS FIRST, c7_numeric ASC NULLS LAST, c27_int ASC NULLS LAST, c31_varchar DESC NULLS LAST, c19_date DESC NULLS FIRST, c22_float4 ASC NULLS FIRST) WHERE FALSE; | client backend
13288 | yugabyte | 2838388 | 13287 | yugabyte | ysqlsh | 127.0.0.1 | | 44736 | 2022-09-02 16:49:55.200908+00 | 2022-09-02 16:49:57.227399+00 | 2022-09-02 16:49:57.227399+00 | 2022-09-02 16:49:57.227518+00 | | | active | | 3 | SELECT * FROM pg_stat_activity;
| client backend
| | 2602450 | | | | | | | 2022-09-02 07:55:42.475315+00 | | | | Activity | CheckpointerMain | | | |
| checkpointer
(3 rows)
From threadz.3.html
<tr><td>rpc_tp_TabletServer_26-2602745</td><td>20.950s</td><td>0.000s</td><td>0.000s</td><td rowspan="1"><pre> @ 0x7f59bcac911f (unknown)
@ 0x7f59bce4246c __GI___nanosleep
@ 0x7f59bd40bdda std::__1::this_thread::sleep_for()
@ 0x7f59c1801b58 yb::GenericBackoffWaiter<>::Wait()
@ 0x7f59c047e101 yb::client::RetryFunc()
@ 0x7f59c0483d90 yb::client::YBClient::Data::WaitForBackfillIndexToFinish()
@ 0x7f59c0483ab6 yb::client::YBClient::Data::BackfillIndex()
@ 0x7f59c04616f0 yb::client::YBClient::BackfillIndex()
@ 0x7f59c1d9b954 yb::tserver::PgClientSession::BackfillIndex()
@ 0x7f59c1d959b0 yb::tserver::PgClientServiceImpl::Impl::BackfillIndex()
@ 0x7f59c1d938a8 yb::tserver::PgClientServiceImpl::BackfillIndex()
Please get server 3's logs.
Too large to upload here, accessible from within Yugabyte org: https://drive.google.com/file/d/1z-zPXJxQZku0zcWIO2PhnRqbwCcEx1jX/view?usp=sharing
cc @jason-yb
I took a look at yb-tserver.devp.deen.log.INFO.20220902-075533.2601900 but haven't found which backfill request hung.
Looking at YBClient::Data::BackfillIndex
, there is only log for the completion of backfill:
LOG(INFO) << "Backfilled index " << req.index_identifier().ShortDebugString();
There is no log for backfill start.
The client of backfill showed YBClient::Data::WaitForBackfillIndexToFinish()
in thread dump.
We would need master log to get more idea of what might have gone wrong.
From yb-tserver.devp.deen.log.INFO.20220902-075533.2601900
I0902 07:56:36.254930 2602785 tablet.cc:2174] Begin BackfillIndexesForYsql at { physical: 1662105396035999 } from <start-of-the-tablet> for [table_id: "00004000000030008000000000004105" version: 0 is_local: false hash_column_count: 0 range_column_count: 0 is_unique: false indexed_table_id: "0000400000003000800000000000402a" use_mangled_column_name: false index_permissions: INDEX_PERM_WRITE_AND_DELETE backfill_error_message: ""]
I0902 07:56:36.255028 2602785 tablet.cc:2097] Using grace margin of 1000ms, original deadline: 59.787s
I0902 07:56:36.390646 2602765 tablet.cc:2174] Begin BackfillIndexesForYsql at { physical: 1662105396035999 } from <start-of-the-tablet> for [table_id: "00004000000030008000000000004105" version: 0 is_local: false hash_column_count: 0 range_column_count: 0 is_unique: false indexed_table_id: "0000400000003000800000000000402a" use_mangled_column_name: false index_permissions: INDEX_PERM_WRITE_AND_DELETE backfill_error_message: ""]
Not sure why the same start message was logged twice.
I'll just upload the full yugabyte-data directory: https://drive.google.com/file/d/1aBOkQ-NSfJ565nHUPYP8jIA8Pmn_3hhP/view?usp=sharing
Happened again on an RF1:
yugabyte=# SELECT * FROM pg_stat_activity;
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type
-------+---------------+--------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+---------------------+-------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------
22995 | db_lst_370325 | 869853 | 13287 | yugabyte | lst_370325_i | 127.0.0.1 | | 41722 | 2022-09-08 13:47:50.336027+00 | | 2022-09-08 14:09:26.657113+00 | 2022-09-08 14:09:26.661356+00 | Client | ClientRead | idle | | | COMMIT | client backend
22995 | db_lst_370325 | 869855 | 13287 | yugabyte | lst_370325_i | 127.0.0.1 | | 41724 | 2022-09-08 13:47:50.337525+00 | 2022-09-08 14:09:28.58866+00 | 2022-09-08 14:09:28.58866+00 | 2022-09-08 14:09:28.588743+00 | | | active | | | CREATE INDEX CONCURRENTLY idx_tg4_0_c0_float8_c9_boolean_c6_float4_c27_varchar_c1_numeric_c12_varchar_c17_numeric_c22_float8 ON tg4_0 USING lsm (c0_float8 DESC NULLS FIRST, c9_boolean ASC NULLS LAST, c6_float4 ASC NULLS FIRST, c27_varchar ASC NULLS FIRST, c1_numeric ASC NULLS LAST, c12_varchar ASC NULLS FIRST, c17_numeric DESC NULLS FIRST, c22_float8 ASC NULLS LAST) WHERE TRUE; | client backend
13288 | yugabyte | 890510 | 13287 | yugabyte | ysqlsh | 127.0.0.1 | | 41846 | 2022-09-08 16:10:18.660385+00 | 2022-09-08 16:10:41.149582+00 | 2022-09-08 16:10:41.149582+00 | 2022-09-08 16:10:41.14962+00 | | | active | | 3 | SELECT * FROM pg_stat_activity; | client backend
22995 | db_lst_370325 | 869861 | 13287 | yugabyte | lst_370325_i | 127.0.0.1 | | 41728 | 2022-09-08 13:47:50.340063+00 | | 2022-09-08 14:09:25.541895+00 | 2022-09-08 14:09:25.543896+00 | Client | ClientRead | idle | | | COMMIT | client backend
22995 | db_lst_370325 | 869866 | 13287 | yugabyte | lst_370325_i | 127.0.0.1 | | 41734 | 2022-09-08 13:47:50.343454+00 | 2022-09-08 14:09:32.010303+00 | 2022-09-08 14:09:32.039314+00 | 2022-09-08 14:09:32.039886+00 | Client | ClientRead | idle in transaction | | | savepoint sp_1; | client backend
22995 | db_lst_370325 | 869868 | 13287 | yugabyte | lst_370325_i | 127.0.0.1 | | 41738 | 2022-09-08 13:47:50.34557+00 | | 2022-09-08 14:09:25.961501+00 | 2022-09-08 14:09:26.155547+00 | Client | ClientRead | idle | | | DROP INDEX idx_tg2_0_c14_date_c5_date_c13_numeric_c11_bigint_c12_float8; | client backend
| | 829823 | | | | | | | 2022-09-08 11:58:08.433195+00 | | | | Activity | CheckpointerMain | | | | | checkpointer
(7 rows)
Jira Link: DB-3296
Description
Usually at the end of an LST run all queries will be finished or aborted by timeout, and thus LST can finish too. In the last few days I noticed that sometimes LST just stays stuck at the end, which means some queries are still stuck probably:
max_connections is set to 20 to figure out https://github.com/yugabyte/yugabyte-db/issues/13546. Maybe this bug is also related to it. LST run log:
After waiting for 10 minutes nothing else happens. All queries from LST are running with a timeout (
'options': '-c statement_timeout=300000'} # in ms
), so it is unexpected for them to just get stuck like this:Both the
CREATE INDEX
andUPDATE
queries appear to be stuck for >40 minutes already. I'll provide yugabyte-data directory in case it helps, accessible from within Yugabyte org: https://drive.google.com/file/d/19c17iCJ7X86PD9tqJpkC8n4r3Y2ZIQXD/view?usp=sharing CC @tedyu The create index being involved might point to a relation to recent DDL changes.