yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.88k stars 1.05k forks source link

[YSQL][LST] Hanging "CREATE INDEX" and "UPDATE" queries during LST run, don't respect timeout #13771

Open def- opened 2 years ago

def- commented 2 years ago

Jira Link: DB-3296

Description

Usually at the end of an LST run all queries will be finished or aborted by timeout, and thus LST can finish too. In the last few days I noticed that sometimes LST just stays stuck at the end, which means some queries are still stuck probably:

$ cd ~/code/yugabyte-db
$ git checkout fabc178ef5ad0de723abc13d5d90c68773fd6c36
$ arc patch D18465
$ arc patch D18509
$ ./yb_build.sh release
$ bin/yb-ctl --replication_factor 3 create --tserver_flags=ysql_max_connections=20,ysql_enable_packed_row=true,yb_enable_read_committed_isolation=true,ysql_num_shards_per_tserver=2,enable_stream_compression=true,stream_compression_algo=2,yb_num_shards_per_tserver=2 --master_flags=yb_enable_read_committed_isolation=true,enable_stream_compression=true,stream_compression_algo=2,enable_automatic_tablet_splitting=true,tablet_split_low_phase_shard_count_per_node=1,tablet_split_high_phase_shard_count_per_node=5,ysql_enable_packed_row=true
$ cd ~/code/yb-long-system-test
$ git checkout 1016af6a
$ while true; do python3.9 ./long_system_test.py --nodes=127.0.0.1:5433,127.0.0.2:5433,127.0.0.3:5433 --threads=20 --runtime=60 --max-columns=20 --complexity=full; sleep 10; done

max_connections is set to 20 to figure out https://github.com/yugabyte/yugabyte-db/issues/13546. Maybe this bug is also related to it. LST run log:

2022-08-25 07:37:03,124 MainThread INFO     --------------------------------------------------------------------------------
2022-08-25 07:37:03,124 MainThread INFO     Running Long System Test 0.1 - https://github.com/yugabyte/yb-long-system-test/
2022-08-25 07:37:03,124 MainThread INFO     --------------------------------------------------------------------------------
2022-08-25 07:37:03,124 MainThread INFO
2022-08-25 07:37:03,172 MainThread INFO     Reproduce with: git checkout 1016af6a && ./long_system_test.py --nodes=127.0.0.1:5433,127.0.0.2:5433,127.0.0.3:5433 --threads=20 --runtime=60 --max-columns=20 --complexity=full --seed=719764
2022-08-25 07:37:03,842 MainThread INFO     Database version: PostgreSQL 11.2-YB-2.15.3.0-b0 on x86_64-pc-linux-gnu, compiled by clang version 13.0.1 (https://github.com/yugabyte/llvm-project.git 0131d8e1a20f72a0f374979985db4c2871be71bc), 64-bit
2022-08-25 07:37:03,845 MainThread INFO     Creating tables for database db_lst_719764
2022-08-25 07:37:32,297 MainThread INFO     Starting worker_0: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,299 MainThread INFO     Starting worker_1: CreateIndexAction, DropIndexAction, SetConfigAction, AddColumnAction
2022-08-25 07:37:32,299 MainThread INFO     Starting worker_2: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,300 MainThread INFO     Starting worker_3: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,301 MainThread INFO     Starting worker_4: SingleInsertAction, SingleUpdateAction, SingleDeleteAction, BulkInsertAction, BulkUpdateAction, SetConfigAction
2022-08-25 07:37:32,303 MainThread INFO     Starting worker_5: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,310 MainThread INFO     Starting worker_6: CreateIndexAction, DropIndexAction, SetConfigAction, AddColumnAction
2022-08-25 07:37:32,320 MainThread INFO     Starting worker_7: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,326 MainThread INFO     Starting worker_8: CreateIndexAction, DropIndexAction, SetConfigAction, AddColumnAction
2022-08-25 07:37:32,326 MainThread INFO     Starting worker_9: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,329 MainThread INFO     Starting worker_10: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,330 MainThread INFO     Starting worker_11: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,335 MainThread INFO     Starting worker_12: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,336 MainThread INFO     Starting worker_13: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,337 MainThread INFO     Starting worker_14: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,341 MainThread INFO     Starting worker_15: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,341 MainThread INFO     Starting worker_16: CreateIndexAction, DropIndexAction, SetConfigAction, AddColumnAction
2022-08-25 07:37:32,342 MainThread INFO     Starting worker_17: RandomSelectAction, SetConfigAction
2022-08-25 07:37:32,343 MainThread INFO     Starting worker_18: SingleInsertAction, SingleUpdateAction, SingleDeleteAction, BulkInsertAction, BulkUpdateAction, SetConfigAction
2022-08-25 07:37:32,346 MainThread INFO     Starting worker_19: SingleInsertAction, SingleUpdateAction, SingleDeleteAction, BulkInsertAction, BulkUpdateAction, SetConfigAction
2022-08-25 07:37:42,373 MainThread INFO     Worker queries/s: [000.9][000.2][000.3][000.3][000.8][001.2][000.0][000.5][000.0][000.2][000.3][001.3][000.1][000.1][000.2][001.0][000.0][000.2][001.1][002.9]
[...]
2022-08-25 08:37:36,383 MainThread INFO     Worker queries/s: [004.7][000.0][004.0][003.6][000.0][000.8][000.0][003.0][000.0][007.0][003.4][004.4][002.9][005.3][005.0][004.8][000.0][004.7][000.0][000.0]
2022-08-25 08:37:36,383 MainThread INFO
2022-08-25 08:37:36,383 MainThread INFO     --------------------------------------------------------------------------------
2022-08-25 08:37:36,383 MainThread INFO     LST finished successfully in 3633.2 sec
2022-08-25 08:37:36,383 MainThread INFO     --------------------------------------------------------------------------------
2022-08-25 08:37:36,383 MainThread INFO
2022-08-25 08:37:36,384 MainThread INFO     Expected error statistics: 9% failures in 140284 queries
2022-08-25 08:37:36,385 MainThread INFO     XX000 ERROR:  Query error: schema version mismatch for table: https://github.com/yugabyte/yugabyte-db/issues/9973
2022-08-25 08:37:36,385 MainThread INFO       RandomSelectAction: 451, SingleUpdateAction: 12, SingleInsertAction: 19, SingleDeleteAction: 10
2022-08-25 08:37:36,385 MainThread INFO     XX000 ERROR:  Not found: The object: Maybe https://github.com/yugabyte/yugabyte-db/issues/2224
2022-08-25 08:37:36,385 MainThread INFO       RandomSelectAction: 64, SingleInsertAction: 7, SingleDeleteAction: 2
2022-08-25 08:37:36,385 MainThread INFO     22003 re.compile('ERROR:  .* out of range'): Overflow can currently happen with many of our queries
2022-08-25 08:37:36,385 MainThread INFO       RandomSelectAction: 7219, SingleUpdateAction: 20, SingleDeleteAction: 6
2022-08-25 08:37:36,385 MainThread INFO     40001 ERROR:  Query error: schema version mismatch for table: https://github.com/yugabyte/yugabyte-db/issues/9973
2022-08-25 08:37:36,385 MainThread INFO       RandomSelectAction: 198, SingleUpdateAction: 21, SingleDeleteAction: 8, SingleInsertAction: 3
2022-08-25 08:37:36,385 MainThread INFO     22011 ERROR:  negative substring length not allowed: https://g2022-08-25 08:37:36,386 MainThread INFO       SingleUpdateAction: 5, SingleInsertAction: 8
2022-08-25 08:37:36,386 MainThread INFO     40001 re.compile('.*Errors occurred while reaching out to the tablet servers.*'): https://github.com/yugabyte/yugabyte-db/issues/12070
2022-08-25 08:37:36,386 MainThread INFO       SingleInsertAction: 44, SingleUpdateAction: 3, SingleDeleteAction: 2
2022-08-25 08:37:36,387 MainThread INFO     XX000 re.compile('.*Tablet deleted: Table deleted.*'): https://github.com/yugabyte/yugabyte-db/issues/12070
2022-08-25 08:37:36,387 MainThread INFO       SingleInsertAction: 1, SingleUpdateAction: 1, SingleDeleteAction: 1
2022-08-25 08:37:36,387 MainThread INFO     XX000 re.compile('.*Error when decoding range components of a document key:.*'): https://github.com/yugabyte/yugabyte-db/issues/11992
2022-08-25 08:37:36,387 MainThread INFO       SingleDeleteAction: 3, SingleInsertAction: 5
2022-08-25 08:37:36,387 MainThread INFO     40001 ERROR:  Operation failed. Try again: Delete interfered
2022-08-25 08:37:36,387 MainThread INFO       SingleUpdateAction: 2, SingleInsertAction: 2
2022-08-25 08:37:36,387 MainThread INFO     40001 re.compile('.* Conflicts with higher priority transaction:.*'): https://github.com/yugabyte/yugabyte-db/issues/10283
2022-08-25 08:37:36,387 MainThread INFO       DropIndexAction: 1
2022-08-25 08:37:36,387 MainThread INFO     XX000 ERROR:  Not implemented: Not supported to restart DDL transaction: https://github.com/yugabyte/yugabyte-db/issues/13270
2022-08-25 08:37:36,387 MainThread INFO       CreateIndexAction: 2
2022-08-25 08:37:36,387 MainThread INFO     42P17 ERROR:  functions in index predicate must be marked IMMUTABLE: https://github.com/yugabyte/yugabyte-db/issues/12468
2022-08-25 08:37:36,387 MainThread INFO       CreateIndexAction: 1
2022-08-25 08:37:36,387 MainThread INFO     57014 ERROR:  canceling statement due to statement timeout: Queries can time out, but there shouldn't be too many
2022-08-25 08:37:36,387 MainThread INFO       SingleDeleteAction: 2, SingleUpdateAction: 1
ithub.com/yugabyte/yugabyte-db/issues/12288
2022-08-25 08:37:36,385 MainThread INFO       RandomSelectAction: 3316, SingleDeleteAction: 1, SingleUpdateAction: 4
2022-08-25 08:37:36,386 MainThread INFO     XX000 ERROR:  cache lookup failed for attribute: https://github.com/yugabyte/yugabyte-db/issues/11999
2022-08-25 08:37:36,386 MainThread INFO       SingleDeleteAction: 584, SingleUpdateAction: 571
2022-08-25 08:37:36,386 MainThread INFO     XX000 ERROR:  Operation failed. Try again: Delete interfered
2022-08-25 08:37:36,386 MainThread INFO       SingleUpdateAction: 5, SingleInsertAction: 8
2022-08-25 08:37:36,386 MainThread INFO     40001 re.compile('.*Errors occurred while reaching out to the tablet servers.*'): https://github.com/yugabyte/yugabyte-db/issues/12070
2022-08-25 08:37:36,386 MainThread INFO       SingleInsertAction: 44, SingleUpdateAction: 3, SingleDeleteAction: 2
2022-08-25 08:37:36,387 MainThread INFO     XX000 re.compile('.*Tablet deleted: Table deleted.*'): https://github.com/yugabyte/yugabyte-db/issues/12070
2022-08-25 08:37:36,387 MainThread INFO       SingleInsertAction: 1, SingleUpdateAction: 1, SingleDeleteAction: 1
2022-08-25 08:37:36,387 MainThread INFO     XX000 re.compile('.*Error when decoding range components of a document key:.*'): https://github.com/yugabyte/yugabyte-db/issues/11992
2022-08-25 08:37:36,387 MainThread INFO       SingleDeleteAction: 3, SingleInsertAction: 5
2022-08-25 08:37:36,387 MainThread INFO     40001 ERROR:  Operation failed. Try again: Delete interfered
2022-08-25 08:37:36,387 MainThread INFO       SingleUpdateAction: 2, SingleInsertAction: 2
2022-08-25 08:37:36,387 MainThread INFO     40001 re.compile('.* Conflicts with higher priority transaction:.*'): https://github.com/yugabyte/yugabyte-db/issues/10283
2022-08-25 08:37:36,387 MainThread INFO       DropIndexAction: 1
2022-08-25 08:37:36,387 MainThread INFO     XX000 ERROR:  Not implemented: Not supported to restart DDL transaction: https://github.com/yugabyte/yugabyte-db/issues/13270
2022-08-25 08:37:36,387 MainThread INFO       CreateIndexAction: 2
2022-08-25 08:37:36,387 MainThread INFO     42P17 ERROR:  functions in index predicate must be marked IMMUTABLE: https://github.com/yugabyte/yugabyte-db/issues/12468
2022-08-25 08:37:36,387 MainThread INFO       CreateIndexAction: 1
2022-08-25 08:37:36,387 MainThread INFO     57014 ERROR:  canceling statement due to statement timeout: Queries can time out, but there shouldn't be too many
2022-08-25 08:37:36,387 MainThread INFO       SingleDeleteAction: 2, SingleUpdateAction: 1
tablet servers.*'): https://github.com/yugabyte/yugabyte-db/issues/12070
2022-08-25 08:37:36,386 MainThread INFO       SingleInsertAction: 44, SingleUpdateAction: 3, SingleDeleteAction: 2
2022-08-25 08:37:36,387 MainThread INFO     XX000 re.compile('.*Tablet deleted: Table deleted.*'): https://github.com/yugabyte/yugabyte-db/issues/12070
2022-08-25 08:37:36,387 MainThread INFO       SingleInsertAction: 1, SingleUpdateAction: 1, SingleDeleteAction: 1
2022-08-25 08:37:36,387 MainThread INFO     XX000 re.compile('.*Error when decoding range components of a document key:.*'): https://github.com/yugabyte/yugabyte-db/issues/11992
2022-08-25 08:37:36,387 MainThread INFO       SingleDeleteAction: 3, SingleInsertAction: 5
2022-08-25 08:37:36,387 MainThread INFO     40001 ERROR:  Operation failed. Try again: Delete interfered
2022-08-25 08:37:36,387 MainThread INFO       SingleUpdateAction: 2, SingleInsertAction: 2
2022-08-25 08:37:36,387 MainThread INFO     40001 re.compile('.* Conflicts with higher priority transaction:.*'): https://github.com/yugabyte/yugabyte-db/issues/10283
2022-08-25 08:37:36,387 MainThread INFO       DropIndexAction: 1
2022-08-25 08:37:36,387 MainThread INFO     XX000 ERROR:  Not implemented: Not supported to restart DDL transaction: https://github.com/yugabyte/yugabyte-db/issues/13270
2022-08-25 08:37:36,387 MainThread INFO       CreateIndexAction: 2
2022-08-25 08:37:36,387 MainThread INFO     42P17 ERROR:  functions in index predicate must be marked IMMUTABLE: https://github.com/yugabyte/yugabyte-db/issues/12468
2022-08-25 08:37:36,387 MainThread INFO       CreateIndexAction: 1
2022-08-25 08:37:36,387 MainThread INFO     57014 ERROR:  canceling statement due to statement timeout: Queries can time out, but there shouldn't be too many
2022-08-25 08:37:36,387 MainThread INFO       SingleDeleteAction: 2, SingleUpdateAction: 1
2022-08-25 08:37:36,386 MainThread INFO       SingleInsertAction: 44, SingleUpdateAction: 3, SingleDeleteAction: 2
2022-08-25 08:37:36,387 MainThread INFO     XX000 re.compile('.*Tablet deleted: Table deleted.*'): https://github.com/yugabyte/yugabyte-db/issues/12070
2022-08-25 08:37:36,387 MainThread INFO       SingleInsertAction: 1, SingleUpdateAction: 1, SingleDeleteAction: 1
2022-08-25 08:37:36,387 MainThread INFO     XX000 re.compile('.*Error when decoding range components of a document key:.*'): https://github.com/yugabyte/yugabyte-db/issues/11992
2022-08-25 08:37:36,387 MainThread INFO       SingleDeleteAction: 3, SingleInsertAction: 5
2022-08-25 08:37:36,387 MainThread INFO     40001 ERROR:  Operation failed. Try again: Delete interfered
2022-08-25 08:37:36,387 MainThread INFO       SingleUpdateAction: 2, SingleInsertAction: 2
2022-08-25 08:37:36,387 MainThread INFO     40001 re.compile('.* Conflicts with higher priority transaction:.*'): https://github.com/yugabyte/yugabyte-db/issues/10283
2022-08-25 08:37:36,387 MainThread INFO       DropIndexAction: 1
2022-08-25 08:37:36,387 MainThread INFO     XX000 ERROR:  Not implemented: Not supported to restart DDL transaction: https://github.com/yugabyte/yugabyte-db/issues/13270
2022-08-25 08:37:36,387 MainThread INFO       CreateIndexAction: 2
2022-08-25 08:37:36,387 MainThread INFO     42P17 ERROR:  functions in index predicate must be marked IMMUTABLE: https://github.com/yugabyte/yugabyte-db/issues/12468
2022-08-25 08:37:36,387 MainThread INFO       CreateIndexAction: 1
2022-08-25 08:37:36,387 MainThread INFO     57014 ERROR:  canceling statement due to statement timeout: Queries can time out, but there shouldn't be too many
2022-08-25 08:37:36,387 MainThread INFO       SingleDeleteAction: 2, SingleUpdateAction: 1

After waiting for 10 minutes nothing else happens. All queries from LST are running with a timeout ('options': '-c statement_timeout=300000'} # in ms), so it is unexpected for them to just get stuck like this:

$ bin/ysqlsh
ysqlsh (11.2-YB-2.15.3.0-b0)
Type "help" for help.

yugabyte=# SELECT * FROM pg_stat_activity;
 datid |    datname    |   pid   | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |    wait_event    |        state        | backend_xid | backend_xmin |                                                                                                                                      query                                                                                                                                      |  backend_type
-------+---------------+---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+---------------------+-------------+--------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------
 13288 | yugabyte      | 3616031 |    13287 | yugabyte | ysqlsh           | 127.0.0.1   |                 |       50122 | 2022-08-25 08:49:22.852015+00 | 2022-08-25 08:50:56.697739+00 | 2022-08-25 08:50:56.697739+00 | 2022-08-25 08:50:56.697782+00 |                 |                  | active              |             |            3 | SELECT * FROM pg_stat_activity;                                                                                                                                                                                                                                                 | client backend
 20034 | db_lst_719764 | 3590716 |    13287 | yugabyte | lst_719764_i     | 127.0.0.1   |                 |       50036 | 2022-08-25 07:37:32.320845+00 | 2022-08-25 08:17:03.154782+00 | 2022-08-25 08:17:03.154782+00 | 2022-08-25 08:17:03.154855+00 |                 |                  | active              |             |              | CREATE INDEX CONCURRENTLY idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_numeric ON tg0_0 USING lsm (c0_varchar DESC NULLS LAST, c22_int ASC NULLS LAST, c15_date ASC NULLS FIRST, c1_smallint ASC NULLS LAST, c2_int ASC NULLS LAST, c3_numeric DESC NULLS LAST); | client backend
 20034 | db_lst_719764 | 3590739 |    13287 | yugabyte | lst_719764_i     | 127.0.0.1   |                 |       50060 | 2022-08-25 07:37:32.347365+00 | 2022-08-25 08:17:06.460012+00 | 2022-08-25 08:17:07.786626+00 | 2022-08-25 08:17:07.810246+00 | Client          | ClientRead       | idle in transaction |             |              | UPDATE tg1_0 SET c5_numrange = '(-15.060555688130208,53.926491790237776]'::NUMRANGE WHERE FALSE;                                                                                                                                                                                | client backend
 20034 | db_lst_719764 | 3613572 |       10 | postgres |                  |             |                 |          -1 | 2022-08-25 08:17:06.257753+00 | 2022-08-25 08:17:08.09195+00  | 2022-08-25 08:17:08.092165+00 | 2022-08-25 08:50:56.348113+00 |                 |                  | active              |             |            3 | BACKFILL INDEX 20522 WITH x'0880011a00' READ TIME 6805157585793978368 PARTITION x'';                                                                                                                                                                                            | client backend
 20034 | db_lst_719764 | 3613666 |       10 | postgres |                  |             |                 |          -1 | 2022-08-25 08:18:06.347421+00 | 2022-08-25 08:18:06.711396+00 | 2022-08-25 08:18:06.745392+00 | 2022-08-25 08:50:56.204116+00 |                 |                  | active              |             |              | BACKFILL INDEX 20522 WITH x'0880011a00' READ TIME 6805157585793978368 PARTITION x'';                                                                                                                                                                                            | client backend
 20034 | db_lst_719764 | 3613791 |       10 | postgres |                  |             |                 |          -1 | 2022-08-25 08:19:06.386257+00 | 2022-08-25 08:19:06.518662+00 | 2022-08-25 08:19:06.572306+00 | 2022-08-25 08:50:56.023823+00 |                 |                  | active              |             |              | BACKFILL INDEX 20522 WITH x'0880011a00' READ TIME 6805157585793978368 PARTITION x'';                                                                                                                                                                                            | client backend
 20034 | db_lst_719764 | 3613875 |       10 | postgres |                  |             |                 |          -1 | 2022-08-25 08:20:06.467979+00 | 2022-08-25 08:20:06.680382+00 | 2022-08-25 08:20:06.716008+00 | 2022-08-25 08:50:56.023479+00 |                 |                  | active              |             |              | BACKFILL INDEX 20522 WITH x'0880011a00' READ TIME 6805157585793978368 PARTITION x'';                                                                                                                                                                                            | client backend
       |               | 3418310 |          |          |                  |             |                 |             | 2022-08-24 17:30:55.361004+00 |                               |                               |                               | Activity        | CheckpointerMain |                     |             |              |                                                                                                                                                                                                                                                                                 | checkpointer
(8 rows)

Both the CREATE INDEX and UPDATE queries appear to be stuck for >40 minutes already. I'll provide yugabyte-data directory in case it helps, accessible from within Yugabyte org: https://drive.google.com/file/d/19c17iCJ7X86PD9tqJpkC8n4r3Y2ZIQXD/view?usp=sharing CC @tedyu The create index being involved might point to a relation to recent DDL changes.

def- commented 2 years ago
[dfelsing@dev-server-dfelsing ~]$ ps aux|grep postgres
dfelsing 3418204  0.0  0.1 267180 43264 ?        S    Aug24   0:02 /nfusr/dev-server/dfelsing/code/yugabyte-db/build/release-clang13-linuxbrew-dynamic-ninja/postgres/bin/postgres -D /nfusr/dev-server/dfelsing/yugabyte-data/node-3/disk-1/pg_data -p 5433 -h 127.0.0.3 -k /tmp/.yb.896641872422930295 -c unix_socket_permissions=0700 -c logging_collector=on -c log_directory=/nfusr/dev-server/dfelsing/yugabyte-data/node-3/disk-1/yb-data/tserver/logs -c yb_pg_metrics.node_name=dev-server-dfelsing:9000 -c yb_pg_metrics.port=13000 -c config_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-3/disk-1/pg_data/ysql_pg.conf -c hba_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-3/disk-1/pg_data/ysql_hba.conf
dfelsing 3418234  0.0  0.1 267180 42632 ?        S    Aug24   0:02 /nfusr/dev-server/dfelsing/code/yugabyte-db/build/release-clang13-linuxbrew-dynamic-ninja/postgres/bin/postgres -D /nfusr/dev-server/dfelsing/yugabyte-data/node-2/disk-1/pg_data -p 5433 -h 127.0.0.2 -k /tmp/.yb.1895829667009633833 -c unix_socket_permissions=0700 -c logging_collector=on -c log_directory=/nfusr/dev-server/dfelsing/yugabyte-data/node-2/disk-1/yb-data/tserver/logs -c yb_pg_metrics.node_name=dev-server-dfelsing:9000 -c yb_pg_metrics.port=13000 -c config_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-2/disk-1/pg_data/ysql_pg.conf -c hba_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-2/disk-1/pg_data/ysql_hba.conf
dfelsing 3418264  0.0  0.1 267180 43992 ?        S    Aug24   0:02 /nfusr/dev-server/dfelsing/code/yugabyte-db/build/release-clang13-linuxbrew-dynamic-ninja/postgres/bin/postgres -D /nfusr/dev-server/dfelsing/yugabyte-data/node-1/disk-1/pg_data -p 5433 -h 127.0.0.1 -k /tmp/.yb.9427142463175692966 -c unix_socket_permissions=0700 -c logging_collector=on -c log_directory=/nfusr/dev-server/dfelsing/yugabyte-data/node-1/disk-1/yb-data/tserver/logs -c yb_pg_metrics.node_name=dev-server-dfelsing:9000 -c yb_pg_metrics.port=13000 -c config_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-1/disk-1/pg_data/ysql_pg.conf -c hba_file=/nfusr/dev-server/dfelsing/yugabyte-data/node-1/disk-1/pg_data/ysql_hba.conf
dfelsing 3418294  0.3  0.0 122584 13756 ?        Ss   Aug24   3:24 postgres: logger
dfelsing 3418296  0.0  0.0 275376 20972 ?        Ssl  Aug24   0:04 postgres: YSQL webserver
dfelsing 3418298  0.0  0.0 267180 14952 ?        Ss   Aug24   0:00 postgres: checkpointer
dfelsing 3418299  0.0  0.0 124700 13864 ?        Ss   Aug24   0:04 postgres: stats collector
dfelsing 3418300  0.4  0.0 122584 13764 ?        Ss   Aug24   4:20 postgres: logger
dfelsing 3418302  0.0  0.0 275376 20912 ?        Ssl  Aug24   0:05 postgres: YSQL webserver
dfelsing 3418304  0.3  0.0 122584 14128 ?        Ss   Aug24   3:29 postgres: logger
dfelsing 3418305  0.0  0.0 267180 14952 ?        Ss   Aug24   0:00 postgres: checkpointer
dfelsing 3418306  0.0  0.0 124700 13728 ?        Ss   Aug24   0:05 postgres: stats collector
dfelsing 3418308  0.0  0.0 275376 21660 ?        Ssl  Aug24   0:04 postgres: YSQL webserver
dfelsing 3418310  0.0  0.0 267180 15764 ?        Ss   Aug24   0:00 postgres: checkpointer
dfelsing 3418311  0.0  0.0 124700 14092 ?        Ss   Aug24   0:05 postgres: stats collector
dfelsing 3590709  0.9  0.3 390712 95552 ?        Ssl  07:37   0:45 postgres: yugabyte db_lst_719764 127.0.0.1(33030) idle
dfelsing 3590713  6.5  0.8 552504 249820 ?       Ssl  07:37   5:27 postgres: yugabyte db_lst_719764 127.0.0.1(33036) idle
dfelsing 3590716  1.1  0.3 401976 99892 ?        Ssl  07:37   0:57 postgres: yugabyte db_lst_719764 127.0.0.1(50036) CREATE INDEX
dfelsing 3590721  1.5  0.3 394808 100472 ?       Ssl  07:37   1:15 postgres: yugabyte db_lst_719764 127.0.0.1(42778) idle
dfelsing 3590738  1.0  0.3 390712 94476 ?        Ssl  07:37   0:52 postgres: yugabyte db_lst_719764 127.0.0.1(33060) idle
dfelsing 3590739  6.7  0.7 532664 225832 ?       Ssl  07:37   5:34 postgres: yugabyte db_lst_719764 127.0.0.1(50060) idle in transaction
dfelsing 3590742  5.9  0.7 533328 231976 ?       Ssl  07:37   4:55 postgres: yugabyte db_lst_719764 127.0.0.1(33066) idle
dfelsing 3613572  0.3  1.0 655992 334204 ?       Ssl  08:17   0:07 postgres: postgres db_lst_719764 [local] BACKFILL INDEX
dfelsing 3613666  0.1  0.3 431056 120532 ?       Ssl  08:18   0:04 postgres: postgres db_lst_719764 [local] BACKFILL INDEX
dfelsing 3613791  0.1  0.3 429008 116640 ?       Ssl  08:19   0:04 postgres: postgres db_lst_719764 [local] BACKFILL INDEX
dfelsing 3613875  0.1  0.3 426960 115028 ?       Ssl  08:20   0:04 postgres: postgres db_lst_719764 [local] BACKFILL INDEX
dfelsing 3616910  0.0  0.0  12140  1164 pts/2    S+   09:00   0:00 grep --color=auto postgres

In the postgres logs there are some occurences of catalog cache misses:

2022-08-25 08:17:03.335 UTC [3590716] LOG:  Catalog cache miss on cache with id 50:
        Target rel: pg_class (oid : 1259), index oid 2663
        Search keys: idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_num, 2200
            @           0x9b5748  errmsg
            @           0x98beb2  SearchCatCacheMiss
            @           0x98ae85  SearchCatCacheInternal
            @           0x9adf69  GetSysCacheOid
            @           0x582e53  index_create
            @           0x648ef7  DefineIndex
            @           0x85a938  ProcessUtilitySlow
            @           0x858ec0  standard_ProcessUtility
            @           0x8584ae  YBProcessUtilityDefaultHook
            @     0x7f953e09fe70  pgss_ProcessUtility
            @     0x7f953e096ad7  ybpgm_ProcessUtility
            @     0x7f953e07a72e  pg_hint_plan_ProcessUtility
            @           0x9e4b07  YBTxnDdlProcessUtility
            @           0x8580fa  PortalRunUtility
            @           0x8577e7  PortalRunMulti
            @           0x8570a6  PortalRun
            @           0x854826  yb_exec_simple_query_impl
            @           0x854dc6  yb_exec_query_wrapper_one_attempt
            @           0x85193b  PostgresMain
            @           0x7bda3c  BackendRun
            @           0x7bce90  ServerLoop
            @           0x7b944b  PostmasterMain
            @           0x719c1d  PostgresServerProcessMain
            @           0x71a0c2  main
            @     0x7f954293a825  __libc_start_main
            @           0x4c6ed9  _start

2022-08-25 08:17:03.335 UTC [3590716] STATEMENT:  CREATE INDEX CONCURRENTLY idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_numeric ON tg0_0 USING lsm (c0_varchar DESC NULLS LAST, c22_int ASC NULLS LAST, c15_date ASC NULLS FIRST, c1_smallint ASC NULLS LAST, c2_int ASC NULLS LAST, c3_numeric DESC NULLS LAST);

Tserver logs around this time:

I0825 08:16:59.135116 3445811 libpq_utils.cc:207] Connected to PG (host=/tmp/.yb.9427142463175692966 port=5433 dbname='db_lst_719764' connect_timeout=58 user='postgres' password=<REDACTED>), time taken: 2.400s
[...]
I0825 08:17:03.362573 3418004 pg_create_table.cc:90] PgCreateTable: creating a transactional index: db_lst_719764.idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_num/{ database_oid: 20034 object_oid: 2
0522 }
I0825 08:17:03.478680 3424869 tablet.cc:1965] T 27ca2aad50c945f5be8bdf65cdc63f54 P b544631ca71c4083b396dd88b3535098: Alter schema from Schema [
        0:ybrowid[binary NOT NULL NOT A PARTITION KEY],
        1:c0_varchar[string NULLABLE NOT A PARTITION KEY],
        2:c1_smallint[int16 NULLABLE NOT A PARTITION KEY],
        3:c2_int[int32 NULLABLE NOT A PARTITION KEY],
        4:c3_numeric[decimal NULLABLE NOT A PARTITION KEY],
        5:c4_daterange[binary NULLABLE NOT A PARTITION KEY],
        6:c5_date[int32 NULLABLE NOT A PARTITION KEY],
        7:c6_varchar[string NULLABLE NOT A PARTITION KEY],
        8:c7_int8range[binary NULLABLE NOT A PARTITION KEY],
        9:c8_daterange[binary NULLABLE NOT A PARTITION KEY],
        10:c9_int4range[binary NULLABLE NOT A PARTITION KEY],
        11:c10_int[int32 NULLABLE NOT A PARTITION KEY],
        12:c11_json[binary NULLABLE NOT A PARTITION KEY],
        13:c12_float4[float NULLABLE NOT A PARTITION KEY],
        14:c13_jsonb[binary NULLABLE NOT A PARTITION KEY],
        15:c14_smallint[int16 NULLABLE NOT A PARTITION KEY],
        16:c15_date[int32 NULLABLE NOT A PARTITION KEY],
        17:c16_date[int32 NULLABLE NOT A PARTITION KEY],
        18:c17_int4range[binary NULLABLE NOT A PARTITION KEY],
        19:c18_int4range[binary NULLABLE NOT A PARTITION KEY],
        20:c19_float4[float NULLABLE NOT A PARTITION KEY],
        21:c20_numeric[decimal NULLABLE NOT A PARTITION KEY],
        22:c21_int8range[binary NULLABLE NOT A PARTITION KEY],
        23:c22_int[int32 NULLABLE NOT A PARTITION KEY],
        24:c23_numrange[binary NULLABLE NOT A PARTITION KEY],
        25:c24_smallint[int16 NULLABLE NOT A PARTITION KEY],
        26:c25_varchar[string NULLABLE NOT A PARTITION KEY],
        27:c26_date[int32 NULLABLE NOT A PARTITION KEY],
        28:c27_date[int32 NULLABLE NOT A PARTITION KEY]
]
properties: contain_counters: false is_transactional: true consistency_level: STRONG use_mangled_column_name: false is_ysql_catalog_table: false retain_delete_markers: false partitioning_version: 1
colocation_id: 814432576 version 122 to Schema [
        0:ybrowid[binary NOT NULL NOT A PARTITION KEY],
        1:c0_varchar[string NULLABLE NOT A PARTITION KEY],
        2:c1_smallint[int16 NULLABLE NOT A PARTITION KEY],
        3:c2_int[int32 NULLABLE NOT A PARTITION KEY],
        4:c3_numeric[decimal NULLABLE NOT A PARTITION KEY],
        5:c4_daterange[binary NULLABLE NOT A PARTITION KEY],
        6:c5_date[int32 NULLABLE NOT A PARTITION KEY],
        7:c6_varchar[string NULLABLE NOT A PARTITION KEY],
        8:c7_int8range[binary NULLABLE NOT A PARTITION KEY],
        9:c8_daterange[binary NULLABLE NOT A PARTITION KEY],
        10:c9_int4range[binary NULLABLE NOT A PARTITION KEY],
        11:c10_int[int32 NULLABLE NOT A PARTITION KEY],
        12:c11_json[binary NULLABLE NOT A PARTITION KEY],
        13:c12_float4[float NULLABLE NOT A PARTITION KEY],
        14:c13_jsonb[binary NULLABLE NOT A PARTITION KEY],
        15:c14_smallint[int16 NULLABLE NOT A PARTITION KEY],
        16:c15_date[int32 NULLABLE NOT A PARTITION KEY],
        17:c16_date[int32 NULLABLE NOT A PARTITION KEY],
        18:c17_int4range[binary NULLABLE NOT A PARTITION KEY],
        19:c18_int4range[binary NULLABLE NOT A PARTITION KEY],
        20:c19_float4[float NULLABLE NOT A PARTITION KEY],
        21:c20_numeric[decimal NULLABLE NOT A PARTITION KEY],
        22:c21_int8range[binary NULLABLE NOT A PARTITION KEY],
        23:c22_int[int32 NULLABLE NOT A PARTITION KEY],
        24:c23_numrange[binary NULLABLE NOT A PARTITION KEY],
        25:c24_smallint[int16 NULLABLE NOT A PARTITION KEY],
        26:c25_varchar[string NULLABLE NOT A PARTITION KEY],
        27:c26_date[int32 NULLABLE NOT A PARTITION KEY],
        28:c27_date[int32 NULLABLE NOT A PARTITION KEY]
]
properties: contain_counters: false is_transactional: true consistency_level: STRONG use_mangled_column_name: false is_ysql_catalog_table: false retain_delete_markers: false partitioning_version: 1
colocation_id: 814432576 version 123
I0825 08:17:03.479979 3424869 meta_data_cache.cc:77] Creating a metadata cache without a permissions cache
I0825 08:17:03.636530 3418004 table_creator.cc:363] Created index db_lst_719764.idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_num of type PGSQL_TABLE_TYPE
I0825 08:17:06.251971 3420376 tablet.cc:2175] Begin BackfillIndexesForYsql at { physical: 1661415426219233 } from <start-of-the-tablet> for [table_id: "00004e4200003000800000000000502a" version: 0 is_local: false hash_column_count: 0 range_column_count: 0 is_unique: false indexed_table_id: "00004e42000030008000000000004e82" use_mangled_column_name: false index_permissions: INDEX_PERM_WRITE_AND_DELETE backfill_error_message: ""]
I0825 08:17:08.091812 3420376 libpq_utils.cc:207] Connected to PG (host=/tmp/.yb.9427142463175692966 port=5433 dbname='db_lst_719764' connect_timeout=59 user='postgres' password=<REDACTED>), time taken: 1.840s
tedyu commented 2 years ago

According to Dennis, the hanging happened more than once in the past week. This means without reduced max_connections, the hanging can still happen.

tedyu commented 2 years ago

I think the CREATE INDEX CONCURRENTLY idx_tg0_0_c0_varchar_c22_int_c15_date_c1_smallint_c2_int_c3_numeric was followed by BACKFILL INDEX 20522 WITH. Meaning backfill had been running.

cc @jason-yb

tedyu commented 2 years ago

Dennis: It would help if you can detect the hanging LST (by looking at query_start column of pg_stat_activity) and capture /threadz?group=all from the tservers.

def- commented 2 years ago

Got it again. After 30 mins of hanging:

yugabyte=# SELECT * FROM pg_stat_activity;
 datid |    datname    |   pid   | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |    wait_event    | state  | backend_xid | backend_xmin |                                          query                                           |  backend_type
-------+---------------+---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+------------------------------------------------------------------------------------------+----------------
 13288 | yugabyte      | 1762894 |    13287 | yugabyte | ysqlsh           | 127.0.0.1   |                 |       37658 | 2022-09-02 16:10:10.266513+00 | 2022-09-02 16:10:54.14204+00  | 2022-09-02 16:10:54.14204+00  | 2022-09-02 16:10:54.142131+00 |                 |                  | active |             |            3 | SELECT * FROM pg_stat_activity;                                                          | client backend
 20281 | db_lst_850715 | 1739026 |    13287 | yugabyte | lst_850715_i     | 127.0.0.1   |                 |       37548 | 2022-09-02 14:38:51.910724+00 |                               | 2022-09-02 14:58:10.409321+00 | 2022-09-02 14:58:10.464552+00 | Client          | ClientRead       | idle   |             |              | ROLLBACK                                                                                 | client backend
 20281 | db_lst_850715 | 1739051 |    13287 | yugabyte | lst_850715_i     | 127.0.0.1   |                 |       37572 | 2022-09-02 14:38:51.928679+00 |                               | 2022-09-02 14:58:18.211339+00 | 2022-09-02 14:58:18.229574+00 | Client          | ClientRead       | idle   |             |              | ROLLBACK                                                                                 | client backend
 20281 | db_lst_850715 | 1739056 |    13287 | yugabyte | lst_850715_i     | 127.0.0.1   |                 |       37578 | 2022-09-02 14:38:51.924037+00 |                               | 2022-09-02 15:23:55.498421+00 | 2022-09-02 15:23:55.525019+00 | Client          | ClientRead       | idle   |             |              | ROLLBACK                                                                                 | client backend
 20281 | db_lst_850715 | 1757651 |       10 | postgres |                  |             |                 |          -1 | 2022-09-02 14:58:33.464523+00 | 2022-09-02 14:58:34.800631+00 | 2022-09-02 14:58:34.800879+00 | 2022-09-02 16:10:53.240109+00 |                 |                  | active |             |            3 | BACKFILL INDEX 19700 WITH x'0880011a00' READ TIME 6808087401954246656 PARTITION x'aaa8'; | client backend
 20281 | db_lst_850715 | 1757788 |       10 | postgres |                  |             |                 |          -1 | 2022-09-02 14:59:33.49601+00  | 2022-09-02 14:59:33.650278+00 | 2022-09-02 14:59:33.673502+00 | 2022-09-02 16:10:53.174117+00 |                 |                  | active |             |              | BACKFILL INDEX 19700 WITH x'0880011a00' READ TIME 6808087401954246656 PARTITION x'aaa8'; | client backend
 20281 | db_lst_850715 | 1757898 |       10 | postgres |                  |             |                 |          -1 | 2022-09-02 15:00:33.55396+00  | 2022-09-02 15:00:33.626707+00 | 2022-09-02 15:00:33.642816+00 | 2022-09-02 16:10:53.639117+00 |                 |                  | active |             |              | BACKFILL INDEX 19700 WITH x'0880011a00' READ TIME 6808087401954246656 PARTITION x'aaa8'; | client backend
 20281 | db_lst_850715 | 1758006 |       10 | postgres |                  |             |                 |          -1 | 2022-09-02 15:01:33.630768+00 | 2022-09-02 15:01:33.701976+00 | 2022-09-02 15:01:33.720206+00 | 2022-09-02 16:10:54.043188+00 |                 |                  | active |             |              | BACKFILL INDEX 19700 WITH x'0880011a00' READ TIME 6808087401954246656 PARTITION x'aaa8'; | client backend
       |               | 1617073 |          |          |                  |             |                 |             | 2022-09-02 10:42:06.998491+00 |                               |                               |                               | Activity        | CheckpointerMain |        |             |              |                                                                                          | checkpointer
(9 rows)

This time BACKFILL INDEX and ROLLBACK. So the common part appears to be backfill index. @tedyu Requested captures are attached: threadz.zip

tedyu commented 2 years ago

In threadz.2.html , I notice the following:

<tr><td>rpc_tp_TabletServer_29-1617254</td><td>21.120s</td><td>0.000s</td><td>0.000s</td><td rowspan="1"><pre>    @     0x7fb396ea711f  (unknown)
    @     0x7fb39722046c  __GI___nanosleep
    @     0x7fb3977e9dda  std::__1::this_thread::sleep_for()
    @     0x7fb39bbe49d8  yb::GenericBackoffWaiter<>::Wait()
    @     0x7fb39a85ffc1  yb::client::RetryFunc()
    @     0x7fb39a865c50  yb::client::YBClient::Data::WaitForBackfillIndexToFinish()
    @     0x7fb39a865976  yb::client::YBClient::Data::BackfillIndex()
    @     0x7fb39a8435b0  yb::client::YBClient::BackfillIndex()
    @     0x7fb39c17e7f4  yb::tserver::PgClientSession::BackfillIndex()
    @     0x7fb39c178850  yb::tserver::PgClientServiceImpl::Impl::BackfillIndex()
    @     0x7fb39c176748  yb::tserver::PgClientServiceImpl::BackfillIndex()
    @     0x7fb39b8ce5b9  std::__1::__function::__func<>::operator()()
    @     0x7fb39b8ccf03  yb::tserver::PgClientServiceIf::Handle()
    @     0x7fb398fc8829  yb::rpc::ServicePoolImpl::Handle()
    @     0x7fb398f726e2  yb::rpc::InboundCall::InboundCallTask::Run()
    @     0x7fb398fd47df  yb::rpc::(anonymous namespace)::Worker::Execute()

Total number of threads: 1</pre></td></tr>
<tr><td>rpc_tp_TabletServer_55-1617995</td><td>19.450s</td><td>0.000s</td><td>0.000s</td><td rowspan="4"><pre>    @     0x7fb396ea711f  (unknown)
    @     0x7fb396f515cc  __GI___poll
    @     0x7fb39bc3ecf1  pqSocketCheck
    @     0x7fb39bc3eaeb  pqWait
    @     0x7fb39bc3ad80  PQgetResult
    @     0x7fb39bc3b395  PQexecFinish
    @     0x7fb39c4ba390  yb::pgwrapper::PGConn::Fetch()
    @     0x7fb39be89263  yb::tablet::Tablet::BackfillIndexesForYsql()
    @     0x7fb39c1dd8f7  yb::tserver::TabletServiceAdminImpl::BackfillIndex()

In the other two files, I saw BackfillIndex(). This means index backfilling got stuck across the tservers.

It hangs in PQexecFinish():

    while ((result = PQgetResult(conn)) != NULL)
tedyu commented 2 years ago

Can we get tserver logs corresponding to threadz.2.html ?

Thanks

def- commented 2 years ago

I don't have them anymore, but happened again ;) New try:

[deen@devp yugabyte-db]$ bin/ysqlsh -h 127.0.0.1
yugabyte=# SELECT * FROM pg_stat_activity;
 datid | datname  |   pid   | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |    wait_event    | state  | backend_xid | backend_xmin |              query              |  backend_type
-------+----------+---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+---------------------------------+----------------
 13288 | yugabyte | 2838317 |    13287 | yugabyte | ysqlsh           | 127.0.0.1   |                 |       60720 | 2022-09-02 16:49:04.682505+00 | 2022-09-02 16:49:12.483181+00 | 2022-09-02 16:49:12.483181+00 | 2022-09-02 16:49:12.483248+00 |                 |                  | active |             |            3 | SELECT * FROM pg_stat_activity; | client backend
       |          | 2602430 |          |          |                  |             |                 |             | 2022-09-02 07:55:42.438895+00 |                               |                               |                               | Activity        | CheckpointerMain |        |             |              |                                 | checkpointer
(2 rows)
[deen@devp yugabyte-db]$ bin/ysqlsh -h 127.0.0.2
ysqlsh (11.2-YB-2.15.3.0-b0)
Type "help" for help.

yugabyte=# SELECT * FROM pg_stat_activity;
 datid | datname  |   pid   | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start          |         query_start          |         state_change          | wait_event_type |    wait_event    | state  | backend_xid | backend_xmin |              query              |  backend_type
-------+----------+---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+------------------------------+------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+---------------------------------+----------------
 13288 | yugabyte | 2838356 |    13287 | yugabyte | ysqlsh           | 127.0.0.1   |                 |       49566 | 2022-09-02 16:49:46.047175+00 | 2022-09-02 16:49:47.76328+00 | 2022-09-02 16:49:47.76328+00 | 2022-09-02 16:49:47.763399+00 |                 |                  | active |             |            3 | SELECT * FROM pg_stat_activity; | client backend
       |          | 2602444 |          |          |                  |             |                 |             | 2022-09-02 07:55:42.457823+00 |                              |                              |                               | Activity        | CheckpointerMain |        |             |              |                                 | checkpointer
(2 rows)
[deen@devp yugabyte-db]$ bin/ysqlsh -h 127.0.0.3
ysqlsh (11.2-YB-2.15.3.0-b0)
Type "help" for help.

yugabyte=# SELECT * FROM pg_stat_activity;
 datid | datname  |   pid   | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |    wait_event    | state  | backend_xid | backend_xmin |                                                                                                                                                                                                                                           query
                                                                                                                                                                                                |  backend_type
-------+----------+---------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+--------+-------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------
 22661 |          | 2792520 |    13287 | yugabyte | lst_626892_i     | 127.0.0.1   |                 |       44708 | 2022-09-02 15:34:09.109658+00 | 2022-09-02 16:34:11.269827+00 | 2022-09-02 16:34:11.269827+00 | 2022-09-02 16:34:11.269911+00 |                 |                  | active |             |              | CREATE INDEX CONCURRENTLY idx_t0_c6_int_c4_date_c15_bigint_c20_float4_c34_boolean_c32_float8_c7_numeric_c27_int_c31_varchar_c19_date_c22_float4 ON t0 USING btree (c6_int ASC NULLS FIRST, c4_date DESC NULLS FIRST, c15_bigint DESC NULLS LAST, c20_float4 DESC NULLS LAST, c34_boolean A
SC NULLS FIRST, c32_float8 ASC NULLS FIRST, c7_numeric ASC NULLS LAST, c27_int ASC NULLS LAST, c31_varchar DESC NULLS LAST, c19_date DESC NULLS FIRST, c22_float4 ASC NULLS FIRST) WHERE FALSE; | client backend
 13288 | yugabyte | 2838388 |    13287 | yugabyte | ysqlsh           | 127.0.0.1   |                 |       44736 | 2022-09-02 16:49:55.200908+00 | 2022-09-02 16:49:57.227399+00 | 2022-09-02 16:49:57.227399+00 | 2022-09-02 16:49:57.227518+00 |                 |                  | active |             |            3 | SELECT * FROM pg_stat_activity;
                                                                                                                                                                                                | client backend
       |          | 2602450 |          |          |                  |             |                 |             | 2022-09-02 07:55:42.475315+00 |                               |                               |                               | Activity        | CheckpointerMain |        |             |              |
                                                                                                                                                                                                | checkpointer
(3 rows)

threadz.zip

tedyu commented 2 years ago

From threadz.3.html

<tr><td>rpc_tp_TabletServer_26-2602745</td><td>20.950s</td><td>0.000s</td><td>0.000s</td><td rowspan="1"><pre>    @     0x7f59bcac911f  (unknown)
    @     0x7f59bce4246c  __GI___nanosleep
    @     0x7f59bd40bdda  std::__1::this_thread::sleep_for()
    @     0x7f59c1801b58  yb::GenericBackoffWaiter<>::Wait()
    @     0x7f59c047e101  yb::client::RetryFunc()
    @     0x7f59c0483d90  yb::client::YBClient::Data::WaitForBackfillIndexToFinish()
    @     0x7f59c0483ab6  yb::client::YBClient::Data::BackfillIndex()
    @     0x7f59c04616f0  yb::client::YBClient::BackfillIndex()
    @     0x7f59c1d9b954  yb::tserver::PgClientSession::BackfillIndex()
    @     0x7f59c1d959b0  yb::tserver::PgClientServiceImpl::Impl::BackfillIndex()
    @     0x7f59c1d938a8  yb::tserver::PgClientServiceImpl::BackfillIndex()

Please get server 3's logs.

def- commented 2 years ago

Too large to upload here, accessible from within Yugabyte org: https://drive.google.com/file/d/1z-zPXJxQZku0zcWIO2PhnRqbwCcEx1jX/view?usp=sharing

tedyu commented 2 years ago

cc @jason-yb

tedyu commented 2 years ago

I took a look at yb-tserver.devp.deen.log.INFO.20220902-075533.2601900 but haven't found which backfill request hung.

Looking at YBClient::Data::BackfillIndex, there is only log for the completion of backfill:

  LOG(INFO) << "Backfilled index " << req.index_identifier().ShortDebugString();

There is no log for backfill start.

tedyu commented 2 years ago

The client of backfill showed YBClient::Data::WaitForBackfillIndexToFinish() in thread dump.

We would need master log to get more idea of what might have gone wrong.

tedyu commented 2 years ago

From yb-tserver.devp.deen.log.INFO.20220902-075533.2601900

I0902 07:56:36.254930 2602785 tablet.cc:2174] Begin BackfillIndexesForYsql at { physical: 1662105396035999 } from <start-of-the-tablet> for [table_id: "00004000000030008000000000004105" version: 0 is_local: false hash_column_count: 0 range_column_count: 0 is_unique: false indexed_table_id: "0000400000003000800000000000402a" use_mangled_column_name: false index_permissions: INDEX_PERM_WRITE_AND_DELETE backfill_error_message: ""]
I0902 07:56:36.255028 2602785 tablet.cc:2097] Using grace margin of 1000ms, original deadline: 59.787s
I0902 07:56:36.390646 2602765 tablet.cc:2174] Begin BackfillIndexesForYsql at { physical: 1662105396035999 } from <start-of-the-tablet> for [table_id: "00004000000030008000000000004105" version: 0 is_local: false hash_column_count: 0 range_column_count: 0 is_unique: false indexed_table_id: "0000400000003000800000000000402a" use_mangled_column_name: false index_permissions: INDEX_PERM_WRITE_AND_DELETE backfill_error_message: ""]

Not sure why the same start message was logged twice.

def- commented 2 years ago

I'll just upload the full yugabyte-data directory: https://drive.google.com/file/d/1aBOkQ-NSfJ565nHUPYP8jIA8Pmn_3hhP/view?usp=sharing

def- commented 2 years ago

Happened again on an RF1:

yugabyte=# SELECT * FROM pg_stat_activity;
 datid |    datname    |  pid   | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type |    wait_event    |        state        | backend_xid | backend_xmin |                                                                                                                                                                                            query                                                                                                                                                                                             |  backend_type
-------+---------------+--------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------------+---------------------+-------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------
 22995 | db_lst_370325 | 869853 |    13287 | yugabyte | lst_370325_i     | 127.0.0.1   |                 |       41722 | 2022-09-08 13:47:50.336027+00 |                               | 2022-09-08 14:09:26.657113+00 | 2022-09-08 14:09:26.661356+00 | Client          | ClientRead       | idle                |             |              | COMMIT                                                                                                                                                                                                                                                                                                                                                                                       | client backend
 22995 | db_lst_370325 | 869855 |    13287 | yugabyte | lst_370325_i     | 127.0.0.1   |                 |       41724 | 2022-09-08 13:47:50.337525+00 | 2022-09-08 14:09:28.58866+00  | 2022-09-08 14:09:28.58866+00  | 2022-09-08 14:09:28.588743+00 |                 |                  | active              |             |              | CREATE INDEX CONCURRENTLY idx_tg4_0_c0_float8_c9_boolean_c6_float4_c27_varchar_c1_numeric_c12_varchar_c17_numeric_c22_float8 ON tg4_0 USING lsm (c0_float8 DESC NULLS FIRST, c9_boolean ASC NULLS LAST, c6_float4 ASC NULLS FIRST, c27_varchar ASC NULLS FIRST, c1_numeric ASC NULLS LAST, c12_varchar ASC NULLS FIRST, c17_numeric DESC NULLS FIRST, c22_float8 ASC NULLS LAST) WHERE TRUE; | client backend
 13288 | yugabyte      | 890510 |    13287 | yugabyte | ysqlsh           | 127.0.0.1   |                 |       41846 | 2022-09-08 16:10:18.660385+00 | 2022-09-08 16:10:41.149582+00 | 2022-09-08 16:10:41.149582+00 | 2022-09-08 16:10:41.14962+00  |                 |                  | active              |             |            3 | SELECT * FROM pg_stat_activity;                                                                                                                                                                                                                                                                                                                                                              | client backend
 22995 | db_lst_370325 | 869861 |    13287 | yugabyte | lst_370325_i     | 127.0.0.1   |                 |       41728 | 2022-09-08 13:47:50.340063+00 |                               | 2022-09-08 14:09:25.541895+00 | 2022-09-08 14:09:25.543896+00 | Client          | ClientRead       | idle                |             |              | COMMIT                                                                                                                                                                                                                                                                                                                                                                                       | client backend
 22995 | db_lst_370325 | 869866 |    13287 | yugabyte | lst_370325_i     | 127.0.0.1   |                 |       41734 | 2022-09-08 13:47:50.343454+00 | 2022-09-08 14:09:32.010303+00 | 2022-09-08 14:09:32.039314+00 | 2022-09-08 14:09:32.039886+00 | Client          | ClientRead       | idle in transaction |             |              | savepoint sp_1;                                                                                                                                                                                                                                                                                                                                                                              | client backend
 22995 | db_lst_370325 | 869868 |    13287 | yugabyte | lst_370325_i     | 127.0.0.1   |                 |       41738 | 2022-09-08 13:47:50.34557+00  |                               | 2022-09-08 14:09:25.961501+00 | 2022-09-08 14:09:26.155547+00 | Client          | ClientRead       | idle                |             |              | DROP INDEX idx_tg2_0_c14_date_c5_date_c13_numeric_c11_bigint_c12_float8;                                                                                                                                                                                                                                                                                                                     | client backend
       |               | 829823 |          |          |                  |             |                 |             | 2022-09-08 11:58:08.433195+00 |                               |                               |                               | Activity        | CheckpointerMain |                     |             |              |                                                                                                                                                                                                                                                                                                                                                                                              | checkpointer
(7 rows)