yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.92k stars 1.06k forks source link

[DocDB] Observed compaction thread utilising large amount of CPU causing failure of LOAD phase on 2 core machines #16031

Open mangesh-at-yb opened 1 year ago

mangesh-at-yb commented 1 year ago

Jira Link: DB-5433

Description

While comparing table limits between YB-colocated database Vs YB-normal database, observed high CPU utilisation on Leader node for that colocated db with 250 tables, flame-graphs shown "Priority-worker" kicks in and consumes CPU, hence sysbench LOAD phase fails with RPC timeout. But this behaviour has not observed with Non-colocated db, sysbench LOAD phase completes without errors and completes whole workload.

Setup:

yugabyte=# create database colodb with colocated=true;

- **Execute below commands  from client machine :** 

- Cleanup: ( if cluster has used previously )

./src/sysbench ./src/lua/oltp_multi_insert.lua --table_size=500000 --manual_range_splitting=false --range_selects=false --pgsql-user=yugabyte --range_size=1000 --index_updates=10 --tables=250 --pgsql-db=colodb --warmup-time=600 --point_selects=10 --serial_cache_size=1000 --create_secondary=false --range_key_partitioning=true --non_index_updates=10 --thread-init-timeout=60 --time=180 --pgsql-password=Password321! --db-driver=pgsql --pgsql-port=5433 --pgsql-host= --threads=1 cleanup;

- CREATE phase:  

./src/sysbench ./src/lua/oltp_multi_insert.lua --table_size=500000 --manual_range_splitting=false --range_selects=false --pgsql-user=yugabyte --range_size=1000 --index_updates=10 --tables=250 --pgsql-db=colodb --warmup-time=600 --point_selects=10 --serial_cache_size=1000 --create_secondary=false --range_key_partitioning=true --non_index_updates=10 --thread-init-timeout=60 --time=180 --pgsql-password=Password321! --db-driver=pgsql --pgsql-port=5433 --pgsql-host= --threads=1 create;


- Get leader, using "yb_stats" command from lates yb_stats : https://github.com/fritshoogland-yugabyte/yb_stats/releases/download/v0.9.3/yb_stats-0.9.3-1.el7.x86_64.rpm 

yb_stats --hosts --get-coloc-leader-host colodb

- LOAD phase:  Which consumes CPU and fails with RPC timeout.

./src/sysbench ./src/lua/oltp_multi_insert.lua --table_size=500000 --manual_range_splitting=false --range_selects=false --pgsql-user=yugabyte --range_size=1000 --index_updates=10 --tables=250 --pgsql-db=colodb --warmup-time=600 --point_selects=10 --serial_cache_size=1000 --create_secondary=false --range_key_partitioning=true --non_index_updates=10 --thread-init-timeout=60 --time=180 --pgsql-password=Password321! --db-driver=pgsql --pgsql-port=5433 --pgsql-host= --threads=2 load;


- Repeat same steps with Non-colocated database, by specifying all three YB_NODES and LOAD threads set to 6

### Observed: 

- 100% CPU utilisation on YB-Leader node and after loading some tables, sysbench gives below RPC timeout error for one of the INSERT query in LOAD phase.

"FATAL: PQexec() failed: 7 Timed out: Perform RPC (request call id 3636) to 172.151.21.120:9100 timed out after 120.000s",

- Flame graph shown "Priority-worker" thread kicks in, incase of Colocated db and consumes 100% CPU.

- Also tried setting up "rocksdb_level0_file_num_compaction_trigger=20", apart from decrease in CPU% for "Priority-worker" no major difference observed.

- With Non-colocated db, sysbench completes LOAD phase and executes RUN phase successfully, used below LOAD phase cmd for Non-colocated db: Watch for "--pgsql-host=<YB_NODES>" and "--threads=6", specify all three YB nodes.

./src/sysbench ./src/lua/oltp_multi_insert.lua --table_size=500000 --manual_range_splitting=false --range_selects=false --pgsql-user=yugabyte --range_size=1000 --index_updates=10 --tables=250 --pgsql-db=colodb --warmup-time=600 --point_selects=10 --serial_cache_size=1000 --create_secondary=false --range_key_partitioning=true --non_index_updates=10 --thread-init-timeout=60 --time=180 --pgsql-password=Password321! --db-driver=pgsql --pgsql-port=5433 --pgsql-host= --threads=6 load;



### Expected 

- Need to tune compaction  in case of Colocated database, which will not consume more CPU and cause the failures. 

### Extra info :
- Flame-graph analysis:

<img width="1125" alt="Screenshot 2023-02-08 at 4 13 32 PM" src="https://user-images.githubusercontent.com/98325393/217507611-36a27fc1-521a-4d1d-aaa0-4ab24f5e45db.png">

- Flame graphs:
**colo-db-compaction-setto-20** 
![colo-db-compaction-setto-20](https://user-images.githubusercontent.com/98325393/217507788-d5970d59-bdcf-4404-a6bd-c8ec62439e85.svg)
**colo-db-default-compaction-setting**
![colo-db-default-compaction-setting](https://user-images.githubusercontent.com/98325393/217507797-693ef490-ba03-4bb8-b204-4125541dd4a8.svg)
**non-colo-db-default-compaction-setting**
![non-colo-db-default-compaction-setting](https://user-images.githubusercontent.com/98325393/217507804-11b0f475-daf1-4101-ad1d-6f0e9ec14b66.svg)

[DB-5433]: https://yugabyte.atlassian.net/browse/DB-5433?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
mangesh-at-yb commented 1 year ago

Just to update : Started same test with m6i.xlarge, performing LOAD with Leader node with 2 LOADER threads, fails with below RPC time out.

FATAL: PQexec() failed: 7 Timed out: Perform RPC (request call id 34991) to 172.151.31.99:9100 timed out after 120.000s
FATAL: failed query was: INSERT INTO sbtest80(k, c, pad) VALUES(81362, '67233889824-69552176477-28869115654-24051879237-73600010468-34709815261-74547841183-29152143741-75316507750-35270468914',
 '28039815892-10506914127-01177189312-92370775153-24182095649'),(281116, '76517898126-99154701948-93433972356-51134860849-96140225563-80068134487-03296725878-04260751091-57963222982-32820740264
', '08832037379-06236232545-98682650486-38366692060-27387748652'),(256067, '26841799804-09101657156-46679981132-01313592986-23120962785-32351882912-29372291556-16579651989-70843672242-672223954
91', '09794392307-02377482891-72609903204-57652824248-31448813842'),(164596, '41155833876-54833679367-12688509004-95476315873-88863607955-21568341205-70296703406-75843533786-49657038772-8731628
9786', '14841117821-46154343544-15131649708-80987487852-72344841042'),(141826, '55769726639-27865016028-26318146463-77112472586-43911179340-52399438115-73309179042-75375546631-17862281396-44708
041265', '63516899867-73320175478-21616642148-31566020766-00428438765'),(32938, '38536602463-72291221577-78922577067-03958150007-08925666153-86839059383-16814878924-40629702547-61664445361-2681
9459891', '75919606341-25419791426-54451169620-60491850723-47554203608'),(188135, '01240105384-38610285546-28895885862-39750459200-79430589681-63311337989-45609739762-67896798035-65429861883-03
420788253', '14276143083-93670758046-14212357199-45825934330-29689532355'),(479095, '93956438651-82064706422-42968544493-48255885026-59541580268-11152516587-07752272949-10687205781-91716333614-
39914820209', '40715264434-66776861582-83397599007-54144208742-80876793638'),(12462, '04484011383-81255326637-12119672876-48063185255-43775827402-25740400337-16394201242-52223088290-89082225215-30832774225', '01770117462-77491255066-26617067206-87965138888-58256337050'),(132946, '49710695896-76238473030-32257718616-89648342587-83664369331-95636977557-63868342863-68033720305-30503438787-77095568569', '88457956182-76776658069-54189131138-21121020654-29032977247'),(169953, '91137050746-82986102292-42281078695-72011392800-33352075149-15110055965-63868065975-30470935228-35974196028-82890579770', '12979479049-96801354504-24311885580-77707510210-25523178872'),(142779, '25265438300-55509039232-96462224174-46042551945-85119999707-43627078509-99229972172-37643985747-36027400398-64844019450', '52287234036-99509074031-71142601529-74214564501-77740432407'),(20506, '02939666602-20967964594-51747522536-09067765334-55059068255-53321227748-09919284493-78720481614-93118329722-07519608550', '79445299424-44503442261-08996185269-56126983643-08332744610'),(11038, '31063839953-93153902682-20499750632-97469043055-16665037992-63754007156-39866923196-66114037391-70467041544-82733601737', '52689430005-70504946589-19774721461-15147623550-76743112897'),(101894, '92817842865-26283438836-52920303781-23963025680-70280723346-74673818912-47805883925-18979410879-38183026010-82298690630', '45161285441-10530725330-20517640891-11770433334-95423171571'),(423002, '10946512329-84162598761-53782501567-98872669053-61928799200-05439958623-87512146283-40534859483-22157214533-44343943287', '17443897903-92117678752-55241837758-89139023199-20391652876'),(96489, '74789880154-78236679376-56035627962-02138884704-20982077484-84406102171-47776061046-94883288080-56518599335-05907758423', '62474093857-02067724911-53006902156-04514990808-58250808152'),(16718, '77926413585-55729478194-74333585317-16761055585-16370766336-11187231121-94920094246-94397622209-94755002183-98432804587', '67505174974-17720891574-42489241463-63358142148-37493021416'),(258850, '61189486970-01932114051-53565742880-01312536006-09864506906-62938842469-72130121967-23362558954-91546531283-18876423944', '42068530261-61702009618-55001767010-14412772530-50194555877'),(370459, '20845682159-77783845950-94635203435-48500785246-88283015886-70019312457-40760859054-77411081065-12953821399-22024288233', '95047801012-16280468408-20028413139-88362970966-34911694803'),(334184, '75571875744-79360116657-50801186766-28582138169-39408087165-81696748545-67714320148-62375739668-29340183384-09744699221', '30718981692-0FATAL: `sysbench.cmdline.call_command' function failed: ./src/lua/oltp_common.lua:307: db_bulk_insert_next() failed

sysbench command used during CLEANUP, CREATE and LOAD phases 172.151.31.99 is leader node for "colodb".

echo -e " \n ===================== `date`: CLEANUP ========================== \n";
./src/sysbench ./src/lua/oltp_read_only.lua --table_size=500000 --manual_range_splitting=false --range_selects=false --pgsql-user=yugabyte --range_size=1000 --index_updates=10 --tables=500 --pgsql-db=colodb --warmup-time=600 --point_selects=10 --serial_cache_size=1000 --create_secondary=false --range_key_partitioning=true --non_index_updates=10 --thread-init-timeout=60 --time=180 --pgsql-password=Password321! --db-driver=pgsql --pgsql-port=5433 --pgsql-host=172.151.28.9,172.151.31.99,172.151.17.63 --threads=1 cleanup;
echo -e " \n ===================== `date`: CREATE ========================== \n";
./src/sysbench ./src/lua/oltp_read_only.lua --table_size=500000 --manual_range_splitting=false --range_selects=false --pgsql-user=yugabyte --range_size=1000 --index_updates=10 --tables=500 --pgsql-db=colodb --warmup-time=600 --point_selects=10 --serial_cache_size=1000 --create_secondary=false --range_key_partitioning=true --non_index_updates=10 --thread-init-timeout=60 --time=180 --pgsql-password=Password321! --db-driver=pgsql --pgsql-port=5433 --pgsql-host=172.151.28.9,172.151.31.99,172.151.17.63 --threads=1 create;
echo -e " \n ===================== `date`: LOAD ========================== \n";
./src/sysbench ./src/lua/oltp_read_only.lua --table_size=500000 --manual_range_splitting=false --range_selects=false --pgsql-user=yugabyte --range_size=1000 --index_updates=10 --tables=500 --pgsql-db=colodb --warmup-time=600 --point_selects=10 --serial_cache_size=1000 --create_secondary=false --range_key_partitioning=true --non_index_updates=10 --thread-init-timeout=60 --time=180 --pgsql-password=Password321! --db-driver=pgsql --pgsql-port=5433 --pgsql-host=172.151.31.99 --threads=2 load;

Universe: http://10.9.131.126/universes/c99c433f-3707-4afb-b6e0-c3c539d13cc3/overview

mangesh-at-yb commented 1 year ago

After looking in to logs from leader node, below are the observations, with "m6i.xlarge" 2 LOAD threads running with Leader node.

W0210 03:31:01.916704 9077 leader_election.cc:283] T c5753442ed2944d1b44d17ae0a6508f1 P fd78f0a4f99b4e178c97c72c20886a3d [CANDIDATE]: Term 3 pre-election: Tablet error from VoteRequest() call to peer 7f62325eac984c249eb8e9c86e3ec3ca: Illegal state (yb/tserver/service_util.cc:285): Tablet c5753442ed2944d1b44d17ae0a6508f1 not RUNNING: BOOTSTRAPPING (tablet server error 12) (raft group state error 0)

W0210 03:31:08.152658 9076 leader_election.cc:277] T c5753442ed2944d1b44d17ae0a6508f1 P fd78f0a4f99b4e178c97c72c20886a3d [CANDIDATE]: Term 3 pre-election: RPC error from VoteRequest() call to peer d960bcf92a2644f097f726bc7ad5b999: Network error (yb/util/net/socket.cc:540): recvmsg error: Connection refused (system error 111)

0210 03:32:36.502919 9772 column_family.cc:652] T c5753442ed2944d1b44d17ae0a6508f1 P fd78f0a4f99b4e178c97c72c20886a3d [R]: [default] Increasing compaction threads because we have 14 level-0 files

W0210 03:36:30.510851 9047 remote_bootstrap_service.cc:134] Error handling RemoteBootstrapService RPC request from 172.151.28.9:34681: Not found (yb/tserver/remote_bootstrap_session.cc:496): Unable to get piece of log segment: Already sent active log segment, don't send 1080

- Query time out happens with below sysbench error message :

FATAL: PQexec() failed: 7 Timed out: Perform RPC (request call id 11618) to 172.151.17.63:9100 timed out after 120.000s

mangesh-at-yb commented 1 year ago

Below is the status of further perf-tests with different different instance types:

Screenshot 2023-02-10 at 5 48 16 PM
mangesh-at-yb commented 1 year ago

Last one has also failed... updated matrix is as follows:

Screenshot 2023-02-10 at 6 31 14 PM

Last messages logged in yb-tserver.WARNING on leader node to which LOAD with 3 threads were running :

W0210 12:53:13.346904 18964 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 12 level-0 files
W0210 12:53:20.795806 18964 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 10 level-0 files
W0210 12:55:37.713780 18925 yb_rpc.cc:336] Call yb.tserver.PgClientService.Perform 172.151.24.247:52360 => 172.151.24.247:9100 (request call id 78428) took 147417ms (client timeout 120000ms).
W0210 12:55:38.246479 18920 yb_rpc.cc:336] Call yb.tserver.PgClientService.Perform 172.151.24.247:52362 => 172.151.24.247:9100 (request call id 88142) took 144076ms (client timeout 120000ms).
W0210 12:55:38.266140 18963 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 10 level-0 files
W0210 12:55:38.274075 18810 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 10 level-0 files
W0210 12:55:38.285270 18964 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 11 level-0 files
W0210 12:55:41.356426 18963 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 10 level-0 files
W0210 12:55:47.190238 18963 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 10 level-0 files
W0210 12:55:47.510357 18964 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 11 level-0 files
W0210 12:55:47.549558 18928 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 11 level-0 files
W0210 12:55:47.583182 18963 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 12 level-0 files
W0210 12:55:53.746605 18963 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 10 level-0 files
W0210 12:55:53.797255 18923 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 10 level-0 files
W0210 12:55:53.826100 18964 column_family.cc:652] T eeeb82da457e45b58024750ef0dc85ea P 5e55d85c8aad4df9b4bf4c41405d375e [R]: [default] Increasing compaction threads because we have 11 level-0 files
mangesh-at-yb commented 1 year ago

Attaching yb-tserver.WARNING log from Leader node ( m6i.2xlarge ), which is flooded with below messages when we saw RPC time out:

Huqicheng commented 1 year ago

From my reproduce, in the following time frame, cpu usage is close to 100% and noticed load command timeout:

I0222 05:14:34.648023 185686 write_operation.cc:91] Apply operation failed: Operation failed. Try again (yb/tablet/transaction_participant.cc:216): Transaction was recently aborted: 43cdfe3a-74cb-4ea9-822f-b9baaa9ca551: 40001 (pgsql error 40001) (transaction error 1)
I0222 05:14:34.648877 185686 tablet_service.cc:351] T c1fc56c31e084345b715875912ffee86 P 26dfb2c83b3a411ca7fac8b259cfc07d [state=RUNNING]: Write failed: Operation failed. Try again (yb/tablet/transaction_participant.cc:216): Transaction was recently aborted: 43cdfe3a-74cb-4ea9-822f-b9baaa9ca551: 40001 (pgsql error 40001) (transaction error 1)
I0222 05:14:34.648927 185686 tablet_rpc.cc:458] Operation failed. Try again (yb/tablet/transaction_participant.cc:216): Failed Write(tablet: c1fc56c31e084345b715875912ffee86, num_ops: 2691, num_attempts: 1, txn: 43cdfe3a-74cb-4ea9-822f-b9baaa9ca551, subtxn: [none]) to tablet c1fc56c31e084345b715875912ffee86 on tablet server { uuid: 26dfb2c83b3a411ca7fac8b259cfc07d private: [host: "10.9.203.53" port: 9100] cloud_info: placement_cloud: "aws" placement_region: "us-west-2" placement_zone: "us-west-2c" after 1 attempt(s): Transaction was recently aborted: 43cdfe3a-74cb-4ea9-822f-b9baaa9ca551: 40001 (pgsql error 40001) (transaction error 1)
W0222 05:14:34.651682 185686 yb_rpc.cc:336] Call yb.tserver.PgClientService.Perform 10.9.203.53:37962 => 10.9.203.53:9100 (request call id 76944) took 197094ms (client timeout 120000ms).
I0222 05:14:34.656157 185686 tablet.cc:930] T c1fc56c31e084345b715875912ffee86 P 26dfb2c83b3a411ca7fac8b259cfc07d: DoCleanupIntentFiles: Intents SST file will be deleted: { total_size: 64426657 base_size: 1580856 uncompressed_size: 170806164 name_id: 2553 db_path: /mnt/d0/yb-data/tserver/data/rocksdb/table-00004000000030008000000000000000.colocated.parent.uuid/tablet-c1fc56c31e084345b715875912ffee86.intents imported: 0 being_compacted: 0 column_family_name: default level: 0 smallest: { seqno: 1125903930969602 user_frontier: 0x000055678cd17dc0 -> { op_id: 12.294529 hybrid_time: { physical: 1677042673579379 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <invalid> primary_schema_version: <NULL> cotable_schema_versions: [{66730000-0000-0080-0030-000000400000, 0}, {5b730000-0000-0080-0030-000000400000, 0}] } } largest: { seqno: 1125903933795466 user_frontier: 0x000055678cd17b90 -> { op_id: 12.294735 hybrid_time: { physical: 1677042696592967 } history_cutoff: <invalid> hybrid_time_filter: <invalid> max_value_level_ttl_expiration_time: <initial> primary_schema_version: <NULL> cotable_schema_versions: [{66730000-0000-0080-0030-000000400000, 0}, {5b730000-0000-0080-0030-000000400000, 0}] } } }, max ht: { physical: 1677042696592967 }, min running transaction start ht: <max>
I0222 05:14:34.656200 185686 db_impl.cc:5435] T c1fc56c31e084345b715875912ffee86 P 26dfb2c83b3a411ca7fac8b259cfc07d [R]: [default] New memtable created with log file: #5683
截屏2023-02-22 下午1 45 12

Also, there's no compactions happen in this time frame,

截屏2023-02-22 下午1 44 00

Even number of network packets dropped a lot:

截屏2023-02-22 下午1 45 43

Seems the high cpu usage is not related to compaction.

bmatican commented 1 year ago

@mangesh-at-yb @hbhanawat Is CPU % usage, not something that we track for these experiments? I'd imagine had we seen we hit 100% CPU usage with this experiment, we'd start from there and aim to not hit 100% during the load phase, by reducing concurrency?

@Huqicheng in your repro, could you check /threadz or run perf, during the window of 100% CPU usage / no compactions, etc, to see where the CPU is going?

hbhanawat commented 1 year ago

@bmatican The gist of this issue is that we are hitting 100% CPU with a single loader thread on 2 core machines. We cannot reduce it further :-)