yugabyte / yugabyte-db

YugabyteDB - the cloud native distributed SQL database for mission-critical applications.
https://www.yugabyte.com
Other
8.98k stars 1.07k forks source link

[DocDB][AutoTS][Upgrade] upgrade_ysql times out with automatic tablet splitting #15344

Closed def- closed 1 year ago

def- commented 1 year ago

Jira Link: DB-4481

Description

This is reproducible now in upgrade from 2.8.10.0-b21 to 2.17.1.0-b290 with automatic tablet splitting flag enabled. It should get turned on during upgrade since the new version supports automatic tablet splitting, but instead the upgrade_ysql command times out: https://phabricator.dev.yugabyte.com/D21514

testupgrade-aws-rf3-upgrade-2.8.10.0_21: Start
    (     0.487s) User Login : Success
    (     0.143s) Refresh YB Version : Success
    (    90.453s) Setup Provider : Success
    (     0.068s) Updating Health Check Interval to 60000 sec : Success
    (   451.200s) Create universe dfel-isd2501-d2fed16cf1-20221214-150042 : Success
    (    40.293s) Start sample workloads : Success
    (    11.574s) Collect tablets count : Success
    (     5.800s) Collect tablets count : Success
    (     6.023s) Collect tablets count : Success
    (    14.197s) Collect tablets count : Success
    (   474.790s) Creating V2 backups in S3 : Success
    (   430.621s) Verify geo-partitioned objects : Success
    (   540.686s) Executing yb-admin upgrade_ysql : >>> Integration Test Failed <<< 
Remote command timed out.
    (    21.911s) Saved server log files and keys at /share/jenkins/workspace/itest-system-developer/logs/2.17.1.0_testupgrade-aws-rf3-upgrade-2.8.10.0_21_20221214_153853 : Success
    (    95.543s) Destroy universe : Success
    (     0.219s) Check and stop workloads : Success
testupgrade-aws-rf3-upgrade-2.8.10.0_21: End

itest logs show that this is just a standard yb-admin upgrade_ysql failing:

2022-12-14 15:35:19,644 test_universe_base.py:528 INFO testupgrade-aws-rf3-upgrade-2.8.10.0_21 Command:=/home/yugabyte/master/bin/yb-admin --timeout_ms 120000 --master_addresses 10.9.99.151:7100,10.9.133.111:7100,10.9.194.73:7100 upgrade_ysql
2022-12-14 15:35:19,645 ssh_utils.py:272 INFO testupgrade-aws-rf3-upgrade-2.8.10.0_21 Running cmd: /home/yugabyte/master/bin/yb-admin --timeout_ms 120000 --master_addresses 10.9.99.151:7100,10.9.133.111:7100,10.9.194.73:7100 upgrade_ysql
2022-12-14 15:37:19,738 ssh_utils.py:291 INFO testupgrade-aws-rf3-upgrade-2.8.10.0_21 Exception for /home/yugabyte/master/bin/yb-admin --timeout_ms 120000 --master_addresses 10.9.99.151:7100,10.9.133.111:7100,10.9.194.73:7100 upgrade_ysql
2022-12-14 15:37:19,738 ssh_utils.py:292 INFO testupgrade-aws-rf3-upgrade-2.8.10.0_21 Time-out Error, time limit setting: 120 seconds
2022-12-14 15:37:19,739 ssh_utils.py:293 INFO testupgrade-aws-rf3-upgrade-2.8.10.0_21 OUT=
2022-12-14 15:37:19,739 test_universe_base.py:536 ERROR testupgrade-aws-rf3-upgrade-2.8.10.0_21 Remote command timed out.
2022-12-14 15:37:19,739 test_base.py:165 ERROR testupgrade-aws-rf3-upgrade-2.8.10.0_21 ITEST FAILED testupgrade-aws-rf3-upgrade-2.8.10.0_21 : RuntimeError(RuntimeError('Remote command timed out.'))
2022-12-14 15:37:19,742 test_base.py:166 INFO testupgrade-aws-rf3-upgrade-2.8.10.0_21 Traceback (most recent call last):
  File "/var/lib/jenkins/code/devops/venv/lib/python3.7/site-packages/paramiko/channel.py", line 699, in recv
    out = self.in_buffer.read(nbytes, self.timeout)
  File "/var/lib/jenkins/code/devops/venv/lib/python3.7/site-packages/paramiko/buffered_pipe.py", line 164, in read
    raise PipeTimeout()
paramiko.buffered_pipe.PipeTimeout

Tserver logs might indicate the source of the problem:

E1214 15:27:59.173772 32354 meta_cache.cc:502] T 7bc18df3f77548d1a86b941aa12c525c: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1837): Tablet 7bc18df3f77548d1a86b941aa12c525c not found
E1214 15:27:59.173954 32354 meta_cache.cc:502] T 7bc18df3f77548d1a86b941aa12c525c: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1837): Tablet 7bc18df3f77548d1a86b941aa12c525c not found
E1214 15:27:59.175027 32354 meta_cache.cc:502] T 7bc18df3f77548d1a86b941aa12c525c: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1837): Tablet 7bc18df3f77548d1a86b941aa12c525c not found
E1214 15:27:59.179129 32354 meta_cache.cc:502] T 7bc18df3f77548d1a86b941aa12c525c: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1837): Tablet 7bc18df3f77548d1a86b941aa12c525c not found
E1214 15:27:59.180006 32355 meta_cache.cc:502] T 7bc18df3f77548d1a86b941aa12c525c: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1837): Tablet 7bc18df3f77548d1a86b941aa12c525c not found

Full logs: https://drive.google.com/file/d/1dRNWRgExTOvlSysyrA5Y0yDAlt0PEMTt/view?usp=share_link

rthallamko3 commented 1 year ago

@kripasreenivasan , @Arjun-yb , Can you folks confirm if this issue repros in the newer builds? Wanted to understand if this is an issue or not.

rthallamko3 commented 1 year ago

Per @Arjun-yb , it works fine now.

Verified
test: testupgrade-aws-rf3-upgrade-2.12.4.2_1 and testupgrade-aws-rf3-upgrade-2.8.11.0_6
in 2.17.2.0-b209 and 2.17.3.0-b72, it is working fine.

Closing the issue for now.

rthallamko3 commented 1 year ago

@Arjun-yb to clarify if this repros with updated automation.

rjalan-yb commented 1 year ago

Hey @rthallamko3, this issue is reproducible on master too and is seen while upgrade from a lot of older versions: https://jenkins.dev.yugabyte.com/job/itest-system-developer/4379/

rthallamko3 commented 1 year ago

@rjalan-yb , Are the recent failures related to tablet splitting failures too? Can you provide the link to the logs? The above Jenkins run points to a number of tests.

rjalan-yb commented 1 year ago

Hey @rthallamko3 @Arjun-yb, we do not see the exact remote timeout error, but we do see a core file generated and also the tablet not found error during upgrade and upgrade fails.

https://jenkins.dev.yugabyte.com/job/itest-system-developer/4586/artifact/logs/2.17.3.0_testupgrade-aws-rf3-upgrade-2.11.2.0_89_20230328_211020/10.9.130.67/

yb-tserver.ERROR

E0328 22:19:29.341523 20641 meta_cache.cc:502] T 8b14e126a0eb492a81f5a7dff4f87660: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1848): Tablet 8b14e126a0eb492a81f5a7dff4f87660 not found
E0328 22:19:29.365556 20642 meta_cache.cc:502] T 8b14e126a0eb492a81f5a7dff4f87660: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1848): Tablet 8b14e126a0eb492a81f5a7dff4f87660 not found

Core file generated: https://jenkins.dev.yugabyte.com/job/itest-system-developer/4586/artifact/logs/2.17.3.0_testupgrade-aws-rf3-upgrade-2.11.2.0_89_20230328_211020/10.9.130.67/

[New LWP 54160]
Core was generated by `/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --'.
Program terminated with signal SIGABRT, Aborted.
#0  0x0000000000476ca9 in raise ()

Thread 1 (LWP 54160):
#0  0x0000000000476ca9 in raise ()
No symbol table info available.
#1  0x0000000000437f20 in abort ()
No symbol table info available.
#2  0x00000000004012d6 in readFile(std::string) ()
No symbol table info available.
#3  0x00000000004017f6 in patchElf() ()
No symbol table info available.
#4  0x00000000004023c6 in main ()
No symbol table info available.

Error in automation log: https://jenkins.dev.yugabyte.com/job/itest-system-developer/4586/artifact/logs/2.17.3.0_testupgrade-aws-rf3-upgrade-2.11.2.0_89_20230328_211020/testupgrade-aws-rf3-upgrade-2.11.2.0_89myk7homf.log

interpreter /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/ld.so yb-server\n    /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/post_install.sh: line 23: 55646 Aborted                 (core dumped) "$patchelf_path" --set-interpreter "$ld_path" "$f"\n  stderr_lines: <omitted>\n  stdout: \'\'\n  stdout_lines: <omitted>.')
2023-03-28 22:16:01,590 test_base.py:180 INFO testupgrade-aws-rf3-upgrade-2.11.2.0_89 Traceback (most recent call last):
  File "/var/lib/jenkins/code/internal-services/itest/src/test_base.py", line 175, in execute_steps
    step.call()
  File "/var/lib/jenkins/code/internal-services/itest/src/test_base.py", line 54, in call
    ret = self.function()
  File "/var/lib/jenkins/code/internal-services/itest/src/universe_tests/system_tests/test_upgrade.py", line 260, in do_custom_work
    universe = self.do_software_upgrade(universe)
  File "/var/lib/jenkins/code/internal-services/itest/src/universe_tests/test_universe_base.py", line 2505, in do_software_upgrade
    self.yw_proxy.do_upgrade(universe, upgrade_json)
  File "/var/lib/jenkins/code/internal-services/itest/src/yw_proxy.py", line 476, in do_upgrade
    self.wait_for_task(response_json['taskUUID'], sys._getframe().f_code.co_name)
  File "/var/lib/jenkins/code/internal-services/itest/src/yw_proxy.py", line 629, in wait_for_task
    raise RuntimeError(error_msg)
RuntimeError: wait_for_task: Failed task with errors in 211.24973511695862s:
Failed to execute task {"platformVersion":"2.17.3.0-b112","sleepAfterMasterRestartMillis":180000,"sleepAfterTServerRestartMillis":180000,"nodeExporterUser":"prometheus","deviceInfo":{"volumeSize":50,"numVolumes":1,"storageClass":"","storageType":"GP2"},"universeUUID":"aa1df494-2e65-4cde-8951-4c5015dee73f","enableYbc":false,"installYbc":false,"ybcInstalled":false,"encryptionAtRestConfig":{"encryptionAtRestEnabled":false,"opType":"UNDEFINED","type":"DATA_KEY"},"communicationPorts":{"masterHttpPort":7000,"masterRpcPor..., hit error:

Error occurred. Code: 1. Output: ybops.common.exceptions.YBOpsRuntimeError: Runtime error: Playbook run of configure-cluster-server.yml against 10.9.130.67, failed with return code 2
TASK [install_package : Download package | Run the post install script] ********
Tuesday 28 March 2023  22:15:50 +0000 (0:00:01.215)       0:01:28.255 ********* 
fatal: [10.9.130.67]: FAILED! => changed=true 
  cmd:
  - /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/post_install.sh
  delta: '0:00:00.911277'
  end: '2023-03-28 22:15:51.970985'
  msg: non-zero return code
  rc: 134
  start: '2023-03-28 22:15:51.059708'
  stderr: |-
    + /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-interpreter /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/ld.so ldb
    warning: working around a Linux kernel bug by creating a hole of 1245184 bytes in ldb
    + /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-rpath /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/postgres:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb-thirdparty:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/linuxbrew/lib ldb
    + /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-interpreter /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/ld.so log-dump
 ...by creating a hole of 1245184 bytes in yb-ysck
    + /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-rpath /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/postgres:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb-thirdparty:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/linuxbrew/lib yb-ysck
    + /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-interpreter /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/ld.so redis-cli
    + /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-rpath /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/postgres:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb-thirdparty:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/linuxbrew/lib redis-cli
    + /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-interpreter /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/ld.so yb-server
    /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/post_install.sh: line 23: 55646 Aborted                 (core dumped) "$patchelf_path" --set-interpreter "$ld_path" "$f"
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>.
arybochkin commented 1 year ago

Core was generated by `/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --'

at least this part is a know issue with patchelf: "patchelf v0.9 needs 2 x file size + 32MB". So, it is insufficient space either before the upgrade has been started or during the upgrate. Additionally, If automatic tablet splitting is turned on and it may work during upgrade (it requires a quorum for split operation), then post-split compaction may start consume some space to compact tablet's files, in worst case it may need additional total size of tablet SST free space (with defaults) -- should be confirmed by some test. Maybe it makes sense to check there's enough space to have a copy of N largest tablets (where N is 1 by default).

rthallamko3 commented 1 year ago

@rjalan-yb , Based on @arybochkin 's comments , it seems like a better way to avoid upgrades from running out of memory and disk is to add checks in advance before upgrade and that is tracked by https://yugabyte.atlassian.net/browse/PLAT-5561. Seems like this issue was hit before too, Can we add the 2.18 blocker tag to the platform item instead?

bmatican commented 1 year ago

@rjalan-yb what's the amount of size on disk available for this? Is this just a matter of the nodes being under-sized? Trying to asses if this should really be a 2.18 blocker @rthallamko3

rthallamko3 commented 1 year ago

Tagging @Arjun-yb , who is going to try running the test on larger instances.

rjalan-yb commented 1 year ago

Hey @rthallamko3, with a larger instance, we do not see this issue. However, we do see some fatals and error files generated for 2.14 and later versions:

E0331 07:30:35.525130 45074 tablet_metadata.cc:669] T da3743351a4f4e29987363852dd9ac0c P 1e7cb3896ac948a0bb8a6d7b975518f9: Failed to destroy regular DB at: /mnt/d0/yb-data/tserver/data/rocksdb/table-0000404d000030008000000000004147/tablet-da3743351a4f4e29987363852dd9ac0c: IO error (yb/rocksdb/util/env_posix.cc:320): /mnt/d0/yb-data/tserver/data/rocksdb/table-0000404d000030008000000000004147/tablet-da3743351a4f4e29987363852dd9ac0c/LOCK: No such file or directory
E0331 07:30:35.529448 45074 tablet_metadata.cc:669] T dc87ce9f9fb441c295424ac82f126af7 P 1e7cb3896ac948a0bb8a6d7b975518f9: Failed to destroy regular DB at: /mnt/d0/yb-data/tserver/data/rocksdb/table-0000404d000030008000000000004147/tablet-dc87ce9f9fb441c295424ac82f126af7: IO error (yb/rocksdb/util/env_posix.cc:320): /mnt/d0/yb-data/tserver/data/rocksdb/table-0000404d000030008000000000004147/tablet-dc87ce9f9fb441c295424ac82f126af7/LOCK: No such file or directory
E0331 07:31:53.023344 45213 tablet_peer.cc:1580] Peer 909f4d2eaf5748ecabdea20a0930788b is a VOTER Not changing its role after remote bootstrap

Fatal:

F20230331 07:29:58 ../../src/yb/yql/pggate/pggate.cc:1726] Check failed: _s.ok() Bad status: Aborted (yb/rpc/reactor.cc:94): Shutdown connection (system error 108)
    @     0x7fead8077e1c  google::LogDestination::LogToSinks()
    @     0x7fead8077844  google::LogMessage::SendToLog()
    @     0x7fead8078fd8  google::LogMessage::Flush()
    @     0x7fead80797bf  google::LogMessageFatal::~LogMessageFatal()
    @     0x7feadd0ba397  yb::pggate::PgApiImpl::ClearSeparateDdlTxnMode()
    @           0xb2d20f  YBTxnDdlProcessUtility
    @           0x963ad4  PortalRunUtility
    @           0x9638a7  PortalRunMulti
    @           0x962e63  PortalRun
    @           0x95ec06  yb_exec_simple_query_impl
    @           0x95f3c4  yb_exec_query_wrapper_one_attempt
    @           0x95b97b  PostgresMain
    @           0x8b6fbb  BackendRun
    @           0x8b66ba  ServerLoop
    @           0x8b2a31  PostmasterMain
    @           0x7d2153  PostgresServerProcessMain
    @           0x4fb042  main
    @     0x7feadc088825  __libc_start_main
    @           0x4faf49  _start
    @              (nil)  (unknown)

We can constantly see this reproduced: https://jenkins.dev.yugabyte.com/job/itest-system-developer/4690/ The setup details can be found in Jenkins run. We can see the node is upgraded but the Jenkins run is failing because FATALS are present.

Complete Logs: https://drive.google.com/drive/folders/1GdXW6OZGejVt3N8ZlqBRiWoUVvtSM0lg?usp=sharing

rthallamko3 commented 1 year ago

@rjalan-yb , We have seen similar issues due to schema change operations and that issue is being fixed by #12926 (Pending backport to 2.14 branch, it should be done in the next couple of days).

Can you confirm if the master branch looks clean? If so, I can move the bug as DUP of #12926

rthallamko3 commented 1 year ago

@rjalan-yb , Ping on above.

rjalan-yb commented 1 year ago

Hey @rthallamko3, things with tablet splitting are working as expected in upgrade cases: Master Upgrade Jenkins run

All cases passed except 2. 2.0.11 -> CQLSH issue we know, 2.2 -> For some reason, it created same table twice in metadata and thus this command fails: '/home/yugabyte/master/bin/yb-admin --master_addresses 10.9.131.74:7100,10.9.195.151:7100,10.9.92.40:7100 list_tablets ysql.splitdb upg_copy_tab_2_2_7_0_0 0' returned error code 1.

It fails with this error:

Error: Invalid argument (yb/tools/yb-admin_client.cc:371): Namespace 'ysql.splitdb' has multiple tables named 'upg_copy_tab_2_2_7_0_0', specify table id instead

And it runs with table id on manual try and tablets were split. So not an issue with tablet splitting.

This is a known issue it seems as we have an article for that too: https://support.yugabyte.com/hc/en-us/articles/360061078832-Namespace-foo-of-type-ycql-not-found

It is only seen for 2.2 version and not any other version. @Arjun-yb is also checking this once.

rthallamko3 commented 1 year ago

Since the timeout issue is not happening anymore. Will close this issue. The invalid argument issue on 2.2 seems different from what this GH is trying to capture. Please create a separate issue for that if needed. cc @rjalan-yb