Closed def- closed 1 year ago
@kripasreenivasan , @Arjun-yb , Can you folks confirm if this issue repros in the newer builds? Wanted to understand if this is an issue or not.
Per @Arjun-yb , it works fine now.
Verified
test: testupgrade-aws-rf3-upgrade-2.12.4.2_1 and testupgrade-aws-rf3-upgrade-2.8.11.0_6
in 2.17.2.0-b209 and 2.17.3.0-b72, it is working fine.
Closing the issue for now.
@Arjun-yb to clarify if this repros with updated automation.
Hey @rthallamko3, this issue is reproducible on master too and is seen while upgrade from a lot of older versions: https://jenkins.dev.yugabyte.com/job/itest-system-developer/4379/
@rjalan-yb , Are the recent failures related to tablet splitting failures too? Can you provide the link to the logs? The above Jenkins run points to a number of tests.
Hey @rthallamko3 @Arjun-yb, we do not see the exact remote timeout error, but we do see a core file generated and also the tablet not found error during upgrade and upgrade fails.
yb-tserver.ERROR
E0328 22:19:29.341523 20641 meta_cache.cc:502] T 8b14e126a0eb492a81f5a7dff4f87660: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1848): Tablet 8b14e126a0eb492a81f5a7dff4f87660 not found
E0328 22:19:29.365556 20642 meta_cache.cc:502] T 8b14e126a0eb492a81f5a7dff4f87660: Received error from GetTabletStatus: Not found (yb/tserver/ts_tablet_manager.cc:1848): Tablet 8b14e126a0eb492a81f5a7dff4f87660 not found
Core file generated: https://jenkins.dev.yugabyte.com/job/itest-system-developer/4586/artifact/logs/2.17.3.0_testupgrade-aws-rf3-upgrade-2.11.2.0_89_20230328_211020/10.9.130.67/
[New LWP 54160]
Core was generated by `/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --'.
Program terminated with signal SIGABRT, Aborted.
#0 0x0000000000476ca9 in raise ()
Thread 1 (LWP 54160):
#0 0x0000000000476ca9 in raise ()
No symbol table info available.
#1 0x0000000000437f20 in abort ()
No symbol table info available.
#2 0x00000000004012d6 in readFile(std::string) ()
No symbol table info available.
#3 0x00000000004017f6 in patchElf() ()
No symbol table info available.
#4 0x00000000004023c6 in main ()
No symbol table info available.
Error in automation log: https://jenkins.dev.yugabyte.com/job/itest-system-developer/4586/artifact/logs/2.17.3.0_testupgrade-aws-rf3-upgrade-2.11.2.0_89_20230328_211020/testupgrade-aws-rf3-upgrade-2.11.2.0_89myk7homf.log
interpreter /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/ld.so yb-server\n /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/post_install.sh: line 23: 55646 Aborted (core dumped) "$patchelf_path" --set-interpreter "$ld_path" "$f"\n stderr_lines: <omitted>\n stdout: \'\'\n stdout_lines: <omitted>.')
2023-03-28 22:16:01,590 test_base.py:180 INFO testupgrade-aws-rf3-upgrade-2.11.2.0_89 Traceback (most recent call last):
File "/var/lib/jenkins/code/internal-services/itest/src/test_base.py", line 175, in execute_steps
step.call()
File "/var/lib/jenkins/code/internal-services/itest/src/test_base.py", line 54, in call
ret = self.function()
File "/var/lib/jenkins/code/internal-services/itest/src/universe_tests/system_tests/test_upgrade.py", line 260, in do_custom_work
universe = self.do_software_upgrade(universe)
File "/var/lib/jenkins/code/internal-services/itest/src/universe_tests/test_universe_base.py", line 2505, in do_software_upgrade
self.yw_proxy.do_upgrade(universe, upgrade_json)
File "/var/lib/jenkins/code/internal-services/itest/src/yw_proxy.py", line 476, in do_upgrade
self.wait_for_task(response_json['taskUUID'], sys._getframe().f_code.co_name)
File "/var/lib/jenkins/code/internal-services/itest/src/yw_proxy.py", line 629, in wait_for_task
raise RuntimeError(error_msg)
RuntimeError: wait_for_task: Failed task with errors in 211.24973511695862s:
Failed to execute task {"platformVersion":"2.17.3.0-b112","sleepAfterMasterRestartMillis":180000,"sleepAfterTServerRestartMillis":180000,"nodeExporterUser":"prometheus","deviceInfo":{"volumeSize":50,"numVolumes":1,"storageClass":"","storageType":"GP2"},"universeUUID":"aa1df494-2e65-4cde-8951-4c5015dee73f","enableYbc":false,"installYbc":false,"ybcInstalled":false,"encryptionAtRestConfig":{"encryptionAtRestEnabled":false,"opType":"UNDEFINED","type":"DATA_KEY"},"communicationPorts":{"masterHttpPort":7000,"masterRpcPor..., hit error:
Error occurred. Code: 1. Output: ybops.common.exceptions.YBOpsRuntimeError: Runtime error: Playbook run of configure-cluster-server.yml against 10.9.130.67, failed with return code 2
TASK [install_package : Download package | Run the post install script] ********
Tuesday 28 March 2023 22:15:50 +0000 (0:00:01.215) 0:01:28.255 *********
fatal: [10.9.130.67]: FAILED! => changed=true
cmd:
- /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/post_install.sh
delta: '0:00:00.911277'
end: '2023-03-28 22:15:51.970985'
msg: non-zero return code
rc: 134
start: '2023-03-28 22:15:51.059708'
stderr: |-
+ /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-interpreter /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/ld.so ldb
warning: working around a Linux kernel bug by creating a hole of 1245184 bytes in ldb
+ /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-rpath /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/postgres:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb-thirdparty:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/linuxbrew/lib ldb
+ /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-interpreter /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/ld.so log-dump
...by creating a hole of 1245184 bytes in yb-ysck
+ /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-rpath /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/postgres:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb-thirdparty:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/linuxbrew/lib yb-ysck
+ /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-interpreter /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/ld.so redis-cli
+ /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-rpath /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/postgres:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/yb-thirdparty:/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/linuxbrew/lib redis-cli
+ /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --set-interpreter /home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/lib/ld.so yb-server
/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/post_install.sh: line 23: 55646 Aborted (core dumped) "$patchelf_path" --set-interpreter "$ld_path" "$f"
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>.
Core was generated by `/home/yugabyte/yb-software/yugabyte-2.17.3.0-b112-centos-x86_64/bin/patchelf --'
at least this part is a know issue with patchelf: "patchelf v0.9 needs 2 x file size + 32MB". So, it is insufficient space either before the upgrade has been started or during the upgrate. Additionally, If automatic tablet splitting is turned on and it may work during upgrade (it requires a quorum for split operation), then post-split compaction may start consume some space to compact tablet's files, in worst case it may need additional total size of tablet SST free space (with defaults) -- should be confirmed by some test. Maybe it makes sense to check there's enough space to have a copy of N largest tablets (where N is 1 by default).
@rjalan-yb , Based on @arybochkin 's comments , it seems like a better way to avoid upgrades from running out of memory and disk is to add checks in advance before upgrade and that is tracked by https://yugabyte.atlassian.net/browse/PLAT-5561. Seems like this issue was hit before too, Can we add the 2.18 blocker tag to the platform item instead?
@rjalan-yb what's the amount of size on disk available for this? Is this just a matter of the nodes being under-sized? Trying to asses if this should really be a 2.18 blocker @rthallamko3
Tagging @Arjun-yb , who is going to try running the test on larger instances.
Hey @rthallamko3, with a larger instance, we do not see this issue. However, we do see some fatals and error files generated for 2.14 and later versions:
E0331 07:30:35.525130 45074 tablet_metadata.cc:669] T da3743351a4f4e29987363852dd9ac0c P 1e7cb3896ac948a0bb8a6d7b975518f9: Failed to destroy regular DB at: /mnt/d0/yb-data/tserver/data/rocksdb/table-0000404d000030008000000000004147/tablet-da3743351a4f4e29987363852dd9ac0c: IO error (yb/rocksdb/util/env_posix.cc:320): /mnt/d0/yb-data/tserver/data/rocksdb/table-0000404d000030008000000000004147/tablet-da3743351a4f4e29987363852dd9ac0c/LOCK: No such file or directory
E0331 07:30:35.529448 45074 tablet_metadata.cc:669] T dc87ce9f9fb441c295424ac82f126af7 P 1e7cb3896ac948a0bb8a6d7b975518f9: Failed to destroy regular DB at: /mnt/d0/yb-data/tserver/data/rocksdb/table-0000404d000030008000000000004147/tablet-dc87ce9f9fb441c295424ac82f126af7: IO error (yb/rocksdb/util/env_posix.cc:320): /mnt/d0/yb-data/tserver/data/rocksdb/table-0000404d000030008000000000004147/tablet-dc87ce9f9fb441c295424ac82f126af7/LOCK: No such file or directory
E0331 07:31:53.023344 45213 tablet_peer.cc:1580] Peer 909f4d2eaf5748ecabdea20a0930788b is a VOTER Not changing its role after remote bootstrap
Fatal:
F20230331 07:29:58 ../../src/yb/yql/pggate/pggate.cc:1726] Check failed: _s.ok() Bad status: Aborted (yb/rpc/reactor.cc:94): Shutdown connection (system error 108)
@ 0x7fead8077e1c google::LogDestination::LogToSinks()
@ 0x7fead8077844 google::LogMessage::SendToLog()
@ 0x7fead8078fd8 google::LogMessage::Flush()
@ 0x7fead80797bf google::LogMessageFatal::~LogMessageFatal()
@ 0x7feadd0ba397 yb::pggate::PgApiImpl::ClearSeparateDdlTxnMode()
@ 0xb2d20f YBTxnDdlProcessUtility
@ 0x963ad4 PortalRunUtility
@ 0x9638a7 PortalRunMulti
@ 0x962e63 PortalRun
@ 0x95ec06 yb_exec_simple_query_impl
@ 0x95f3c4 yb_exec_query_wrapper_one_attempt
@ 0x95b97b PostgresMain
@ 0x8b6fbb BackendRun
@ 0x8b66ba ServerLoop
@ 0x8b2a31 PostmasterMain
@ 0x7d2153 PostgresServerProcessMain
@ 0x4fb042 main
@ 0x7feadc088825 __libc_start_main
@ 0x4faf49 _start
@ (nil) (unknown)
We can constantly see this reproduced: https://jenkins.dev.yugabyte.com/job/itest-system-developer/4690/ The setup details can be found in Jenkins run. We can see the node is upgraded but the Jenkins run is failing because FATALS are present.
Complete Logs: https://drive.google.com/drive/folders/1GdXW6OZGejVt3N8ZlqBRiWoUVvtSM0lg?usp=sharing
@rjalan-yb , We have seen similar issues due to schema change operations and that issue is being fixed by #12926 (Pending backport to 2.14 branch, it should be done in the next couple of days).
Can you confirm if the master branch looks clean? If so, I can move the bug as DUP of #12926
@rjalan-yb , Ping on above.
Hey @rthallamko3, things with tablet splitting are working as expected in upgrade cases: Master Upgrade Jenkins run
All cases passed except 2. 2.0.11 -> CQLSH issue we know, 2.2 -> For some reason, it created same table twice in metadata and thus this command fails:
'/home/yugabyte/master/bin/yb-admin --master_addresses 10.9.131.74:7100,10.9.195.151:7100,10.9.92.40:7100 list_tablets ysql.splitdb upg_copy_tab_2_2_7_0_0 0' returned error code 1.
It fails with this error:
Error: Invalid argument (yb/tools/yb-admin_client.cc:371): Namespace 'ysql.splitdb' has multiple tables named 'upg_copy_tab_2_2_7_0_0', specify table id instead
And it runs with table id on manual try and tablets were split. So not an issue with tablet splitting.
This is a known issue it seems as we have an article for that too: https://support.yugabyte.com/hc/en-us/articles/360061078832-Namespace-foo-of-type-ycql-not-found
It is only seen for 2.2 version and not any other version. @Arjun-yb is also checking this once.
Since the timeout issue is not happening anymore. Will close this issue. The invalid argument issue on 2.2 seems different from what this GH is trying to capture. Please create a separate issue for that if needed. cc @rjalan-yb
Jira Link: DB-4481
Description
This is reproducible now in upgrade from 2.8.10.0-b21 to 2.17.1.0-b290 with automatic tablet splitting flag enabled. It should get turned on during upgrade since the new version supports automatic tablet splitting, but instead the upgrade_ysql command times out: https://phabricator.dev.yugabyte.com/D21514
itest logs show that this is just a standard yb-admin upgrade_ysql failing:
Tserver logs might indicate the source of the problem:
Full logs: https://drive.google.com/file/d/1dRNWRgExTOvlSysyrA5Y0yDAlt0PEMTt/view?usp=share_link