yunionio / cloudpods

A cloud-native open-source unified multi-cloud and hybrid-cloud platform. 开源、云原生的多云管理及混合云融合平台
https://www.cloudpods.org
Apache License 2.0
2.57k stars 525 forks source link

备份机一直启动中[求助/Help] #18642

Open shengxianli opened 10 months ago

shengxianli commented 10 months ago

backup_stopping=>backup_starting image

wanyaoqi commented 10 months ago

@shengxianli 看下备份机所在节点的 host 日志有没有报错, kubectl logs -n onecloud -c host <pod name> | lesss qemu启动日志在 /opt/cloud/workspace/servers/logs/ ,这个也看下有没有错误 还有你部署的版本信息贴一下。

shengxianli commented 10 months ago

@wanyaoqi v3.10.3

kubectllogs.txt qemu.txt

wanyaoqi commented 10 months ago
[info 2023-11-13 02:46:53 appsrv.(*Application).ServeHTTP(appsrv.go:284)] kMbJpGF6XQFxOj2mDAK2ebwlGWw= 200 4afb7e-b55261-2224b0 POST /servers/c50fb291-7f53-4a78-8762-79f377efb4b5/start (10.0.254.17:65165:compute_v2) 1.83ms
[info 2023-11-13 02:46:53 guestman.(*SKVMGuestInstance).asyncScriptStart(qemu-kvm.go:548)] Use vnc port 7
[error 2023-11-13 02:46:53 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference

@shengxianli 看日志是关机后再次开机失败了,应该是有 bug的,我们验证一下。

wanyaoqi commented 10 months ago

@shengxianli 我们在最新的3.10.7 版本测试创建重启等操作是没有问题的。建议升级到3.10.7版本

shengxianli commented 10 months ago

@wanyaoqi 收到我升级下看看

shengxianli commented 10 months ago

@wanyaoqi 还是在启动中,宿主机是不是也要做一些操作 kubectllogs.txt qemu.txt

shengxianli commented 10 months ago

image

wanyaoqi commented 10 months ago

@wanyaoqi 还是在启动中,宿主机是不是也要做一些操作 kubectllogs.txt qemu.txt

@shengxianli 你这个是升级了吗,看着还是会 panic

[error 2023-11-14 10:06:47 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference
shengxianli commented 10 months ago

@wanyaoqi 升级了,我重新安装部署一下物理机再试试吧吧。

shengxianli commented 10 months ago

@wanyaoqi 我这显示是已经升级了,并且在备份机上看不到这台主机 [root@IT-Cloudpods ocboot]# kubectl -n onecloud get onecloudclusters default -o=jsonpath='{.spec.version}' v3.10.7[root@IT-Cloudpods ocboot]#

但是升级过程中好像有一些错误 [root@IT-Cloudpods ocboot]# ./ocboot.py upgrade 10.0.254.17 v3.10.7 INFO:lib.ssh:exec_command: [ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl -n onecloud get onecloudclusters default -o json ssh -p 22 -o LogLevel=error -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ForwardX11=no -i /root/.ssh/id_rsa root@10.0.254.17 '[ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl -n onecloud get onecloudclusters default -o json' INFO:lib.ssh:exec_command: [ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl get nodes -o json ssh -p 22 -o LogLevel=error -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ForwardX11=no -i /root/.ssh/id_rsa root@10.0.254.17 '[ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl get nodes -o json' ansible-playbook -e @/tmp/oc_vars.yml -i /tmp/test-hosts.ini ./onecloud/upgrade-cluster.yml

PLAY [all] *****

TASK [Gathering Facts] ***** /usr/local/lib/python3.6/site-packages/ansible/parsing/vault/init.py:44: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography. The next release of cryptography will remove support for Python 3.6. from cryptography.exceptions import InvalidSignature ok: [test02] ok: [test01] ok: [it-cloudpods]

TASK [utils/detect-os : gather os specific variables] ** ok: [it-cloudpods] => (item=/root/ocboot/onecloud/roles/utils/detect-os/vars/../vars/centos-x86_64.yml) ok: [test01] => (item=/root/ocboot/onecloud/roles/utils/detect-os/vars/../vars/centos-x86_64.yml) ok: [test02] => (item=/root/ocboot/onecloud/roles/utils/detect-os/vars/../vars/centos-x86_64.yml)

TASK [utils/detect-os : Lookup offline data path] ** ok: [it-cloudpods] ok: [test02] ok: [test01]

TASK [utils/detect-os : Set offline data path var] ***** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : Set online status] ***** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : Set offline deploy] **** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : Check if /usr/bin/python3 exists] ** ok: [test01] ok: [test02] ok: [it-cloudpods]

TASK [utils/detect-os : Set python interpreter] **** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : debug] ***** ok: [it-cloudpods] => { "online_status": "online" } ok: [test01] => { "online_status": "online" } ok: [test02] => { "online_status": "online" }

TASK [utils/detect-os : set default fact is_running_on_vm] ***** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : detect if running on VM] *** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : join as host condition] **** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : vm node join as host agent] **** skipping: [it-cloudpods] skipping: [test01] skipping: [test02]

TASK [utils/detect-os : physical node join as host agent] ** skipping: [it-cloudpods] skipping: [test01] skipping: [test02]

TASK [disable telegraf for host service] *** skipping: [it-cloudpods] skipping: [test01] skipping: [test02]

TASK [upgrade/common : set others] ***** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [Include cronjobs] ****

TASK [utils/cronjobs : Ensure a job that runs every minute] **** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/cronjobs : ensure /opt/yunion/scripts path exists] ***** ok: [test02] ok: [test01] ok: [it-cloudpods]

TASK [enable auto backup] ** skipping: [it-cloudpods] skipping: [test01] skipping: [test02]

TASK [Include utils/k8s/kubelet/extra-args tasks] **

TASK [utils/k8s/kubelet/extra-args : Check kubelet if init] **** ok: [test02] ok: [test01] ok: [it-cloudpods]

TASK [utils/k8s/kubelet/extra-args : Sync /etc/sysconfig/kubelet] ** ok: [test01] ok: [test02] ok: [it-cloudpods]

TASK [utils/k8s/kubelet/extra-args : Restart kubelet] ** changed: [test01] changed: [test02] changed: [it-cloudpods]

TASK [upgrade/common : common upgrade | Import major version upgrade task] ***** skipping: [it-cloudpods] skipping: [test01] skipping: [test02]

TASK [checking ports] **

TASK [utils/kernel-modules : install service to reload kernel modules] ***** changed: [test01] changed: [test02] changed: [it-cloudpods]

TASK [utils/kernel-modules : prepare load module scripts] ** changed: [test01] changed: [test02] changed: [it-cloudpods]

TASK [utils/kernel-modules : execute load module scripts] ** changed: [test01] changed: [test02] changed: [it-cloudpods]

TASK [utils/kernel-modules : enable load modules service] ** changed: [test01] changed: [test02] changed: [it-cloudpods]

PLAY [primary_master_node] *****

TASK [Gathering Facts] ***** ok: [it-cloudpods]

TASK [Include utils/controlplane tasks] ****

TASK [utils/controlplane : Remove kubeadm cronjob that renews certificates] **** ok: [it-cloudpods]

TASK [utils/controlplane : Ensure a cronjob that renews k8s certificates] ** ok: [it-cloudpods]

TASK [upgrade/primary_master_node : Upgrade ocadm packages] **** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Copy clean_k8s_obj.sed script to /tmp] ***** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Backup current onecloud cluster and operator resource to /opt/yunion/ocboot/_upgrade/.*] *** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : update operator only] ** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Copy turn-on-operator-clear-component patch to /tmp/turn-on-operator-clear-component.patch.yml] *** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Turn on operator -clear-component option] *** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : test if version file contains cloudpods-ee image] *** skipping: [it-cloudpods]

TASK [upgrade/primary_master_node : patch hyper image for ee mode] ***** skipping: [it-cloudpods]

TASK [upgrade/primary_master_node : primary master node | Use ocadm upgrade version "v3.10.3" to "v3.10.7"] *** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : primary master node | Check ocadm upgrade async task. To watch upgrade process, SSH login host "IT-Cloudpods" execute: kubectl get pods -n onecloud -w] *** FAILED - RETRYING: primary master node | Check ocadm upgrade async task. To watch upgrade process, SSH login host "IT-Cloudpods" execute: kubectl get pods -n onecloud -w (30 retries left). FAILED - RETRYING: primary master node | Check ocadm upgrade async task. To watch upgrade process, SSH login host "IT-Cloudpods" execute: kubectl get pods -n onecloud -w (29 retries left). changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Copy turn-off-operator-clear-component patch to /tmp/turn-off-operator-clear-component.patch.yml] *** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Turn off operator -clear-component option] *** changed: [it-cloudpods]

TASK [patch coredns deployment] ****

TASK [utils/k8s/addons/patch : copy deployment kube-system/coredns patch to /tmp/coredns.patch.yml] *** changed: [it-cloudpods]

TASK [utils/k8s/addons/patch : patch deployment kube-system/coredns onecloud.yunion.io/controller node selector] *** changed: [it-cloudpods]

TASK [patch calico-kube-controllers deployment] ****

TASK [utils/k8s/addons/patch : copy deployment kube-system/calico-kube-controllers patch to /tmp/calico-kube-controllers.patch.yml] *** changed: [it-cloudpods]

TASK [utils/k8s/addons/patch : patch deployment kube-system/calico-kube-controllers onecloud.yunion.io/controller node selector] *** changed: [it-cloudpods]

PLAY [master_nodes] **** skipping: no hosts matched

PLAY [worker_nodes] ****

TASK [Gathering Facts] ***** ok: [test02] ok: [test01]

PLAY [primary_master_node:master_nodes:worker_nodes] ***

TASK [Gathering Facts] ***** ok: [test01] ok: [test02] ok: [it-cloudpods]

TASK [utils/gpu-init : ensure gpu init dir exists] ***** ok: [test01] => (item=/usr/local/gpu-init) ok: [test02] => (item=/usr/local/gpu-init) ok: [test01] => (item=/usr/share/hwdata) ok: [test02] => (item=/usr/share/hwdata) ok: [it-cloudpods] => (item=/usr/local/gpu-init) ok: [it-cloudpods] => (item=/usr/share/hwdata)

TASK [utils/gpu-init : Update pciids] ** ok: [test01] ok: [test02] ok: [it-cloudpods]

TASK [utils/gpu-init : cp gpu related files] *** ok: [test01] => (item=functions) ok: [test02] => (item=functions) ok: [it-cloudpods] => (item=functions) changed: [test01] => (item=gpu_setup.sh) changed: [test02] => (item=gpu_setup.sh) changed: [it-cloudpods] => (item=gpu_setup.sh)

TASK [utils/gpu-init : init gpus] ** changed: [it-cloudpods] changed: [test01] changed: [test02]

PLAY RECAP ***** it-cloudpods : ok=44 changed=21 unreachable=0 failed=0 skipped=7 rescued=0 ignored=0 test01 : ok=28 changed=7 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0 test02 : ok=28 changed=7 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0

INFO:lib.ssh:exec_command: [ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl -n onecloud annotate --overwrite=true onecloudclusters default upgrade.ocboot.yunion.io/current-version=v3.10.7 ssh -p 22 -o LogLevel=error -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ForwardX11=no -i /root/.ssh/id_rsa root@10.0.254.17 '[ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl -n onecloud annotate --overwrite=true onecloudclusters default upgrade.ocboot.yunion.io/current-version=v3.10.7' b'\n\xe2\x94\x8c\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x90\n\xe2\x94\x82 \xe2\x94\x82\n\xe2\x94\x82 The system has been upgraded to the latest version. \xe2\x94\x82\n\xe2\x94\x82 \xe2\x94\x82\n\xe2\x94\x94\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x98\n\n'

wanyaoqi commented 10 months ago

@wanyaoqi 我这显示是已经升级了,并且在备份机上看不到这台主机 [root@IT-Cloudpods ocboot]# kubectl -n onecloud get onecloudclusters default -o=jsonpath='{.spec.version}' v3.10.7[root@IT-Cloudpods ocboot]#

但是升级过程中好像有一些错误 [root@IT-Cloudpods ocboot]# ./ocboot.py upgrade 10.0.254.17 v3.10.7 INFO:lib.ssh:exec_command: [ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl -n onecloud get onecloudclusters default -o json ssh -p 22 -o LogLevel=error -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ForwardX11=no -i /root/.ssh/id_rsa root@10.0.254.17 '[ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl -n onecloud get onecloudclusters default -o json' INFO:lib.ssh:exec_command: [ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl get nodes -o json ssh -p 22 -o LogLevel=error -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ForwardX11=no -i /root/.ssh/id_rsa root@10.0.254.17 '[ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl get nodes -o json' ansible-playbook -e @/tmp/oc_vars.yml -i /tmp/test-hosts.ini ./onecloud/upgrade-cluster.yml

PLAY [all] *****

TASK [Gathering Facts] ***** /usr/local/lib/python3.6/site-packages/ansible/parsing/vault/init.py:44: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography. The next release of cryptography will remove support for Python 3.6. from cryptography.exceptions import InvalidSignature ok: [test02] ok: [test01] ok: [it-cloudpods]

TASK [utils/detect-os : gather os specific variables] ** ok: [it-cloudpods] => (item=/root/ocboot/onecloud/roles/utils/detect-os/vars/../vars/centos-x86_64.yml) ok: [test01] => (item=/root/ocboot/onecloud/roles/utils/detect-os/vars/../vars/centos-x86_64.yml) ok: [test02] => (item=/root/ocboot/onecloud/roles/utils/detect-os/vars/../vars/centos-x86_64.yml)

TASK [utils/detect-os : Lookup offline data path] ** ok: [it-cloudpods] ok: [test02] ok: [test01]

TASK [utils/detect-os : Set offline data path var] ***** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : Set online status] ***** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : Set offline deploy] **** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : Check if /usr/bin/python3 exists] ** ok: [test01] ok: [test02] ok: [it-cloudpods]

TASK [utils/detect-os : Set python interpreter] **** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : debug] ***** ok: [it-cloudpods] => { "online_status": "online" } ok: [test01] => { "online_status": "online" } ok: [test02] => { "online_status": "online" }

TASK [utils/detect-os : set default fact is_running_on_vm] ***** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : detect if running on VM] *** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : join as host condition] **** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/detect-os : vm node join as host agent] **** skipping: [it-cloudpods] skipping: [test01] skipping: [test02]

TASK [utils/detect-os : physical node join as host agent] ** skipping: [it-cloudpods] skipping: [test01] skipping: [test02]

TASK [disable telegraf for host service] *** skipping: [it-cloudpods] skipping: [test01] skipping: [test02]

TASK [upgrade/common : set others] ***** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [Include cronjobs] ****

TASK [utils/cronjobs : Ensure a job that runs every minute] **** ok: [it-cloudpods] ok: [test01] ok: [test02]

TASK [utils/cronjobs : ensure /opt/yunion/scripts path exists] ***** ok: [test02] ok: [test01] ok: [it-cloudpods]

TASK [enable auto backup] ** skipping: [it-cloudpods] skipping: [test01] skipping: [test02]

TASK [Include utils/k8s/kubelet/extra-args tasks] **

TASK [utils/k8s/kubelet/extra-args : Check kubelet if init] **** ok: [test02] ok: [test01] ok: [it-cloudpods]

TASK [utils/k8s/kubelet/extra-args : Sync /etc/sysconfig/kubelet] ** ok: [test01] ok: [test02] ok: [it-cloudpods]

TASK [utils/k8s/kubelet/extra-args : Restart kubelet] ** changed: [test01] changed: [test02] changed: [it-cloudpods]

TASK [upgrade/common : common upgrade | Import major version upgrade task] ***** skipping: [it-cloudpods] skipping: [test01] skipping: [test02]

TASK [checking ports] **

TASK [utils/kernel-modules : install service to reload kernel modules] ***** changed: [test01] changed: [test02] changed: [it-cloudpods]

TASK [utils/kernel-modules : prepare load module scripts] ** changed: [test01] changed: [test02] changed: [it-cloudpods]

TASK [utils/kernel-modules : execute load module scripts] ** changed: [test01] changed: [test02] changed: [it-cloudpods]

TASK [utils/kernel-modules : enable load modules service] ** changed: [test01] changed: [test02] changed: [it-cloudpods]

PLAY [primary_master_node] *****

TASK [Gathering Facts] ***** ok: [it-cloudpods]

TASK [Include utils/controlplane tasks] ****

TASK [utils/controlplane : Remove kubeadm cronjob that renews certificates] **** ok: [it-cloudpods]

TASK [utils/controlplane : Ensure a cronjob that renews k8s certificates] ** ok: [it-cloudpods]

TASK [upgrade/primary_master_node : Upgrade ocadm packages] **** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Copy clean_k8s_obj.sed script to /tmp] ***** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Backup current onecloud cluster and operator resource to /opt/yunion/ocboot/_upgrade/.*] *** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : update operator only] ** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Copy turn-on-operator-clear-component patch to /tmp/turn-on-operator-clear-component.patch.yml] *** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Turn on operator -clear-component option] *** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : test if version file contains cloudpods-ee image] *** skipping: [it-cloudpods]

TASK [upgrade/primary_master_node : patch hyper image for ee mode] ***** skipping: [it-cloudpods]

TASK [upgrade/primary_master_node : primary master node | Use ocadm upgrade version "v3.10.3" to "v3.10.7"] *** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : primary master node | Check ocadm upgrade async task. To watch upgrade process, SSH login host "IT-Cloudpods" execute: kubectl get pods -n onecloud -w] *** FAILED - RETRYING: primary master node | Check ocadm upgrade async task. To watch upgrade process, SSH login host "IT-Cloudpods" execute: kubectl get pods -n onecloud -w (30 retries left). FAILED - RETRYING: primary master node | Check ocadm upgrade async task. To watch upgrade process, SSH login host "IT-Cloudpods" execute: kubectl get pods -n onecloud -w (29 retries left). changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Copy turn-off-operator-clear-component patch to /tmp/turn-off-operator-clear-component.patch.yml] *** changed: [it-cloudpods]

TASK [upgrade/primary_master_node : Turn off operator -clear-component option] *** changed: [it-cloudpods]

TASK [patch coredns deployment] ****

TASK [utils/k8s/addons/patch : copy deployment kube-system/coredns patch to /tmp/coredns.patch.yml] *** changed: [it-cloudpods]

TASK [utils/k8s/addons/patch : patch deployment kube-system/coredns onecloud.yunion.io/controller node selector] *** changed: [it-cloudpods]

TASK [patch calico-kube-controllers deployment] ****

TASK [utils/k8s/addons/patch : copy deployment kube-system/calico-kube-controllers patch to /tmp/calico-kube-controllers.patch.yml] *** changed: [it-cloudpods]

TASK [utils/k8s/addons/patch : patch deployment kube-system/calico-kube-controllers onecloud.yunion.io/controller node selector] *** changed: [it-cloudpods]

PLAY [master_nodes] **** skipping: no hosts matched

PLAY [worker_nodes] ****

TASK [Gathering Facts] ***** ok: [test02] ok: [test01]

PLAY [primary_master_node:master_nodes:worker_nodes] ***

TASK [Gathering Facts] ***** ok: [test01] ok: [test02] ok: [it-cloudpods]

TASK [utils/gpu-init : ensure gpu init dir exists] ***** ok: [test01] => (item=/usr/local/gpu-init) ok: [test02] => (item=/usr/local/gpu-init) ok: [test01] => (item=/usr/share/hwdata) ok: [test02] => (item=/usr/share/hwdata) ok: [it-cloudpods] => (item=/usr/local/gpu-init) ok: [it-cloudpods] => (item=/usr/share/hwdata)

TASK [utils/gpu-init : Update pciids] ** ok: [test01] ok: [test02] ok: [it-cloudpods]

TASK [utils/gpu-init : cp gpu related files] *** ok: [test01] => (item=functions) ok: [test02] => (item=functions) ok: [it-cloudpods] => (item=functions) changed: [test01] => (item=gpu_setup.sh) changed: [test02] => (item=gpu_setup.sh) changed: [it-cloudpods] => (item=gpu_setup.sh)

TASK [utils/gpu-init : init gpus] ** changed: [it-cloudpods] changed: [test01] changed: [test02]

PLAY RECAP ***** it-cloudpods : ok=44 changed=21 unreachable=0 failed=0 skipped=7 rescued=0 ignored=0 test01 : ok=28 changed=7 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0 test02 : ok=28 changed=7 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0

INFO:lib.ssh:exec_command: [ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl -n onecloud annotate --overwrite=true onecloudclusters default upgrade.ocboot.yunion.io/current-version=v3.10.7 ssh -p 22 -o LogLevel=error -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ForwardX11=no -i /root/.ssh/id_rsa root@10.0.254.17 '[ -s /etc/kubernetes/admin.conf ] && export KUBECONFIG=/etc/kubernetes/admin.conf || :; kubectl -n onecloud annotate --overwrite=true onecloudclusters default upgrade.ocboot.yunion.io/current-version=v3.10.7' b'\n\xe2\x94\x8c\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x90\n\xe2\x94\x82 \xe2\x94\x82\n\xe2\x94\x82 The system has been upgraded to the latest version. \xe2\x94\x82\n\xe2\x94\x82 \xe2\x94\x82\n\xe2\x94\x94\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x80\xe2\x94\x98\n\n'

@shengxianli 尝试删除再重新添加备份机试试 另外之前给的 kubectl.log 都不全,可以试试 logs 的时候加上 -p

shengxianli commented 10 months ago

@wanyaoqi 看下这个日志可以吗 kubectllogs.txt

swordqiu commented 10 months ago

@shengxianli [error 2023-11-14 10:06:47 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference 这个backtrace贴一下?

shengxianli commented 10 months ago

@swordqiu 不好意思不是很懂这个代码层面,还有你说的backtrace是啥意思

swordqiu commented 10 months ago

@swordqiu 不好意思不是很懂这个代码层面,还有你说的backtrace是啥意思

从上面kubectllogs.txt可以看到有这样的日志:[error 2023-11-14 10:06:47 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference。这个在日志中应该会打印对应的调用栈,是否方便贴这个调用栈。

shengxianli commented 10 months ago

@swordqiu 方便,我应该如何找到这个东西

dengju2020 commented 10 months ago

@shengxianli 确认存在bug,我们在排期处理,有进度会在此issue通知。谢谢

shengxianli commented 10 months ago

@dengju2020 好的。我说我重装好多次了还是这样,还挺奇怪的