yunionio / cloudpods

A cloud-native open-source unified multi-cloud and hybrid-cloud platform. 开源、云原生的多云管理及混合云融合平台
https://www.cloudpods.org
Apache License 2.0
2.6k stars 534 forks source link

[求助/Help] ubuntu 安装3.11 卡住 #19700

Closed Belkuy closed 8 months ago

Belkuy commented 8 months ago

系统版本:ubuntu 22.04 cloudpos版本:3.11,使用 ocboot 部署工具以 All in One 的方式快速部署私有云版本。

一直卡在: TASK [primary-master-node/setup_cloud : Create essential services, wait for a few minutes. You can open another terminal and execute kubectl get pods -n onecloud -w to watch the process.] ***

查看pod: image

查看集群事件,也没有错误 image

zhasm commented 8 months ago

@Belkuy

# 查看所有 pods
kubectl get pods -o wide -A  
# 查看 启动参数
cat /proc/cmdline
Belkuy commented 8 months ago

@Belkuy

# 查看所有 pods
kubectl get pods -o wide -A  
# 查看 启动参数
cat /proc/cmdline

image

cat /proc/cmdline

BOOT_IMAGE=/boot/vmlinuz-5.15.0-100-generic root=UUID=2346bb46-0e8c-416a-9afd-f678255cd407 ro systemd.unified_cgroup_hierarchy=0

zhasm commented 8 months ago

@Belkuy

image

看一下最下面一行的operator 的日志

Belkuy commented 8 months ago

@Belkuy image 看一下最下面一行的operator 的日志

[info 240311 06:15:08 component.(etcdManager).fixEtcdSize(etcd.go:133)] Master node count 1 [info 240311 06:15:11 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) [info 240311 06:15:19 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) [info 240311 06:15:27 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) [info 240311 06:15:35 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) [error 240311 06:15:41 component.(etcdManager).membersDefrag(etcd.go:229)] members defrag failed: creating etcd client failed etcdclient: no available endpoints [info 240311 06:15:43 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) Post "https://default-keystone:30357/v3/auth/tokens": dial tcp: lookup default-keystone on 10.96.0.10:53: read udp 10.40.175.134:58831->10.96.0.10:53: i/o timeout E0311 06:15:48.918024 1 onecloud_cluster_controller.go:210] OnecloudCluster: onecloud/default, sync failed sync component: sync phase control "keystone": get mcclient session: {"error":{"class":"DNSError","code":499,"details":"Post \"https://default-keystone:30357/v3/auth/tokens\": dial tcp: lookup default-keystone on 10.96.0.10:53: read udp 10.40.175.134:58831->10.96.0.10:53: i/o timeout","request":{"body":"{\"auth\":{\"context\":{\"source\":\"operator\"},\"identity\":{\"methods\":[\"password\"],\"password\":{\"user\":{\"dom...ult\"},\"name\":\"system\"}}}}","headers":{"Content-Length":"242","Content-Type":"application/json","User-Agent":"yunioncloud-go/201708"},"method":"POST","url":"https://default-keystone:30357/v3/auth/tokens"}}}, requeuing I0311 06:15:48.930825 1 configmap_control.go:73] update ConfigMap: [onecloud/default-cluster-config] successfully, cluster: default [info 240311 06:15:48 controller.recordResourceEvent(utils.go:162)] update ConfigMap default-cluster-config in OnecloudCluster default successful I0311 06:15:48.931175 1 event.go:282] Event(v1.ObjectReference{Kind:"OnecloudCluster", Namespace:"onecloud", Name:"default", UID:"77e333fa-d16c-4caa-a0a5-a897ab42c821", APIVersion:"onecloud.yunion.io/v1alpha1", ResourceVersion:"1059", FieldPath:""}): type: 'Normal' reason: 'SuccessfulUpdate' update ConfigMap default-cluster-config in OnecloudCluster default successful [info 240311 06:15:48 component.(etcdManager).fixEtcdSize(etcd.go:133)] Master node count 1 [info 240311 06:15:51 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) [info 240311 06:15:59 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) [info 240311 06:16:07 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) [info 240311 06:16:15 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) [info 240311 06:16:24 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) Post "https://default-keystone:30357/v3/auth/tokens": dial tcp: lookup default-keystone on 10.96.0.10:53: read udp 10.40.175.134:47245->10.96.0.10:53: i/o timeout E0311 06:16:29.050305 1 onecloud_cluster_controller.go:210] OnecloudCluster: onecloud/default, sync failed sync component: sync phase control "keystone": get mcclient session: {"error":{"class":"DNSError","code":499,"details":"Post \"https://default-keystone:30357/v3/auth/tokens\": dial tcp: lookup default-keystone on 10.96.0.10:53: read udp 10.40.175.134:47245->10.96.0.10:53: i/o timeout","request":{"body":"{\"auth\":{\"context\":{\"source\":\"operator\"},\"identity\":{\"methods\":[\"password\"],\"password\":{\"user\":{\"dom...ult\"},\"name\":\"system\"}}}}","headers":{"Content-Length":"242","Content-Type":"application/json","User-Agent":"yunioncloud-go/201708"},"method":"POST","url":"https://default-keystone:30357/v3/auth/tokens"}}}, requeuing I0311 06:16:29.063124 1 configmap_control.go:73] update ConfigMap: [onecloud/default-cluster-config] successfully, cluster: default [info 240311 06:16:29 controller.recordResourceEvent(utils.go:162)] update ConfigMap default-cluster-config in OnecloudCluster default successful I0311 06:16:29.063338 1 event.go:282] Event(v1.ObjectReference{Kind:"OnecloudCluster", Namespace:"onecloud", Name:"default", UID:"77e333fa-d16c-4caa-a0a5-a897ab42c821", APIVersion:"onecloud.yunion.io/v1alpha1", ResourceVersion:"1059", FieldPath:""}): type: 'Normal' reason: 'SuccessfulUpdate' update ConfigMap default-cluster-config in OnecloudCluster default successful [info 240311 06:16:29 component.(etcdManager).fixEtcdSize(etcd.go:133)] Master node count 1 [info 240311 06:16:32 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) [info 240311 06:16:40 component.(etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j]) [info 240311 06:16:48 component.(*etcdManager).run(etcd.go:762)] skip reconciliation: running ([]), pending ([default-etcd-6stvfvw54j])

zhasm commented 8 months ago

@Belkuy 多谢反馈,排查中

zexi commented 8 months ago

@Belkuy 看下 kube-system namespace 里面 calico-node pod 的日志,如果里面有 ipset 相关的报错,可以把这个 daemonset 的镜像换成:registry.cn-beijing.aliyuncs.com/yunionio/calico-node:v3.12.1-ipset-6 试试

Belkuy commented 8 months ago

@Belkuy 看下 kube-system namespace 里面 calico-node pod 的日志,如果里面有 ipset 相关的报错,可以把这个 daemonset 的镜像换成:registry.cn-beijing.aliyuncs.com/yunionio/calico-node:v3.12.1-ipset-6 试试

没有相关测错误信息。

@Belkuy 多谢反馈,排查中

我查了下,应该是没有修改 pod 里面的dns 配置,我看仍然是去找主机的dns 去 解析 svc (192.168.10.1和192.168.10.2 是宿主机的dns) 2024-03-11T06:38:25.688Z [ERROR] plugin/errors: 2 default-etcd-q28dhc5qnf.default-etcd.onecloud.svc. A: read udp 10.40.175.131:51393->192.168.10.2:53: i/o timeout 2024-03-11T06:38:25.688Z [ERROR] plugin/errors: 2 default-etcd-q28dhc5qnf.default-etcd.onecloud.svc. AAAA: read udp 10.40.175.131:51693->192.168.10.1:53: i/o timeout 2024-03-11T06:38:29.999Z [ERROR] plugin/errors: 2 default-keystone. A: read udp 10.40.175.131:36199->192.168.10.2:53: i/o timeout 2024-03-11T06:38:30.691Z [ERROR] plugin/errors: 2 default-etcd-q28dhc5qnf.default-etcd.onecloud.svc. AAAA: read udp 10.40.175.131:51824->192.168.10.2:53: i/o timeout 2024-03-11T06:38:30.691Z [ERROR] plugin/errors: 2 default-etcd-q28dhc5qnf.default-etcd.onecloud.svc. A: read udp 10.40.175.131:33314->192.168.10.2:53: i/o timeout 2024-03-11T06:38:35.696Z [ERROR] plugin/errors: 2 default-etcd-q28dhc5qnf.default-etcd.onecloud.svc. A: read udp 10.40.175.131:58309->192.168.10.1:53: i/o timeout 2024-03-11T06:38:35.696Z [ERROR] plugin/errors: 2 default-etcd-q28dhc5qnf.default-etcd.onecloud.svc. AAAA: read udp 10.40.175.131:58629->192.168.10.1:53: i/o timeout 2024-03-11T06:39:05.134Z [ERROR] plugin/errors: 2 default-keystone. A: read udp 10.40.175.131:58045->192.168.10.1:53: i/o timeout 2024-03-11T06:39:10.134Z [ERROR] plugin/errors: 2 default-keystone. A: read udp 10.40.175.131:34435->192.168.10.2:53: i/o timeout 2024-03-11T06:39:45.254Z [ERROR] plugin/errors: 2 default-keystone. AAAA: read udp 10.40.175.131:56712->192.168.10.1:53: i/o timeout 2024-03-11T06:39:50.254Z [ERROR] plugin/errors: 2 default-keystone. A: read udp 10.40.175.131:44761->192.168.10.1:53: i/o timeout

zhasm commented 8 months ago

registry.cn-beijing.aliyuncs.com/yunionio/calico-node:v3.12.1-ipset-6 试试

@Belkuy

  1. 看一下内核是否等于或高于 5.15;
  2. 直接替换为这个calico-node:v3.12.1-ipset-6 image,并重启相关组件,看看是否解决?
Belkuy commented 8 months ago

registry.cn-beijing.aliyuncs.com/yunionio/calico-node:v3.12.1-ipset-6 试试

@Belkuy

  1. 看一下内核是否等于或高于 5.15;
  2. 直接替换为这个calico-node:v3.12.1-ipset-6 image,并重启相关组件,看看是否解决?

还是一样

zhasm commented 8 months ago

@Belkuy 看到您关闭这个 issue 了~ 目前的状态是问题已经自动解决了,还是暂时放弃等待了?(我的同事们正在定位问题)

Belkuy commented 8 months ago

@Belkuy 看到您关闭这个 issue 了~ 目前的状态是问题已经自动解决了,还是暂时放弃等待了?(我的同事们正在定位问题)

我重启了一次,然后再次执行run.py,然后又自动重启了一次,发现onecloud 下启动了很多容器,web页面也可以正常登陆。后边有时间再使用看看。

qfdk commented 6 months ago

同样的问题 Ubuntu 22.04

Linux cloudpods 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

image 手动重启第一次,可以安装完成了 image 尝试手动第二次重启

root@cloudpods:~# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.15.0-105-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro systemd.unified_cgroup_hierarchy=0 nouveau.modeset=0 iommu=pt rdblacklist=nouveau mgag200.modeset=0 intel_iommu=on crashkernel=auto vfio_iommu_type1.allow_unsafe_interrupts=1

暂时已放弃. 出现的问题有点儿多. 解决起来时间成本略高

zhasm commented 5 months ago

uname -a
Linux ubuntu-vm 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.15.0-52-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro systemd.unified_cgroup_hierarchy=0