yunionio / cloudpods

A cloud-native open-source unified multi-cloud and hybrid-cloud platform. 开源、云原生的多云管理及混合云融合平台
https://www.cloudpods.org
Apache License 2.0
2.5k stars 497 forks source link

[求助/Help] 部署虚机失败,说什么文件不存在,但是我文件都在,并且磁盘空间也是够的 #18954

Closed zhuhedong closed 7 months ago

zhuhedong commented 7 months ago

{ "reason": "Deploy guest fs: request deploy guest fs: rpc error: code = Unknown desc = Connect: failed start guest qemu-kvm: -drive file=/home/system_path/aa5c6997-1d49-470d-8dd2-35fb0daecad3,if=none,id=drive_0,cache=none: Could not open '/home/system_path/aa5c6997-1d49-470d-8dd2-35fb0daecad3': No such file or directory\n: exit status 1", "stage": "OnDeployGuestComplete", "status": "error" } image image 是因为这个tmp满了吗? image

swordqiu commented 7 months ago

@zhuhedong 感觉是你把磁盘路径放在/home/system_path了,建议迁移到 /opt/cloud/workspace/ 目录下 (可以rebind过去),在 local_image_paths 中使用 /opt/cloud/workspace 下的路径。

zhuhedong commented 7 months ago

image 我在host.conf配置文件 local_image_path属性中配置了该目录,并且我在该目录下已经生成了6个虚拟机

swordqiu commented 7 months ago

@zhuhedong 请问Cloudpods版本是什么?

zhuhedong commented 7 months ago

我升级到了,3.10.8 , 我回滚到3.10.7, 就好了,

是不是3.10.8有bug?

swordqiu commented 7 months ago

@zhuhedong 是3.10.8改为轻量虚拟机部署虚拟机引入的bug,多谢反馈。我们尽快修复

zhuhedong commented 7 months ago

我不知道是不是bug还是我服务有问题 突然就503了服务,然后自己又好了 image

swordqiu commented 7 months ago

@zhuhedong 是降级版本时候出现503吗?

zhuhedong commented 7 months ago

又出现进不去了 image { "class": "ServiceAbnormal", "code": 499, "details": "compute服务异常,请检查服务状态", "time": "2023-12-11T00:52:19+08:00" }

zhuhedong commented 7 months ago

@swordqiu 不是。突然出现的,连续两次了,第一次自己好了, 我现在已经无法操作了

zhuhedong commented 7 months ago

现在又好了。。。

swordqiu commented 7 months ago

@zhuhedong kubectl -n onecloud get pods -l app=region 看下region服务对应pod的状态,如果异常,可以查看pod日志:kubectl -n onecloud logs

zhuhedong commented 7 months ago

[warning 2023-12-10 17:06:58 appsrv.do_worker_watchdog(workers_watchdog.go:64)] WorkerManager LogClientWorkerManager has been busy for 2 cycles... Post "https://default-logger:30999/actions": dial tcp: lookup default-logger on 10.96.0.10:53: read udp 10.40.45.54:55567->10.96.0.10:53: read: connection refused [error 2023-12-10 17:06:59 logclient.(*logTask).Run(logclient.go:249)] create action log {"action":"update_status","domain":"Default","domain_id":"default","ip":"10.107.211.133","notes":"running=>unknown: host offline","obj_id":"77303677-01f7-4adc-8ff8-7a9a522a58c2","obj_name":"h601","obj_type":"server","owner_domain_id":"default","owner_tenant_id":"a386cf653a9e4f728344294ca20e2b6f","project_domain":"Default","project_domain_id":"default","roles":"admin","service":"compute","severity":"ERROR","success":false,"tenant":"system","tenant_id":"a386cf653a9e4f728344294ca20e2b6f","user":"regionadmin","user_id":"8870f888a78446e18ffe8021b68625b4"} failed {"error":{"class":"ServiceAbnormal","code":499,"data":{"fields":["log"],"id":"%s service dns resolve error, please check dns setting"},"details":"log service dns resolve error, please check dns setting"}}

zhuhedong commented 7 months ago

@swordqiu 我现在创建虚机都没有密钥信息了。。 image

swordqiu commented 7 months ago

@zhuhedong 请贴一下host-deployer的日志

zhuhedong commented 7 months ago

/dev/nbd13p1 这个磁盘是你们的吗?

zhuhedong commented 7 months ago

[info 2023-12-11 08:13:51 deployserver.(*DeployerServer).ResizeFs(deployserver.go:135)] ***** Resize fs on qemuimg.SImageInfo{Path:"/home/system_path/7b252524-04aa-4c6a-86ac-44c73c61beb2", Format:"", IoLevel:0, Password:"", EncryptFormat:"", EncryptAlg:"", secId:""} [info 2023-12-11 08:13:51 nbd.(NBDDriver).Connect(driver.go:68)] lock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af [error 2023-12-11 08:13:53 nbd.(NBDDriver).setupLVMS(driver.go:148)] unable to find vg from /dev/nbd13p1: unable to find vg, output is " No volume groups found.\n" [info 2023-12-11 08:13:53 nbd.(NBDDriver).Connect(driver.go:88)] /dev/nbd13 hasLVM false err [info 2023-12-11 08:13:53 nbd.(NBDDriver).Connect(driver.go:92)] unlock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af [info 2023-12-11 08:13:53 diskutils.(SKVMGuestDisk).mountKvmRootfs(kvm.go:138)] detect partition /dev/nbd13p1 [info 2023-12-11 08:13:53 xfsutils.LockXfsPartition(lock.go:24)] xfs lock f956f023-a2a0-4fd5-8e89-0b44b0848ab3 [error 2023-12-11 08:13:53 kvmpart.(SKVMGuestDiskPartition).mount.func2(kvmpart.go:186)] mount fail: exit status 32 mount: /tmp/_dev_nbd13p1: wrong fs type, bad option, bad superblock on /dev/nbd13p1, missing codepage or helper program, or other error.

[error 2023-12-11 08:13:54 kvmpart.(*SKVMGuestDiskPartition).mount.func2(kvmpart.go:186)] mount fail: exit status 32 mount: /tmp/_dev_nbd13p1: wrong fs type, bad option, bad superblock on /dev/nbd13p1, missing codepage or helper program, or other error.

[error 2023-12-11 08:13:56 kvmpart.(*SKVMGuestDiskPartition).mount.func2(kvmpart.go:186)] mount fail: exit status 32 mount: /tmp/_dev_nbd13p1: wrong fs type, bad option, bad superblock on /dev/nbd13p1, missing codepage or helper program, or other error.

[info 2023-12-11 08:13:57 xfsutils.UnlockXfsPartition(lock.go:43)] xfs unlock f956f023-a2a0-4fd5-8e89-0b44b0848ab3 [error 2023-12-11 08:13:57 kvmpart.(SKVMGuestDiskPartition).Mount(kvmpart.go:125)] SKVMGuestDiskPartition mount error: mount failed: tried 3 times, exceeding 3, last err : exit status 32 [info 2023-12-11 08:13:57 guestfs.IsPartitionReadonly(core.go:219)] File system /tmp/_dev_nbd13p1 is not readonly [info 2023-12-11 08:13:57 kvmpart.(SKVMGuestDiskPartition).Mount(kvmpart.go:139)] mount fs xfs on /dev/nbd13p1 successfully [error 2023-12-11 08:13:57 kvmpart.(*SKVMGuestDiskPartition).IsMounted(kvmpart.go:245)] /tmp/_dev_nbd13p1 is not a mountpoint: /tmp/_dev_nbd13p1 is not a mountpoint

[info 2023-12-11 08:13:57 nbd.(NBDDriver).Disconnect(driver.go:200)] Disconnect lock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af [info 2023-12-11 08:13:57 nbd.(NBDDriver).Disconnect(driver.go:207)] Disconnect unlock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af [info 2023-12-11 08:13:58 deployserver.(*DeployerServer).DeployGuestFs(deployserver.go:83)] ***** Deploy guest fs on qemuimg.SImageInfo{Path:"/home/system_path/7b252524-04aa-4c6a-86ac-44c73c61beb2", Format:"", IoLevel:0, Password:"", EncryptFormat:"", EncryptAlg:"", secId:""} [info 2023-12-11 08:13:58 nbd.(NBDDriver).Connect(driver.go:68)] lock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af [error 2023-12-11 08:14:00 nbd.(NBDDriver).setupLVMS(driver.go:148)] unable to find vg from /dev/nbd13p1: unable to find vg, output is " No volume groups found.\n" [info 2023-12-11 08:14:00 nbd.(NBDDriver).Connect(driver.go:88)] /dev/nbd13 hasLVM false err [info 2023-12-11 08:14:00 nbd.(NBDDriver).Connect(driver.go:92)] unlock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af [info 2023-12-11 08:14:00 diskutils.(SKVMGuestDisk).mountKvmRootfs(kvm.go:138)] detect partition /dev/nbd13p1 [info 2023-12-11 08:14:00 xfsutils.LockXfsPartition(lock.go:24)] xfs lock f956f023-a2a0-4fd5-8e89-0b44b0848ab3 [error 2023-12-11 08:14:00 kvmpart.(SKVMGuestDiskPartition).mount.func2(kvmpart.go:186)] mount fail: exit status 32 mount: /tmp/_dev_nbd13p1: wrong fs type, bad option, bad superblock on /dev/nbd13p1, missing codepage or helper program, or other error.

[error 2023-12-11 08:14:01 kvmpart.(*SKVMGuestDiskPartition).mount.func2(kvmpart.go:186)] mount fail: exit status 32 mount: /tmp/_dev_nbd13p1: wrong fs type, bad option, bad superblock on /dev/nbd13p1, missing codepage or helper program, or other error.

[error 2023-12-11 08:14:03 kvmpart.(*SKVMGuestDiskPartition).mount.func2(kvmpart.go:186)] mount fail: exit status 32 mount: /tmp/_dev_nbd13p1: wrong fs type, bad option, bad superblock on /dev/nbd13p1, missing codepage or helper program, or other error.

[info 2023-12-11 08:14:04 xfsutils.UnlockXfsPartition(lock.go:43)] xfs unlock f956f023-a2a0-4fd5-8e89-0b44b0848ab3 [error 2023-12-11 08:14:04 kvmpart.(SKVMGuestDiskPartition).Mount(kvmpart.go:125)] SKVMGuestDiskPartition mount error: mount failed: tried 3 times, exceeding 3, last err : exit status 32 [info 2023-12-11 08:14:04 guestfs.IsPartitionReadonly(core.go:219)] File system /tmp/_dev_nbd13p1 is not readonly [info 2023-12-11 08:14:04 kvmpart.(SKVMGuestDiskPartition).Mount(kvmpart.go:139)] mount fs xfs on /dev/nbd13p1 successfully [error 2023-12-11 08:14:04 kvmpart.(*SKVMGuestDiskPartition).IsMounted(kvmpart.go:245)] /tmp/_dev_nbd13p1 is not a mountpoint: /tmp/_dev_nbd13p1 is not a mountpoint

[error 2023-12-11 08:14:04 deployserver.(DeployerServer).DeployGuestFs(deployserver.go:104)] disk.MountRootfs not found partition, not init, quit [info 2023-12-11 08:14:04 nbd.(NBDDriver).Disconnect(driver.go:200)] Disconnect lock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af [info 2023-12-11 08:14:04 nbd.(*NBDDriver).Disconnect(driver.go:207)] Disconnect unlock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af

@swordqiu 这个就是日志

swordqiu commented 7 months ago

@zhuhedong 3.10.7设置主机密码是通过宿主机的nbd模块把虚拟机的磁盘挂载起来设置的,从日志看,似乎是没能识别虚拟机的xfs文件系统。常见原因是宿主机内核比虚拟机的老,导致无法挂载虚拟机的xfs文件系统。请问宿主机和虚拟机的操作系统类型和内核版本分别是什么?

zhuhedong commented 7 months ago

@swordqiu 5.4.130-1.yn20230805.el7.x86_64

swordqiu commented 7 months ago

@zhuhedong 虚拟机的呢?

zhuhedong commented 7 months ago

@swordqiu 上不去虚机,看不到内核, 文件是CentOS-7-x86_64-GenericCloud-2211.qcow2

swordqiu commented 7 months ago

@zhuhedong 这是用nbd经常遇到的问题。建议改用3.10.8,但是不要用 /home/system_path 这个路径存储本地磁盘,而是 mount bind到 /opt/cloud/workspace 下目录。

zhuhedong commented 7 months ago

但是你们现在 3.10.8有bug 昨晚那个bug

zhuhedong commented 7 months ago

@swordqiu

swordqiu commented 7 months ago

@zhuhedong 3.10.8暂时无法支持 /opt/cloud 目录之外的本地磁盘目录,可以用 mount bind,把磁盘目录挂载到 /opt/cloud/workspace 下作为磁盘存储目录

zhuhedong commented 7 months ago

也就是说,我要用3.10.8的话,必须通过挂载磁盘的方式去使用是吗? @swordqiu

swordqiu commented 7 months ago

@zhuhedong 是的。这是目前机制限制的。3.10.8之前用nbd挂在虚拟机磁盘做初始化,存在以上密码初始化不稳定的问题。3.10.8开始用轻量虚拟机来挂载磁盘做初始化,但是因为容器的限制,没法访问 /opt/cloud 之外目录的磁盘文件。建议先用mount bind来规避这个限制。

zhuhedong commented 7 months ago

那我之前的东西移过去 会有什么问题吗? @swordqiu

zhuhedong commented 7 months ago

@swordqiu 又突然不能用了。/。。 image

zexi commented 7 months ago

@zhuhedong 先用 docker ps -a | grep apiserver 看下 kube-apiserver 容器的状态,如果是退出的,用 docker log $容器id 看下日志。 另外用 journalctl -u kubelet --no-pager --since '2 hours ago' 看下 kubelet 服务的日志。

zhuhedong commented 7 months ago

image有一个是挂的 @zexi

zexi commented 7 months ago

@zhuhedong 看下挂了的那个 apiserver 容器的日志:docker logs f8b433,贴上来看下

zhuhedong commented 7 months ago

image @zexi

zhuhedong commented 7 months ago

image

zhuhedong commented 7 months ago

image 这个是k8s的 @zexi

zexi commented 7 months ago

@zhuhedong 192.168.1.10 也是控制节点吗?另外看下 docker ps -a | grep etcd |grep adver etcd 容器的状态,有没有退出,有退出的话把日志放上来。

zhuhedong commented 7 months ago

192.168.1.10 是之前计算节点, 我这边今天会把这个节点重置。, 192.168.1.60 是昨天新装的, 因为出现了重置密钥无法生成的问题, 并且目录无法mount, 就重装升级到了3.10.8 image

@zexi

zhuhedong commented 7 months ago

现在又正常了, 可以正常操作了。。 每出现一次,然后过个半个小时左右 他又好了 @zexi

zexi commented 7 months ago

@zhuhedong 先说下你现在的部署架构,是3节点高可用还是单控制节点?

zhuhedong commented 7 months ago

@zexi 我现在是单节点

zexi commented 7 months ago

@zhuhedong 现在能看到可疑的地方就是 apiserver 被重启了,但容器日志里面看起来没有异常。 感觉是操作系统再 kill 容器之类的,节点的操作系统是 'centos 7' 还是其他的发行版?如果是 centos 7 的话,可以看下 /var/log/messages 文件里面的系统日志有没有异常。

zhuhedong commented 7 months ago

是'centos 7

zexi commented 7 months ago

@zhuhedong 看下 /var/log/messages 里面有没有异常?

zhuhedong commented 7 months ago

image @zexi

zhuhedong commented 7 months ago

image 又开始了 @zexi

zexi commented 7 months ago

image @zexi

@zhuhedong 那个 BTRFS 的内核报错看起来有问题,我们在 centos7 上没用过这个文件系统,建议用 ext4 。 建议你先修复这个 BTRFS 的问题,再看下容器会不会重启。

zhuhedong commented 7 months ago

你们这个是不是用Ubuntu会稳定一点? @zexi

zexi commented 7 months ago

@zhuhedong centos7 使用 ext4 文件系统是测试最多,最稳定的。

zhuhedong commented 7 months ago

好吧, 那么ubuntu呢? @zexi

zexi commented 7 months ago

@zhuhedong btrfs 文件系统我这边还没用过,不管什么发行版都一直用 ext4 。

zexi commented 7 months ago

好吧, 那么ubuntu呢? @zexi

ubuntu 是上个版本才支持的,测试没有 centos 7 多。