Closed zhuhedong closed 7 months ago
@zhuhedong 感觉是你把磁盘路径放在/home/system_path了,建议迁移到 /opt/cloud/workspace/ 目录下 (可以rebind过去),在 local_image_paths 中使用 /opt/cloud/workspace 下的路径。
我在host.conf配置文件 local_image_path属性中配置了该目录,并且我在该目录下已经生成了6个虚拟机
@zhuhedong 请问Cloudpods版本是什么?
我升级到了,3.10.8 , 我回滚到3.10.7, 就好了,
是不是3.10.8有bug?
@zhuhedong 是3.10.8改为轻量虚拟机部署虚拟机引入的bug,多谢反馈。我们尽快修复
我不知道是不是bug还是我服务有问题 突然就503了服务,然后自己又好了
@zhuhedong 是降级版本时候出现503吗?
又出现进不去了
{
"class": "ServiceAbnormal",
"code": 499,
"details": "compute服务异常,请检查服务状态",
"time": "2023-12-11T00:52:19+08:00"
}
@swordqiu 不是。突然出现的,连续两次了,第一次自己好了, 我现在已经无法操作了
现在又好了。。。
@zhuhedong kubectl -n onecloud get pods -l app=region 看下region服务对应pod的状态,如果异常,可以查看pod日志:kubectl -n onecloud logs
[warning 2023-12-10 17:06:58 appsrv.do_worker_watchdog(workers_watchdog.go:64)] WorkerManager LogClientWorkerManager has been busy for 2 cycles... Post "https://default-logger:30999/actions": dial tcp: lookup default-logger on 10.96.0.10:53: read udp 10.40.45.54:55567->10.96.0.10:53: read: connection refused [error 2023-12-10 17:06:59 logclient.(*logTask).Run(logclient.go:249)] create action log {"action":"update_status","domain":"Default","domain_id":"default","ip":"10.107.211.133","notes":"running=>unknown: host offline","obj_id":"77303677-01f7-4adc-8ff8-7a9a522a58c2","obj_name":"h601","obj_type":"server","owner_domain_id":"default","owner_tenant_id":"a386cf653a9e4f728344294ca20e2b6f","project_domain":"Default","project_domain_id":"default","roles":"admin","service":"compute","severity":"ERROR","success":false,"tenant":"system","tenant_id":"a386cf653a9e4f728344294ca20e2b6f","user":"regionadmin","user_id":"8870f888a78446e18ffe8021b68625b4"} failed {"error":{"class":"ServiceAbnormal","code":499,"data":{"fields":["log"],"id":"%s service dns resolve error, please check dns setting"},"details":"log service dns resolve error, please check dns setting"}}
@swordqiu 我现在创建虚机都没有密钥信息了。。
@zhuhedong 请贴一下host-deployer的日志
/dev/nbd13p1 这个磁盘是你们的吗?
[info 2023-12-11 08:13:51 deployserver.(*DeployerServer).ResizeFs(deployserver.go:135)] ***** Resize fs on qemuimg.SImageInfo{Path:"/home/system_path/7b252524-04aa-4c6a-86ac-44c73c61beb2", Format:"", IoLevel:0, Password:"", EncryptFormat:"", EncryptAlg:"", secId:""}
[info 2023-12-11 08:13:51 nbd.(NBDDriver).Connect(driver.go:68)] lock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af
[error 2023-12-11 08:13:53 nbd.(NBDDriver).setupLVMS(driver.go:148)] unable to find vg from /dev/nbd13p1: unable to find vg, output is " No volume groups found.\n"
[info 2023-12-11 08:13:53 nbd.(NBDDriver).Connect(driver.go:88)] /dev/nbd13 hasLVM false err
[error 2023-12-11 08:13:54 kvmpart.(*SKVMGuestDiskPartition).mount.func2(kvmpart.go:186)] mount fail: exit status 32 mount: /tmp/_dev_nbd13p1: wrong fs type, bad option, bad superblock on /dev/nbd13p1, missing codepage or helper program, or other error.
[error 2023-12-11 08:13:56 kvmpart.(*SKVMGuestDiskPartition).mount.func2(kvmpart.go:186)] mount fail: exit status 32 mount: /tmp/_dev_nbd13p1: wrong fs type, bad option, bad superblock on /dev/nbd13p1, missing codepage or helper program, or other error.
[info 2023-12-11 08:13:57 xfsutils.UnlockXfsPartition(lock.go:43)] xfs unlock f956f023-a2a0-4fd5-8e89-0b44b0848ab3 [error 2023-12-11 08:13:57 kvmpart.(SKVMGuestDiskPartition).Mount(kvmpart.go:125)] SKVMGuestDiskPartition mount error: mount failed: tried 3 times, exceeding 3, last err : exit status 32 [info 2023-12-11 08:13:57 guestfs.IsPartitionReadonly(core.go:219)] File system /tmp/_dev_nbd13p1 is not readonly [info 2023-12-11 08:13:57 kvmpart.(SKVMGuestDiskPartition).Mount(kvmpart.go:139)] mount fs xfs on /dev/nbd13p1 successfully [error 2023-12-11 08:13:57 kvmpart.(*SKVMGuestDiskPartition).IsMounted(kvmpart.go:245)] /tmp/_dev_nbd13p1 is not a mountpoint: /tmp/_dev_nbd13p1 is not a mountpoint
[info 2023-12-11 08:13:57 nbd.(NBDDriver).Disconnect(driver.go:200)] Disconnect lock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af
[info 2023-12-11 08:13:57 nbd.(NBDDriver).Disconnect(driver.go:207)] Disconnect unlock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af
[info 2023-12-11 08:13:58 deployserver.(*DeployerServer).DeployGuestFs(deployserver.go:83)] ***** Deploy guest fs on qemuimg.SImageInfo{Path:"/home/system_path/7b252524-04aa-4c6a-86ac-44c73c61beb2", Format:"", IoLevel:0, Password:"", EncryptFormat:"", EncryptAlg:"", secId:""}
[info 2023-12-11 08:13:58 nbd.(NBDDriver).Connect(driver.go:68)] lock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af
[error 2023-12-11 08:14:00 nbd.(NBDDriver).setupLVMS(driver.go:148)] unable to find vg from /dev/nbd13p1: unable to find vg, output is " No volume groups found.\n"
[info 2023-12-11 08:14:00 nbd.(NBDDriver).Connect(driver.go:88)] /dev/nbd13 hasLVM false err
[error 2023-12-11 08:14:01 kvmpart.(*SKVMGuestDiskPartition).mount.func2(kvmpart.go:186)] mount fail: exit status 32 mount: /tmp/_dev_nbd13p1: wrong fs type, bad option, bad superblock on /dev/nbd13p1, missing codepage or helper program, or other error.
[error 2023-12-11 08:14:03 kvmpart.(*SKVMGuestDiskPartition).mount.func2(kvmpart.go:186)] mount fail: exit status 32 mount: /tmp/_dev_nbd13p1: wrong fs type, bad option, bad superblock on /dev/nbd13p1, missing codepage or helper program, or other error.
[info 2023-12-11 08:14:04 xfsutils.UnlockXfsPartition(lock.go:43)] xfs unlock f956f023-a2a0-4fd5-8e89-0b44b0848ab3 [error 2023-12-11 08:14:04 kvmpart.(SKVMGuestDiskPartition).Mount(kvmpart.go:125)] SKVMGuestDiskPartition mount error: mount failed: tried 3 times, exceeding 3, last err : exit status 32 [info 2023-12-11 08:14:04 guestfs.IsPartitionReadonly(core.go:219)] File system /tmp/_dev_nbd13p1 is not readonly [info 2023-12-11 08:14:04 kvmpart.(SKVMGuestDiskPartition).Mount(kvmpart.go:139)] mount fs xfs on /dev/nbd13p1 successfully [error 2023-12-11 08:14:04 kvmpart.(*SKVMGuestDiskPartition).IsMounted(kvmpart.go:245)] /tmp/_dev_nbd13p1 is not a mountpoint: /tmp/_dev_nbd13p1 is not a mountpoint
[error 2023-12-11 08:14:04 deployserver.(DeployerServer).DeployGuestFs(deployserver.go:104)] disk.MountRootfs not found partition, not init, quit [info 2023-12-11 08:14:04 nbd.(NBDDriver).Disconnect(driver.go:200)] Disconnect lock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af [info 2023-12-11 08:14:04 nbd.(*NBDDriver).Disconnect(driver.go:207)] Disconnect unlock root image path /opt/cloud/workspace/disks/image_cache/93fa0998-f72a-4a25-8bb5-8d248bb5c0af
@swordqiu 这个就是日志
@zhuhedong 3.10.7设置主机密码是通过宿主机的nbd模块把虚拟机的磁盘挂载起来设置的,从日志看,似乎是没能识别虚拟机的xfs文件系统。常见原因是宿主机内核比虚拟机的老,导致无法挂载虚拟机的xfs文件系统。请问宿主机和虚拟机的操作系统类型和内核版本分别是什么?
@swordqiu 5.4.130-1.yn20230805.el7.x86_64
@zhuhedong 虚拟机的呢?
@swordqiu 上不去虚机,看不到内核, 文件是CentOS-7-x86_64-GenericCloud-2211.qcow2
@zhuhedong 这是用nbd经常遇到的问题。建议改用3.10.8,但是不要用 /home/system_path 这个路径存储本地磁盘,而是 mount bind到 /opt/cloud/workspace 下目录。
但是你们现在 3.10.8有bug 昨晚那个bug
@swordqiu
@zhuhedong 3.10.8暂时无法支持 /opt/cloud 目录之外的本地磁盘目录,可以用 mount bind,把磁盘目录挂载到 /opt/cloud/workspace 下作为磁盘存储目录
也就是说,我要用3.10.8的话,必须通过挂载磁盘的方式去使用是吗? @swordqiu
@zhuhedong 是的。这是目前机制限制的。3.10.8之前用nbd挂在虚拟机磁盘做初始化,存在以上密码初始化不稳定的问题。3.10.8开始用轻量虚拟机来挂载磁盘做初始化,但是因为容器的限制,没法访问 /opt/cloud 之外目录的磁盘文件。建议先用mount bind来规避这个限制。
那我之前的东西移过去 会有什么问题吗? @swordqiu
@swordqiu 又突然不能用了。/。。
@zhuhedong 先用 docker ps -a | grep apiserver
看下 kube-apiserver 容器的状态,如果是退出的,用 docker log $容器id
看下日志。
另外用 journalctl -u kubelet --no-pager --since '2 hours ago'
看下 kubelet 服务的日志。
有一个是挂的
@zexi
@zhuhedong 看下挂了的那个 apiserver 容器的日志:docker logs f8b433
,贴上来看下
@zexi
这个是k8s的 @zexi
@zhuhedong 192.168.1.10 也是控制节点吗?另外看下 docker ps -a | grep etcd |grep adver
etcd 容器的状态,有没有退出,有退出的话把日志放上来。
192.168.1.10 是之前计算节点, 我这边今天会把这个节点重置。, 192.168.1.60 是昨天新装的, 因为出现了重置密钥无法生成的问题, 并且目录无法mount, 就重装升级到了3.10.8
@zexi
现在又正常了, 可以正常操作了。。 每出现一次,然后过个半个小时左右 他又好了 @zexi
@zhuhedong 先说下你现在的部署架构,是3节点高可用还是单控制节点?
@zexi 我现在是单节点
@zhuhedong 现在能看到可疑的地方就是 apiserver 被重启了,但容器日志里面看起来没有异常。
感觉是操作系统再 kill 容器之类的,节点的操作系统是 'centos 7' 还是其他的发行版?如果是 centos 7 的话,可以看下 /var/log/messages
文件里面的系统日志有没有异常。
是'centos 7
@zhuhedong 看下 /var/log/messages 里面有没有异常?
@zexi
又开始了 @zexi
@zexi
@zhuhedong 那个 BTRFS 的内核报错看起来有问题,我们在 centos7 上没用过这个文件系统,建议用 ext4 。 建议你先修复这个 BTRFS 的问题,再看下容器会不会重启。
你们这个是不是用Ubuntu会稳定一点? @zexi
@zhuhedong centos7 使用 ext4 文件系统是测试最多,最稳定的。
好吧, 那么ubuntu呢? @zexi
@zhuhedong btrfs 文件系统我这边还没用过,不管什么发行版都一直用 ext4 。
好吧, 那么ubuntu呢? @zexi
ubuntu 是上个版本才支持的,测试没有 centos 7 多。
{ "reason": "Deploy guest fs: request deploy guest fs: rpc error: code = Unknown desc = Connect: failed start guest qemu-kvm: -drive file=/home/system_path/aa5c6997-1d49-470d-8dd2-35fb0daecad3,if=none,id=drive_0,cache=none: Could not open '/home/system_path/aa5c6997-1d49-470d-8dd2-35fb0daecad3': No such file or directory\n: exit status 1", "stage": "OnDeployGuestComplete", "status": "error" }
是因为这个tmp满了吗?
![image](https://github.com/yunionio/cloudpods/assets/44915307/d0c82018-fd74-4255-bcea-3f95a16b7c49)