yunionio / cloudpods

A cloud-native open-source unified multi-cloud and hybrid-cloud platform. 开源、云原生的多云管理及混合云融合平台
https://www.cloudpods.org
Apache License 2.0
2.61k stars 535 forks source link

[BUG] 宿主机处理probe-isolated-devices请求期间创建虚拟机会出现报错超时 #21610

Open yulongz opened 1 week ago

yulongz commented 1 week ago

问题描述/What happened: 出现两个虚机创建失败,从对应宿主机host服务中看到日志如下: [info 2024-11-14 07:19:34 isolated_device.getPassthroughGPUS(gpu.go:75)] filter address [] [info 2024-11-14 07:19:35 isolated_device.(PCIDevice).IsBootVGA(gpu.go:321)] PCI address 03:00.0 is boot_vga: /sys/devices/pci0000:00/0000:00:1c.2/0000:02:00.0/0000:03:00.0/boot_vga [info 2024-11-14 07:19:35 isolated_device.getPassthroughGPUS(gpu.go:98)] skip boot vga device 03:00.0 [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 4f:00.0 already use vfio-pci driver [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 4f:00.0 already use vfio-pci driver [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 52:00.0 already use vfio-pci driver [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 52:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 workmanager.(workerTask).Run(manager.go:95)] DelayTask complete: {"telegrafdeployed":false} [info 2024-11-14 07:19:37 modules.TaskComplete(task.go:34)] Sync task a8fb0d5c-f5a6-415f-84da-585d48be5f7f complete succ [info 2024-11-14 07:19:37 appsrv.(*Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7-U9zE9ymrQmA= 200 882365-bbf8c2 POST /servers/cb6eb842-c430-409f-8e7a-6ccd0914b192/start (10.x.x.x:52693:compute_v2) 6.17ms [error 2024-11-14 07:19:37 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference goroutine 57750 [running]: runtime/debug.Stack() /usr/lib/go/src/runtime/debug/stack.go:24 +0x65 runtime/debug.PrintStack() /usr/lib/go/src/runtime/debug/stack.go:16 +0x19 yunion.io/x/onecloud/pkg/appsrv.execCallback.func1() /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:246 +0xdd panic({0x29a2920, 0x54c7ac0}) /usr/lib/go/src/runtime/panic.go:838 +0x207 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).hasGPU(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:870 +0x9f yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).HideKVM(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:153 +0xfd yunion.io/x/onecloud/pkg/hostman/guestman/arch.(X86).GenerateCpuDesc(0xc000f54380?, 0x10, 0xf0, {0x3565cf8, 0xc000f54380}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/arch/x86.go:131 +0x52 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initCpuDesc(0xc000f54380, 0x0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:886 +0x7a yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initGuestDesc(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/pci.go:53 +0x25 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).updateGuestDesc(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:159 +0x1f3 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).asyncScriptStart(0xc000f54380, {0x355a300, 0xc00184e660}, {0x2e6a6a0?, 0xc00228aa60}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:816 +0x1a5 yunion.io/x/onecloud/pkg/hostman/guestman.(guestStartTask).Run(0xc00228afa0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:2032 +0x3b yunion.io/x/onecloud/pkg/appsrv.execCallback(0xc001286f50?) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:249 +0x58 yunion.io/x/onecloud/pkg/appsrv.(SWorker).run(0xc0028089f0) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:92 +0x70 created by yunion.io/x/onecloud/pkg/appsrv.(SWorkerManager).scheduleWithLock /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:268 +0x165 [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 56:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 56:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 57:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 57:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] ce:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] ce:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 workmanager.(workerTask).Run(manager.go:95)] DelayTask complete: {"telegraf_deployed":false} [info 2024-11-14 07:19:38 modules.TaskComplete(task.go:34)] Sync task 6dc8a284-d3e4-4882-894a-54d72d4c8be3 complete succ [info 2024-11-14 07:19:38 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d1:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 isolateddevice.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d1:00.0 already use vfio-pci driver [info 2024-11-14 07:19:39 appsrv.(Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7-U9zE9ymrQmA= 200 5b5cde-d3e11d POST /servers/c3d8d60a-c311-47e8-8c00-4e84707893aa/start (10.x.x.x:26790:compute_v2) 4.28ms [error 2024-11-14 07:19:39 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference goroutine 57857 [running]: runtime/debug.Stack() /usr/lib/go/src/runtime/debug/stack.go:24 +0x65 runtime/debug.PrintStack() /usr/lib/go/src/runtime/debug/stack.go:16 +0x19 yunion.io/x/onecloud/pkg/appsrv.execCallback.func1() /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:246 +0xdd panic({0x29a2920, 0x54c7ac0}) /usr/lib/go/src/runtime/panic.go:838 +0x207 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).hasGPU(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:870 +0x9f yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).HideKVM(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:153 +0xfd yunion.io/x/onecloud/pkg/hostman/guestman/arch.(X86).GenerateCpuDesc(0xc000eba460?, 0x10, 0xf0, {0x3565cf8, 0xc000eba460}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/arch/x86.go:131 +0x52 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initCpuDesc(0xc000eba460, 0x0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:886 +0x7a yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initGuestDesc(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/pci.go:53 +0x25 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).updateGuestDesc(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:159 +0x1f3 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).asyncScriptStart(0xc000eba460, {0x355a300, 0xc00234ad20}, {0x2e6a6a0?, 0xc0003f7e80}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:816 +0x1a5 yunion.io/x/onecloud/pkg/hostman/guestman.(guestStartTask).Run(0xc0017603e0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:2032 +0x3b yunion.io/x/onecloud/pkg/appsrv.execCallback(0xc002118780?) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:249 +0x58 yunion.io/x/onecloud/pkg/appsrv.(SWorker).run(0xc000ba79b0) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:92 +0x70 created by yunion.io/x/onecloud/pkg/appsrv.(SWorkerManager).scheduleWithLock /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:268 +0x165 [info 2024-11-14 07:19:39 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d5:00.0 already use vfio-pci driver [info 2024-11-14 07:19:39 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d5:00.0 already use vfio-pci driver [info 2024-11-14 07:19:40 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d6:00.0 already use vfio-pci driver [info 2024-11-14 07:19:40 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d6:00.0 already use vfio-pci driver

环境/Environment: