Open yulongz opened 1 week ago
问题描述/What happened: 出现两个虚机创建失败,从对应宿主机host服务中看到日志如下: [info 2024-11-14 07:19:34 isolated_device.getPassthroughGPUS(gpu.go:75)] filter address [] [info 2024-11-14 07:19:35 isolated_device.(PCIDevice).IsBootVGA(gpu.go:321)] PCI address 03:00.0 is boot_vga: /sys/devices/pci0000:00/0000:00:1c.2/0000:02:00.0/0000:03:00.0/boot_vga [info 2024-11-14 07:19:35 isolated_device.getPassthroughGPUS(gpu.go:98)] skip boot vga device 03:00.0 [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 4f:00.0 already use vfio-pci driver [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 4f:00.0 already use vfio-pci driver [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 52:00.0 already use vfio-pci driver [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 52:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 workmanager.(workerTask).Run(manager.go:95)] DelayTask complete: {"telegrafdeployed":false} [info 2024-11-14 07:19:37 modules.TaskComplete(task.go:34)] Sync task a8fb0d5c-f5a6-415f-84da-585d48be5f7f complete succ [info 2024-11-14 07:19:37 appsrv.(*Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7-U9zE9ymrQmA= 200 882365-bbf8c2 POST /servers/cb6eb842-c430-409f-8e7a-6ccd0914b192/start (10.x.x.x:52693:compute_v2) 6.17ms [error 2024-11-14 07:19:37 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference goroutine 57750 [running]: runtime/debug.Stack() /usr/lib/go/src/runtime/debug/stack.go:24 +0x65 runtime/debug.PrintStack() /usr/lib/go/src/runtime/debug/stack.go:16 +0x19 yunion.io/x/onecloud/pkg/appsrv.execCallback.func1() /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:246 +0xdd panic({0x29a2920, 0x54c7ac0}) /usr/lib/go/src/runtime/panic.go:838 +0x207 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).hasGPU(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:870 +0x9f yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).HideKVM(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:153 +0xfd yunion.io/x/onecloud/pkg/hostman/guestman/arch.(X86).GenerateCpuDesc(0xc000f54380?, 0x10, 0xf0, {0x3565cf8, 0xc000f54380}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/arch/x86.go:131 +0x52 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initCpuDesc(0xc000f54380, 0x0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:886 +0x7a yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initGuestDesc(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/pci.go:53 +0x25 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).updateGuestDesc(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:159 +0x1f3 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).asyncScriptStart(0xc000f54380, {0x355a300, 0xc00184e660}, {0x2e6a6a0?, 0xc00228aa60}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:816 +0x1a5 yunion.io/x/onecloud/pkg/hostman/guestman.(guestStartTask).Run(0xc00228afa0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:2032 +0x3b yunion.io/x/onecloud/pkg/appsrv.execCallback(0xc001286f50?) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:249 +0x58 yunion.io/x/onecloud/pkg/appsrv.(SWorker).run(0xc0028089f0) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:92 +0x70 created by yunion.io/x/onecloud/pkg/appsrv.(SWorkerManager).scheduleWithLock /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:268 +0x165 [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 56:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 56:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 57:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 57:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] ce:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] ce:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 workmanager.(workerTask).Run(manager.go:95)] DelayTask complete: {"telegraf_deployed":false} [info 2024-11-14 07:19:38 modules.TaskComplete(task.go:34)] Sync task 6dc8a284-d3e4-4882-894a-54d72d4c8be3 complete succ [info 2024-11-14 07:19:38 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d1:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 isolateddevice.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d1:00.0 already use vfio-pci driver [info 2024-11-14 07:19:39 appsrv.(Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7-U9zE9ymrQmA= 200 5b5cde-d3e11d POST /servers/c3d8d60a-c311-47e8-8c00-4e84707893aa/start (10.x.x.x:26790:compute_v2) 4.28ms [error 2024-11-14 07:19:39 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference goroutine 57857 [running]: runtime/debug.Stack() /usr/lib/go/src/runtime/debug/stack.go:24 +0x65 runtime/debug.PrintStack() /usr/lib/go/src/runtime/debug/stack.go:16 +0x19 yunion.io/x/onecloud/pkg/appsrv.execCallback.func1() /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:246 +0xdd panic({0x29a2920, 0x54c7ac0}) /usr/lib/go/src/runtime/panic.go:838 +0x207 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).hasGPU(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:870 +0x9f yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).HideKVM(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:153 +0xfd yunion.io/x/onecloud/pkg/hostman/guestman/arch.(X86).GenerateCpuDesc(0xc000eba460?, 0x10, 0xf0, {0x3565cf8, 0xc000eba460}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/arch/x86.go:131 +0x52 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initCpuDesc(0xc000eba460, 0x0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:886 +0x7a yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initGuestDesc(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/pci.go:53 +0x25 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).updateGuestDesc(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:159 +0x1f3 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).asyncScriptStart(0xc000eba460, {0x355a300, 0xc00234ad20}, {0x2e6a6a0?, 0xc0003f7e80}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:816 +0x1a5 yunion.io/x/onecloud/pkg/hostman/guestman.(guestStartTask).Run(0xc0017603e0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:2032 +0x3b yunion.io/x/onecloud/pkg/appsrv.execCallback(0xc002118780?) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:249 +0x58 yunion.io/x/onecloud/pkg/appsrv.(SWorker).run(0xc000ba79b0) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:92 +0x70 created by yunion.io/x/onecloud/pkg/appsrv.(SWorkerManager).scheduleWithLock /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:268 +0x165 [info 2024-11-14 07:19:39 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d5:00.0 already use vfio-pci driver [info 2024-11-14 07:19:39 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d5:00.0 already use vfio-pci driver [info 2024-11-14 07:19:40 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d6:00.0 already use vfio-pci driver [info 2024-11-14 07:19:40 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d6:00.0 already use vfio-pci driver
环境/Environment:
cat /etc/os-release
uname -a
dmidecode | egrep -i 'manufacturer|product' |sort -u
kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list
问题描述/What happened: 出现两个虚机创建失败,从对应宿主机host服务中看到日志如下: [info 2024-11-14 07:19:34 isolated_device.getPassthroughGPUS(gpu.go:75)] filter address [] [info 2024-11-14 07:19:35 isolated_device.(PCIDevice).IsBootVGA(gpu.go:321)] PCI address 03:00.0 is boot_vga: /sys/devices/pci0000:00/0000:00:1c.2/0000:02:00.0/0000:03:00.0/boot_vga [info 2024-11-14 07:19:35 isolated_device.getPassthroughGPUS(gpu.go:98)] skip boot vga device 03:00.0 [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 4f:00.0 already use vfio-pci driver [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 4f:00.0 already use vfio-pci driver [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 52:00.0 already use vfio-pci driver [info 2024-11-14 07:19:36 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 52:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 workmanager.(workerTask).Run(manager.go:95)] DelayTask complete: {"telegrafdeployed":false} [info 2024-11-14 07:19:37 modules.TaskComplete(task.go:34)] Sync task a8fb0d5c-f5a6-415f-84da-585d48be5f7f complete succ [info 2024-11-14 07:19:37 appsrv.(*Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7-U9zE9ymrQmA= 200 882365-bbf8c2 POST /servers/cb6eb842-c430-409f-8e7a-6ccd0914b192/start (10.x.x.x:52693:compute_v2) 6.17ms [error 2024-11-14 07:19:37 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference goroutine 57750 [running]: runtime/debug.Stack() /usr/lib/go/src/runtime/debug/stack.go:24 +0x65 runtime/debug.PrintStack() /usr/lib/go/src/runtime/debug/stack.go:16 +0x19 yunion.io/x/onecloud/pkg/appsrv.execCallback.func1() /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:246 +0xdd panic({0x29a2920, 0x54c7ac0}) /usr/lib/go/src/runtime/panic.go:838 +0x207 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).hasGPU(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:870 +0x9f yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).HideKVM(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:153 +0xfd yunion.io/x/onecloud/pkg/hostman/guestman/arch.(X86).GenerateCpuDesc(0xc000f54380?, 0x10, 0xf0, {0x3565cf8, 0xc000f54380}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/arch/x86.go:131 +0x52 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initCpuDesc(0xc000f54380, 0x0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:886 +0x7a yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initGuestDesc(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/pci.go:53 +0x25 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).updateGuestDesc(0xc000f54380) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:159 +0x1f3 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).asyncScriptStart(0xc000f54380, {0x355a300, 0xc00184e660}, {0x2e6a6a0?, 0xc00228aa60}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:816 +0x1a5 yunion.io/x/onecloud/pkg/hostman/guestman.(guestStartTask).Run(0xc00228afa0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:2032 +0x3b yunion.io/x/onecloud/pkg/appsrv.execCallback(0xc001286f50?) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:249 +0x58 yunion.io/x/onecloud/pkg/appsrv.(SWorker).run(0xc0028089f0) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:92 +0x70 created by yunion.io/x/onecloud/pkg/appsrv.(SWorkerManager).scheduleWithLock /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:268 +0x165 [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 56:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 56:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 57:00.0 already use vfio-pci driver [info 2024-11-14 07:19:37 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] 57:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] ce:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] ce:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 workmanager.(workerTask).Run(manager.go:95)] DelayTask complete: {"telegraf_deployed":false} [info 2024-11-14 07:19:38 modules.TaskComplete(task.go:34)] Sync task 6dc8a284-d3e4-4882-894a-54d72d4c8be3 complete succ [info 2024-11-14 07:19:38 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d1:00.0 already use vfio-pci driver [info 2024-11-14 07:19:38 isolateddevice.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d1:00.0 already use vfio-pci driver [info 2024-11-14 07:19:39 appsrv.(Application).ServeHTTP(appsrv.go:289)] GBAjy6fWKymrO7-U9zE9ymrQmA= 200 5b5cde-d3e11d POST /servers/c3d8d60a-c311-47e8-8c00-4e84707893aa/start (10.x.x.x:26790:compute_v2) 4.28ms [error 2024-11-14 07:19:39 appsrv.execCallback.func1(workers.go:242)] WorkerManager exec callback error: runtime error: invalid memory address or nil pointer dereference goroutine 57857 [running]: runtime/debug.Stack() /usr/lib/go/src/runtime/debug/stack.go:24 +0x65 runtime/debug.PrintStack() /usr/lib/go/src/runtime/debug/stack.go:16 +0x19 yunion.io/x/onecloud/pkg/appsrv.execCallback.func1() /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:246 +0xdd panic({0x29a2920, 0x54c7ac0}) /usr/lib/go/src/runtime/panic.go:838 +0x207 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).hasGPU(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:870 +0x9f yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).HideKVM(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:153 +0xfd yunion.io/x/onecloud/pkg/hostman/guestman/arch.(X86).GenerateCpuDesc(0xc000eba460?, 0x10, 0xf0, {0x3565cf8, 0xc000eba460}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/arch/x86.go:131 +0x52 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initCpuDesc(0xc000eba460, 0x0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvmhelper.go:886 +0x7a yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).initGuestDesc(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/pci.go:53 +0x25 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).updateGuestDesc(0xc000eba460) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:159 +0x1f3 yunion.io/x/onecloud/pkg/hostman/guestman.(SKVMGuestInstance).asyncScriptStart(0xc000eba460, {0x355a300, 0xc00234ad20}, {0x2e6a6a0?, 0xc0003f7e80}) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:816 +0x1a5 yunion.io/x/onecloud/pkg/hostman/guestman.(guestStartTask).Run(0xc0017603e0) /root/go/src/yunion.io/x/onecloud/pkg/hostman/guestman/qemu-kvm.go:2032 +0x3b yunion.io/x/onecloud/pkg/appsrv.execCallback(0xc002118780?) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:249 +0x58 yunion.io/x/onecloud/pkg/appsrv.(SWorker).run(0xc000ba79b0) /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:92 +0x70 created by yunion.io/x/onecloud/pkg/appsrv.(SWorkerManager).scheduleWithLock /root/go/src/yunion.io/x/onecloud/pkg/appsrv/workers.go:268 +0x165 [info 2024-11-14 07:19:39 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d5:00.0 already use vfio-pci driver [info 2024-11-14 07:19:39 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d5:00.0 already use vfio-pci driver [info 2024-11-14 07:19:40 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d6:00.0 already use vfio-pci driver [info 2024-11-14 07:19:40 isolated_device.(PCIDevice).forceBindVFIOPCIDriver(gpu.go:339)] d6:00.0 already use vfio-pci driver
环境/Environment:
cat /etc/os-release
): ubuntu2204uname -a
):Linux cloud-node-0133 5.15.0-124-generic #134-Ubuntu SMP Fri Sep 27 20:20:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linuxdmidecode | egrep -i 'manufacturer|product' |sort -u
) idProduct: 0x03ee Manufacturer: Intel(R) Corporation Manufacturer: NO DIMM Manufacturer: Samsung Manufacturer: Supermicro Manufacturer: SUPERMICRO Memory Subsystem Controller Manufacturer ID: Unknown Memory Subsystem Controller Product ID: Unknown Module Manufacturer ID: Bank 1, Hex 0xCE Module Manufacturer ID: Unknown Module Product ID: Unknown Product Name: SYS-420GP-TNR Product Name: X12DPG-OA6kubectl exec -n onecloud $(kubectl get pods -n onecloud | grep climc | awk '{print $1}') -- climc version-list
): 3.10.15