yunionio / cloudpods

A cloud-native open-source unified multi-cloud and hybrid-cloud platform. 开源、云原生的多云管理及混合云融合平台
https://www.cloudpods.org
Apache License 2.0
2.49k stars 497 forks source link

[求助/Help]baremetal物理机管理服务组件DHCP服务的配置文件在哪里? #20332

Closed NeckC closed 4 days ago

NeckC commented 1 month ago

物理机注册过程中,dhcp无法分配,找不到物理机的DHCP服务在哪个镜像中。 需要查看baremetal对应镜像的配置文件中DHCP是如何进行分配的,以及需要查看DHCP分配的日志和服务状态。 在基础资源-》物理机-》添加物理机,的界面中,选择的是PXE引导注册。其中,指定了ipmi地址、用户名和密码、管理口IP选择的是dhcp子网的静态IP,并且也填充了物理机的eth0的mac地址。 网络确定互相通畅,配置确定没有问题按照官方文档进行配置的。

zexi commented 1 month ago

@NeckC dhcp 服务是随着 baremetal-agent 这个服务启动的,需要配置好 dhcp relay ,可以用 tcpdump 抓包看下请求有没有到 baremetal-agent 所在的节点。

NeckC commented 1 month ago

@NeckC dhcp 服务是随着 baremetal-agent 这个服务启动的,需要配置好 dhcp relay ,可以用 tcpdump 抓包看下请求有没有到 baremetal-agent 所在的节点。 @zexi 抓包看了有收到DHCP的请求 但是没有回复报文,所以现在就想看下 baremetal-agent 这个服务的DHCP配置文件在哪,服务里面的日志报文也要看下具体的信息。

zexi commented 1 month ago

@NeckC 可以在物理机 pxe dhcp 启动的时候,看下 baremetal-agent 的日志,里面应该有报错。 baremetal-agent 会根据 dhcp relay 的来源选择网络,还有 dhcp 请求的 mac 地址来选择对应的物理机。

NeckC commented 1 month ago

@zexi 问一下这个DHCP的配置文件信息 在cloudpod的哪个容器里面 image

NOTE: "boot_file FILE" and "opt bootfile FILE" are conceptually the same,

but "boot_file" goes into BOOTP-defined fixed-size field in the packet,

whereas "opt bootfile" goes into DHCP option 0x43.

Same for "sname HOST" and "opt tftp HOST".

Static leases map

static_lease 00:60:08:11:CE:4E 192.168.0.54

static_lease 00:60:08:11:CE:3E 192.168.0.44 optional_hostname

The remainder of options are DHCP options and can be specified with the

keyword 'opt' or 'option'. If an option can take multiple items, such

as the dns option, they can be listed on the same line, or multiple

lines.

Examples:

opt dns 192.168.10.2 192.168.10.10 option subnet 255.255.255.0 opt router 192.168.10.2 opt wins 192.168.10.10 option dns 129.219.13.81 # appended to above DNS servers for a total of 3 option domain local option lease 864000 # default: 10 days option msstaticroutes 10.0.0.0/8 10.127.0.1 # single static route option staticroutes 10.0.0.0/8 10.127.0.1, 10.11.12.0/24 10.11.12.1

zexi commented 1 month ago

@zexi 问一下这个DHCP的配置文件信息 在cloudpod的哪个容器里面 image

NOTE: "boot_file FILE" and "opt bootfile FILE" are conceptually the same,

but "boot_file" goes into BOOTP-defined fixed-size field in the packet,

whereas "opt bootfile" goes into DHCP option 0x43.

Same for "sname HOST" and "opt tftp HOST".

Static leases map

static_lease 00:60:08:11:CE:4E 192.168.0.54 #static_lease 00:60:08:11:CE:3E 192.168.0.44 optional_hostname

The remainder of options are DHCP options and can be specified with the

keyword 'opt' or 'option'. If an option can take multiple items, such

as the dns option, they can be listed on the same line, or multiple

lines.

Examples:

opt dns 192.168.10.2 192.168.10.10 option subnet 255.255.255.0 opt router 192.168.10.2 opt wins 192.168.10.10 option dns 129.219.13.81 # appended to above DNS servers for a total of 3 option domain local option lease 864000 # default: 10 days option msstaticroutes 10.0.0.0/8 10.127.0.1 # single static route option staticroutes 10.0.0.0/8 10.127.0.1, 10.11.12.0/24 10.11.12.1

这个应该不是我们的 dhcp 服务设置

NeckC commented 1 month ago

@zexi 你好那你们cloudpodS的DHCP服务配置在哪里的文件里看呢?

NeckC commented 1 month ago

@zexi 然后还想看下你们cloudpods正常的DHCP日志是啥样的,我看的日志里没有回应 image

zexi commented 1 month ago

@zexi 然后还想看下你们cloudpods正常的DHCP日志是啥样的,我看的日志里没有回应 image

@NeckC 这里有 warning 的日志,说的是收到从 10.123.123.252 这个 dhcp_relay 过来的 a4:ae:12:65:18:75 mac 地址的 dhcp 请求,但是没有找到 match 的 network 。 需要到平台创建一个子网,gateway 为 10.123.123.252 。 如果实际环境中 10.123.123.252 不是网关,可以用命令行更新对应子网的 dhcp 属性,用 --dhcp 的参数:climc network-update --dhcp 10.123.123.252 $子网名称

NeckC commented 1 month ago

@zexi 你好这个是什么原因 image

prepare=>prepare_fail: {"reason":"Do deploy: run /lib/mos/sysinit.sh: \"/lib/mos/sysinit.sh\" error: Process exited with status 255, cmd error: Huawei system initialization\nERROR:Can't get hole memory!!\n time 0.08 (ms).\nERROR:Can't get hole memory!!\n time 0.07 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\n","stage":"OnSyncConfigComplete","status":"error"} run /lib/mos/sysinit.sh: "/lib/mos/sysinit.sh" error: Process exited with status 255, cmd error: Huawei system initialization ERROR:Can't get hole memory!! time 0.08 (ms). ERROR:Can't get hole memory!! time 0.07 (ms). ERROR:Can't get hole memory!! time 0.06 (ms). ERROR:Can't get hole memory!! time 0.06 (ms). ERROR:Can't get hole memory!! time 0.06 (ms). ERROR:Can't get hole memory!! time 0.06 (ms). { "reason": { "reason": { "reason": "Do deploy: run /lib/mos/sysinit.sh: \"/lib/mos/sysinit.sh\" error: Process exited with status 255, cmd error: Huawei system initialization\nERROR:Can't get hole memory!!\n time 0.08 (ms).\nERROR:Can't get hole memory!!\n time 0.07 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\n", "stage": "OnSyncConfigComplete", "status": "error" }, "stage": "OnSyncConfigComplete" }, "stage": "OnPrepareComplete", "status": "error", "__task_name__": "BaremetalPrepareTask" }

三段报错都指向内存(空间)分配遇到问题,/lib/mos/sysinit.sh初始化脚本执行失败

zexi commented 1 month ago

@zexi 你好这个是什么原因 image

prepare=>prepare_fail: {"reason":"Do deploy: run /lib/mos/sysinit.sh: "/lib/mos/sysinit.sh" error: Process exited with status 255, cmd error: Huawei system initialization\nERROR:Can't get hole memory!!\n time 0.08 (ms).\nERROR:Can't get hole memory!!\n time 0.07 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\n","stage":"OnSyncConfigComplete","status":"error"} run /lib/mos/sysinit.sh: "/lib/mos/sysinit.sh" error: Process exited with status 255, cmd error: Huawei system initialization ERROR:Can't get hole memory!! time 0.08 (ms). ERROR:Can't get hole memory!! time 0.07 (ms). ERROR:Can't get hole memory!! time 0.06 (ms). ERROR:Can't get hole memory!! time 0.06 (ms). ERROR:Can't get hole memory!! time 0.06 (ms). ERROR:Can't get hole memory!! time 0.06 (ms). { "reason": { "reason": { "reason": "Do deploy: run /lib/mos/sysinit.sh: "/lib/mos/sysinit.sh" error: Process exited with status 255, cmd error: Huawei system initialization\nERROR:Can't get hole memory!!\n time 0.08 (ms).\nERROR:Can't get hole memory!!\n time 0.07 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\nERROR:Can't get hole memory!!\n time 0.06 (ms).\n", "stage": "OnSyncConfigComplete", "status": "error" }, "stage": "OnSyncConfigComplete" }, "stage": "OnPrepareComplete", "status": "error", "task_name": "BaremetalPrepareTask" }

三段报错都指向内存(空间)分配遇到问题,/lib/mos/sysinit.sh初始化脚本执行失败

@NeckC 没有遇到过这个问题,感觉和你的服务器硬件有关系。 可以执行下面的命令登录到机器里面:

climc host-ssh $物理机id

然后执行下面的命令看下是在哪个命令报错?

sh -x /lib/mos/sysinit.sh
NeckC commented 1 month ago

@zexi image

已在目标主机上执行sysinit.sh,发现里面包含的6条命令结果都是can't get hole memory,应该是虚拟系统本身没获取到内存空间或是没权限导致的执行命令失败,在出入free -g 可以看到我的足额内存(64G),在df -h也是正常显示32G空间;请问接下来该怎么排查这个问题

zexi commented 1 month ago

@zexi image

已在目标主机上执行sysinit.sh,发现里面包含的6条命令结果都是can't get hole memory,应该是虚拟系统本身没获取到内存空间或是没权限导致的执行命令失败,在出入free -g 可以看到我的足额内存(64G),在df -h也是正常显示32G空间;请问接下来该怎么排查这个问题

@NeckC sh -x /lib/mos/oem/huawei.sh 看下是调用哪个命令报错失败了?

github-actions[bot] commented 4 days ago

If you do not provide feedback for more than 37 days, we will close the issue and you can either reopen it or submit a new issue.

您超过 37 天未反馈信息,我们将关闭该 issue,如有需求您可以重新打开或者提交新的 issue。