Closed gurevichmark closed 5 years ago
btw we use the stock perl openbmc implementation and xCAT 2.14.3.
I have experienced perhaps something similar with f5u16
yesterday afternoon. After doing 7 or 8 rinstalls
, the BMC was pingable but I could not ssh
to it or issue commands to it:
[root@stratton01 ~]# rpower f5u16 bmcstate
f5u16: BMC Ready
[root@stratton01 ~]# rinstall f5u16 osimage=rhels7.5-alt-rhv4.2-ppc64le-netboot-compute
Provision node(s): f5u16
f5u16: BMC did not respond. Validate BMC configuration and retry the command. (timeout)
Error: [stratton01]: Failed to run 'rsetboot' against the following nodes: f5u16
[root@stratton01 ~]#
After about 24 hours, BMC was accessible again.
fyi: after a power reseat the BMC is reachable again. And I was just able to reproduce this again after some rinstall and rsetboot commands.
@Obihoernchen Can you show the firmware level on the box? rinv <node> firm
Saved BMC dump on stratton01. /var/log/xcat/dump/20180924-1112_f5u16_dump_8.tar.xz.
after f5u16
recovered.
@whowutwut
BMC Firmware Product: ibm-v2.1-438-g0030304-r15-0-g19832d3 (Active)*
HOST Firmware Product: IBM-witherspoon-ibm-OP9-v2.0.8-2.2-prod (Active)*
HOST Firmware Product: -- additional info: buildroot-2018.02.1-6-ga8d1126
HOST Firmware Product: -- additional info: capp-ucode-p9-dd2-v4
HOST Firmware Product: -- additional info: hcode-hw080418a.op920
HOST Firmware Product: -- additional info: hostboot-binaries-hw080418a.op920
HOST Firmware Product: -- additional info: hostboot-d033213-pfb2e171
HOST Firmware Product: -- additional info: linux-4.16.13-openpower1-p328018f
HOST Firmware Product: -- additional info: machine-xml-7cd20a6
HOST Firmware Product: -- additional info: occ-084756c
HOST Firmware Product: -- additional info: op-build-v2.0.8-1-gc51594f
HOST Firmware Product: -- additional info: petitboot-v1.7.2-p8f11e93
HOST Firmware Product: -- additional info: sbe-55d6eb2
HOST Firmware Product: -- additional info: skiboot-v6.0.7
FW team thinks this is a side effect of the Broadcom NCSI bug on the shared port. Work around is to ipmi power on the node .. I'm not sure with what command. (maybe ipmitool chassis on
?) Trying to get more information
I'm kinda sure rsetboot <node> net
is causing the issue, not rpower
.
Every time this occurred rsetboot
hang and I wasn't even able to run rpower
iirc.
I was able to recreate this on f5u16
again today. After about 10 reprovisions BMC is not responding to setting the control/host0/boot/one_time/attr/BootSource
source REST API:
[root@stratton01 tools]# rinstall f5u16 osimage=rhels7.5-alt-rhv4.2-ppc64le-netboot-compute
Provision node(s): f5u16
Mon Sep 24 16:42:22 2018 f5u16: [openbmc_debug] login curl -k -c cjar -b cjar -X POST -H "Content-Type: application/json" https://10.5.16.100/login -d '{"data": ["root", "xxxxxx"]}'
Mon Sep 24 16:42:22 2018 f5u16: [openbmc_debug] login 200 OK
Mon Sep 24 16:42:22 2018 f5u16: [openbmc_debug] set_one_time_boot_enable curl -k -c cjar -b cjar -X PUT -H "Content-Type: application/json" https://10.5.16.100/xyz/openbmc_project/control/host0/boot/one_time/attr/Enabled -d '{"data": 1}'
Mon Sep 24 16:42:23 2018 f5u16: [openbmc_debug] set_one_time_boot_enable 200 OK
Mon Sep 24 16:42:23 2018 f5u16: [openbmc_debug] set_one_time_boot_state curl -k -c cjar -b cjar -X PUT -H "Content-Type: application/json" https://10.5.16.100/xyz/openbmc_project/control/host0/boot/one_time/attr/BootSource -d '{"data": "xyz.openbmc_project.Control.Boot.Source.Sources.Network"}'
Mon Sep 24 16:42:53 2018 f5u16: [openbmc_debug] set_one_time_boot_state BMC did not respond. Validate BMC configuration and retry the command. (timeout)
f5u16: BMC did not respond. Validate BMC configuration and retry the command. (timeout)
Error: [stratton01]: Failed to run 'rsetboot' against the following nodes: f5u16
[root@stratton01 tools]#
Mon Sep 24 16:56:56 2018 f5u16: [openbmc_debug_perl] rflash_list_response 200 OK
f5u16: ID Purpose State Version
f5u16: -------------------------------------------------------
f5u16: 78d09908 BMC Active ibm-v2.0-0-r44-0-g843c2e1
f5u16: 1b3ffcf4 Host Active IBM-witherspoon-ibm-OP9_v1.19_1.173
f5u16: b04fff27 BMC Active(*) ibm-v2.0-0-r46-0-gbed584c
f5u16: e339c76d Host Active(*) IBM-witherspoon-ibm-OP9_v1.19_1.189
f5u16:
[root@stratton01 tools]#
Strange, if I run the command directly, it appears to work:
[root@stratton01 tools]# curl -k -c cjar -b cjar -X POST -H "Content-Type: application/json" https://10.5.16.100/login -d '{"data": ["root", "xxxxxxx"]}'
{
"data": "User 'root' logged in",
"message": "200 OK",
"status": "ok"
}
[root@stratton01 tools]# curl -k -c cjar -b cjar -X PUT -H "Content-Type: application/json" https://10.5.16.100/xyz/openbc_project/control/host0/boot/one_time/attr/BootSource -d '{"data": "xyz.openbmc_project.Control.Boot.Source.Sources.Network"}'
{
"data": null,
"message": "200 OK",
"status": "ok"
}[root@stratton01 tools]#
Not sure why f5u16
was behaving that way, but I no longer think it is related to the original reported issue. I just ran 35 rinstall
commands overnight and they all appear to be successful.
@Obihoernchen Can you recreate this at will ? What are the steps to recreate ?
@gurevichmark usally just rsetboot and rpower or rinstall. Yesterday I tried to recreate it and failed :-/ I'll try more...
The only real difference is: now all nodes have a proper OS installed and booted. Maybe this limits the occurrence of this shared port mode bug? Back when I hit this bug kinda often I did rsetboot and rpower while the node was still booting or installing. Maybe this makes a difference.
@gurevichmark Was able to reproduce it once more. This happend after ~5 rinstall commands during the installation or boot. Unfortunately I can't reproduce it consistently. Maybe the timing of rinstall is important I don't know.
[root@xcat install]# rinstall <node> osimage=rh75-compute-install
Provision node(s): <node>
<node>: 504 Gateway Timeout
Error: [xcat]: Failed to run 'rpower' against the following nodes: <node>
[root@xcat install]# ping <node>-bmc
PING <node>-bmc (172.24.184.32) 56(84) bytes of data.
^C
--- <node>-bmc ping statistics ---
12 packets transmitted, 0 received, 100% packet loss, time 11028ms
Power reseat fixed the BMC.
Could we identify this is xcat error or Firmware issue?
It is hard to tell if this is xCAT
or Firmware since we can not recreate it reliably.
My guess is that is it not xCAT
since I was able to rinstall
one of our BMC machines (f5u16
) 35 times in a row with no problem.
@Obihoernchen Next time when this happens, can you try to see if Host is still pingable ?
Hey @gurevichmark just had this issue again. See below:
[root@xcat software]# rinstall taurusml5 osimage=rh75-compute-install
Provision node(s): taurusml5
taurusml5: 504 Gateway Timeout
Error: [xcat]: Failed to run 'rpower' against the following nodes: taurusml5
[root@xcat software]# rinstall taurusml5 osimage=rh75-compute-install
Provision node(s): taurusml5
taurusml5: BMC did not respond. Validate BMC configuration and retry the command.
Error: [xcat]: Failed to run 'rsetboot' against the following nodes: taurusml5
[root@xcat software]# ping taurusml5-bmc
PING taurusml5-bmc (172.24.184.21) 56(84) bytes of data.
^C
--- taurusml5-bmc ping statistics ---
22 packets transmitted, 0 received, 100% packet loss, time 21007ms
[root@xcat software]# ping taurusml5
PING taurusml5 (172.24.148.21) 56(84) bytes of data.
^C
--- taurusml5 ping statistics ---
12 packets transmitted, 0 received, 100% packet loss, time 11010ms
###############
# wait 10 min #
###############
[root@xcat software]# ping taurusml5-bmc
PING taurusml5-bmc (172.24.184.21) 56(84) bytes of data.
^C
--- taurusml5-bmc ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4000ms
[root@xcat software]# ping taurusml5
PING taurusml5 (172.24.148.21) 56(84) bytes of data.
^C
--- taurusml5 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2000ms
###################
# PDU power reset #
###################
[root@xcat software]# rpower taurusml5 pduoff
taurusml5: pdu1 operational state for outlet 8 is off
taurusml5: pdu2 operational state for outlet 4 is off
[root@xcat software]# rpower taurusml5 pduon
taurusml5: pdu1 operational state for outlet 8 is on
taurusml5: pdu2 operational state for outlet 4 is on
###############
# wait 3 min #
###############
[root@xcat software]# ping taurusml5-bmc
PING taurusml5-bmc (172.24.184.21) 56(84) bytes of data.
64 bytes from taurusml5-bmc (172.24.184.21): icmp_seq=1 ttl=63 time=0.783 ms
64 bytes from taurusml5-bmc (172.24.184.21): icmp_seq=2 ttl=63 time=0.595 ms
64 bytes from taurusml5-bmc (172.24.184.21): icmp_seq=3 ttl=63 time=0.624 ms
64 bytes from taurusml5-bmc (172.24.184.21): icmp_seq=4 ttl=63 time=0.583 ms
^C
--- taurusml5-bmc ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3007ms
rtt min/avg/max/mdev = 0.583/0.646/0.783/0.082 ms
@Obihoernchen Can you run the following on the host to see the broadcom firmware level.
for i in `lspci -d "14e4:1657:0200" | cut -f1 -d" "` ; do ethtool -i $(echo $(ls /sys/bus/pci/devices/$i/net)) | grep firmware-version ; done
@gurevichmark
firmware-version: 5719-v1.43 NCSI v1.4.22.0
firmware-version: 5719-v1.43 NCSI v1.4.22.0
Ok, looks like the latest level.
Next time this happens, after power cycle, can you run this command on xcat MN to generate and download the dump:
rspconfig <node> dump
The dump file will be downloaded into /var/log/xcat/dump
No updates since October. Closing this issue. Please open a new one if this problem reappears.
User reports in #hcp_xcat:
Anyone has issues with openbmc (P9) and
rsetboot <node> net
? It works fine a couple of times and then after using it to often (5-10 times) BMC does not respond anymore (no ping, nothing). I don't know whether the BMC hangs, lost the network config or sth else. We experienced this behaviour 3 times in a row now on 3 different BMCs.Basically:
And this works fine a few times and then after X iterations it fails and BMC is unreachable. btw: 8335-GTX (AC922) server with newest FW:
BMC and OS is in shared port mode! So just a single Ethernet cable. And the BMC has a VLAN configured.
BMC eventually comes back after about 24 hours.