xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
367 stars 172 forks source link

BMC becomes unreachable after several provision requests. #5656

Closed gurevichmark closed 5 years ago

gurevichmark commented 6 years ago

User reports in #hcp_xcat:

Anyone has issues with openbmc (P9) and rsetboot <node> net? It works fine a couple of times and then after using it to often (5-10 times) BMC does not respond anymore (no ping, nothing). I don't know whether the BMC hangs, lost the network config or sth else. We experienced this behaviour 3 times in a row now on 3 different BMCs.

Basically:

nodeset <node> osimage=XXX
rsetboot <node> net
rpower <node> boot

And this works fine a few times and then after X iterations it fails and BMC is unreachable. btw: 8335-GTX (AC922) server with newest FW:

service pack: OP920.02  
Scale-Out LC System Firmware (OP920.02/PNOR OP9_v2.0.8-2.2/BMC ibm-v2.1-438-r15)

BMC and OS is in shared port mode! So just a single Ethernet cable. And the BMC has a VLAN configured.

BMC eventually comes back after about 24 hours.

Obihoernchen commented 6 years ago

btw we use the stock perl openbmc implementation and xCAT 2.14.3.

gurevichmark commented 6 years ago

I have experienced perhaps something similar with f5u16 yesterday afternoon. After doing 7 or 8 rinstalls, the BMC was pingable but I could not ssh to it or issue commands to it:

[root@stratton01 ~]# rpower f5u16 bmcstate
f5u16: BMC Ready

[root@stratton01 ~]# rinstall f5u16 osimage=rhels7.5-alt-rhv4.2-ppc64le-netboot-compute
Provision node(s): f5u16
f5u16: BMC did not respond. Validate BMC configuration and retry the command. (timeout)
Error: [stratton01]: Failed to run 'rsetboot' against the following nodes: f5u16
[root@stratton01 ~]#

After about 24 hours, BMC was accessible again.

Obihoernchen commented 6 years ago

fyi: after a power reseat the BMC is reachable again. And I was just able to reproduce this again after some rinstall and rsetboot commands.

whowutwut commented 6 years ago

@Obihoernchen Can you show the firmware level on the box? rinv <node> firm

gurevichmark commented 6 years ago

Saved BMC dump on stratton01. /var/log/xcat/dump/20180924-1112_f5u16_dump_8.tar.xz. after f5u16 recovered.

Obihoernchen commented 6 years ago

@whowutwut

BMC Firmware Product:   ibm-v2.1-438-g0030304-r15-0-g19832d3 (Active)*
HOST Firmware Product:   IBM-witherspoon-ibm-OP9-v2.0.8-2.2-prod (Active)*
HOST Firmware Product: -- additional info: buildroot-2018.02.1-6-ga8d1126
HOST Firmware Product: -- additional info: capp-ucode-p9-dd2-v4
HOST Firmware Product: -- additional info: hcode-hw080418a.op920
HOST Firmware Product: -- additional info: hostboot-binaries-hw080418a.op920
HOST Firmware Product: -- additional info: hostboot-d033213-pfb2e171
HOST Firmware Product: -- additional info: linux-4.16.13-openpower1-p328018f
HOST Firmware Product: -- additional info: machine-xml-7cd20a6
HOST Firmware Product: -- additional info: occ-084756c
HOST Firmware Product: -- additional info: op-build-v2.0.8-1-gc51594f
HOST Firmware Product: -- additional info: petitboot-v1.7.2-p8f11e93
HOST Firmware Product: -- additional info: sbe-55d6eb2
HOST Firmware Product: -- additional info: skiboot-v6.0.7
whowutwut commented 6 years ago

FW team thinks this is a side effect of the Broadcom NCSI bug on the shared port. Work around is to ipmi power on the node .. I'm not sure with what command. (maybe ipmitool chassis on?) Trying to get more information

Obihoernchen commented 6 years ago

I'm kinda sure rsetboot <node> net is causing the issue, not rpower. Every time this occurred rsetboot hang and I wasn't even able to run rpower iirc.

gurevichmark commented 6 years ago

I was able to recreate this on f5u16 again today. After about 10 reprovisions BMC is not responding to setting the control/host0/boot/one_time/attr/BootSource source REST API:

[root@stratton01 tools]# rinstall f5u16 osimage=rhels7.5-alt-rhv4.2-ppc64le-netboot-compute
Provision node(s): f5u16
Mon Sep 24 16:42:22 2018 f5u16: [openbmc_debug] login curl -k -c cjar -b cjar -X POST -H "Content-Type: application/json" https://10.5.16.100/login -d '{"data": ["root", "xxxxxx"]}'
Mon Sep 24 16:42:22 2018 f5u16: [openbmc_debug] login 200 OK
Mon Sep 24 16:42:22 2018 f5u16: [openbmc_debug] set_one_time_boot_enable curl -k -c cjar -b cjar -X PUT -H "Content-Type: application/json" https://10.5.16.100/xyz/openbmc_project/control/host0/boot/one_time/attr/Enabled -d '{"data": 1}'
Mon Sep 24 16:42:23 2018 f5u16: [openbmc_debug] set_one_time_boot_enable 200 OK
Mon Sep 24 16:42:23 2018 f5u16: [openbmc_debug] set_one_time_boot_state curl -k -c cjar -b cjar -X PUT -H "Content-Type: application/json" https://10.5.16.100/xyz/openbmc_project/control/host0/boot/one_time/attr/BootSource -d '{"data": "xyz.openbmc_project.Control.Boot.Source.Sources.Network"}'
Mon Sep 24 16:42:53 2018 f5u16: [openbmc_debug] set_one_time_boot_state BMC did not respond. Validate BMC configuration and retry the command. (timeout)
f5u16: BMC did not respond. Validate BMC configuration and retry the command. (timeout)
Error: [stratton01]: Failed to run 'rsetboot' against the following nodes: f5u16
[root@stratton01 tools]#
Mon Sep 24 16:56:56 2018 f5u16: [openbmc_debug_perl] rflash_list_response 200 OK
f5u16: ID       Purpose State      Version
f5u16: -------------------------------------------------------
f5u16: 78d09908 BMC     Active     ibm-v2.0-0-r44-0-g843c2e1
f5u16: 1b3ffcf4 Host    Active     IBM-witherspoon-ibm-OP9_v1.19_1.173
f5u16: b04fff27 BMC     Active(*)  ibm-v2.0-0-r46-0-gbed584c
f5u16: e339c76d Host    Active(*)  IBM-witherspoon-ibm-OP9_v1.19_1.189
f5u16:
[root@stratton01 tools]#
gurevichmark commented 6 years ago

Strange, if I run the command directly, it appears to work:

[root@stratton01 tools]# curl -k -c cjar -b cjar -X POST -H "Content-Type: application/json" https://10.5.16.100/login -d '{"data": ["root", "xxxxxxx"]}'
{
  "data": "User 'root' logged in",
  "message": "200 OK",
  "status": "ok"
}

[root@stratton01 tools]# curl -k -c cjar -b cjar -X PUT -H "Content-Type: application/json" https://10.5.16.100/xyz/openbc_project/control/host0/boot/one_time/attr/BootSource -d '{"data": "xyz.openbmc_project.Control.Boot.Source.Sources.Network"}'
{
  "data": null,
  "message": "200 OK",
  "status": "ok"
}[root@stratton01 tools]#
gurevichmark commented 6 years ago

Not sure why f5u16 was behaving that way, but I no longer think it is related to the original reported issue. I just ran 35 rinstall commands overnight and they all appear to be successful.

@Obihoernchen Can you recreate this at will ? What are the steps to recreate ?

Obihoernchen commented 6 years ago

@gurevichmark usally just rsetboot and rpower or rinstall. Yesterday I tried to recreate it and failed :-/ I'll try more...

The only real difference is: now all nodes have a proper OS installed and booted. Maybe this limits the occurrence of this shared port mode bug? Back when I hit this bug kinda often I did rsetboot and rpower while the node was still booting or installing. Maybe this makes a difference.

Obihoernchen commented 6 years ago

@gurevichmark Was able to reproduce it once more. This happend after ~5 rinstall commands during the installation or boot. Unfortunately I can't reproduce it consistently. Maybe the timing of rinstall is important I don't know.

[root@xcat install]# rinstall <node> osimage=rh75-compute-install
Provision node(s): <node>
<node>: 504 Gateway Timeout
Error: [xcat]: Failed to run 'rpower' against the following nodes: <node>
[root@xcat install]# ping <node>-bmc
PING <node>-bmc (172.24.184.32) 56(84) bytes of data.
^C
--- <node>-bmc ping statistics ---
12 packets transmitted, 0 received, 100% packet loss, time 11028ms

Power reseat fixed the BMC.

robin2008 commented 6 years ago

Could we identify this is xcat error or Firmware issue?

gurevichmark commented 6 years ago

It is hard to tell if this is xCAT or Firmware since we can not recreate it reliably.

My guess is that is it not xCAT since I was able to rinstall one of our BMC machines (f5u16) 35 times in a row with no problem.

gurevichmark commented 6 years ago

@Obihoernchen Next time when this happens, can you try to see if Host is still pingable ?

Obihoernchen commented 6 years ago

Hey @gurevichmark just had this issue again. See below:

[root@xcat software]# rinstall taurusml5 osimage=rh75-compute-install
Provision node(s): taurusml5

taurusml5: 504 Gateway Timeout
Error: [xcat]: Failed to run 'rpower' against the following nodes: taurusml5

[root@xcat software]# rinstall taurusml5 osimage=rh75-compute-install
Provision node(s): taurusml5
taurusml5: BMC did not respond. Validate BMC configuration and retry the command.
Error: [xcat]: Failed to run 'rsetboot' against the following nodes: taurusml5

[root@xcat software]# ping taurusml5-bmc
PING taurusml5-bmc (172.24.184.21) 56(84) bytes of data.
^C
--- taurusml5-bmc ping statistics ---
22 packets transmitted, 0 received, 100% packet loss, time 21007ms

[root@xcat software]# ping taurusml5
PING taurusml5 (172.24.148.21) 56(84) bytes of data.
^C
--- taurusml5 ping statistics ---
12 packets transmitted, 0 received, 100% packet loss, time 11010ms

###############
# wait 10 min #
###############
[root@xcat software]# ping taurusml5-bmc
PING taurusml5-bmc (172.24.184.21) 56(84) bytes of data.
^C
--- taurusml5-bmc ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4000ms

[root@xcat software]# ping taurusml5
PING taurusml5 (172.24.148.21) 56(84) bytes of data.
^C
--- taurusml5 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2000ms

###################
# PDU power reset #
###################
[root@xcat software]# rpower taurusml5 pduoff
taurusml5: pdu1 operational state for outlet 8 is off
taurusml5: pdu2 operational state for outlet 4 is off

[root@xcat software]# rpower taurusml5 pduon
taurusml5: pdu1 operational state for outlet 8 is on
taurusml5: pdu2 operational state for outlet 4 is on

###############
# wait 3 min #
###############
[root@xcat software]# ping taurusml5-bmc
PING taurusml5-bmc (172.24.184.21) 56(84) bytes of data.
64 bytes from taurusml5-bmc (172.24.184.21): icmp_seq=1 ttl=63 time=0.783 ms
64 bytes from taurusml5-bmc (172.24.184.21): icmp_seq=2 ttl=63 time=0.595 ms
64 bytes from taurusml5-bmc (172.24.184.21): icmp_seq=3 ttl=63 time=0.624 ms
64 bytes from taurusml5-bmc (172.24.184.21): icmp_seq=4 ttl=63 time=0.583 ms
^C
--- taurusml5-bmc ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3007ms
rtt min/avg/max/mdev = 0.583/0.646/0.783/0.082 ms
gurevichmark commented 6 years ago

@Obihoernchen Can you run the following on the host to see the broadcom firmware level.

for i in `lspci -d "14e4:1657:0200" | cut -f1 -d" "` ; do ethtool -i  $(echo $(ls /sys/bus/pci/devices/$i/net)) | grep firmware-version ; done
Obihoernchen commented 6 years ago

@gurevichmark

firmware-version: 5719-v1.43 NCSI v1.4.22.0
firmware-version: 5719-v1.43 NCSI v1.4.22.0
gurevichmark commented 6 years ago

Ok, looks like the latest level. Next time this happens, after power cycle, can you run this command on xcat MN to generate and download the dump: rspconfig <node> dump The dump file will be downloaded into /var/log/xcat/dump

gurevichmark commented 5 years ago

No updates since October. Closing this issue. Please open a new one if this problem reappears.