Open belldi opened 6 years ago
@belldi I formatted your above issue.
We are seeing the 503 come back the first time that we call a REST command after rebooting the BMC. I logged into the cn06 BMC and it looks like it's been up for around 20 minutes.
I thought this was an openbmc issue.. but today I saw it on an earlier firmware:
[root@briggs01 910.1738.20170921r]# rinv mid05tor12cn17 firm
Tue Nov 7 16:55:41 2017 mid05tor12cn17: [openbmc_debug] LOGIN Failed using curl command
mid05tor12cn17: Error: 503 Service Unavailable
[root@briggs01 910.1738.20170921r]# rinv mid05tor12cn17 firm
Tue Nov 7 16:55:47 2017 mid05tor12cn17: [openbmc_debug] curl -k -c cjar -H "Content-Type: application/json" -d '{ "data": ["root", "xxxxxx"] }' https://172.12.139.117/login
Tue Nov 7 16:55:47 2017 mid05tor12cn17: [openbmc_debug] login_response 200 OK
Tue Nov 7 16:55:47 2017 mid05tor12cn17: [openbmc_debug] curl -k -b cjar -X GET -H "Content-Type: application/json" https://172.12.139.117/xyz/openbmc_project/software/enumerate
Tue Nov 7 16:55:48 2017 mid05tor12cn17: [openbmc_debug] rinv_firm_response 200 OK
mid05tor12cn17: HOST Firmware Product: IBM-witherspoon-redbud-ibm-OP9_v1.19_1.2 (Active)*
mid05tor12cn17: HOST Firmware Product: -- additional info: buildroot-2017.5-163-g8035745
mid05tor12cn17: HOST Firmware Product: -- additional info: capp-ucode-9c73e9f
mid05tor12cn17: HOST Firmware Product: -- additional info: hostboot-binaries-67a15fd
mid05tor12cn17: HOST Firmware Product: -- additional info: hostboot-c68be97
mid05tor12cn17: HOST Firmware Product: -- additional info: linux-4.12.8-openpower1-p5e99141
mid05tor12cn17: HOST Firmware Product: -- additional info: machine-xml-c1e49a6-p1c7cc59
mid05tor12cn17: HOST Firmware Product: -- additional info: occ-854999a
mid05tor12cn17: HOST Firmware Product: -- additional info: op-build-v1.7-1321-g53c41e3-dirty
mid05tor12cn17: HOST Firmware Product: -- additional info: petitboot-v1.5.1-p6b4d24f
mid05tor12cn17: HOST Firmware Product: -- additional info: sbe-4a62126-pd206ccb
mid05tor12cn17: HOST Firmware Product: -- additional info: skiboot-v5.8-rc1-p2f3f436
Let me try and make sure this is a FW issue by taking xCAT out of the equation...
Uh o... After rebooting the BMC, we are able to log into it without 503. This error seems to be in our code....
[root@briggs01 p9]# ./login.sh
==> IP=172.11.139.5
+ date
Tue Nov 7 17:29:23 EST 2017
+ curl -c cjar -k -X POST -H 'Content-Type: application/json' -d '{"data": [ "root", "0penBmc" ] }' https://172.11.139.5/login
{
"data": "User 'root' logged in",
"message": "200 OK",
"status": "ok"
}
real 0m13.744s
user 0m0.021s
sys 0m0.065s
@xuweibj I think perhaps #4207 caused this so I installed 10/30 build of xCAT:
[root@briggs01 xcat]# lsxcatd -v
Version 2.13.8 (git commit 0133577acf2482e30650fb87ce6f35a7dc3c2b4c, built Mon Oct 30 06:15:41 EDT 2017)
Reboot BMC, after up 4 minutes...
[root@briggs01 openbmc]# ./run_cmd_bmc.sh mid05tor12cn05 uptime
==> mid05tor12cn05: Attempting to run the command: uptime
22:38:29 up 4 min, load average: 2.03, 2.50, 1.18
Run test:
[root@briggs01 xcat]# rpower mid05tor12cn05 state
Tue Nov 7 17:37:42 2017 mid05tor12cn05: [openbmc_debug] curl -k -c cjar -H "Content-Type: application/json" -d '{ "data": ["root", "xxxxxx"] }' https://172.11.139.5/login
Tue Nov 7 17:37:56 2017 mid05tor12cn05: [openbmc_debug] login_response 200 OK
Tue Nov 7 17:37:56 2017 mid05tor12cn05: [openbmc_debug] curl -k -b cjar -X GET -H "Content-Type: application/json" https://172.11.139.5/xyz/openbmc_project/state/enumerate
Tue Nov 7 17:37:56 2017 mid05tor12cn05: [openbmc_debug] rpower_status_response 200 OK
mid05tor12cn05: on
I think maybe we need to revert that pull request for now and make sure this does not happen.
Hi @whowutwut , yes, the code for 503 is merged in #4207. The 503 means that curl .... https://<BMCIP>/login
failed or timeout. If we do not have that code, we may got results like:
cn01: 503 Service Unavailable
cn02: 504 Gateway Timeout
Even cn02 is OK, but because cn01 hang, we can not get correct response from cn02. So I think we could not revert that pull request.
I think the problem here is because after bmcreboot, the first login need over 10s. So curl
command timeout and then print 503 Error. But if we set the timeout more than 10s, the performance maybe worse.
But I think we need to differentiate between REST return code and our own return code. If it's our own return code we should not use REST return codes because I expect only the BMC to send those codes back.
Also I think it is OK To let the first call of the bmc to be slow because it's the bmc issues. Not ours to fix. As long as we hit the end point in parallel we are not the bottleneck
@whowutwut OK, I think we could modify the error msg and the timeout value. But how long do you think is better? I tested against 500 nodes and timeout is 20s. if all the nodes do not return response in 20s(need to wait for timeout), for rpower status spend more than 2m40s. And if set to 30s, will be more than 4m.
I think we should bring up this to the openbmc firmware team to fix that issue though on first connect. After the first call connects what is the normal performance of the commands?
Is this only if the nodes are no up? But if all nodes are up it's ok right?
Perhaps we should revert the pull request causing this issue first. Then take a look at this again
If all nodes are OK, only 25s for rpower state against 500 nodes. And if nodes are not ping-able, almost 40s. If the node could be ping-able, but did not return response until timeout, will be more than (8*timeout)s.
The problem if we revert it is that if one node could be ping-able, but does not return response until timeout (60s setting in async), other nodes all will be 504 Timeout
. It is easy to reproduce when we run bmcreboot against one node and then run any other commands when the bmc is powering-on against more than 2 nodes (one is the bmcreboot node).
@xuweibj I see better what you are doing.. Do you have the data for the above?
For this issue, we should urgently replace this error message to not use 503 because it's misleading.
I would also suggest in the fix for this issue to adjust the timeout to let the BMC correctly work for now...and we can discuss more on our midweek call.
relate to task #4352
@whowutwut, for adding test case, depending on the discussion above, it is hard to create such trigger. There is only one CN in daily regression environment, so we can not cover it there. For simulator, depending on xuwei's feedback, it is hard to simulate such 503 error. So I have not found a way to test this issue up to now. I will follow this issue.
@hu-weihua I think the issue here is that we (xCAT) should not use 503 return code if we have an internal error because it makes it look like it's an error coming from the REST server, but it's actually xCAT timing out early and reporting the same error
I think to test this, we want to reboot the bmc, then wait 5+ minutes, we know the BMC should be ready. Then when we issue command against the BMC, we should not timeout early, it should allow the rest service to start and return the correct value...
Essentially we should not get a time out from our code, which now is changed to report a different error msg: BMC did not respond. Verify BMC is in BMCReady state and retry the command
@xuweibj Is this one pending? I don't think there is any additional plans for it.
@whowutwut Yes, this is pending.
Ok @belldi I'm going to close this, please open a new issue or re-open if you see issues after upgrading to 2.13.9.
I'd like to relook at this please...
[root@xcat ~]# rinstall gpu003 Provision node(s): gpu003 gpu003: 503 Service Unavailable Error: [xcat]: Failed to run 'rpower' against the following nodes: gpu003
Clearly this is still a bug.
[root@xcat ~]# rpm -qa | grep xCAT xCAT-buildkit-2.15.1-snap202003041643.noarch xCAT-genesis-scripts-x86_64-2.15.1-snap202003041643.noarch xCAT-probe-2.15.1-snap202003041643.noarch xCAT-server-2.15.1-snap202003041643.noarch xCAT-genesis-base-x86_64-2.14.5-snap201811190037.noarch xCAT-2.15.1-snap202003041643.x86_64 perl-xCAT-2.15.1-snap202003041643.noarch xCAT-client-2.15.1-snap202003041643.noarch xCAT-genesis-scripts-ppc64-2.15.1-snap202003041643.noarch xCAT-genesis-base-ppc64-2.14.5-snap202005281327.noarch
Note, however I see this behaviour only on rinstall commands it seems.
rpower commands work; rinstall commands come back with a 503 and fail to power on the node... Yet then sending the rpower command itself works fine...
@CJCShadowsan Have you figured out how to recreate this and case the system to return 503
@gurevichmark yes:
rinstall <node>
Where
With an IC922, this doesn't occur it seems - but with an AC922 on current latest firmware it's pretty reliably happening.
I am met with a 503, I check the power straight after with an rpower <node> stat
and i'm met with an off status... I immediately rpower <node> on
and it works.
For reference:
[root@xcat ~]# rinstall gpu003
Provision node(s): gpu003
gpu003: 503 Service Unavailable
Error: [xcat]: Failed to run 'rpower' against the following nodes: gpu003
[root@xcat ~]# rpower gpu003 stat
gpu003: off
[root@xcat ~]# rpower gpu003 on
gpu003: on
[root@xcat ~]#
Which version of firmware do you have installed on AC922 ?
Latest @gurevichmark - our first port of call was to upgrade to latest firmware just in case it was a firmware issue... Turned out it wasn't.
When you run rinstall
and it fails with 503
, then your run rinstall
again, will it fail again ?
@gurevichmark - yes, a second rinstall seems to work.
@CJCShadowsan And can you then get the same node back to the state where rinstall
fails again with 503
Actually @gurevichmark :
[root@xcat network]# rinstall gpu002
Provision node(s): gpu002
gpu002: 503 Service Unavailable
Error: [xcat]: Failed to run 'rpower' against the following nodes: gpu002
[root@xcat network]# rinstall gpu002
Provision node(s): gpu002
[root@xcat network]# rpower gpu002 stat
gpu002: off
[root@xcat network]# rpower gpu002 stat
gpu002: off
[root@xcat network]# rinstall gpu002
Provision node(s): gpu002
[root@xcat network]# rpower gpu002 stat
gpu002: on
[root@xcat network]#
So it fails with a 503 the first time, then a second rinstall also doesn't work... It takes a third one to get it to turn on.
I mean i've heard third times the charm but this is taking it a little far!
Ok, so now that gpu002
was able to rinstall
, can you ever get it to the state where it fails again with 503
? Or is it such that once rinstall
successfully goes through, it will not fail again with 503
?
Almost consistently.
It's a regular occurrence because we've been installing and reinstalling nodes...
I've reinstalled gpu002 4 times in the last week. Each time it's been the same.
On Wed, 15 Jul 2020, 20:04 Mark Gurevich, notifications@github.com wrote:
Ok, so now that gpu002 was able to rinstall, can you ever get it to the state where it fails again with 503 ? Or is it such that once rinstall successfully goes through, it will not fail again with 503 ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xcat2/xcat-core/issues/4264#issuecomment-658947726, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD326MGIKPFX6BX7LP4MUYLR3X4TXANCNFSM4ECWHS5A .
Do you have to wait some minimum amount of time after a successful rinstall
?
For example, in your earlier post, first and second rinstall
failed, third one succeeded. If you issue another, fourth, rinstall
after rpower stat
showed on
, will it fail with 503
?
No, that's the thing. This command history is almost instant... Note not just a 503 but a 504 too. The subsequent commands are ran pretty much instantly after the return of the first.
[root@xcat slurm]# rinstall R1
Provision node(s): gpu001 gpu002 gpu003 gpu004 gpu005 gpu006 gpu007 gpu008 infer001
gpu001: 503 Service Unavailable
gpu007: 504 Gateway Timeout
gpu003: reset
gpu005: reset
infer001: reset
gpu004: reset
gpu002: reset
gpu008: reset
gpu006: reset
Error: [xcat]: Failed to run 'rpower' against the following nodes: gpu001 gpu002 gpu003 gpu004 gpu005 gpu006 gpu007 gpu008 infer001
[root@xcat slurm]# rinstall gpu001
Provision node(s): gpu001
[root@xcat slurm]# rpower gpu001 stat
gpu001: off
[root@xcat slurm]# rpower gpu001 stat
gpu001: off
[root@xcat slurm]# rinstall gpu001
Provision node(s): gpu001
^[[A[root@xcat slurm]# rpower gpu001 stat
gpu001: on
[root@xcat slurm]# rinstall gpu007
Provision node(s): gpu007
[root@xcat slurm]# rpower gpu007 stat
gpu007: off
[root@xcat slurm]# rpower gpu007 stat
gpu007: off
[root@xcat slurm]# rinstall gpu007
Provision node(s): gpu007
[root@xcat slurm]# rpower gpu007 stat
gpu007: on (Chassis)
[root@xcat slurm]# rpower gpu007 stat
gpu007: on
[root@xcat slurm]#
As you can see, some get on with it straight away. Others give a 503 or 504 and I have to issue two rinstall commands to get them to do any power on.
Highly frustrating!
@CJCShadowsan I am not able to recreate the 503
on my system with the same BMC firmware level installed.
Can you run the following sequence of steps:
rpower R1 off
rpower R1 bmcstate
rpower R1 stat
rflash R1 -l | xcatcoll
rinstall R1 -V
rpower R1 bmcstate
rpower R1 stat
rpower R1 bmcstate
rpower R1 stat
rinstall R1 -V
Problem description I intermittently see error messages for " Error: 503 Service Unavailable" while attempting to power on compute node. After failure message, my next attempt is generally successful.
Attempt to power on compute node
xCAT version