xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
360 stars 171 forks source link

rpower command intermittently reports 503 Service Unavailable Error #4264

Open belldi opened 6 years ago

belldi commented 6 years ago

Problem description I intermittently see error messages for " Error: 503 Service Unavailable" while attempting to power on compute node. After failure message, my next attempt is generally successful.

Attempt to power on compute node

# rpower cn06 on
cn06: Error: 503 Service Unavailable
# echo $?
1
# rpower cn06 state
cn06: off
# rpower cn06 on
cn06: on
# rpower cn06 state
cn06: on

xCAT version

# lsxcatd -a
Version 2.13.8 (git commit e4f452a44d417e2f9c64b6dd064e3fc316046ac4, built Sat Nov  4 07:15:42 EDT 2017)
This is a Management Node
cfgloc=Pg:dbname=xcatdb;host=192.168.95.1|xcatadm
dbengine=Pg
dbname=xcatdb
dbhost=192.168.95.1
dbadmin=xcatadm
whowutwut commented 6 years ago

@belldi I formatted your above issue.

We are seeing the 503 come back the first time that we call a REST command after rebooting the BMC. I logged into the cn06 BMC and it looks like it's been up for around 20 minutes.

I thought this was an openbmc issue.. but today I saw it on an earlier firmware:

[root@briggs01 910.1738.20170921r]# rinv mid05tor12cn17 firm
Tue Nov  7 16:55:41 2017 mid05tor12cn17: [openbmc_debug] LOGIN Failed using curl command
mid05tor12cn17: Error: 503 Service Unavailable
[root@briggs01 910.1738.20170921r]# rinv mid05tor12cn17 firm
Tue Nov  7 16:55:47 2017 mid05tor12cn17: [openbmc_debug] curl -k -c cjar -H "Content-Type: application/json" -d '{ "data": ["root", "xxxxxx"] }' https://172.12.139.117/login
Tue Nov  7 16:55:47 2017 mid05tor12cn17: [openbmc_debug] login_response 200 OK
Tue Nov  7 16:55:47 2017 mid05tor12cn17: [openbmc_debug] curl -k -b cjar -X GET -H "Content-Type: application/json" https://172.12.139.117/xyz/openbmc_project/software/enumerate
Tue Nov  7 16:55:48 2017 mid05tor12cn17: [openbmc_debug] rinv_firm_response 200 OK
mid05tor12cn17: HOST Firmware Product: IBM-witherspoon-redbud-ibm-OP9_v1.19_1.2 (Active)*
mid05tor12cn17: HOST Firmware Product: -- additional info: buildroot-2017.5-163-g8035745
mid05tor12cn17: HOST Firmware Product: -- additional info: capp-ucode-9c73e9f
mid05tor12cn17: HOST Firmware Product: -- additional info: hostboot-binaries-67a15fd
mid05tor12cn17: HOST Firmware Product: -- additional info: hostboot-c68be97
mid05tor12cn17: HOST Firmware Product: -- additional info: linux-4.12.8-openpower1-p5e99141
mid05tor12cn17: HOST Firmware Product: -- additional info: machine-xml-c1e49a6-p1c7cc59
mid05tor12cn17: HOST Firmware Product: -- additional info: occ-854999a
mid05tor12cn17: HOST Firmware Product: -- additional info: op-build-v1.7-1321-g53c41e3-dirty
mid05tor12cn17: HOST Firmware Product: -- additional info: petitboot-v1.5.1-p6b4d24f
mid05tor12cn17: HOST Firmware Product: -- additional info: sbe-4a62126-pd206ccb
mid05tor12cn17: HOST Firmware Product: -- additional info: skiboot-v5.8-rc1-p2f3f436

Let me try and make sure this is a FW issue by taking xCAT out of the equation...

whowutwut commented 6 years ago

Uh o... After rebooting the BMC, we are able to log into it without 503. This error seems to be in our code....

[root@briggs01 p9]# ./login.sh 
==> IP=172.11.139.5
+ date
Tue Nov  7 17:29:23 EST 2017
+ curl -c cjar -k -X POST -H 'Content-Type: application/json' -d '{"data": [ "root", "0penBmc" ] }' https://172.11.139.5/login
{
  "data": "User 'root' logged in", 
  "message": "200 OK", 
  "status": "ok"
}
real    0m13.744s
user    0m0.021s
sys 0m0.065s
whowutwut commented 6 years ago

@xuweibj I think perhaps #4207 caused this so I installed 10/30 build of xCAT:

[root@briggs01 xcat]# lsxcatd -v
Version 2.13.8 (git commit 0133577acf2482e30650fb87ce6f35a7dc3c2b4c, built Mon Oct 30 06:15:41 EDT 2017)

Reboot BMC, after up 4 minutes...

[root@briggs01 openbmc]# ./run_cmd_bmc.sh  mid05tor12cn05 uptime
==> mid05tor12cn05: Attempting to run the command: uptime
 22:38:29 up 4 min,  load average: 2.03, 2.50, 1.18

Run test:

[root@briggs01 xcat]# rpower mid05tor12cn05 state 
Tue Nov  7 17:37:42 2017 mid05tor12cn05: [openbmc_debug] curl -k -c cjar -H "Content-Type: application/json" -d '{ "data": ["root", "xxxxxx"] }' https://172.11.139.5/login
Tue Nov  7 17:37:56 2017 mid05tor12cn05: [openbmc_debug] login_response 200 OK
Tue Nov  7 17:37:56 2017 mid05tor12cn05: [openbmc_debug] curl -k -b cjar -X GET -H "Content-Type: application/json" https://172.11.139.5/xyz/openbmc_project/state/enumerate
Tue Nov  7 17:37:56 2017 mid05tor12cn05: [openbmc_debug] rpower_status_response 200 OK
mid05tor12cn05: on

I think maybe we need to revert that pull request for now and make sure this does not happen.

xuweibj commented 6 years ago

Hi @whowutwut , yes, the code for 503 is merged in #4207. The 503 means that curl .... https://<BMCIP>/login failed or timeout. If we do not have that code, we may got results like:

cn01: 503 Service Unavailable
cn02: 504 Gateway Timeout

Even cn02 is OK, but because cn01 hang, we can not get correct response from cn02. So I think we could not revert that pull request.

I think the problem here is because after bmcreboot, the first login need over 10s. So curl command timeout and then print 503 Error. But if we set the timeout more than 10s, the performance maybe worse.

whowutwut commented 6 years ago

But I think we need to differentiate between REST return code and our own return code. If it's our own return code we should not use REST return codes because I expect only the BMC to send those codes back.

Also I think it is OK To let the first call of the bmc to be slow because it's the bmc issues. Not ours to fix. As long as we hit the end point in parallel we are not the bottleneck

xuweibj commented 6 years ago

@whowutwut OK, I think we could modify the error msg and the timeout value. But how long do you think is better? I tested against 500 nodes and timeout is 20s. if all the nodes do not return response in 20s(need to wait for timeout), for rpower status spend more than 2m40s. And if set to 30s, will be more than 4m.

whowutwut commented 6 years ago

I think we should bring up this to the openbmc firmware team to fix that issue though on first connect. After the first call connects what is the normal performance of the commands?

Is this only if the nodes are no up? But if all nodes are up it's ok right?

whowutwut commented 6 years ago

Perhaps we should revert the pull request causing this issue first. Then take a look at this again

xuweibj commented 6 years ago

If all nodes are OK, only 25s for rpower state against 500 nodes. And if nodes are not ping-able, almost 40s. If the node could be ping-able, but did not return response until timeout, will be more than (8*timeout)s.

The problem if we revert it is that if one node could be ping-able, but does not return response until timeout (60s setting in async), other nodes all will be 504 Timeout. It is easy to reproduce when we run bmcreboot against one node and then run any other commands when the bmc is powering-on against more than 2 nodes (one is the bmcreboot node).

whowutwut commented 6 years ago

@xuweibj I see better what you are doing.. Do you have the data for the above?

For this issue, we should urgently replace this error message to not use 503 because it's misleading.

I would also suggest in the fix for this issue to adjust the timeout to let the BMC correctly work for now...and we can discuss more on our midweek call.

hu-weihua commented 6 years ago

relate to task #4352

hu-weihua commented 6 years ago

@whowutwut, for adding test case, depending on the discussion above, it is hard to create such trigger. There is only one CN in daily regression environment, so we can not cover it there. For simulator, depending on xuwei's feedback, it is hard to simulate such 503 error. So I have not found a way to test this issue up to now. I will follow this issue.

whowutwut commented 6 years ago

@hu-weihua I think the issue here is that we (xCAT) should not use 503 return code if we have an internal error because it makes it look like it's an error coming from the REST server, but it's actually xCAT timing out early and reporting the same error

I think to test this, we want to reboot the bmc, then wait 5+ minutes, we know the BMC should be ready. Then when we issue command against the BMC, we should not timeout early, it should allow the rest service to start and return the correct value...

Essentially we should not get a time out from our code, which now is changed to report a different error msg: BMC did not respond. Verify BMC is in BMCReady state and retry the command

whowutwut commented 6 years ago

@xuweibj Is this one pending? I don't think there is any additional plans for it.

xuweibj commented 6 years ago

@whowutwut Yes, this is pending.

whowutwut commented 6 years ago

Ok @belldi I'm going to close this, please open a new issue or re-open if you see issues after upgrading to 2.13.9.

CJCShadowsan commented 4 years ago

I'd like to relook at this please...

[root@xcat ~]# rinstall gpu003 Provision node(s): gpu003 gpu003: 503 Service Unavailable Error: [xcat]: Failed to run 'rpower' against the following nodes: gpu003

Clearly this is still a bug.

[root@xcat ~]# rpm -qa | grep xCAT xCAT-buildkit-2.15.1-snap202003041643.noarch xCAT-genesis-scripts-x86_64-2.15.1-snap202003041643.noarch xCAT-probe-2.15.1-snap202003041643.noarch xCAT-server-2.15.1-snap202003041643.noarch xCAT-genesis-base-x86_64-2.14.5-snap201811190037.noarch xCAT-2.15.1-snap202003041643.x86_64 perl-xCAT-2.15.1-snap202003041643.noarch xCAT-client-2.15.1-snap202003041643.noarch xCAT-genesis-scripts-ppc64-2.15.1-snap202003041643.noarch xCAT-genesis-base-ppc64-2.14.5-snap202005281327.noarch

CJCShadowsan commented 4 years ago

Note, however I see this behaviour only on rinstall commands it seems.

rpower commands work; rinstall commands come back with a 503 and fail to power on the node... Yet then sending the rpower command itself works fine...

gurevichmark commented 4 years ago

@CJCShadowsan Have you figured out how to recreate this and case the system to return 503

CJCShadowsan commented 4 years ago

@gurevichmark yes:

rinstall <node>

Where is an AC922 setup with openbmc.

With an IC922, this doesn't occur it seems - but with an AC922 on current latest firmware it's pretty reliably happening.

I am met with a 503, I check the power straight after with an rpower <node> stat and i'm met with an off status... I immediately rpower <node> on and it works.

CJCShadowsan commented 4 years ago

For reference:

[root@xcat ~]# rinstall gpu003
Provision node(s): gpu003
gpu003: 503 Service Unavailable
Error: [xcat]: Failed to run 'rpower' against the following nodes: gpu003
[root@xcat ~]# rpower gpu003 stat
gpu003: off
[root@xcat ~]# rpower gpu003 on
gpu003: on
[root@xcat ~]#
gurevichmark commented 4 years ago

Which version of firmware do you have installed on AC922 ?

CJCShadowsan commented 4 years ago

image

Latest @gurevichmark - our first port of call was to upgrade to latest firmware just in case it was a firmware issue... Turned out it wasn't.

gurevichmark commented 4 years ago

When you run rinstall and it fails with 503, then your run rinstall again, will it fail again ?

CJCShadowsan commented 4 years ago

@gurevichmark - yes, a second rinstall seems to work.

gurevichmark commented 4 years ago

@CJCShadowsan And can you then get the same node back to the state where rinstall fails again with 503

CJCShadowsan commented 4 years ago

Actually @gurevichmark :

[root@xcat network]# rinstall gpu002
Provision node(s): gpu002
gpu002: 503 Service Unavailable
Error: [xcat]: Failed to run 'rpower' against the following nodes: gpu002
[root@xcat network]# rinstall gpu002
Provision node(s): gpu002
[root@xcat network]# rpower gpu002 stat
gpu002: off
[root@xcat network]# rpower gpu002 stat
gpu002: off
[root@xcat network]# rinstall gpu002
Provision node(s): gpu002
[root@xcat network]# rpower gpu002 stat
gpu002: on
[root@xcat network]#

So it fails with a 503 the first time, then a second rinstall also doesn't work... It takes a third one to get it to turn on.

I mean i've heard third times the charm but this is taking it a little far!

gurevichmark commented 4 years ago

Ok, so now that gpu002 was able to rinstall, can you ever get it to the state where it fails again with 503 ? Or is it such that once rinstall successfully goes through, it will not fail again with 503 ?

CJCShadowsan commented 4 years ago

Almost consistently.

It's a regular occurrence because we've been installing and reinstalling nodes...

I've reinstalled gpu002 4 times in the last week. Each time it's been the same.

On Wed, 15 Jul 2020, 20:04 Mark Gurevich, notifications@github.com wrote:

Ok, so now that gpu002 was able to rinstall, can you ever get it to the state where it fails again with 503 ? Or is it such that once rinstall successfully goes through, it will not fail again with 503 ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/xcat2/xcat-core/issues/4264#issuecomment-658947726, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD326MGIKPFX6BX7LP4MUYLR3X4TXANCNFSM4ECWHS5A .

gurevichmark commented 4 years ago

Do you have to wait some minimum amount of time after a successful rinstall ?

For example, in your earlier post, first and second rinstall failed, third one succeeded. If you issue another, fourth, rinstall after rpower stat showed on, will it fail with 503 ?

CJCShadowsan commented 4 years ago

No, that's the thing. This command history is almost instant... Note not just a 503 but a 504 too. The subsequent commands are ran pretty much instantly after the return of the first.

[root@xcat slurm]# rinstall R1
Provision node(s): gpu001 gpu002 gpu003 gpu004 gpu005 gpu006 gpu007 gpu008 infer001
gpu001: 503 Service Unavailable
gpu007: 504 Gateway Timeout
gpu003: reset
gpu005: reset
infer001: reset
gpu004: reset
gpu002: reset
gpu008: reset
gpu006: reset
Error: [xcat]: Failed to run 'rpower' against the following nodes: gpu001 gpu002 gpu003 gpu004 gpu005 gpu006 gpu007 gpu008 infer001
[root@xcat slurm]# rinstall gpu001
Provision node(s): gpu001
[root@xcat slurm]# rpower gpu001 stat
gpu001: off
[root@xcat slurm]# rpower gpu001 stat
gpu001: off
[root@xcat slurm]# rinstall gpu001
Provision node(s): gpu001
^[[A[root@xcat slurm]# rpower gpu001 stat
gpu001: on
[root@xcat slurm]# rinstall gpu007
Provision node(s): gpu007
[root@xcat slurm]# rpower gpu007 stat
gpu007: off
[root@xcat slurm]# rpower gpu007 stat
gpu007: off
[root@xcat slurm]# rinstall gpu007
Provision node(s): gpu007
[root@xcat slurm]# rpower gpu007 stat
gpu007: on (Chassis)
[root@xcat slurm]# rpower gpu007 stat
gpu007: on
[root@xcat slurm]#
CJCShadowsan commented 4 years ago

As you can see, some get on with it straight away. Others give a 503 or 504 and I have to issue two rinstall commands to get them to do any power on.

Highly frustrating!

gurevichmark commented 4 years ago

@CJCShadowsan I am not able to recreate the 503 on my system with the same BMC firmware level installed. Can you run the following sequence of steps: