xcat2 / xcat-core

Code repo for xCAT core packages
Eclipse Public License 1.0
361 stars 171 forks source link

conserver looses console output with ppc64le firestone (OpenPower S822LC) systems #1090

Closed ralphbellofatto closed 8 years ago

ralphbellofatto commented 8 years ago

We are running xcat with IBM openpower firestone systems.

xcat is configured with

 lsdef -t site clustersite -i consoleondemand
consoleondemand=no

if we execute

service conserver restart
rcons <nodeid>

or

service conserver stop
service conserver start
rcons <nodeid>

rcons will continue to work until 1 minute after we issued the "service conserver stop", or the "service conserver restart" commands.

We managed to duplicate the problem without conserver with the following script.

/opt/xcat/bin/ipmitool-xcat -v -e !  -I lanplus -U ADMIN -P xxxx  -H fstgb002-bmc sol activat &
sleep 5;
pid=$!;
kill $pid
/opt/xcat/bin/ipmitool-xcat -v -e !  -I lanplus -U ADMIN -P xxxx -H fstgb002-bmc sol deactivate
/opt/xcat/bin/ipmitool-xcat -v -e !  -I lanplus -U ADMIN -P xxxx -H fstgb002-bmc sol activate
Wait one minute and we see:
IPMI Request Match NOT FOUND

We believe this is what is happening with the conserver stop/start sequence. and a close approximation of what happens with the conserver restart sequence. NOTE: the -v option is needed to see the Match NOT FOUND message.

we observe that the deactivate and activate sequence is part of the "/opt/xcat/share/xcat/cons/ipmi" file.

if ($iface eq "lanplus") {
    system "$ipmitool -I lanplus $inteloption $user $pass -H $bmc $solcom deactivate"; #Stop any active session
}
exec "$ipmitool -I $iface $inteloption $user $pass -H $bmc $solcom activate";

We have looked inside the conserver source code and observed it does a killproc on all the child processes for "service conserver stop" and does a kill HUP for the "service conserver restart".

we believe the root of this problem is that the killproc of the /opt/xcat/bin/ipmitool-xcat invoked by conserver, kills the process without closing the session with the BMC. So at this point the BMC has a 1 minute time out to close the open session. Then the sol deactivate comes along, followed immediately by the sol activate. One minute after the kill of the ipmitool-xcat, the BMC's timer goes off and closes the session that was open by the killed ipmitool process. the BMC's code seems to make the assumption that any active sol sessions also must be terminated at this point, so it internally kills off the sol connection inside the BMC. However, it leaves the current ipmitool-xcat's connection connected, just without the SOL active any more.

The end result is a hung connection that can only be cleared by.

service conserver stop
sleep 61
service conserver start

another workaround that we successfully tried is to modify the file "/opt/xcat/share/xcat/cons/ipmi" and insert a "sleep 60; between the "sol deactivate" and the "sol activate".

This has the undesirable effect of making rcons unresponsive for 60 seconds after starting, when we have "consoleondemand=yes" for our xcat parameters.

So far our conversations with AMI indicate that their software is working as designed, in that the currently active sol session automatically deactivated by any IPMI session timing out.

The way we notice this in the field, is that periodically we see that our "rcons" sessions are not working. and that the /var/log/console/ files don't have any output in them for a while.

This usually surfaces after we have some sort of error event in the system and we go looking to see if there is any evidence in the console logs of what happened, only to find that they stopped working days ago.

whowutwut commented 8 years ago

@zet809 I think the issue here may be in the conserver restart where we are issuing a kill to the sol processes. Can we change the stop/restart code in the conserver that we ship to do a sol deactivate to avoid this situation?

ralphbellofatto commented 8 years ago

victor, that still won't help us here...

That is because the conserver will do the kill and the kill will cause a timeout 60 seconds after the kill and that timeout will take down any session that has an activate up at the time...

ralphbellofatto commented 8 years ago

an sol deactivate at this point will not fix the problem.. This is because after that is issued, the conserver will kill the ipmi executable. Once that kill happens, the BMC will timeout the connection and after 1 minute the timeout will disconnect any sol session activated by any other sessions at the time the timeout happens.

conserver issues a sigTERM on "service conserver stop" and a sigHUP on "signal conserver restart". both of these when received by ipmitool, immediately terminates the IPMI session, without notifying the BMC that connection is going down.

I think the only way to prevent this is to either modify conserver to process the sigTERM and sigHUP and treat them the same as a sigINT is handled by conserver.

Another thing that might work is to trap the sigTERM and sigHUP signals in the ipmi perl wrapper in xcat and send a "kill -2" to the ipmitool when receiving these, which will result in a clean shutdown in ipmitool..

robers97 commented 8 years ago

This may be a packaging issue rooted in some variance in the ipmitool that may help use solve this.

Using ipmitool v. 1.8.13, I could not replication the sequence above with the ipmitool (x86 to S822LC)

[robers@oc6056000657 tmp]$ ipmitool -V ipmitool version 1.8.13 [robers@oc6056000657 tmp]$ ./ipmi_check.sh Running Get PICMG Properties my_addr 0x20, transit 0, target 0x20 Error Response 0xc1 from Get PICMG Properities No PICMG Extenstion discovered [SOL Session operational. Use !? for help] tcgetattr: Inappropriate ioctl for device IPMI Request Match NOT FOUND ./ipmi_check.sh: line 2: kill: (14660) - No such process Running Get PICMG Properties my_addr 0x20, transit 0, target 0x20 Error Response 0xc1 from Get PICMG Properities No PICMG Extenstion discovered Info: SOL payload already de-activated Running Get PICMG Properties my_addr 0x20, transit 0, target 0x20 Error Response 0xc1 from Get PICMG Properities No PICMG Extenstion discovered [SOL Session operational. Use !? for help]

Using ipmitool v. 1.8.15 (S821LC to S822LC):

[root@hab10 tmp]# ./ipmi_check.sh [SOL Session operational. Use !? for help] tcgetattr: Inappropriate ioctl for device ./ipmi_check.sh: line 2: kill: (67315) - No such process Info: SOL payload already de-activated [SOL Session operational. Use !? for help]

ralphbellofatto commented 8 years ago

Ipmi tool 1.8.13 does not loose the console because when the BMC times out the old connection it sends a message to the ipmitool with the SOL conneciton it does not understand, which in ipmitool 1.8.13 cause the tool to exit with the message BMC closed the connection.

IPMItool 1.8.15 tolerates this message and keeps the session open, however, the BMC has internally shut down the SOL connection to that connection, however the connection remains open.

There are two problems here, 1, that when the ipmitool receives the kill from conserver, it does not cleanly shutdown the connection with the BMC. (this could be acomplished by the ipmitool code handling the sigTERM signal in the same manner that it handles the sigINT).

The second part of the problem is that the BMC pulls the SOL session from a all connections, that does not own the SOL connection when a connection times out. It does this even if the connection that timed out does now "own" the SOL session.

ralphbellofatto commented 8 years ago

oops should not have hit close and comment, issue is still open...

jayeshmpatel commented 8 years ago

I have requested AMI (viswanathans@ami.com) team to comment on this issue..

zet809 commented 8 years ago

With the patch mentioned in https://sourceforge.net/p/ipmitool/mailman/message/34370504/, the ipmitool sol session within rcons can be terminated immediately after BMC close the sol session. Will upload the pkg with this patch soon.

whowutwut commented 8 years ago

@zet809 does this patch cause additional issues with rcons?

zet809 commented 8 years ago

@whowutwut No worse issue is introduced with this patch. With this patch, the client won't hung if BMC close the SOL session.

zet809 commented 8 years ago

Move this defect to 2.12.2 to trace the new build of ipmitool

whowutwut commented 8 years ago

@jayeshmpatel Are you able to get anyone from AMI to comment on this issue? Also is there some console issues that plan to be fixed in the next firmware?

whowutwut commented 8 years ago

@ralphbellofatto We created a new ipmitool-xcat-1.8.15-2, can you update your xCAT to the latest development version: xcat-core: https://xcat.org/files/xcat/repos/yum/devel/core-snap/xCAT-core.repo xcat-dep: https://xcat.org/files/xcat/repos/yum/xcat-dep/rh7/ppc64le/xCAT-dep.repo

Then run yum update '*xCAT*', you should get the patched version and let us know if this resolves the issue?

whowutwut commented 8 years ago

@ralphbellofatto We updated the xCAT dependency package to contain a new ipmitool-xcat-1.8.15-3 version which handles the SIGHUP and SIGTERM.. If you want to pull it down and give it a try.

whowutwut commented 8 years ago

seems there still is some issues with service conserver restart case...

daniceexi commented 8 years ago

@chenglch There were some discussion in the email. Could post the issues you found? I move this issue to next release so that we can continue fix the potential issues.

chenglch commented 8 years ago

@whowutwut , I have tested on your c910f02c05p03 node again, I can't reproduce this issue on both c910f05c39 and c910f05c33 when running server conserver restart

daniceexi commented 8 years ago

@chenglch @zet809 Please update the status for this issue to make sure all the problems mentioned in this issue has been fixed.

zet809 commented 8 years ago

No more issues encountered with ipmitool-xcat-1.8.15-3, so close this defect.

whowutwut commented 8 years ago

@ralphbellofatto are you ok with the fix? Can we close this and if there are additional issues we can open another to track?

chenglch commented 8 years ago

Hi @ralphbellofatto , as we can't recreate this issue in our environment again, I close this issue at first, If you encounter this problem again, feel free to reopen.