Fixing segmentation fault and hanging issues in Team when places fail.
Modified sleepUntil and waitForParentToReceive to return the team status, and used the return value to avoid moving data to a failed place , and to terminate the collective when needed.
The fix is not complete, sleepUntil still hangs in some cases. The new class ResilientAllreduceTest includes a sample log for a hanging scenario.
Fixing segmentation fault and hanging issues in Team when places fail. Modified sleepUntil and waitForParentToReceive to return the team status, and used the return value to avoid moving data to a failed place , and to terminate the collective when needed.
The fix is not complete, sleepUntil still hangs in some cases. The new class ResilientAllreduceTest includes a sample log for a hanging scenario.