A failure may be noticed in some places few steps after other places,
resulting in disagreement between places on the number of executed
steps, and consequently a failure in allreduce calls that compute the
min/max/sum of the stepTimes list. To fix this bug, places first agree
on the minimum number of steps, then use it to size the allreduce
buffers.
A failure may be noticed in some places few steps after other places, resulting in disagreement between places on the number of executed steps, and consequently a failure in allreduce calls that compute the min/max/sum of the stepTimes list. To fix this bug, places first agree on the minimum number of steps, then use it to size the allreduce buffers.