Closed shamouda closed 7 years ago
@bherta could you review this?
Is there an issue somewhere that describes why this change was needed? If I remember correctly, the fatal_error() method just prints out a message, it doesn't cause the program to exit. Marking of places as dead should already be happening in the buffer flushing step, so explicitly marking them as dead slightly earlier, as in this PR, may be of benefit, but I'm not sure about that. I do see that some missing pthread_unlock calls were added in x10rt_net_send_get & put, so that's good by itself.
Sometimes when I kill places, the socket runtime hangs after printing fatal errors such as"sending GET x10rt_msg_pass.type" or "reading x10rt_msg_params.len". Marking a place dead where these messages were printed seemed to fix that hanging.
I can see that. The "sending GET" message was an example where the unlock was missing. And if there is an error partway through the read, as in "reading x10rt_msg_params.len", then there may be some data stuck in the read buffers. I could see it getting stuck re-reading partial data without the marking of the place as dead.
While I think there are more changes here than are necessary to fix the bug, I don't think any of this is harmful, so I'm fine if we want to merge this in.
In x10rt_sockets.cc mark place dead when a read/write error is detected instead of throwing a fatal error