zrlio / disni

DiSNI: Direct Storage and Networking Interface
Apache License 2.0
186 stars 66 forks source link

Need proper handling for RDMA_CM_EVENT_ROUTE_ERROR #41

Open lynus opened 5 years ago

lynus commented 5 years ago

I encountered in a situation where client randomly receive RDMA_CM_EVENT_ROUTE_ERROR, when client connects the server at a relatively high frequence. Disni does not handle this case properly, as it would block indefinitely in RdmaEndpoint.connect(). I think a new 'CONN_STATE_ERROR' CM state should be introduced to RdmaEndpoint and properly handled in connect() and close().

lynus commented 5 years ago

Dealing with partially initialized enpoint seems requiring a lot of modification and tests to existing implementations.

patrickstuedi commented 5 years ago

@lynus So you're saying, a client gets RDMA_CM_EVENT_ROUTE_ERROR and after that blocks indefinitely, correct?

lynus commented 5 years ago

@patrickstuedi Yes,I have confirmed that. This is due to RdmaEndpoint.dispatchCmEvent does not notifyAll() when receiving unknow events.

patrickstuedi commented 5 years ago

Did you check if pulling notifyAll() out of the "if" statements helps?

https://github.com/zrlio/disni/blob/master/src/main/java/com/ibm/disni/RdmaEndpoint.java#L135

lynus commented 5 years ago

Pulling notifyAll() out of if statement can fix the indefinite loop problem mentioned above.

However, that's no enough. Say endpoint received route resolve error , then connect() gets notified and throws a IOException. Now the endpoint is in partial initialized state, which disni so far can not properly handle. For example, this route-resolve-failed endpoint can not simply recall connect, because it has already called rdma_resolve_addr, and bond to a device. On the other hand, if ep is of RdmaActiveEndpoint type, close() implementation of this type will call RdmaActiveEndpointGroup.close(), which would throw exception on https://github.com/zrlio/disni/blob/cfcee63cf054fb151f459357f601d7717f658b6c/src/main/java/com/ibm/disni/RdmaActiveEndpointGroup.java#L113 due to https://github.com/zrlio/disni/blob/cfcee63cf054fb151f459357f601d7717f658b6c/src/main/java/com/ibm/disni/verbs/impl/NatCmaIdPrivate.java#L55

In summary, disni lacks the consideration of handling partially initialized ep. I suggest:

  1. Introducing isResourceAllocated() to RdmaEndpoint. It should be checked in every derived ep class in close() so as to avoid release unallocated resources. For example, https://github.com/zrlio/darpc/blob/4d03031d7f118722f093e198965c394573dccd96/src/main/java/com/ibm/darpc/DaRPCEndpoint.java#L125 should check it before deregisterMemory(dataMr);
  2. Fix the RdmaEndpoint.connect() logic to support reconnect.
  3. Pulling notifyAll() out of if statements, like you proposed.
patrickstuedi commented 5 years ago

Ok, understand. But can we not solve this issue even simpler:

1) Make close() robust with regard to partially initialized state. For instance, by checking if RdmaCmId is open before calling getVerbs() in the close() call

https://github.com/zrlio/disni/blob/cfcee63cf054fb151f459357f601d7717f658b6c/src/main/java/com/ibm/disni/RdmaActiveEndpointGroup.java#L113

2) By rewriting the connect() call so that it remembers its state and starts from there, e.g., not binding if it is already bound, etc.

lynus commented 5 years ago

Make close() robust with regard to partially initialized state. For instance, by checking if RdmaCmId is open before calling getVerbs() in the close() call

This only makes RdmaEndpoint itself robust, not the derived class. I still think isResourceAllocated() is needed, to inform derived class whether init() has been successfully executed. For example, the close() of DaRPCEndpoint should be: super.close(); if (isResourceAllocated()) derigisterMemory(dataMr);

PepperJo commented 5 years ago

Why not reset the endpoint to initial state if it fails in the connect? Resetting can be easily handled in the connect method. Or is there any sensible use-case in retrying the connect from a specific state. If that is the case I would rather expose this in the interface, e.g. by having multiple methods to connect.

lynus commented 5 years ago

@PepperJo We can't simply set connState to CONN_STATE_INITIALIZED and then connect() again, because idPriv.resolveAddr() would fail since ep may have already bond to a device. My PR introduces a separate state isError to mark the prior error. connect() method then try to avoid unnecessary steps.

PepperJo commented 5 years ago

@lynus the problem is that it is hidden to the user that the connect() method performs multiple steps. I don't like users calling connect() multiple times with connect starting wherever the last error occurred. Because this can change the semantic of the connect call e.g. think about dynamic routeable environments. So I propose whenever we failed in the connect call we set the endpoint in failure state and throw an exception whenever a method is called other than close().