Open lynus opened 5 years ago
@yuvaldeg you're the ODP expert, what do you think?
@lynus is correct, this should work and allow us to register >4GB blocks. We should change it to long and break the build if this is not a 64bit system.
@lynus, please report any other issues you may be experiencing with ODP - are these mostly functional or performance issues?
I have encountered following pitfalls that are not documented in any materials I could find in the Internet:
The infiniband card in my experiements is MT2770 (Mellanox ConnectX-4). So far, I have only tested send/recv functions. I will report more issues in further experiments. P.S. @yuvaldeg I spend tens of hours on trial and error to figure out the cause of IBV_WC_WR_FLUSH_ERR. Is there any technique to investigating the kernel modules and extract the cause of error?
Re #1: You are correct, rnr_retry_count should be non-zero. ODP utilizes RNR for it to function as expected. E.g., if a user tries to read/write from/to remote memory, whereas the remote memory is registered with ODP, the remote side will respond with an RNR so that the initiator will back off until the page fault process is complete. This is why RNRs are a requirement for ODP.
Re #2: For the reasons above, page fault is a lengthy process that should be avoided as much as possible. ODP is intended to be used with buffers that get reused frequently - thus they do not get evicted too frequently hence minimizing the occurrence of page faults. A better approach for ODP is to use it on larger data buffers and then use prefetch wisely and make any effort to reuse those buffers and not release them. Everything else should be kept registered without ODP (e.g. QP WR buffers, receive WQEs, etc...).
Regarding the ConnectX-4 card - generally speaking, ConnectX-5 will get better performance with ODP, though ConnectX-4 also shows very nice improvements with ODP.
Re IBV_WC_WR_FLUSH_ERR - this is usually a symptom of some other issue. Whenever a QP gets an error, it will report the error using a WC, and will then flush the rest of the posted WRs with an IBV_WC_WR_FLUSH_ERR. So, you should fish out the first WR that fails and see what reason for failure it reports - we can further assist if needed.
@yuvaldeg Thanks for the comments. Regarding IBV_WC_WR_FLUSH_ERR, here is the simplified description of my case: There are two receiving WR pending in QP, A and B. Request A only references mmaped memory, and B references both mapped and unmmaped memory. Sending side starts to transfer data, and receiving side quickly get A's completion sucessful event. Then it is a lengthy wait for B's event. If I read the unmmaped buffer eagerly before getting B's completion event, then I immediately get IBV_WC_WR_FLUSH_ERR on B.
So there are no prior error to IBV_WC_WR_FLUSH_ERR and the only failed WR is B with flush error. This kind of error is really hard to debug.
Thanks for the details, @lynus I want to make sure I have the full picture - can you please share a code snippet of the mapping, registration and post recv? also if you can share the code where you get the error while reading before B's completion, that would be great. If you can share an entire code sample that will be even better.
Right now, I'm working on a experimental project aiming at using DiSNI directly writing to remote JVM managed heap. The entire code includes customed JVM (https://github.com/lynus/JikesRVM/tree/RDMA) and my disni fork (https://github.com/lynus/disni). So it may take too many time to dive in. I will write a short C codes to reproduce this exact situation. Thanks for your kindness :-)
@yuvaldeg @lynus anything come out of that discussion? Does 64bit registrations make sense at all given that ByteBuffer.allocateDirect(size) is 32 signed?
I vote for 64bit registration support. The target memory buffers may come from user's own allocator other than ByteBuffer.allocateDirect(). So I think a new interface RdmaEndpoint.registerMemory(long addr, long size) is also needed in some use cases.
@patrickstuedi I agree with @lynus, I think we should support it. It will be needed sooner or later, so we might as well do it now.
Recently, I extended disni enabling On Demand Paging (ODP) feature. With this feeature, it is practical to register large memory whose size should be hold in long type instead of int. In fact, native call ibv_reg_mr() use size_t which is 64-bit wide in 64-bit architecture system. So what would think of breaking the api and updating IbvMr.length to long typed?
P.S. I encountered some pitfalls to enable ODP. API imcompatibility is just one them.