zrlio / disni

DiSNI: Direct Storage and Networking Interface
Apache License 2.0
186 stars 66 forks source link

register memory that larger than 4GB #39

Open lynus opened 5 years ago

lynus commented 5 years ago

Recently, I extended disni enabling On Demand Paging (ODP) feature. With this feeature, it is practical to register large memory whose size should be hold in long type instead of int. In fact, native call ibv_reg_mr() use size_t which is 64-bit wide in 64-bit architecture system. So what would think of breaking the api and updating IbvMr.length to long typed?

P.S. I encountered some pitfalls to enable ODP. API imcompatibility is just one them.

patrickstuedi commented 5 years ago

@yuvaldeg you're the ODP expert, what do you think?

yuvaldeg commented 5 years ago

@lynus is correct, this should work and allow us to register >4GB blocks. We should change it to long and break the build if this is not a 64bit system.

yuvaldeg commented 5 years ago

@lynus, please report any other issues you may be experiencing with ODP - are these mostly functional or performance issues?

lynus commented 5 years ago

I have encountered following pitfalls that are not documented in any materials I could find in the Internet:

  1. When connecting QP, connection parameter rnr_retry_count should be set nonzero, otherwise completion queue would receive IBV_WC_RNR_RETRY_EXC_ERR if you send ODP registered memory. So I set default RdmaConnParam.rnr_retry_count to 7.
  2. Page fault handling can be very long if no prefetch applied. It can cost as long as 5-6 seconds to handle the first page fault. If you post_receive some WRs on ODP-registered memory, you should not read/write memory referenced by the WRs before receiving WC notification, otherwise a IBV_WC_WR_FLUSH_ERR on the receiver side is raised. Interestingly, you do can read/write the adjacent memory within the same page.

The infiniband card in my experiements is MT2770 (Mellanox ConnectX-4). So far, I have only tested send/recv functions. I will report more issues in further experiments. P.S. @yuvaldeg I spend tens of hours on trial and error to figure out the cause of IBV_WC_WR_FLUSH_ERR. Is there any technique to investigating the kernel modules and extract the cause of error?

yuvaldeg commented 5 years ago

Re #1: You are correct, rnr_retry_count should be non-zero. ODP utilizes RNR for it to function as expected. E.g., if a user tries to read/write from/to remote memory, whereas the remote memory is registered with ODP, the remote side will respond with an RNR so that the initiator will back off until the page fault process is complete. This is why RNRs are a requirement for ODP.

Re #2: For the reasons above, page fault is a lengthy process that should be avoided as much as possible. ODP is intended to be used with buffers that get reused frequently - thus they do not get evicted too frequently hence minimizing the occurrence of page faults. A better approach for ODP is to use it on larger data buffers and then use prefetch wisely and make any effort to reuse those buffers and not release them. Everything else should be kept registered without ODP (e.g. QP WR buffers, receive WQEs, etc...).

Regarding the ConnectX-4 card - generally speaking, ConnectX-5 will get better performance with ODP, though ConnectX-4 also shows very nice improvements with ODP.

Re IBV_WC_WR_FLUSH_ERR - this is usually a symptom of some other issue. Whenever a QP gets an error, it will report the error using a WC, and will then flush the rest of the posted WRs with an IBV_WC_WR_FLUSH_ERR. So, you should fish out the first WR that fails and see what reason for failure it reports - we can further assist if needed.

lynus commented 5 years ago

@yuvaldeg Thanks for the comments. Regarding IBV_WC_WR_FLUSH_ERR, here is the simplified description of my case: There are two receiving WR pending in QP, A and B. Request A only references mmaped memory, and B references both mapped and unmmaped memory. Sending side starts to transfer data, and receiving side quickly get A's completion sucessful event. Then it is a lengthy wait for B's event. If I read the unmmaped buffer eagerly before getting B's completion event, then I immediately get IBV_WC_WR_FLUSH_ERR on B.

So there are no prior error to IBV_WC_WR_FLUSH_ERR and the only failed WR is B with flush error. This kind of error is really hard to debug.

yuvaldeg commented 5 years ago

Thanks for the details, @lynus I want to make sure I have the full picture - can you please share a code snippet of the mapping, registration and post recv? also if you can share the code where you get the error while reading before B's completion, that would be great. If you can share an entire code sample that will be even better.

lynus commented 5 years ago

Right now, I'm working on a experimental project aiming at using DiSNI directly writing to remote JVM managed heap. The entire code includes customed JVM (https://github.com/lynus/JikesRVM/tree/RDMA) and my disni fork (https://github.com/lynus/disni). So it may take too many time to dive in. I will write a short C codes to reproduce this exact situation. Thanks for your kindness :-)

patrickstuedi commented 5 years ago

@yuvaldeg @lynus anything come out of that discussion? Does 64bit registrations make sense at all given that ByteBuffer.allocateDirect(size) is 32 signed?

lynus commented 5 years ago

I vote for 64bit registration support. The target memory buffers may come from user's own allocator other than ByteBuffer.allocateDirect(). So I think a new interface RdmaEndpoint.registerMemory(long addr, long size) is also needed in some use cases.

yuvaldeg commented 5 years ago

@patrickstuedi I agree with @lynus, I think we should support it. It will be needed sooner or later, so we might as well do it now.