zrlio / disni

DiSNI: Direct Storage and Networking Interface
Apache License 2.0
186 stars 66 forks source link

work completion event received with wrong value #44

Open yaelna opened 5 years ago

yaelna commented 5 years ago

we tried implementing sever client on top of send recv operations and noticed that sometimes we get a receive completion event following a send operation (or vise versa)

I've managed to reproduce the issue on top of RDMAvsTcpBenchmark ( changed files attached) by saving the last completion event and compering the new event against the old one. we expect that events will always be send-recv-send-recv-... and found that we get two consecutive similar events.

server output: 2305 [Thread-0] INFO com.ibm.disni - got event type + RDMA_CM_EVENT_ESTABLISHED, srcAddress /192.168.33.137:8881, dstAddress /192.168.33.137:51085 RDMAvsTcpBenchmarkServer::client connection accepted 2308 [Thread-1] INFO com.ibm.disni - cq processing, caught exception but keep going *server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2000***** java.lang.RuntimeException: *server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2000***** at com.ibm.disni.examples.SendRecvServer$CustomServerEndpoint.dispatchCqEvent(SendRecvServer.java:212) at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:37) at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:29) at com.ibm.disni.RdmaCqProcessor.dispatchCqEvent(RdmaCqProcessor.java:106) at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:136) at java.lang.Thread.run(Thread.java:745) 2309 [Thread-1] INFO com.ibm.disni - cq processing, caught exception but keep going *server got IBV_WC_RECV event twice in a row. last id = 2001, current id 2001***** java.lang.RuntimeException: *server got IBV_WC_RECV event twice in a row. last id = 2001, current id 2001***** at com.ibm.disni.examples.SendRecvServer$CustomServerEndpoint.dispatchCqEvent(SendRecvServer.java:212) at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:37) at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:29) at com.ibm.disni.RdmaCqProcessor.dispatchCqEvent(RdmaCqProcessor.java:106) at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:136) at java.lang.Thread.run(Thread.java:745)

changedFiles.zip

patrickstuedi commented 5 years ago

Does the problem also occur with the SendRecv Benchmark (https://github.com/zrlio/disni/blob/master/src/test/java/com/ibm/disni/benchmarks/SendRecvClient.java, https://github.com/zrlio/disni/blob/master/src/test/java/com/ibm/disni/benchmarks/SendRecvServer.java)?

yaelna commented 5 years ago

I didn't try it, I have made changes in these classes: https://github.com/zrlio/disni/blob/master/src/test/java/com/ibm/disni/examples/SendRecvClient.java https://github.com/zrlio/disni/blob/master/src/test/java/com/ibm/disni/examples/SendRecvServer.java

petro-rudenko commented 5 years ago

The issue is that disni reuses wc object. In your code you need to do:

lastEvent.set(wc.clone());

After changing this got correct result SEND -> RECV -> SEND -> RECV, etc.

yaelna commented 5 years ago

I've changed my code as follow:

public synchronized void dispatchCqEvent(IbvWC wc) throws IOException { IbvWC clone = wc.clone(); int newOpCode = clone.getOpcode(); IbvWC old = lastEvent.get(); if(old == null){ lastEvent.set(clone); } else if (old.getOpcode() == newOpCode){ throw new RuntimeException("*server got "+ IbvWC.IbvWcOpcode.valueOf(newOpCode) +" event twice in a row. last id = "+old.getWr_id()+", current id "+wc.getWr_id()+"*****"); } lastEvent.set(clone); wcEvents.add(clone); } and still I get this error in my logs : 7814 [Thread-1] INFO com.ibm.disni - cq processing, caught exception but keep going *server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2001***** java.lang.RuntimeException: *server got IBV_WC_SEND event twice in a row. last id = 2000, current id 2001***** at com.ibm.disni.examples.SendRecvServer$CustomServerEndpoint.dispatchCqEvent(SendRecvServer.java:213) at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:37) at com.ibm.disni.RdmaActiveCqProcessor.dispatchCqEvent(RdmaActiveCqProcessor.java:29) at com.ibm.disni.RdmaCqProcessor.dispatchCqEvent(RdmaCqProcessor.java:106) at com.ibm.disni.RdmaCqProcessor.run(RdmaCqProcessor.java:136) at java.lang.Thread.run(Thread.java:745)

I am using Soft-RoCE to do this check , are you using the real hardware?

petro-rudenko commented 5 years ago

@yaelna Can you please run this function:

public void dispatchCqEvent(IbvWC wc) throws IOException {
                        System.out.println("Server got event " + wc.getWr_id() + " : " + IbvWC.IbvWcOpcode.valueOf(wc.getOpcode()) +
                                "Old " + lastEvent.get() + "\n\n");
                        int newOpCode = wc.getOpcode();
                        IbvWC old = lastEvent.get();
                        if(old == null){
                                lastEvent.set(wc.clone());
                                System.out.println(" Last event1 = " + lastEvent.get());
                        } else if (old.getOpcode() == newOpCode){
                                System.out.println( "Last event: " + old  + "current :" + wc);
                                throw new RuntimeException("*******server got "+ IbvWC.IbvWcOpcode.valueOf(newOpCode) +" event twice in a row. last id = "+old.getWr_id()+", current id "+old.getWr_id()+"***********");
                        }
                        lastEvent.set(wc.clone());
                        System.out.println("Last event1 = " + lastEvent.get());
                        wcEvents.add(wc);
                }

Yes i'm running on real Roce device. Regarding SoftRoce, maybe related to #37. Please see my comment: https://github.com/zrlio/disni/issues/37#issuecomment-458055469

venkatsc commented 5 years ago

I am testing Flink with DiSNI and noticed some sporadic duplicated receives. We are already cloning the WC in endpoint classes as suggested in this bug.

Receive complete: 406030 buff address: 139941385133904 seq:405014 
Receive complete: 406031 buff address: 139941385166688 seq:405016 <-- should be 405015
Receive complete: 406032 buff address: 139941385199472 seq:405016

All the receives starting from seq 0 till 405014 received without any repetition. This issue is happening only once in a while.

Server did not repeat the sequence during send, however, on client it is received twice. Any ideas why this could happen?

Additional information: it is tested on InfiniBand HCA mlx-4

venkatsc commented 5 years ago

Disclaimer: Following analysis is purely based on code walk through.

The issue could be with the way free() is used. For instance, the provided examples executes the command and frees the resources ,immediately, before WC event received.

This immediate free() of resources could tamper with the work request, as free() will return the direct buffer to which this command is written to the pool, before HCA executing the posted WR. The returned buffer could be allocated to the next WR and tampers the data associated with the already posted WR.

Imagining the HCA reading the WR from the command address and current WR being written to the same address as an interleaved operation, this could explain why we see duplicate WR ids, data in the receive or send completions.

The solution could be to hold command reference until WC is received and call free() on command only on WC receive.

patrickstuedi commented 5 years ago

Hi Venkatsc, thanks for posting this. Let me double check the code as well and find out if this is really an issue of the buffer being freed too early, that would be a major bug.

patrickstuedi commented 5 years ago

Can you point me to the line in the example with the problematic early free?

venkatsc commented 5 years ago

@patrickstuedi I have faced with this issue irregularly while testing code by posting many WRs at once. As per my reading of the code, it is more of documentation fix. Method free() should not be called until WR is completed. But the examples in the repository uses free() immediately after execute() and that may not cause issues when small number of WR posted.

The below line of code is taken from here

endpoint.postSend(endpoint.getWrList_send()).execute().free();

I think, below flow causing the issue when execute and free methods used in immediate succession.

  1. Post a WR with WR_ID: 100 by issuing exec() and the object is written to memory address.
    • Let us make some assumptions:
    • Command is written to the memory address 123
    • There are few hundred pending WRs queued at the HCA and the current WR is added to the queue.
  2. Since exec() is non-blocking, as soon as the WR is added to the queue, the free() will return the buffer to the memory pool by calling 1, 2.
  3. Post new WR with WR_ID: 101, this request tries to get the buffer from the pool.
    • Let us assume, the buffer with 123 from the pool is assigned to this WR .
  4. Now, following three scenarios can happen
    • WR with WR_ID: 100 is processed by HCA already, everything is fine. WR with WR_ID: 101 can override the data at memory address 123 without any issues.
      • HCA is currently processing WR with WR_ID: 100 and WR with WR_ID: 101 starts writing its request at the same location. Result: memory corruption and depending on how much HCA read, the result could be duplicate WR with same id, or different ids posting same memory address for data send/receive.
      • HCA has not reached the WR_ID:100 for processing. In this case, WR_ID: 101 would replace WR_ID: 100 and add duplicate WR to the queue.
patrickstuedi commented 5 years ago

in (2), when exec is called, the WR list is actually copied by the user driver to a dedicated memory area from where the NIC is fetching the WR list via DMA. The copied WR lits also needs to have the specific format matching the device so the user driver also creates a slightly different layout. Therefore, the application memory backing the WR list can be freed and re-used right away after exec().

The copying of the WR list can been seen in the default post_send call of libibverbs that eventuallyIcalls into the kernel:

https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/cmd.c#L1247

This is not the post_send call that typiclly gets executed if you run on IB or RoCE beause normally a specific user-driver for a particular NIC is directly memory mapping the NIC queues and populates WRs without kernel involvement, but the copying will be similar.

At least that's my understanding of how things work, happy to discuss.

patrickstuedi commented 5 years ago

Note that we are talking about copying WRs, not data, just to not create confusion here.

BernardMetzler commented 5 years ago

-----"Patrick Stuedi" notifications@github.com wrote: -----

To: "zrlio/disni" disni@noreply.github.com From: "Patrick Stuedi" notifications@github.com Date: 11/05/2019 04:42PM Cc: "Subscribed" subscribed@noreply.github.com Subject: [EXTERNAL] Re: [zrlio/disni] work completion event received with wrong value (#44)

in (2), when exec is called, the WR list is actually copied by the user driver to a dedicated memory area from where the NIC is fetching the WR list via DMA. The copied WR lits also needs to have the specific format matching the device so the user driver also creates a slightly different layout. Therefore, the application memory backing the WR list can be freed and re-used right away after exec(). The copying of the WR list can been seen in the default post_send call of libibverbs that eventuallyIcalls into the kernel:

https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/cmd.c# L1247 This is not the post_send call that typiclly gets executed if you run on IB or RoCE beause normally a specific user-driver for a particular NIC is directly memory mapping the NIC queues and populates WRs without kernel involvement, but the copying will be similar. At least that's my understanding of how things work, happy to discuss.

I agree. The work request list is only needed to create driver specific work queue elements. The post_send() or post_receive() call is synchronous. If the call finishes, appropriate work requests got instantiated within the work queue of the driver, or the call fails since the work queue was full. In any case, with the call return, the WR list is under control of the caller again and can be re-used/overwritten, or free'd.

Best regards, Bernard.

venkatsc commented 5 years ago

https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/cmd.c#L1247

Indeed, you are right. I was under assumption that, libibverbs does not copy the request from the given address. As I mentioned before, the issues is occurring rarely, but after not freeing resource until completion I didn't see the issue. Hence, thought free() could be the issue.

Recently, after posting the comment, I had faced the duplicate response again. Issue is due to application bug that is overriding header in the same memory location.

venkatsc commented 5 years ago

Should I remove the comments, to avoid misdirection. As you explained, free() is not the cause.

patrickstuedi commented 5 years ago

Please leave the comments, I think it's a good discussion. And it would be great to find the actual cause for this problem..