zrlio / disni

DiSNI: Direct Storage and Networking Interface
Apache License 2.0
182 stars 67 forks source link

performance #58

Open bburas opened 1 year ago

bburas commented 1 year ago

I am using the RDMA benchmark code to perform a latency test for send/recv using 100Gb/s Mellanox cards directly connected. Seeing from 100-700us for a small string (12 bytes). But from qperf rc_lat I get about 4 us. Is this what I should expect?

   sendBuf.asCharBuffer().put(msg);
    sendBuf.clear();
    postSend.getWrMod(0).getSgeMod(0).setLength(msg.length());
    postSend.execute();
    endpoint.getWcEvents().take();
    postRecv.execute();
    endpoint.getWcEvents().take();
    recvBuf.clear();
PepperJo commented 1 year ago

You should expect latency slightly higher than native maybe around 5-10us overhead max. The code snippet you posted is problematic as you should post the receive request before the send is issued. Worst case the send from the other side is issued before the receive is posted and you will run into retries or on some transports your connections will be aborted. I recommend running the benchmarks provided here: https://github.com/zrlio/disni/tree/master/src/test/java/com/ibm/disni/benchmarks

bburas commented 1 year ago

I refactored the original benchmark code you reference into separate

  1. init and connect
  2. send msg to server and recv response from server

but I still use the CustomClientEndpoint. The CustomClientEndpoint::init() does a pre post recv request as the last line: System.out.println("SimpleClient::initiated recv"); this.postRecv(wrList_recv).execute().free(); But as you mentioned, if I run the benchmark code directly using the default 1000 loops I do see about 40us average which is not bad. java com.ibm.disni.benchmarks.RDMAvsTcpBenchmarkClient -a 192.168.0.37 -p 20886 -s 1024

SimpleClient::initiated recv RDMAvsTcpBenchmarkClient::client channel set up RDMA result: Total time: 38.721645 ms Bidirectional bandwidth: 0.04925794430511669 Gbytes/s Bidirectional bandwidth: 0.3940635544409335 Gbits/s Bidirectional average latency: 0.038721645 ms

PepperJo commented 1 year ago

I recommend running more than just 1000 loops. 38ms total runtime is probably not enough to get stable/good performance. You have to keep in mind that in Java it takes a while until all code path are JIT compiled so initially there is a lot more overhead.

ShiningChuang commented 1 year ago

I recommend running more than just 1000 loops. 38ms total runtime is probably not enough to get stable/good performance. You have to keep in mind that in Java it takes a while until all code path are JIT compiled so initially there is a lot more overhead.

In fact I have increased the loop to 1,000,000 and the buffer size to 32 * 64 in the RDMAvsTcpBenchmarkClient and RDMAvsTcpBenchmarkServer tests, but the throughput and latency of the DISNI is not what is expected, and it is close to that of TCP, is this reasonable?

RDMA result:
Total time: 60238.736352 ms
Bidirectional bandwidth: 0.06332631619850285 Gb/s
Bidirectional average latency: 0.060238736352 ms
TCP result:
Total time: 63491.424003 ms
Bidirectional bandwidth: 0.060082087077535914 Gb/s
Bidirectional average latency: 0.063491424003 ms
PepperJo commented 1 year ago

I do see a difference when I run it:

RDMA result:
Total time: 2836.468479 ms
Bidirectional bandwidth: 0.13448756063631195 Gb/s
Bidirectional average latency: 0.02836468479 ms
TCP result:
Total time: 5699.473495 ms
Bidirectional bandwidth: 0.06693069577341723 Gb/s
Bidirectional average latency: 0.05699473495 ms

That said, this benchmark is not good for comparison as it only uses one outstanding posted receive for RDMA (It's more a ping pong test rather then a good benchmark). I recommend you use SendRecvClient/Server if you are interested in send/recv numbers. While it doesn't allow to set preposted receives independently from sends it at least gives you an idea what performance can be like with higher QDs. If you want a "real" RDMA benchmark, i.e. using one-sided operations like RDMA read use ReadClient/Server instead of send/recv. I see around 3us read latency with that benchmark.