Closed patrickmacarthur closed 8 years ago
Having done some more digging on this, this issue only occurs for message sizes <= about 32 bytes.
So far, I cannot reproduce it. It might be the sender overruns the receiver with SENDs, where the receiver cannot catch up with pre-posting RECEIVEs? A SEND to an empty RQ would break the connection. Do you see any such 'RX ERROR' messages via dmesg?
I looked at dmesg and realized that there appears to be a firmware bug in the underlying NIC. I was able to work around the bug by disabling the relevant offload feature on the NIC and now the test runs fine.
There appears to be a different issue with the RDMA READ bandwidth test but I don't have time to debug it now. I will open a new ticket when I am able to gather more information.
On the latest master (commit id 6731fa60c32c9d4a73a27e0737a4fc99fe48d7c4) running under kernel version 3.17.8, running perftest-2.4-1.el7 on Scientific Linux 7.2. The hang is purely in userspace.
Stack trace on server:
Stack trace on client:
This is reproducible about 90% of the time.
Please let me know if you need any more information to reproduce the issue.