zrlio / softiwarp

SoftiWARP: Software iWARP kernel driver and user library for Linux
130 stars 48 forks source link

ib_send_bw hangs #7

Closed patrickmacarthur closed 8 years ago

patrickmacarthur commented 8 years ago

On the latest master (commit id 6731fa60c32c9d4a73a27e0737a4fc99fe48d7c4) running under kernel version 3.17.8, running perftest-2.4-1.el7 on Scientific Linux 7.2. The hang is purely in userspace.

Stack trace on server:

#0  0x00007ffff665752e in siw_poll_cq_mapped () from /lib64/libsiw-rdmav2.so
#1  0x00000000004127b4 in ibv_poll_cq (wc=0x624e50, num_entries=16, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1277
#2  run_iter_bw_server (ctx=ctx@entry=0x7fffffffdd70, user_param=user_param@entry=0x7fffffffde90) at src/perftest_resources.c:2699
#3  0x0000000000403677 in main (argc=<optimized out>, argv=<optimized out>) at src/send_bw.c:429

Stack trace on client:

#0  pthread_spin_lock () at ../nptl/sysdeps/x86_64/pthread_spin_lock.S:26
#1  0x00007ffff665752e in siw_poll_cq_mapped (ibcq=0x624a90, num_entries=<optimized out>, wc=0x7fffffffdb60) at src/siw_uverbs.c:478
#2  0x00000000004050d0 in ibv_poll_cq (wc=0x7fffffffdb60, num_entries=1, cq=<optimized out>) at /usr/include/infiniband/verbs.h:1277
#3  rdma_read_keys (rem_dest=rem_dest@entry=0x625e60, comm=comm@entry=0x7fffffffdc40) at src/perftest_communication.c:407
#4  0x00000000004068c3 in ctx_hand_shake (comm=comm@entry=0x7fffffffdc40, my_dest=my_dest@entry=0x624b50, 
    rem_dest=rem_dest@entry=0x625e60) at src/perftest_communication.c:1103
#5  0x00000000004036f0 in main (argc=<optimized out>, argv=<optimized out>) at src/send_bw.c:440

This is reproducible about 90% of the time.

Please let me know if you need any more information to reproduce the issue.

patrickmacarthur commented 8 years ago

Having done some more digging on this, this issue only occurs for message sizes <= about 32 bytes.

BernardMetzler commented 8 years ago

So far, I cannot reproduce it. It might be the sender overruns the receiver with SENDs, where the receiver cannot catch up with pre-posting RECEIVEs? A SEND to an empty RQ would break the connection. Do you see any such 'RX ERROR' messages via dmesg?

patrickmacarthur commented 8 years ago

I looked at dmesg and realized that there appears to be a firmware bug in the underlying NIC. I was able to work around the bug by disabling the relevant offload feature on the NIC and now the test runs fine.

There appears to be a different issue with the RDMA READ bandwidth test but I don't have time to debug it now. I will open a new ticket when I am able to gather more information.