Open NathanTP opened 9 years ago
Replication Instructions:
I figured out what the problem is. The client is hanging in receive_tag_to_addr_info
, which loops until it receives a message less than TAG_ADDR_MAP_SIZE_MSG
. However, in the corresponding send_tag_to_addr_info
in rmem-server, only one tag_addr_map message is sent.
I've fixed the problem, but there is an issues with the way we're measuring statistics. The on_completion
function is called from a different thread, so calling stats_start
and stats_end
there will cause the stack to get screwed up. I've commented out the statistic collection in on_completion for now.
In the rvm_better_malloc branch, rvm_test_normal hangs during recovery (i.e. when passed the 'y' flag). Here is the backtrace of the hang from GDB:
(gdb) bt │-1.987774
0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 │3.668215
1 0x000000000040233b in receive_tag_to_addr_info (rmem=0x60c2e0) at rmem.c:111 │-4.632812
2 0x00000000004026fd in rmem_connect (rmem=0x60c2e0, host=0x7fffffffed91 "f2", port=0x7fffffffed94 "12345") at rmem.c:175 │-1.821815
3 0x0000000000403360 in rvm_cfg_create (opts=0x7fffffffea00) at rvm.c:109 │0.777220
4 0x0000000000401ad8 in main (argc=4, argv=0x7fffffffeb28) at tests/rvm_test_normal.c:87
The problem occurs when the buddy allocator's pool of memory is increased from 16k to 32k (buddy_malloc.c:25, POOL_SZ). As far as I can tell the only change this causes is that the client allocates 9 blocks instead of 5 during the first allocation (when the buddy pool is allocated).
This error does not occur in dgemv_test, but it does in rvm_test_normal and rvm_test_full. The only significant difference I can think of here is that dgemv_test only allocates from the buddy pool while the rvm_tests*'s allocate from both pools (small allocations and >1pg allocations).