yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

CaffeOnSpark use infiniband but Cannot find the address of another infiniband host. #290

Closed loveheng closed 6 years ago

loveheng commented 6 years ago

I1204 14:17:51.222190 51672 solver.cpp:60] Solver scaffolding done. I1204 14:17:51.228431 51672 CaffeNet.cpp:240] RDMA adapter: mlx5_0 I1204 14:17:51.233731 51672 CaffeNet.cpp:388] 0-th RDMA addr: 0200000038010000c52c5c00 I1204 14:17:51.233794 51672 CaffeNet.cpp:388] 1-th RDMA addr: I1204 14:17:51.233836 51672 JniCaffeNet.cpp:145] 0-th local addr: 0200000038010000c52c5c00 I1204 14:17:51.233851 51672 JniCaffeNet.cpp:145] 1-th local addr: 17/12/04 14:17:51 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 931 bytes result sent to driver 17/12/04 14:17:51 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 2 17/12/04 14:17:51 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 2) 17/12/04 14:17:51 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3 17/12/04 14:17:51 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1563.0 B, free 7.1 KB) 17/12/04 14:17:51 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 19 ms 17/12/04 14:17:51 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.6 KB, free 9.7 KB) 17/12/04 14:17:51 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2 17/12/04 14:17:51 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 98.0 B, free 9.8 KB) 17/12/04 14:17:51 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 14 ms 17/12/04 14:17:51 INFO storage.MemoryStore: Block broadcast2 stored as values in memory (estimated size 392.0 B, free 10.1 KB) I1204 14:17:51.413678 51672 common.cpp:61] 1-th string is NULL F1204 14:17:51.440570 51672 rdma.cpp:327] Check failed: self Failed to register memory region

junshi15 commented 6 years ago

Is it possible that you have multiple Infiniband adapters in the boxes and the first adapter is down?

loveheng commented 6 years ago

Log prompt I '1-th local addr is empty' and '1-th string is NULL'. What is the cause of this.

junshi15 commented 6 years ago

It is from here.

I assume you had two nodes. This log is from rank-1 node, the 1-th string, i.e. its own address is "" according to this line.

If you looked at logs from the other node, the 0-th string probably was empty.

There is nothing wrong with the addresses.

loveheng commented 6 years ago

I see, thank you.