yahoo / CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Apache License 2.0
1.27k stars 358 forks source link

RDMA over ethernet #74

Open markyeh opened 8 years ago

markyeh commented 8 years ago

Hi, folks,

Do you have any plan to support RoCE(RDMA over Converged Ethernet) device?

Now I tried to setup CaffeOnSpark in my GPU+RoCE environment, I added GID index in source code for RoCE connection.

But now I got failed in ibv_regmr() by using the "data" address. If I replaced the "data_" address with a malloc address (same size), It works.

That would be great if you could give me any suggestion about this issue. Thanks.

mriduljain commented 8 years ago

We don't have any plans for RoCE, but it would be great if you can make it work. Like Infiniband RDMA support it would be awesome to have ROCE support for CaffeOnSpark. I don't know much about RoCE, but can provide whatever help you need. Let me checkout Infiniband code tomorrow and help with what you are asking.

mriduljain commented 8 years ago

SInce we use libverbs, we are almost ready for RoCE too. I guess we just need to add a couple of lines for RoCE handshake or equivalent.

mriduljain commented 8 years ago

@markyeh I see, you apparently added that in your code. In next couple of weeks I may be able to check it by switching to RoCE, if you share the code.

shenjingGitHub commented 7 years ago

hi,markyeah!Can you run on InfiniBand status?When i use infiniband cards,the data cannot transfer between sending and receiving.Do you know how to resolve the problem?thank you~

markyeh commented 7 years ago

Hi Shenjing,

Sorry, I do not use CaffeOnSpark anymore. The RoCE issue that I mentioned should be a driver setup issue, just FYI. Good luck.