worldcoin / iris-mpc

iris-mpc repository
Apache License 2.0
8 stars 1 forks source link

INFRA-3262: Add `Dockerfile` with `nccl` and `CUDA 12.6` #510

Closed marcin-janas closed 3 days ago

marcin-janas commented 1 week ago

Requestor/Issue: https://linear.app/worldcoin/issue/INFRA-3262/test-smpcv2-performance-on-new-cuda Tested (yes/no): no Description/Why: to be able to test nccl with CUDA 12.6 and fix the below error

starting device 0...
ip-10-15-32-252:44:44 [0] NCCL INFO cudaDriverVersion 12060
ip-10-15-32-252:44:44 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to enp71s0
ip-10-15-32-252:44:44 [0] NCCL INFO Bootstrap : Using enp71s0:10.15.32.252<0>
ip-10-15-32-252:44:44 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
ip-10-15-32-252:44:44 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
ip-10-15-32-252:44:44 [0] NCCL INFO NET/Plugin: Loaded net plugin AWS Libfabric (v6)
ip-10-15-32-252:44:44 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
ip-10-15-32-252:44:44 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower
are not supported.
ip-10-15-32-252:44:44 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
ip-10-15-32-252:44:44 [0] NCCL INFO NET/OFI Selected Provider is efa (found 32 nics)
ip-10-15-32-252:44:44 [0] NCCL INFO Using network AWS Libfabric
ip-10-15-32-252:44:44 [0] NCCL INFO ncclCommInitRank comm 0x5559516e5250 rank 1 nranks 3 cudaDev 0 nvmlDev 0 busId 53000 commId 0xe2a7e2
c44a2cb587 - Init START

...

ip-10-15-32-252:44:92 [0] NCCL INFO Channel 07/0 : 1[0] -> 2[0] [send] via NET/AWS Libfabric/3/GDRDMA
ip-10-15-32-252:44:92 [0] NCCL INFO Connected all rings
Segmentation fault (core dumped)
linear[bot] commented 1 week ago

INFRA-3262 Test SMPCv2 performance on new CUDA