Closed timurcarstensen closed 1 week ago
Closes #153
Could you check that this works for you @rheasukthanker ? I tested this on juwels and didn't have any issues
I tried the fix, but from deepspeed.profiling.flops_profiler import get_model_profile
fails for me on a gpu node on juwels with raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)") deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)
. Are you setting CUDA_HOME
somewhere?
No, I’m just importing the GCC and CUDA modules and then everything runs just fine for me. Could you share your setup with me?
On Friday 25 October 2024, Rhea Sukthanker @.***> wrote:
Closes #153 https://github.com/whittle-org/whittle/issues/153
Could you check that this works for you @rheasukthanker https://github.com/rheasukthanker ? I tested this on juwels and didn't have any issues
I tried the fix, but from deepspeed.profiling.flops_profiler import get_model_profile fails for me on a gpu node on juwels. , still fails for me with raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)") deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s). Are you setting CUDA_HOME somewhere?
— Reply to this email directly, view it on GitHub https://github.com/whittle-org/whittle/pull/154#issuecomment-2438297139, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJXGDRTCUXOMDSY3IC5XITLZ5JYHFAVCNFSM6AAAAABQTSVCXSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMZYGI4TOMJTHE . You are receiving this because you authored the thread.Message ID: @.***>
Thanks, so I realised I wasn't loading CUDA on the gpu (my other script still worked just fine). One thing I notice however is deepspeed autodetects cuda as an accelerator when using a gpu and the cpu as an accelerator when using only a cpu. I am not sure if we can somehow avoid deepspeed from autodetecting the gpu during import?
I think there's no way to get around this since this happens when we import the profiler from DeepSpeed
, I would merge this now :)
Closes #153
Could you check that this works for you @rheasukthanker ? I tested this on juwels and didn't have any issues