Closed cwsmith closed 4 years ago
Can you try setting processes-per-node
to 4. I don't know what happens if it exceeds processes
.
Thank you @suchyta1 . I reduced processes-per-node
to 4
for the coupler and got the same error at run time.
terminate called after throwing an instance of 'std::runtime_error'
what(): cudaGetDeviceCount(&m_cudaDevCount) error( cudaErrorNoDevice): no CUDA-capable device is detected /gpfs/alpine/fus123/scratch/cwsmith/spack-stage/spack-stage-kokkos-3.1-if6ttd7st5pdd3hese5w5d3stjdtgemz/spack-src/core/src/Cuda/Kokkos_Cuda_Instance.cpp:223
Traceback functionality not available
AFAIK, LSF needs to be told that the job step (not sure what the official LSF term for a jsrun
call is...) that four resource sets need to be defined where each one has a CPU process and one GPU. I recall that EFFIS was using ERF files to supply this info to LSF. For example, without using EFFIS, my jsrun
command for the coupler is:
jsrun <environment stuff> \
-n 4 --tasks_per_rs 1 --cpu_per_rs 1 --gpu_per_rs 1 --bind rs \
/path/to/coupler 1
Can you try using True
or On
instead of 1
for use-gpus
. I'm not sure if 1
will resolve correctly in YAML as a boolean instead of an integer. The way it's implemented now, a GPU will be assigned to each MPI rank. (use-gpus
only has a direct effect on Summit, because summit explicitly has the --gpu_per_rs
setting. On Rhea, if you use the gpu partition, use-gpus
doesn't have any effect, as there's no gpu setting flag with srun.)
Are the environment things important? You might need to set those. Though I don't think you need to load Cuda, as far as I'm aware.
coupler:
pre-submit-commands: ["mkdir out"]
processes: 4
processes-per-node: 4
cpus-per-process: 1
use-gpus: True
executable_path: /gpfs/alpine/fus123/scratch/cwsmith/spack-install/linux-rhel7-power9le/gcc-8.1.1/coupler-develop-6eczhtc7ufb6o62onfeabrropnj4ahv6/bin/cpl
commandline_args:
- ${steps}
env:
OMP_NUM_THREADS: 1
HDF5_USE_FILE_LOCKING: 'FALSE'
Adding use-gpus: True
got the coupler job step past the cudaGetDeviceCount
error. Thank you.
The change is here: https://github.com/SCOREC/testcases/commit/d9fd662daf312452b0157f662d86d6b3501eecd3
The <environment stuff>
passed to jsrun
is a LD_PRELOAD
setting/hack to avoid a spack issue on summit.
In the composition file here: https://github.com/SCOREC/testcases/blob/cplEffis/run_1/summit/run_1.yaml how can I request that each process for the
coupler
has one GPU associated with it? I found theuse-gpus
logic https://github.com/wdmapp/effis/blob/a80650d6d50c4adc422afeca64dbcc7ea909438c/util/effis-compose.py#L787-L788 but was not sure how to use it. My naive attempt to adduse-gpus: 1
in thecoupler:
section failed with aeffis-compose.py
indexing error.Note, if this is not supported yet we can make Kokkos optional in the coupler to avoid this.