wdmapp / effis

4 stars 6 forks source link

specifying gpu requirements on summit #3

Closed cwsmith closed 4 years ago

cwsmith commented 4 years ago

In the composition file here: https://github.com/SCOREC/testcases/blob/cplEffis/run_1/summit/run_1.yaml how can I request that each process for the coupler has one GPU associated with it? I found the use-gpus logic https://github.com/wdmapp/effis/blob/a80650d6d50c4adc422afeca64dbcc7ea909438c/util/effis-compose.py#L787-L788 but was not sure how to use it. My naive attempt to add use-gpus: 1 in the coupler: section failed with a effis-compose.py indexing error.

Note, if this is not supported yet we can make Kokkos optional in the coupler to avoid this.

suchyta1 commented 4 years ago

Can you try setting processes-per-node to 4. I don't know what happens if it exceeds processes.

cwsmith commented 4 years ago

Thank you @suchyta1 . I reduced processes-per-node to 4 for the coupler and got the same error at run time.

terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaGetDeviceCount(&m_cudaDevCount) error( cudaErrorNoDevice): no CUDA-capable device is detected /gpfs/alpine/fus123/scratch/cwsmith/spack-stage/spack-stage-kokkos-3.1-if6ttd7st5pdd3hese5w5d3stjdtgemz/spack-src/core/src/Cuda/Kokkos_Cuda_Instance.cpp:223
Traceback functionality not available

AFAIK, LSF needs to be told that the job step (not sure what the official LSF term for a jsrun call is...) that four resource sets need to be defined where each one has a CPU process and one GPU. I recall that EFFIS was using ERF files to supply this info to LSF. For example, without using EFFIS, my jsrun command for the coupler is:

jsrun <environment stuff> \
  -n 4 --tasks_per_rs 1 --cpu_per_rs 1 --gpu_per_rs 1 --bind rs \
  /path/to/coupler 1
suchyta1 commented 4 years ago

Can you try using True or On instead of 1 for use-gpus. I'm not sure if 1 will resolve correctly in YAML as a boolean instead of an integer. The way it's implemented now, a GPU will be assigned to each MPI rank. (use-gpus only has a direct effect on Summit, because summit explicitly has the --gpu_per_rs setting. On Rhea, if you use the gpu partition, use-gpus doesn't have any effect, as there's no gpu setting flag with srun.)

Are the environment things important? You might need to set those. Though I don't think you need to load Cuda, as far as I'm aware.

  coupler:
    pre-submit-commands: ["mkdir out"]
    processes: 4
    processes-per-node: 4
    cpus-per-process: 1
    use-gpus: True
    executable_path: /gpfs/alpine/fus123/scratch/cwsmith/spack-install/linux-rhel7-power9le/gcc-8.1.1/coupler-develop-6eczhtc7ufb6o62onfeabrropnj4ahv6/bin/cpl
    commandline_args:
      - ${steps}
    env:
      OMP_NUM_THREADS: 1
      HDF5_USE_FILE_LOCKING: 'FALSE'
cwsmith commented 4 years ago

Adding use-gpus: True got the coupler job step past the cudaGetDeviceCount error. Thank you.

The change is here: https://github.com/SCOREC/testcases/commit/d9fd662daf312452b0157f662d86d6b3501eecd3

The <environment stuff> passed to jsrun is a LD_PRELOAD setting/hack to avoid a spack issue on summit.