Reference scripts to run QUDA test utilities on Summit. Not guaranteed to be optimal.
To run these scripts as-is, clone this repository, make a static link (ln -s
) to ./staggered_invert_test
in this directory, build a submit script using ./build-submit-script.sh
(run it without any arguments to get an error message telling you what to do), and submit the built script using bsub
.
These scripts aren't specifically set up to be run in any organized directory structure; they are a base that should be customized for any user's specific needs. Currently, the scripts assume that a job will be submitted from the same directory that these scripts run it (but again, it should be clear and easy to see how to change that). The scripts are currently hard-coded to assume the QUDA test executable "staggered_invert_test" also lives in the same directory, either copied there or statically linked (i.e., you called ln -s [QUDA test directory]/staggered_invert_test
in this directory. I'll document how to modify this below.
These scripts have only been tested with the feature/p2p-zero-copy
branch of QUDA (though I don't see any reason why it wouldn't work with any modern branch) with the QMP interface. These scripts assume t
is the fast direction, i.e., it's preferentially split within a node, which can be verified with the output from the feature/p2p-zero-copy
branch. I don't see why using a different (modern) repository, or using MPI instead of QMP when building QUDA, should make a difference. Please tell me if you find any issues.
I'm not sure if it makes a difference, but to be complete, my ~/.profile
file contains (and contained when I built QMP, QIO, and QUDA):
module load cmake/3.9.2
module load git/2.13.0
module load makedepend/1.0.5
module load screen/4.3.1
module load cuda/9.2.64
The description of the files is as follows:
bind-4gpu.sh
: The numactl
script when you only use 4 GPUs per node. This only gets used if there isn't a factor of 3 in the T
or Z
direction. (This choice is based on the assumption that the global dimension in the X
, Y
, and Z
directions are all equal. I'm sure with some topologies this will break.) This may not be ideal: I'm looking for feedback on this.bind-6gpu.sh
: The numactl
script when you use the full 6 GPUs per node. This gets used if my scripts can detect a factor of 3 in the T
or Z
direction. This may not be ideal: I'm looking for feedback on this.build-submit-script.sh
: The script that writes submit scripts by modifying submit-script-base.lsf
, which is described below. The script also creates a directory where a QUDA tunecache gets saved (if it does not already exist).
./build-submit-script.sh [nnodes] [global x] ... [global t] [grid x] ... [grid t]
, where global
refers to the global volume, and grid
refers to the breakdown of the topology, i.e., a topology of 1 1 1 6
refers to not breaking up the x, y, and z directions, and splitting the t direction 6 ways.48^3 x 96
volume being run on 8 nodes (48 GPUs) with a topology 1 x 2 x 4 x 6
will generate the identifier string 48x48x48x96-n8-1x2x4x6
sed
global find-replaces on the base file submit-script-base.lsf
, saving to a useable submit script (given the example above) of submit-script-48x48x48x96-n8-1x2x4x6.lsf
. There are 10 replacements::NNODES:
- the number of nodes.:NGPUS:
- the number of GPUs:GRIDX:
, :GRIDY:
, ... - the topology.:DIMX:
, :DIMY:
, ... - the global volume.tunecache-48x48x48x96-n8-1x2x4x6
.submit-script-base.lsf
: The base submit script that gets modified by build-submit-script.sh
. This script will not run as is. The relevant lines each person could want to modify are noted below.
build-submit-script.sh
.build-submit-script.sh
. The output name as written breaks the convention of using -
within the identifier string, using .
instead, because the job submit parser complains about the "-n
" that lives within the identifier string. I should fix it so a consistent character is used everywhere (maybe _
?), I just haven't gotten around to it.bash
variables based on what's set by build-submit-script.sh
. As written, the topology and the global volume is passed in (living in the variables GRIDX
, ..., GRIDT
, DIMX
, ..., DIMT
), and the per-GPU volume is reconstructed (LOCALDIMX
, ..., LOCALDIMT
). The consistency checks in build-submit-script.sh
guarantees the integer division will be safe.staggered_invert_test
. This assumes the executable, or a static link to it, lives in the same directory.--gridsize
and --dim
flags passed to QUDA test executable), as well as any additional flags. --recon 12 --recon-sloppy 12
), and the inversion is run to a maximum iterations of 10000 or a tolerance of 1e-5 (--niter 10000 --tol 1e-5
). The last flag (--pipeline 1
) enables a version of the CG algorithm which fuses the two reductions into one, improving strong scaling.$APP
. The numactl
scripts bind-4gpu.sh
and bind-6gpu.sh
assume this variable is set.export APP=./jsrun_layout
, the utility given here, to understand how the jsrun
command described below works.QUDA_RESOURCE_PATH
, consistent with the directory created in build-submit-script.sh
.jsrun
command, using the 4 or 6 GPU binding scripts as appropriate. The way I build the command is "consistent" with the binding scripts, in so far as they work. This may not be ideal: I'm looking for feedback on this. Description of the command:--nrs ${NNODES}
: Request ${NNODES}
"resource sets", in the parlance of the resource manager on Summit. The convention of the number of resource sets equalling the number of nodes is consistent with how I defined the subsequent flags.-a6 -g6 -c42 -dpacked -b packed:7
: request 6 MPI ranks per resource set/node, 6 GPUs (such that each rank can see all 6 GPUs, QUDA takes advantage of assigning them), 42 cores (the bind script binds CPUs to GPUs so far as I can tell appropriately). So far as I understand -dpacked
and -b packed:7
specify how ranks are ordered among multiple nodes, and assign 7 hardware cores to each rank (if I remember correctly). The explicit export OMP_NUM_THREADS=7
may not be needed in all case, for example, in the case of QDPJIT
where there should only be one launching thread. This probably isn't ideal, and I'd love corrections or a better explanation.-a4 -g6 -c40 -dpacked -b packed:10
: request 4 MPI ranks per resource set/node, 6 GPUs (this, combined with the line export CUDA_VISIBLE_DEVICES=0,1,3,4
, ensures that pairs of GPUs connected by NVLINK are used, as opposed to an asymmetric setup of 3 GPUs connected by NVLINK, and another on its own within a node), and 40 cores, where the -dpacked
and -b packed:10
specifies the ordering of ranks and assigns 10 hardware cores to each rank. See the above comment about the export OMP_NUM_THREADS
line. This probably isn't ideal, and I'd love corrections or a better explanation.--latency_priority gpu-cpu
: This probably gets ignored due to specifying bindings via numactl
, but in principle when you trust jsrun
to assign a layout for you it preferentially assigns it to minimize GPU to CPU latencies as opposed to CPU to CPU latencies (cpu-cpu
). --bind-4gpu.sh
or --bind-6gpu.sh
, using the correct numactl
script as appropriate.jsrun
command.jsrun
command. You can comment this out for testing: I believe you can run ./build-submit-script.sh
to build a submit script, then run the built script without bsub to investigate the generated jsrun
command.After generating a submit script, you can submit it as-is using bsub [submit script]
without any further flags (unless you want to, of course).
A few example commands:
48^3 x 96
, topology 1x2x4x12
, so 16 nodes:
./build-submit-script 16 48 48 48 96 1 2 4 12
creates the submit script submit-script-48x48x48x96-n16-1x2x4x12.lsf
and the tunecache directory tunecache-48x48x48x96-n16-1x2x4x12
, using the 6 GPU binding script because there's a factor of 3.64^3 x 128
, topology 1x4x4x16
, so 64 nodes:
./build-submit-script 64 64 64 64 128 1 4 4 16
creates the submit script submit-script-64x64x64x64-n64-1x4x4x16.lsf
and the tunecache directory tunecache-48x48x48x96-n16-1x2x4x12
, using the 4 GPU binding script because there are no factors of 3.96^3 x 196
, topology 4x8x8x48
, so 2048 (!) nodes:
./build-submit-script 2048 96 96 96 196 4 8 8 48
creates the submit script submit-script-96x96x96x192-n2048-4x8x8x48.lsf
and the tunecache directory tunecache-96x96x96x192-n2048-4x8x8x48
, using the 6 GPU binding script because there is a factor of 3.I hope this is a sufficient description of the scripts and how to use them. If there's anything unclear, please send me a message at evansweinberg [at] gmail.com. Alternatively, I'm happy to give anyone access to the repo to make edits or submit a pull request, similarly send me a message.
Most importantly: If there's anything that's sub-optimal, misguided, or wrong, PLEASE let me know!
Cheers!