Closed jmaassen closed 5 years ago
The SchedulerAdaptorDescription should include methods to determine which counts can be used
We could add SchedulerAdaptorDescription.supportsMultiNode() boolean
This will flag the local, ssh as not able to run jobs a cross multiple machines aka nodeCount > 1.
We could also flag GridEngine to not support multi node, making the whole parallel environment mapping much easier. @arnikz do you need GridEngine multi node support?
Make sense for local
, ssh
and at
, but the GridEngine case is a bit shakey as it does support multinode runs, it just doesn't allow you to ask for nodes ....
It does seem that SGE supports -l excl=true
to allow you to reserve an entire node for yourself.
Not sure if this completely solves the issue though. It would allow correct behavior when you specify "-nodes 10", but I'm not sure what would happen if you would say "-nodes 10 -core-per-node 16" on a 4 core/machine cluster....
There some info here that discuss getting info on the nodes using qhost
https://serverfault.com/questions/266848/how-to-reserve-complete-nodes-on-sun-grid-engine https://stackoverflow.com/questions/33372283/requesting-integer-multiple-of-m-cores-per-node-on-sge
After giving this some further thought, another option would be to go for a concept based on tasks instead, somewhat similar to what SLURM is doing.
The idea is that each task basically represents an "executable" being started somewhere (this can also be a script of course). This task may need 1 or more cores. In addition you may wish to start several of these tasks instead of just one. This straightforward to specify using:
--tasks T (default=1)
--coresPerTask C (default=1)
When you want more than one task (T > 1
), these need to be distributed over one or more nodes. You could either fill up each node with as many tasks that will fit (taking the coresPerTask
into account, as well as other constraints such as memory), or choose to assign less tasks per node (and thereby needing more nodes). This can simply be specified using:
--tasksPerNode N (default is unset; let scheduler decide).
With this approach, running sequential (single task-single core) and multi threaded (single task-multiple core) jobs is still simple. In addition, it also allows for schedulers to decide what the best task-to-node assignment is (by simply not specifying tasksPerNode
) which is useful for SLURM and SGE in some cases. If needed, the values of T
, C
and N
can be used to compute the node count for SLURM and TORQUE, and the slot count for SGE.
When the job started, you can either start the executable once per job, or once for each task. The first seems to be the default on all schedulers:
--start-per-job (default)
--start-per-task
To start once per task, the adaptors can use the nodefile (TORQUE and SGE) or srun
(SLURM). This approach will also make it easy to start MPI jobs, by simply using mpirun
. I think this approach is easy to understand and the most flexible. I'll try to implement it to see if I run into any issues.
In the JobDescription
this would translate to:
private int tasks = 1;
private int coresPerTask = 1;
private int tasksPerNode = -1;
private boolean startPerTask = false;
The rest follows from there....
Implemented in a27b63f which passes all unit and integration tests.
Not entirely sure about the mapping in SGE yet. Need multi node multi core cluster setup to test this.
There is some recurring confusion of the semantics of the node, process and thread count in the
JobDescription
. See for example https://github.com/NLeSC/xenon-cli/issues/63 , https://github.com/NLeSC/xenon-cli/issues/57 and https://github.com/NLeSC/xenon/issues/206Currently we have:
This filters thru to xenon-cli which has command options to set these values.
After some discussion we came to the following command line options for the cli:
and for starting the processes -one- of the following options:
All options are optional. If no values are set, the default is used. This leads to the following behavior:
--cores-per-node 2
you will get 1 node, 2 cores, 1 executable started--cores-per-node 2 -start-per-core
you will get 1 node, 2 cores, 2 executables started--nodes 2
you will get 2 nodes, 1 core each, 1 executable started on first node--nodes 2 --cores-per-node 2
you will get 2 nodes, 2 core each, 1 executable started on first node--nodes 2 --cores-per-node 2 -start-per-node
you will get 2 nodes, 2 core each, 1 executable started on each node (2 in total)--nodes 2 --cores-per-node 2 -start-per-core
you will get 2 nodes, 2 core each, 1 executable started on each core (4 in total)This approach is slightly less flexible than the previous one, as it is not possible to directly express starting a job on 4 nodes with 4 processes per node and 4 threads per process (for running an mixed MPI/OpenMP job for example). However, just starting 4 nodes with 16 cores each will probably give you the same result.
For the
JobDescription
this would result inprocessesPerNode
being renamed intocoresPerNode
,threadsPerProcess
disappearing, andstartSingleProcess
turning into some enum.Any comments?