xenon-middleware / xenon

A middleware abstraction library that provides a simple programming interface to various compute and storage resources.
http://xenon-middleware.github.io/xenon/
Apache License 2.0
34 stars 17 forks source link

Cleanup of node, process and thread count. #625

Closed jmaassen closed 5 years ago

jmaassen commented 6 years ago

There is some recurring confusion of the semantics of the node, process and thread count in the JobDescription. See for example https://github.com/NLeSC/xenon-cli/issues/63 , https://github.com/NLeSC/xenon-cli/issues/57 and https://github.com/NLeSC/xenon/issues/206

Currently we have:

private int nodeCount = 1;
private int processesPerNode = 1;
private int threadsPerProcess = -1;
private boolean startSingleProcess = false;

This filters thru to xenon-cli which has command options to set these values.

After some discussion we came to the following command line options for the cli:

--nodes X  (default 1)
--cores-per-node Y (default 1)

and for starting the processes -one- of the following options:

--start-per-job (default)
--start-per-node
--start-per-core

All options are optional. If no values are set, the default is used. This leads to the following behavior:

This approach is slightly less flexible than the previous one, as it is not possible to directly express starting a job on 4 nodes with 4 processes per node and 4 threads per process (for running an mixed MPI/OpenMP job for example). However, just starting 4 nodes with 16 cores each will probably give you the same result.

For the JobDescription this would result in processesPerNode being renamed into coresPerNode, threadsPerProcess disappearing, and startSingleProcess turning into some enum.

Any comments?

sverhoeven commented 6 years ago

The SchedulerAdaptorDescription should include methods to determine which counts can be used

sverhoeven commented 6 years ago

We could add SchedulerAdaptorDescription.supportsMultiNode() boolean

This will flag the local, ssh as not able to run jobs a cross multiple machines aka nodeCount > 1.

We could also flag GridEngine to not support multi node, making the whole parallel environment mapping much easier. @arnikz do you need GridEngine multi node support?

jmaassen commented 6 years ago

Make sense for local, ssh and at, but the GridEngine case is a bit shakey as it does support multinode runs, it just doesn't allow you to ask for nodes ....

jmaassen commented 6 years ago

It does seem that SGE supports -l excl=true to allow you to reserve an entire node for yourself.

Not sure if this completely solves the issue though. It would allow correct behavior when you specify "-nodes 10", but I'm not sure what would happen if you would say "-nodes 10 -core-per-node 16" on a 4 core/machine cluster....

jmaassen commented 6 years ago

There some info here that discuss getting info on the nodes using qhost

https://serverfault.com/questions/266848/how-to-reserve-complete-nodes-on-sun-grid-engine https://stackoverflow.com/questions/33372283/requesting-integer-multiple-of-m-cores-per-node-on-sge

jmaassen commented 6 years ago

After giving this some further thought, another option would be to go for a concept based on tasks instead, somewhat similar to what SLURM is doing.

The idea is that each task basically represents an "executable" being started somewhere (this can also be a script of course). This task may need 1 or more cores. In addition you may wish to start several of these tasks instead of just one. This straightforward to specify using:

--tasks T (default=1)
--coresPerTask C (default=1)

When you want more than one task (T > 1), these need to be distributed over one or more nodes. You could either fill up each node with as many tasks that will fit (taking the coresPerTask into account, as well as other constraints such as memory), or choose to assign less tasks per node (and thereby needing more nodes). This can simply be specified using:

--tasksPerNode N (default is unset; let scheduler decide). 

With this approach, running sequential (single task-single core) and multi threaded (single task-multiple core) jobs is still simple. In addition, it also allows for schedulers to decide what the best task-to-node assignment is (by simply not specifying tasksPerNode) which is useful for SLURM and SGE in some cases. If needed, the values of T, C and N can be used to compute the node count for SLURM and TORQUE, and the slot count for SGE.

When the job started, you can either start the executable once per job, or once for each task. The first seems to be the default on all schedulers:

--start-per-job (default)
--start-per-task

To start once per task, the adaptors can use the nodefile (TORQUE and SGE) or srun (SLURM). This approach will also make it easy to start MPI jobs, by simply using mpirun. I think this approach is easy to understand and the most flexible. I'll try to implement it to see if I run into any issues.

In the JobDescription this would translate to:

private int tasks = 1;
private int coresPerTask = 1;
private int tasksPerNode = -1;
private boolean startPerTask = false;

The rest follows from there....

jmaassen commented 6 years ago

Implemented in a27b63f which passes all unit and integration tests.

Not entirely sure about the mapping in SGE yet. Need multi node multi core cluster setup to test this.