yngtdd / hyperspace

Distributed Bayesian Optimization
23 stars 8 forks source link

Capability to change each hyperspace to use < a node #23

Open jdakka opened 5 years ago

jdakka commented 5 years ago

Is there a way to change the HyperSpace models to use less than a node per process? This will be the case for XSEDE-Bridges if we decide to run simulations, as we won't have enough nodes to spawn 16 MPI processes in which each process takes a node. Also for Summit the number of cores/node will fluctuate anywhere from 42 cores to 128 cores. Ideally we would want to have each process use the same number of cores, but we should benchmark what is the sweet spot on the optimal number of cores per process.

yngtodd commented 5 years ago

Sorry it has taken me so long to get back with you! I just saw this issue.

I may have confused us when talking about HyperSpace running one optimization per node. Technically, it is one optimization per MPI rank. So depending on the system, we could set a number of ranks per compute node. Over the last week or so I have been running 256 ranks on a single DGX. Normally how we place the MPI ranks would be handled by aprun, in the case of machines like Titan, or now jsrun, in the case of Summit. Would the allocation of resources per MPI rank be handled by the Radical scheduler?

karahbit commented 4 years ago

Hi Todd. Following Jumana's question, I see that the minimum is 1 MPI Rank per optimization then, correct? Also, am I correct in assuming each optimization requires only 1 core? (I am interfacing through RADICAL)

yngtodd commented 4 years ago

Hey @karahbit, when asking about the minimum number of MPI ranks per optimization, do you mean the total number of ranks required by hyperspace for a given problem, or do you mean the the number of ranks assigned to a given Bayesian optimization loop? Each Bayesian optimization loop gets one MPI rank, but hyperspace runs many of those in parallel, and the total number of ranks is given by 2^{D} where D is the dimension of your search space.

In the simplest case, it is possible to run the algorithm over a single search dimension. Say this search space is the following:

x = [0, 1, 2, 3]

Hyperspace would divide that search space into two subintervals

x_0 = [0, 1, 2]
x_1 = [1, 2, 3]

Then it will run two parallel Bayesian optimization steps, one for each subinterval of the search space. Each Bayesian optimization step gets its own MPI rank.

You are right each Bayesian optimization step requires only one core. The optimization at each rank is handled by scikit-optimize, and it only needs a single core.

karahbit commented 4 years ago

Excellent, thank you Todd. I just wanted to confirm my assumption because, as Jumana did, I want to run multiple Bayesian optimizations on a single node without requesting more cores than I actually need to.

On Nov 9, 2019, at 12:47 PM, Todd Young notifications@github.com wrote:



Hey @karahbithttps://github.com/karahbit, when asking about the minimum number of MPI ranks per optimization, do you mean the total number of ranks required by hyperspace for a given problem, or do you mean the the number of ranks assigned to a given Bayesian optimization loop? Each Bayesian optimization loop gets one MPI rank, but hyperspace runs many of those in parallel, and the total number of ranks is given by 2^{D} where D is the dimension of your search space.

In the simplest case, it is possible to run the algorithm over a single search dimension. Say this search space is the following:

x = [0, 1, 2, 3]

Hyperspace would divide that search space into two subintervals

x_0 = [0, 1, 2] x_1 = [1, 2, 3]

Then it will run two parallel Bayesian optimization steps, one for each subinterval of the search space. Each Bayesian optimization step gets its own MPI rank.

You are right each Bayesian optimization step requires only one core. The optimization at each rank is handled by scikit-optimizehttps://github.com/scikit-optimize/scikit-optimize, and it only needs a single core.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/yngtodd/hyperspace/issues/23?email_source=notifications&email_token=AC7TYLQ5ZACPI7I4ZXEUTE3QS3ZSJA5CNFSM4GD247HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDULL3A#issuecomment-552121836, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AC7TYLR3LX4UZK47M5JBABLQS3ZSJANCNFSM4GD247HA.

karahbit commented 4 years ago

HI @yngtodd, picking up from this previous conversation., I am doing hyperparameter optimization using HyperSpace! But I would like to know its behavior a little bit better.

So I have 4 parameters: this would be 16 MPI ranks, requiring 16 cores. I was effectively able to solve the problem, but I would like to show how big and how fast I can go. Specifically, I am trying to show strong scaling behavior. However, when I ran HyperSpace for the same problem but using 8 MPI ranks, it was still able to finish it, taking approximately half the time. Do you mind giving me a brief explanation of what's going on behind the scenes here?

Thank you!

yngtodd commented 4 years ago

Hey @karahbit , thanks for using the library.

How many results are being saved when you have 4 parameters but run with 8 mpi ranks? I have a sneaking suspicion that you may only have 8 results. Hyperspace would then only being running the Bayesian optimization on half of the sub-spaces. I just tried that on one of the benchmarks, and that seems to be the case. If you are also seeing that behavior, then I should add a warning for this.

karahbit commented 4 years ago

Hi @yngtodd, of course, it has proven to be useful and I appreciate your work.

As you correctly say, when running 8 MPI processes for 4 hyperparameters, I see only 8 traces or optimizations. I'm missing the other half, therefore having an unfinished solution and providing a not so good score.

What about running the same required 16 MPI processes but on a lower amount of cores, let's say 8? I would be loading each core with 2 processes each, slowing down the solution but at least I get an accurate one. I believe in the MPI world this is called oversubscribing. Do you know anything about this and if this is possible with Hyperspace?

yngtodd commented 4 years ago

Yeah, that would be possible. In the case that num_subspaces > num_ranks, we could place the remaining num_subspaces - num_ranks on ranks that already have a search space to work on. I don't think that would take much to make that happen.

yngtodd commented 4 years ago

In the case that you were just testing, you could use dualdrive(). that would run two of subspaces on each rank. So if you want to use exactly half the number of ranks compared with subspaces, you would be good to go. But, if the number of subspaces is not half the number of MPI ranks, then you would be back to silently leaving out some of the spaces.

karahbit commented 4 years ago

I was actually taking a look into dualdrive. That seems to be a viable solution for the case of spawning 16 MPI processes (16 subspaces) on 8 cores/ranks. But what about if we have only 4 cores now, or 2? To give you some context, I ask these questions because I am concerned with the strong scaling behavior of the solution and please, correct me if I'm wrong with any terminology as I am learning all of this.

yngtodd commented 4 years ago

Yeah, in that case the dualdrive would not be the way to go. We would want to go with that new approach I started to mention.

karahbit commented 4 years ago

Ah, I see. So just by playing around with the MPI launch command mpirun --hostfile hostfile -n 16 ... in a way that we let it place the 16 processes on 4 cores through a hostfile, for example, wouldn't do the trick. In other words, the oversubscribing functionality that MPI provides won't work for our purposes. We would need to modify the approach taken from Hyperspace itself. Is this correct or am I missing something?

yngtodd commented 4 years ago

Yeah, it require some changes in Hyperspace. Originally, the subspaces were scattered out to the various ranks from rank 0. Now, each rank sees all of the subspaces, and indexes into them by rank here. This is fine if you know that you just want one subspace per rank and you are using the number of ranks equal to the number of subspaces. But when you want more than one subspace per rank, this would need to change.

karahbit commented 4 years ago

Thank you for your input on this matter, it was really helpful!