Closed wlandau closed 1 year ago
This seems sensible to enable.
As I have just implemented the use of L'Ecuyer-CMRG streams in mirai
, I have not yet had the time to think about the best way to integrate this with crew
. There may be a clever way that it can use the .Random.seed
directly from mirai
, although that probably also necessitates further improvements - I'll try to summarise below.
The motivation is to ensure good statistical properties for computations which are split into parallel processes and then brought back together - the classic parLapply
type functional programming. From what I understand, even using 'random' random seeds does not guarantee non-synchronicity across multiple child processes. I guess this becomes more relevant the more long-running the computations are. The solution was devised by Luke Tierney and BD Ripley himself it seems (from the source attribution) - using L'Ecuyer-CMRG streams which are generated iteratively, with each one used to generate the next.
The implementation in the base package parallel
actually sends the seeds once the cluster is set up as in the function parallel::clusterSetRNGStream
, whereas I have made it part of the command line argument to daemon()
to ensure that it is always set. This means that for non-programmeRs they at least get a good default out of the box.
At the moment, the implementation in mirai
does, I believe, at least what parallel
does. It is also an important improvement from the previous situation where the random seed was reset (randomly) after each evaluation - it is now persisted. Statistical 'safety' is ensured as each process uses a different stream. Reproducibility however is not guaranteed when dispatcher (or any load-balancing algorithm) is used as which tasks are sent to which workers is then not deterministic.
Non-dispatcher mirai
however does now generally allow reproducibility due to the round-robin behaviour of NNG, so we know that tasks are allocated to workers sequentially. As I alluded to above, a better more-generalised solution is likely possible but at the cost of more complexity. This also means that for crew
, until this is found, it may not benefit directly as I believe targets
does require a high level of reproducibility.
However, the changes should also have no downsides vs before (just to note that the RNGkind in the worker processes are now by default L'ecuyer, although you may also choose to override this at the crew
level for consistency with previous behaviour if that is important).
Is the main issue statistical reproducibility, or is it how "random" the draws look?
For reproducibility, targets
assigns each target a unique deterministic seed of digest::digest2int(as.character(TARGET_NAME), seed = GLOBAL_SEED)
, where GLOBAL_SEED
is configurable and has a fixed default of 0L
. I was thinking crew
users could emulate this on a task-by-task basis if needed.
The latter issue seems trickier. If each task sets its own unique seed deterministically, then that would ignore the part of the RNG algorithm that transitions from state to state, and so the draws might not emulate randomness exactly as advertised.
The latter issue seems trickier. If each task sets its own unique seed deterministically, then that would ignore the part of the RNG algorithm that transitions from state to state, and so the draws might not emulate randomness exactly as advertised.
Or, I might actually be forgetting how seeds work.
For reproducibility, targets assigns each target a unique deterministic seed of digest::digest2int(as.character(TARGET_NAME), seed = GLOBAL_SEED), where GLOBAL_SEED is configurable and has a fixed default of 0L. I was thinking crew users could emulate this on a task-by-task basis if needed.
If this covers reproducibility, would I really need L'Ecuyer-CMRG?
https://stackoverflow.com/a/13807851 seems relevant. The statistical guarantees are supposed to be:
The current approach of targets
guarantees (1), but in hindsight, I am not so sure about (2).
The latter issue seems trickier. If each task sets its own unique seed deterministically, then that would ignore the part of the RNG algorithm that transitions from state to state, and so the draws might not emulate randomness exactly as advertised.
Or, I might actually be forgetting how seeds work.
Exactly this. If it were as simple as setting the seed then BD Ripley would probably not have had to invent such an elaborate solution. The use of L'Ecuyer or not has no bearing on reproducibility.
I don't have a definitive answer as of now as to how much an issue setting the seed deterministically actually might be - especially as each 'task' where this is done could be atomic on one hand or involve a very long sequence of statistical draws on the other.
If helpful, I could export the function this is currently is not reproducible as I mentioned previously.nextstream()
for you to access (and advance) the stream currently stored on host, as an alternative approach.
The topic probably merits a deeper dive at some point. But at least we are incrementally making improvements!
From what I understand, even using 'random' random seeds does not guarantee non-synchronicity across multiple child processes. I guess this becomes more relevant the more long-running the computations are.
So if the computation runs long enough on an existing set of parallel processes, then the PRNG state in one process could potentially overlap the PRNG state in a different parallel process? Because it's not just one long sequence which e.g. Mersenne Twister alone could mitigate?
The solution was devised by Luke Tierney and BD Ripley himself it seems (from the source attribution) - using L'Ecuyer-CMRG streams which are generated iteratively, with each one used to generate the next.
The use of L'Ecuyer or not has no bearing on reproducibility.
Yeah, so I guess RNGkind()[1L]
could be the default. Changed in b0066d27b2db087257bf7251d0509246d3b1f42d.
From what I understand, even using 'random' random seeds does not guarantee non-synchronicity across multiple child processes. I guess this becomes more relevant the more long-running the computations are.
So if the computation runs long enough on an existing set of parallel processes, then the PRNG state in one process could potentially overlap the PRNG state in a different parallel process? Because it's not just one long sequence which e.g. Mersenne Twister alone could mitigate?
Just because Mersenne-Twister has a long period, does not guarantee you that 2 different processes might not start at similar points and hence overlap I guess.
The L'Ecuyer-CMRG streams (at least as implemented in base R) solves this problem by creating these beforehand and passing the random seeds to the child processes. Each of these streams is then guaranteed to be independent of each other. This is what is now implemented in mirai
.
Thanks, that helps.
I see mirai
uses nextRNGStream()
, and the documentation is clear.
So this is my understanding of how to create independent RNG streams. First create an initial stream, which is just a vector of 7 integers.
RNGkind("L'Ecuyer-CMRG")
set.seed(0L) # global seed doesn't matter except for reproducibility
streams <- list()
streams[1L] <- .Random.seed
Then each subsequent stream is created recursively from the previous one.
streams[[2L]] <- nextRNGStream(streams[[1L])
streams[[3L]] <- nextRNGStream(streams[[2L])
...
What's more, each nextRNGStream(streams[[I])
is deterministic.
If mirai
already does all this already, I wonder if crew
should step aside by default and avoid setting seeds altogether. Does that sound reasonable? Users in crew
could still set seeds and algorithms if they really care, but this would not be the default.
Yes that's right. The only addition in my create_stream()
is that the .Random.seed in the host process is restored, analogous to parallel::clusterSetRNGStream()
.
If
mirai
already does all this already, I wonder ifcrew
should step aside by default and avoid setting seeds altogether. Does that sound reasonable? Users increw
could still set seeds and algorithms if they really care, but this would not be the default.
Not an issue for crew to do that. But currently this implementation only ensures the statistical properties without being reproducible. To do so would require mapping tasks to workers beforehand or recording what happens so that it can be repeated. Is that something crew can do?
Basically I have just replicated what happens in parallel
to this point. It is an improvement from completely unreproducible / random statistical properties.
currently this implementation only ensures the statistical properties without being reproducible. To do so would require mapping tasks to workers beforehand or recording what happens so that it can be repeated. Is that something crew can do?
crew
records the seed supplied by the user to push()
. I could change that to the 7-digit L'Ecuyer seed from .Random.seed
before the task begins, and I could make sure it is meaningful by disabling the newly added algorithm
argument. Sound appropriate?
If understand you correctly, you are saying that the seed used is recorded by targets
and hence allows reproducibility if re-run?
If that's the case then great - yes you can change the seed recorded to the length 7 integer vector. In which case you would not want the algorithm to be changed. Note that the actual seed is 6 integers - the first just identifies the .Random.seed
as L'Ecuyer I think.
If that works then shall I open an interface to get and advance the stream for each compute profile? I think this will be best practice for maintainability.
If understand you correctly, you are saying that the seed used is recorded by targets and hence allows reproducibility if re-run?
Both targets
and crew
do this. For crew
, I am thinking a task could return .Random.seed
if algorithm = "mirai"
(will be the default) and otherwise the length 1 integer supplied to set.seed()
.
That's a nice name for the algorithm :) Agree there!
In addition, note that it is the responsibility of the launcher to get and advance the stream for each worker. mirai
does that for the ones it launches itself e.g. locally. Each time a compute profile (environment) is created, a stream is also created and stored there. So my question above is just to confirm if a slightly modified nextstream(.compute)
function should be exported?
Hmm... so then it looks like crew
needs to do more manual work than I realized. Seems doable though, using something like https://github.com/wlandau/crew/issues/113#issuecomment-1706703123.
So my question above is just to confirm if a slightly modified nextstream(.compute) function should be exported?
Yeah, I think that would help a lot.
Ok! I'm currently on my 'commute' so I'll get this to you with some pointers a bit later. Should be straightforward.
nextstream()
in mirai is now ready to go in 9495f5c. I've tested with crew and will put up a PR with the minimal changes required to make it work.
Next to
seed
. c.f. #112