User-supplied RNG algorithm in `push()`

wlandau commented 1 year ago

Next to seed. c.f. #112

shikokuchuo commented 1 year ago

This seems sensible to enable.

As I have just implemented the use of L'Ecuyer-CMRG streams in mirai, I have not yet had the time to think about the best way to integrate this with crew. There may be a clever way that it can use the .Random.seed directly from mirai, although that probably also necessitates further improvements - I'll try to summarise below.

The motivation is to ensure good statistical properties for computations which are split into parallel processes and then brought back together - the classic parLapply type functional programming. From what I understand, even using 'random' random seeds does not guarantee non-synchronicity across multiple child processes. I guess this becomes more relevant the more long-running the computations are. The solution was devised by Luke Tierney and BD Ripley himself it seems (from the source attribution) - using L'Ecuyer-CMRG streams which are generated iteratively, with each one used to generate the next.

The implementation in the base package parallel actually sends the seeds once the cluster is set up as in the function parallel::clusterSetRNGStream, whereas I have made it part of the command line argument to daemon() to ensure that it is always set. This means that for non-programmeRs they at least get a good default out of the box.

At the moment, the implementation in mirai does, I believe, at least what parallel does. It is also an important improvement from the previous situation where the random seed was reset (randomly) after each evaluation - it is now persisted. Statistical 'safety' is ensured as each process uses a different stream. Reproducibility however is not guaranteed when dispatcher (or any load-balancing algorithm) is used as which tasks are sent to which workers is then not deterministic.

Non-dispatcher mirai however does now generally allow reproducibility due to the round-robin behaviour of NNG, so we know that tasks are allocated to workers sequentially. As I alluded to above, a better more-generalised solution is likely possible but at the cost of more complexity. This also means that for crew, until this is found, it may not benefit directly as I believe targets does require a high level of reproducibility.

However, the changes should also have no downsides vs before (just to note that the RNGkind in the worker processes are now by default L'ecuyer, although you may also choose to override this at the crew level for consistency with previous behaviour if that is important).

wlandau commented 1 year ago

Is the main issue statistical reproducibility, or is it how "random" the draws look?

For reproducibility, targets assigns each target a unique deterministic seed of digest::digest2int(as.character(TARGET_NAME), seed = GLOBAL_SEED), where GLOBAL_SEED is configurable and has a fixed default of 0L. I was thinking crew users could emulate this on a task-by-task basis if needed.

The latter issue seems trickier. If each task sets its own unique seed deterministically, then that would ignore the part of the RNG algorithm that transitions from state to state, and so the draws might not emulate randomness exactly as advertised.

wlandau commented 1 year ago

The latter issue seems trickier. If each task sets its own unique seed deterministically, then that would ignore the part of the RNG algorithm that transitions from state to state, and so the draws might not emulate randomness exactly as advertised.

Or, I might actually be forgetting how seeds work.

wlandau commented 1 year ago

For reproducibility, targets assigns each target a unique deterministic seed of digest::digest2int(as.character(TARGET_NAME), seed = GLOBAL_SEED), where GLOBAL_SEED is configurable and has a fixed default of 0L. I was thinking crew users could emulate this on a task-by-task basis if needed.

If this covers reproducibility, would I really need L'Ecuyer-CMRG?

wlandau commented 1 year ago

https://stackoverflow.com/a/13807851 seems relevant. The statistical guarantees are supposed to be:

Reproducibility.
Independence.

The current approach of targets guarantees (1), but in hindsight, I am not so sure about (2).

shikokuchuo commented 1 year ago

The latter issue seems trickier. If each task sets its own unique seed deterministically, then that would ignore the part of the RNG algorithm that transitions from state to state, and so the draws might not emulate randomness exactly as advertised.

Or, I might actually be forgetting how seeds work.

Exactly this. If it were as simple as setting the seed then BD Ripley would probably not have had to invent such an elaborate solution. The use of L'Ecuyer or not has no bearing on reproducibility.

I don't have a definitive answer as of now as to how much an issue setting the seed deterministically actually might be - especially as each 'task' where this is done could be atomic on one hand or involve a very long sequence of statistical draws on the other.

~~If helpful, I could export the function nextstream() for you to access (and advance) the stream currently stored on host, as an alternative approach.~~ this is currently is not reproducible as I mentioned previously.

The topic probably merits a deeper dive at some point. But at least we are incrementally making improvements!

wlandau commented 1 year ago

From what I understand, even using 'random' random seeds does not guarantee non-synchronicity across multiple child processes. I guess this becomes more relevant the more long-running the computations are.

So if the computation runs long enough on an existing set of parallel processes, then the PRNG state in one process could potentially overlap the PRNG state in a different parallel process? Because it's not just one long sequence which e.g. Mersenne Twister alone could mitigate?

The solution was devised by Luke Tierney and BD Ripley himself it seems (from the source attribution) - using L'Ecuyer-CMRG streams which are generated iteratively, with each one used to generate the next.

The use of L'Ecuyer or not has no bearing on reproducibility.

Yeah, so I guess RNGkind()[1L] could be the default. Changed in b0066d27b2db087257bf7251d0509246d3b1f42d.

shikokuchuo commented 1 year ago

From what I understand, even using 'random' random seeds does not guarantee non-synchronicity across multiple child processes. I guess this becomes more relevant the more long-running the computations are.

So if the computation runs long enough on an existing set of parallel processes, then the PRNG state in one process could potentially overlap the PRNG state in a different parallel process? Because it's not just one long sequence which e.g. Mersenne Twister alone could mitigate?

Just because Mersenne-Twister has a long period, does not guarantee you that 2 different processes might not start at similar points and hence overlap I guess.

The L'Ecuyer-CMRG streams (at least as implemented in base R) solves this problem by creating these beforehand and passing the random seeds to the child processes. Each of these streams is then guaranteed to be independent of each other. This is what is now implemented in mirai.

wlandau commented 1 year ago

Thanks, that helps.

I see mirai uses nextRNGStream(), and the documentation is clear.

So this is my understanding of how to create independent RNG streams. First create an initial stream, which is just a vector of 7 integers.

RNGkind("L'Ecuyer-CMRG")
set.seed(0L) # global seed doesn't matter except for reproducibility
streams <- list()
streams[1L] <- .Random.seed

Then each subsequent stream is created recursively from the previous one.

streams[[2L]] <- nextRNGStream(streams[[1L])
streams[[3L]] <- nextRNGStream(streams[[2L])
...

What's more, each nextRNGStream(streams[[I]) is deterministic.

If mirai already does all this already, I wonder if crew should step aside by default and avoid setting seeds altogether. Does that sound reasonable? Users in crew could still set seeds and algorithms if they really care, but this would not be the default.

shikokuchuo commented 1 year ago

Yes that's right. The only addition in my create_stream() is that the .Random.seed in the host process is restored, analogous to parallel::clusterSetRNGStream().

shikokuchuo commented 1 year ago

If mirai already does all this already, I wonder if crew should step aside by default and avoid setting seeds altogether. Does that sound reasonable? Users in crew could still set seeds and algorithms if they really care, but this would not be the default.

Not an issue for crew to do that. But currently this implementation only ensures the statistical properties without being reproducible. To do so would require mapping tasks to workers beforehand or recording what happens so that it can be repeated. Is that something crew can do?

Basically I have just replicated what happens in parallel to this point. It is an improvement from completely unreproducible / random statistical properties.

wlandau commented 1 year ago

currently this implementation only ensures the statistical properties without being reproducible. To do so would require mapping tasks to workers beforehand or recording what happens so that it can be repeated. Is that something crew can do?

crew records the seed supplied by the user to push(). I could change that to the 7-digit L'Ecuyer seed from .Random.seed before the task begins, and I could make sure it is meaningful by disabling the newly added algorithm argument. Sound appropriate?

shikokuchuo commented 1 year ago

If understand you correctly, you are saying that the seed used is recorded by targets and hence allows reproducibility if re-run?

If that's the case then great - yes you can change the seed recorded to the length 7 integer vector. In which case you would not want the algorithm to be changed. Note that the actual seed is 6 integers - the first just identifies the .Random.seed as L'Ecuyer I think.

shikokuchuo commented 1 year ago

If that works then shall I open an interface to get and advance the stream for each compute profile? I think this will be best practice for maintainability.

wlandau commented 1 year ago

If understand you correctly, you are saying that the seed used is recorded by targets and hence allows reproducibility if re-run?

Both targets and crew do this. For crew, I am thinking a task could return .Random.seed if algorithm = "mirai" (will be the default) and otherwise the length 1 integer supplied to set.seed().

shikokuchuo commented 1 year ago

That's a nice name for the algorithm :) Agree there!

In addition, note that it is the responsibility of the launcher to get and advance the stream for each worker. mirai does that for the ones it launches itself e.g. locally. Each time a compute profile (environment) is created, a stream is also created and stored there. So my question above is just to confirm if a slightly modified nextstream(.compute) function should be exported?

wlandau commented 1 year ago

Hmm... so then it looks like crew needs to do more manual work than I realized. Seems doable though, using something like https://github.com/wlandau/crew/issues/113#issuecomment-1706703123.

So my question above is just to confirm if a slightly modified nextstream(.compute) function should be exported?

Yeah, I think that would help a lot.

shikokuchuo commented 1 year ago

Ok! I'm currently on my 'commute' so I'll get this to you with some pointers a bit later. Should be straightforward.

shikokuchuo commented 1 year ago

nextstream() in mirai is now ready to go in 9495f5c. I've tested with crew and will put up a PR with the minimal changes required to make it work.

wlandau / crew

User-supplied RNG algorithm in `push()` #113