zoonproject / zoon

The zoon R package
Other
61 stars 13 forks source link

Stochastic models #15

Open timcdlucas opened 9 years ago

timcdlucas commented 9 years ago

Need to set.seeds() within each module so that parallel modules are comparing some thing. e.g. if we compare two data sets, we want the same pseudorandom background points. Need to check I'm doing it sensibly and reproducibly all the way through though.

AugustT commented 8 years ago

I think this should be in the module rather than in the core workflow. If it was in te core workflow you could never turn it off which could be quite annoying

timcdlucas commented 8 years ago

I guess you should be able to turn it off but I can only think of 1 situation when you'd want it. Running the entire same workflow multiple times to see what affect the random elements have:

e.g.

work1 <- workflow(occurrence = UKAnophelesPlumbeus,
                  covariate  = UKAir,
                  process    = OneHundredBackground,
                  model      = list(RandomForest, RandomForest),
                  output     = PrintMap)

However, having looked a bit more carefully at the parallel package I think that does most of it for you. But then probably very difficult to reproduce 1 workflow out of a large workflow (e.g. how do you exactly reproduce the second random forest model in the workflow above.

Some of this applies to both parallel and serial computation.

AugustT commented 8 years ago

how do you exactly reproduce the second random forest model in the workflow above

by setting the random seed explicitly?

work1 <- workflow(occurrence = UKAnophelesPlumbeus,
                  covariate  = UKAir,
                  process    = OneHundredBackground,
                  model      = list(RandomForest, RandomForest(seed = 123)),
                  output     = PrintMap)

I have added this argument into all of the process modules, doing a PR now. I guess we need something similar for the model modules.

timcdlucas commented 8 years ago

As you currently have it, if someone publishes a big workflow and hasn't set the seeds in each model explicitly then other users still can't replicate. I feel like there might be a better way of doing this.

Something like, if seed = NULL the seed gets selected (in a sensible way...) and then saved. Perhaps written into work1$call.list or somewhere. Then if I want to replicate a particular bit I would set the seed explicitly. So you could do

rework <- workflow(occurrence = UKAnophelesPlumbeus,
                   covariate  = UKAir,
                   process    = OneHundredBackground,
                   model      = RandomForest(seed = work1$call.list[[4]][[2]]$paras$seed),
                   output     = PrintMap)

which would directly replicate the second RandomForest above, even if they hadn't set a seed.

I guess it's convoluted... It also means we don't rely on people doing set.seed() at the top of their scripts nor reporting the seed value if they did set it. Currently a workflow object uploaded to figshare for example isn't replicable because it doesn't store this information.

AugustT commented 8 years ago

Hmmm, good point.

Why not have the seed set by default. So like yours above except seed = 123. If, as you say, non-random is the users normal requirement then this would make things more straighforward and users can use seed = NULL for cases where they do want randomness

timcdlucas commented 8 years ago

Hmmm. It feels weird having a default number. For example if we just so happen to pick a weird seed for a module, then all analyses could have this weird pattern. Maybe I'm worrying needlessly. But for a crossvalidation module, it could conceivable happen that the default seed gave sample(c(1:5), N, replace = TRUE) [1] 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3

or something. Highly unlikely, but the fact that the same weirdness would be repeated, could be an issue.

AugustT commented 8 years ago

This could happen, but it could also happen when not setting a seed. In both cases the solution is the same, use a different seed