yihui / knitr

A general-purpose tool for dynamic report generation in R
https://yihui.org/knitr/
2.38k stars 873 forks source link

parallel chunks #744

Closed rubenarslan closed 9 years ago

rubenarslan commented 10 years ago

I just made a feature request at Rstudio and thought I'd lodge it here too.

I often use knitr to group my models with a bit of context etc. so I'll have something

## Model1 
```{r}
slow_glmer_call

Model2

slow_glmer_call2


I was wondering whether it would be feasible and/or interesting for others to add a order parameter to knitr chunks. Then, chunks which have the same order value could be automatically executed in parallel. 

For me, this would be nice, since I like having a nice readable reproducible report, but I also like not waiting hours for my 20 models to finish in sequence on one core out of eight.

I'm guessing the main barrier would be to do this automatically on different architectures, or does this also somehow go contrary to knitr's philosophy?
yihui commented 10 years ago

It looks interesting, but I guess it should be more straightforward for you to write the parallel code in one code chunk than for me to bring a new design. If I were to do it, it is not clear to me which back-end I should use, the parallel package, or other packages?...

rubenarslan commented 10 years ago

Of course putting it all in one chunk is possible already, but it doesn't result in a readable report and this would mean that the coefplots I generate from my models are made in a different place than the models and so on. I wanted to bring the idea up. I might try my hand at a pull request if I ever get time (maybe while waiting for glmer to finish).

Additionally I think while some people with a pressing need would of course go the within-chunk route, people with a less pressing need would still benefit from the speed increase through automatic parallelisation. I think with knitr parallelisation fits in quite naturally because of the chunk concept. Most R novices probably wouldn't use this otherwise. As an example, I use knitr in my survey framework, so scientists can generate feedback (including ggplots etc) for the survey takers. The average user would only use high-level R functions to generate these plots (which can be somewhat complex and slow nonetheless e.g. a LOESS smooth superimposed on mood data in a diary), but if knitr parallelised the plot chunks, it would be generated faster.

I'm not sure whether some logic could be borrowed from what you do with autodep for caching to understand which chunks can be executed in parallel automatically. I think there's a bunch of useful heuristics for this sort of thing. For example I (and many others probably) do my data munging in a separate document from my analyses. In my analyses document every chunk could run in parallel, it's all plots and analyses there.

I believe parallel would be the way-to-go but the last time I really read up on this was when multicore/snow was still current.

yanlinlin82 commented 10 years ago

I agree with @rubenarslan here that parallel chunk support is reasonable, just like "-j" option of GNU make, which enables the build progress run in several threads, without considering if each command use multi-thread or not.

Another reason would be that most (if not all) hard-coded parallel would increase the complexity of computing code, and it will reduce the elegance of literate programming.

Moreover, since knitr supports "autodep" option, which should be easy to find out independent chunks, why not run them in parallel?

yihui commented 10 years ago

This makes sense. I'll think about it. Running code in parallel is a bit tricky when there are plots produced as side effects. I'm not entirely sure how this would work.

rubenarslan commented 10 years ago

@yihui Would the plots get lost when you collect the result? That would be bad. Or are you thinking about the file-system access, i.e. writing to the figure directory? Because again, that's why I think it would work so nicely with chunks, they'd keep those apart already.

One issue I thought of last night, was what should happen if a chunks spawns forks itself (e.g. because it's bootstrapping something). I don't think you can get a good handle on forks you didn't spawn, especially on Windows. But that needn't be supported I'd say, it just needs to be possible to switch the heuristics off.

yanlinlin82 commented 10 years ago

Indeed. Since we are always plotting to dev.cur(), which is something exact like a global variable, race condition is inevitable when running in parallel. Maybe it is impossible to plot safely in threads, unless introducing lock for the graphic device and plot functions/chunks.

rubenarslan commented 10 years ago

Naïve point of view (I'm not a proper programmer), but running several R sessions in parallel (e.g. two Rstudio projects or Rstudio and R.app) seems thread-safe, plotting too on my Mac.

And it doesn't seem like the necessary time doubles (is there some sort of locking already implemented?), presumably because printing isn't the only step.

But apparently two forks of the same process are not as safe as two sessions:

from parallel::mclapply docs

It is strongly discouraged to use these functions in GUI or embedded environments, because it leads to several processes sharing the same GUI which will likely cause chaos (and possibly crashes). Child processes should never use on-screen graphics devices.

Some precautions have been taken to make this usable in R.app on OS X, but users of third-party front-ends should consult their documentation.

Note that tcltk counts as a GUI for these purposes since Tcl runs an event loop. That event loop is inhibited in a child process but there could still be problems with Tk graphical connections.

Probably this isn't surprising to anyone but me :-)

yihui commented 9 years ago

I'm closing this issue since I do not think I will ever find time to do it. It sounds a little complicated for me. If anyone wants to experiment with the idea, please feel free to, and pull requests are always welcome :)

github-actions[bot] commented 3 years ago

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.