yihui / knitr

A general-purpose tool for dynamic report generation in R
https://yihui.org/knitr/
2.38k stars 873 forks source link

Report writing workflow: decouple steps of writing of code and document. #536

Closed jeroen closed 11 years ago

jeroen commented 11 years ago

Rnw/Rmd is a great format for reproducible reports. However I think the current workflow of writing such reports is not optimal. When writing (non-reproducible) reports, people like to first produce the materials, and then write the document. There is no reason why that should be different for reproducible reports.

I would like to make a web-based editor to write Rnw/Rmd. However, instead users iterating between updating their Rmd/Rnw files and running knitr(), I want a workflow as follows:

Creating a new report

  1. User creates a single R script that runs the study. No weaving.
  2. When the script has been completed, user opens the script in the editor. The editor runs evaluate() once and stores the results-object. The results-object contains all code, console output, figures, etc that will end up in the report. The user is now done with R.
  3. The user writes the report using standard markdown/latex or other document editing workflow. The obtained results (or previews) are displayed in a wysiwyg editor.
  4. The editor creates the weaved code containing md/tex and original R code. The final result is a standard Rnw/Rmd script as we know it.

    Compiling / Editing an existing report

  5. The user opens up the Rnw/Rmd file in the editor.
  6. The editor runs the R code in the report once using evaluate(). Store all results.
  7. as above.
  8. as above.

To implement this, we need an alternative interface, that the decouples knitr() process the process in two steps:

  1. The first function takes a .Rnw / Rmd document, runs evaluate() on the code and returns a list that contains both the R output (evaluate results-object) and latex/markdown code, in the original order. An R script is a special case of a Rnw document that contains no md/text, hence the output will identical to the output of evaluate().
  2. A second function compiles this list of R output an R output and md/tex into the final html/pdf/etc.

The knitr() function can then be implemented as simply running both these functions in sequence.

ramnathv commented 11 years ago

I would be happy to contribute. Here is a web app I put together using some of Yihui's code. I can hook it up to OpenCPU to do knitting, but wasnt sure about security implications. Let me know if you want to talk.

http://ramnathv.github.io/rNotebook/

rmflight commented 11 years ago

Here is an idea. Why not just have an R script that you run once (that has a save to a file RData at the end), and then do a load of that RData as the first thing in the Rmd/Rnw document? If you use ggplot2 for figures, I think you can actually save the output to an object and then print it in the actual document where you want it to appear. Couple this process with the ace editor and previewer that @ramnathv hooked up based on @yihui demo code, and that might work.

I would personally only do this for longer running things. I often find myself editing both the code and the document concurrently, and in that case it is nice to have them in the same actual document. And having the code that generates results near where I talk about them is also a useful way to be reminded what was actually done, instead of having to go back to the script.

ramnathv commented 11 years ago

I agree with @rmflight on the workflow bit. I prefer writing an Rmd document straight and executing the code chunks in Rstudio, instead of adorning the R file with comments.

baptiste commented 11 years ago

For anything that takes time to compute (ie more than a few seconds -- often up to several minutes in my documents), I agree that the current workflow can be painful, even with cache, when one only really want to polish the textual part at the end. I make heavy use of external chunks, which in a way is getting close to this proposal. In a R document with # @knitr tags I have all my analysis, which I play with interactively until there are no bugs and everything runs as intended. I then write the rmd file, inserting these chunk references, and eval it once using caching throughout. That works relatively well, except for the part where the R vs markdown bits are not ideally decoupled, e.g. figure captions, which I want to work on during the report writing, somehow trigger the R code for that chunk to be run again. If caching was more clever, and R code was never run again when only working on the text, I think I'd be mostly happy with the current setup.

rmflight commented 11 years ago

@baptiste can you give a rep. example of code being re-executed when you think it shouldn't for fig captions? Maybe there is an alternative way to write the code.

baptiste commented 11 years ago

@rmflight Any chunk will do, e.g.

```{r, cache=TRUE, fig.cap="This chunk takes **time**."}
Sys.sleep(10)
plot(cars)


If you run this a second time and modify only the caption, it will re-run the R code, which is unfortunate. There might be a workaround using a predefined global variable, but that's not ideal. (Note that this is a known feature request, nothing new).
rmflight commented 11 years ago

Ah, that kind of figure caption. Sorry, I misunderstood. Yeah, that would be annoying. Alternatively, could you not use knitr generated figure captions and instead insert the caption in the body of the text?

baptiste commented 11 years ago

Probably, yes, but I like to avoid temporary workarounds; they somewhat defeat the purpose of wanting to reuse the same construct/template several months from now. For now, and until the caching improves, one can always set Sys.sleep() to a convenient global variable coffee_break=600 :)

baptiste commented 11 years ago

Another positive aspect of jeroenooms's proposal is that it potentially opens to broader uses of knitr, precisely by optionally enabling more separation between the pure writing process and the coding parts. This would expand the range of uses to cover the full spectrum from "pure" literate programming, to "standard" word processors. This is good, because different projects have different needs across this spectrum (e.g some reports are almost pure code with few lines of text, other mostly literature).

Say, for example, one wanted to create a reproducible wiki based on rmd documents. The casual online reader and contributor might want to fix a typo, or add a reference, or a new section, etc. These online changes typically happen on the markdown file that was produced by knitr; as such they cannot currently be inserted back in the original rmd source. If, as suggested here, one could decouple the R code and the text, such reverse-editing could become possible. The output of the code itself would mostly retain its badge of reproducibility as it couldn't be tampered with from the text contributor's side.

rmflight commented 11 years ago

@jeroenooms OK, I re-read your suggestion again, and again, and I think I get it. I actually don't agree with your suggested workflow at all, although at one time I would have.

Traditionally, I think you are right. People tended to do a bunch of experiments, analyze the results, and then write them up. This is how we traditionally think of the scientific process. However, I think we all know that the reality is often much messier.

Finally, we get something we can communicate to others. What is absolutely incredible about the current way that knitr works, is that it makes this workflow explicit, at least for completely computational based papers, or those that involve data manipulation of some kind. The process of writing about the results informs the analysis and there is this back and forth between the two. In a knitr document and workflow, this happens all the time, at least if the computations are embedded in the document itself.

I do think a compromise solution for what you want is my first proposal, do all the computations, save the results, and then load it. Either that, or design a complete alternative to knitr that does what you are looking for, because it seems like a break from the current idea that @yihui has in mind.

I don't know if the workflow I describe above is what @yihui had in mind when he wrote knitr, or if it came about because of it's sweave influence, but I think it actually fits any experimental writing paradigm better than what @jeroenooms proposes. But thats just my 2cents.

rmflight commented 11 years ago

@baptiste sure, I can see those points. I don't know that implementing all of this in knitr is the best way to go about it, though.

baptiste commented 11 years ago

out of curiosity, has someone started working on this idea?

yihui commented 11 years ago

@baptiste I can see your pain and I very much agree with you. The best news I can tell you is that the current weak cache feature has also been biting me hard (when something annoys the author himself, you will have hope :smiley: ). That is the whole point of #396, and I do plan to work on it soon.

Oh this thread is so long, and I have to go through the discussion again to figure out what the rest of people were talking about.

yihui commented 11 years ago

Now it is possible to cache the evaluate() results. See my comment under #396

baptiste commented 11 years ago

Thanks for the improved cache! However, I feel like the OP (jeroenooms) got cheated here, because you closed the issue based on answering my tangentially-related request! :)

The original proposal is still open for thoughts and discussions -- what's your perspective on it?

yihui commented 11 years ago

Not really -- I think the current cache system has done what he wanted:

  1. you can save the results of evaluate();
  2. you can reload these results and build new documents on top of them without breaking the cache;

Yes, your request seems to be "tangentially-related", but I believe actually happens to be essential.

I'll give a concrete example later, to show how to start from an R script, save the results, and weave them into Rmd/Rnw/whatever documents.

baptiste commented 10 years ago

I have a concrete example motivating me to have another look at this issue. For my tamm package, I decided to pull the github wiki as a git submodule; it now lives in inst/wiki, which feels very natural. That means I have a reproducible wiki to document my package (kind of like improved vignettes, with the usual cross-linking and other features inherited from the proven wiki framework). It also means that all the source files (rmd files) are available to the package user (with suitable .buildignore to clean up the directory). Now, if there was a way to weave back changes made to the text part by online wiki users into the Rmd source, this would be a fantastic tool for collaborative documentation of packages (or books, e.g. Hadley's ongoing projects). I wonder what minimal subset of knitr (a fully grown codebase by now) would be required to test this alternative implementation (and interpretation) of interactive/reproducible documents, as a proof of principle?

yihui commented 9 years ago

@baptiste Note Hadley points the button "Edit this page" to the Rmd source document on Github, so there is no such a step as "bring the changes in md back to Rmd". It is fairly easy to set up such buttons in HTML pages, but if you have to rely on the Github wiki system, it will become tricky. I'm not quite sure about the difficulty of implementing your idea, but I'm sure that publishing to the gh-pages branch is fairly easy.

baptiste commented 9 years ago

@yihui There are indeed alternative ways to go about this, as Gabor C. recently suggested on R-devel, and choice is a good thing. What I'm advocating here is opening the possibility of leveraging existing wiki systems such as github's. Wikis are more than just html pages, there's a whole infrastructure that goes with it (page creation, search, autolinks, online editing/formatting tools, image uploads, ...). In that sense, an alternative framework for reproducible documents that allows two-ways sync of the textual content could be an interesting option to explore. It's not the only way to create a reproducible wiki, nor perhaps the best way to do it, but I would argue that it's a useful option to consider.

Do you have a minimal set of core functions to reproduce the essential functionality of knitr, without all the bells and whistles, but easier to modify? knitrtoy could be forked and used as a toy model for alternative implementations.

yihui commented 9 years ago

I agree it is an interesting idea, but I doubt whether people will adopt it (a similar idea at https://www.stat.auckland.ac.nz/~paul/Reports/invert/invert.html and I also doubt if it will prove useful). The easiest solution for such a problem is to invent a special document format, e.g. IPython notebooks are essentially JSON files, and JSON is way easier to expression structured documents than Markdown (#785).

The key issue in this case is that you have to come up with some special markers in the document that tell the parser which part is the source code chunk, and which part is its output. I imagine the document may look like this:

text text text

<!--source:chunk-a, opt1=val1, opt2=val2
1 + 1
source-->
<!--output:chunk-a-->
```r
1 + 1
[1] 2

text text text



What you might do is to define a custom parser (though `knit_patterns$set()`) and a custom renderer (though `knit_hooks$set()`) to process this file. Every time you re-compile the document, you add `<!--output:label-->` and `<!--/output:label-->` to the new chunk output via `knit_hooks$get('chunk')`, and strip off the previous output using `knit_hooks$get('text')`. I think this should work. Please feel free to experiment with it.

This has apparently digressed from Jeroen's original idea, so a new issue might be more appropriate.
github-actions[bot] commented 3 years ago

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue by following the issue guide (https://yihui.org/issue/), and link to this old issue if necessary.