yihui / knitr

A general-purpose tool for dynamic report generation in R
https://yihui.org/knitr/
2.39k stars 879 forks source link

Use knitr::spin() with R and Python code in one document #1773

Open fdetsch opened 4 years ago

fdetsch commented 4 years ago

Question directly copied from StackOverflow:

With the advent of reticulate, combining R and Python in a single .Rmd document has become increasingly popular among the R community (myself included). Now, my personal workflow usually starts with an R script and, at some point, I create a shareable report using knitr::spin() with the plain .R document as input in order to avoid code duplication (see also Knitr's best hidden gem: spin for more on the topic).

However, as soon as Python code is involved in my analysis, I am currently forced to break this workflow and manually convert (ie. copy and paste) my initial .R script into .Rmd before compiling the report. I wonder, does anybody know whether it is – or for that matter, will ever be – possible to make knitr::spin() work with both R and Python code chunks in a single .R file without taking this detour? I mean, just like it works when mixing the two languages, and exchanging objects between them, in a .Rmd file. There is, at least to the best of my knowledge, no possibility to add something like engine = 'python' to spin documents at the moment.


By filing an issue to this repo, I promise that

I understand that my issue may be closed if I don't fulfill my promises.

cderv commented 4 years ago

I find this an interesting question. It make me dig into spin function.

For now, spin is used to convert an Rscript to a literate programming document. One fact here is that an Rscript can only contains comments and R code. Thus, spin uses internally the R base parser parse() to do its magic https://github.com/yihui/knitr/blob/dccdad769b67a23d9cf8b8414f844e2a6a74f21e/R/spin.R#L69 and parse from R base only know how do deal with R code.

Independently of knitr and spin, I don't think an R file script can contain python code too. It works in literate programming format like Rmd but not in scripted filed format. You would not put literal R code in a .py script, and thus no python code in .R script. Seems fair to me.

I think all this makes difficult for R script to contains anything else than R code. Technically, with knitr you can change engine with the engine option (see knitr engines) but it would not work with spin because it will try to parse all code as R code before even looking a the engine.

Example

here is an example of what is blocking here. I put this in test.R

#+ setup, include = FALSE
library(reticulate)
use_miniconda()

# Using python directly
#+ engine = python
import os
os.getenv("RSTUDIO_PANDOC")

# Using R with reticulate
os <- import("os")
os$getenv("RSTUDIO_PANDOC")

We get this error

Error in parse(text = x, keep.source = TRUE) : 
  <text>:7:8: unexpected symbol
6: #+ engine = python
7: import os
          ^

This is because what we know of being python code is parsed by R as R code and it is invalid R code.

Solutions ?

If we really want to have an equivalent of spin (comments and codes only instead of texts and code chunks) for multi engine code, I think it would require a new format (.Rmix ?) I think with its own parsing logic to identify non R code in the uncommented parts and process them correctly. However it would be really similar to Rmd format I guess and not easy to maintain 🤔

Out of curiosity, why do you prefer spin and R script file when you are working on python + R project instead of using Rmd files directly ? Rmd files are one of the good container to mix both languages in order to produce report.

Also, the answer you got on SO is really interesting: using source_python in R script to mix python and R. this is like child document in Rmarkdown where you split you report in several reusable document. Pretty clever and really easy !

I am just sharing some thoughts here on this topic to contribute to the discussion. Hope it helps.

fdetsch commented 4 years ago

Thanks for sharing your thoughts on that. I fully agree that it's a fair split that R and Python have their own places in .R and .py files, respectively. With the seamless integration of Python in .Rmd at hand (and probably lacking in-depth knowledge of the underlying mechanisms), I just became curious whether a similar integration could be feasible for spin(). Call it wishful thinking 😉

As regards your question: for me coming from the R side, it feels like a more native approach to start out with a plain .R script. My files usually include >95% code (mostly R) vs. <5% (mostly informal) comments, which renders spin() the ideal solution. In .Rmd, you need to explicitly insert a code chunk whenever you want to perform actual coding. Maybe I am just lazy about writing, but this seems a little over the top for my purposes where results are mostly conveyed via tables and figures, which can conceivably simple be accomplished using spin().

yihui commented 4 years ago

#+ engine="python" should have worked. The PR #1605 made it fail. This is a known bug, as I replied at https://github.com/yihui/knitr/pull/1605#issuecomment-478132556. Since @Hemken didn't file a new issue, I have completely forgotten it. Sorry.

That said, I probably won't have time to fix it in the near future...

cderv commented 4 years ago

Oh thanks ! I did not noticed that.

It was not intuitive for me that an .R script should contain several languages in the same file using knitr features. I am rather expecting that special multi-language format are in special files like .Rmd to indicate clearly that the file can't be run like a Rscript (i.e. Rscript -f my-py-and-Rcode.R) but requires a special tool, here knitr. (.Rk, .Rknitr or else). So I did not think of it as bug.

My opinion being shared (☺️ ) and know that I know this is a bug, I can dig into that during my spare time. It does not mean just now but in a near future maybe closer than yours 😉

Hemken commented 4 years ago

My solution was just to check if the language was R before the parsing step. If not, then skip that check (with the consequence that multiline comments etc. remain problematic in languages other than R).

I’ll try to dig out this code and make a clean pull request so that you can review it.

yihui commented 7 months ago

With https://github.com/yihui/knitr/commit/74bcff85455a9f92a3099abecba1a8d70ab0cfa4, using engine = "python" should work now, although I tend to agree with @cderv above that it doesn't feel right to have both Python and R code in the same script.

cat(spin(knit = FALSE, text = '#+ setup, include = FALSE
library(reticulate)
use_miniconda()

# Using python directly
#+ engine = "python"
import os
os.getenv("RSTUDIO_PANDOC")

# Using R with reticulate
os <- import("os")
os$getenv("RSTUDIO_PANDOC")'), sep = '\n')
```{r setup, include = FALSE}
library(reticulate)
use_miniconda()

# Using python directly
import os
os.getenv("RSTUDIO_PANDOC")

# Using R with reticulate
os <- import("os")
os$getenv("RSTUDIO_PANDOC")
katrinabrock commented 2 weeks ago

I might be missing something...is there currently a way to spin a fully python script into Rmd?

As in start with example.py that contains

#' Example text

# example comment
print([i for i in 'abcdefg'])

Run something like knitr::spin('pyexample.py', knit = FALSE) and end up with example.Rmd something like:

Example text

```{python}
# example comment
print([i for i in 'abcdefg'])

The closest I've got is adding this to the top of my python file:

+ eval=TRUE, include=FALSE

knitr::opts_chunk$set(engine = 'python')


However, here I'm adding some R to a python script, so the script can no longer run fully on its own. I know one way is to set the option at the knit stage, but I would like to produce an Rmd script that runs on its own.
Hemken commented 2 weeks ago

I think the concept here is that you skip the Rmd file – it is never produced. Instead of RMD -> MD -> final document, you go from source script -> MD -> final document. In the case of a Python script it would go from PY -> MD -> final document.

BITD a lot of us used to write source code with lots of comments. “Spinning” is just a way of supercharging those comments to mimic knitting and weaving.

Doug Hemken

Statistical consultant (retired) Social Science Computing Cooperative Univ. of Wisc. – Madison

cderv commented 2 weeks ago

@katrinabrock currently, #+ engine = 'python' needs to be set on each part of the code. No way to globally set it I think. cc @yihui - should we consider python if spin is on a .py file ?

#' Example text
#' 

#+ engine='python'
# example comment
print([i for i in 'abcdefg'])

I think the concept here is that you skip the Rmd file – it is never produced.

@Hemken with spin() a Rmd file is produced, and then rmarkdown::render() called on it.

katrinabrock commented 2 weeks ago

@cderv Indeed #+ engine = 'python' only applies to the subsequent chunk (and is ignored by python 👍 ), but running knitr::opts_chunk$set(engine = 'python') inside a chunk sets it for the whole document (or until unset). This behavior is documented here.

I'm trying to think if there is a creative way...maybe with multiline quotes and/or adjusting the spin regexes ...to get python interpreter to ignore that line while spin can see it.

EDIT: Here's the best I've come up with so far:

# /*
''' # */
#+ eval=TRUE, include=FALSE
knitr::opts_chunk$set(engine = 'python')
# /*
''' # */

It's super ugly, but it does result in both not creating a python syntax error (or any behavior change), and adding adding the following to the .Rmd (which results in subsequent {r} blocks interpretted as python).

```{r eval=TRUE, include=FALSE}
knitr::opts_chunk$set(engine = 'python')
cderv commented 2 weeks ago

running knitr::opts_chunk$set(engine = 'python') inside a chunk sets it for the whole document (or until unset). This behavior is documented here.

I know that, but this is R code, so it will lead to R cells in the .Rmd created. I thought you did not want that. Especially because script is .py

I think the best way would be to maybe consider that all code in .py script are to be place in a engine = "python" code cell.

But that is not how spin() works for now unfortunately

katrinabrock commented 2 weeks ago

Yes, the result of my workaround in the Rmd is there is one (real, hidden) R cell that sets to the option. Then all the rest of the cells are "R" cells with {r}, but they contain python code and the python code runs successfully when I knit. To me, for .py files this is better than the workaround of adding #+ engine = "python" to each cell because I would then have to sprinkle that line all over my script and if I missed a spot, knitting would fail. With the crazy ''' # */ chunk above, my .py file even still runs cleanly by itself because the R code it contains is sequestered into a string. (But released at the spin stage.)

Indeed, I would prefer a "real" fix where spin recognizes that this is a .py file and inserts {python} instead of {r} and thus neither my workaround nor yours would be needed.