knitr engine API and cache compatibility with reticulate engine

tmastny commented 6 years ago

I think whether or not this is a knitr or reticulate bug depends on the knitr engine API, which I do not completely understand. I carefully searched for this bug in the knitr and reticulate issues and didn't see anything, so I apologize if this is already known.

The bug

Suppose we have the following file called python_test.Rmd:

---
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE, 
                      results='show', cache=TRUE, autodep=FALSE)
knitr::opts_knit$set(progress = TRUE, verbose = TRUE)
knitr::knit_engines$set(python = reticulate::eng_python)

x = 1
print(x)

print(x + 9)

When you first press the **Knit** button the document compiles successfully.

Now suppose I change chunk2, so it has `print(x + 10)` and I save the file. If I try clicking the **Knit** button I get the following error:

Quitting from lines 19-20 (python_test.Rmd) Error in py_run_string_impl(code, local, convert) : NameError: name 'x' is not defined

Detailed traceback: File "", line 1, in Calls: ... force -> py_run_string -> py_run_string_impl -> .Call Execution halted

----
## My efforts to debug
Here's what I've learned about the error:

It reliably happens when I call `knitr::knit('python_test.Rmd', envir = new.env())` *after a session restart*, so I don't think it is an `rmarkdown::render` error. 

The error message points to `py_run_string_impl`, which is a `reticulate` function. But I believe the problem arises before `knitr` reaches the python engine. 

When you call `knitr::knit('python_test.Rmd', envir = new.env())`, chunk1 eventually enters the `call_block` function in `knitr/R/block.R`. It passes `if (params$cache > 0)`, and the hash comes up with the same value. Then `cache$load` tries to bring the saved data into `knit_global`. 

At this point, if you `ls(knit_global())` you'll see `character(0)`. So it isn't clear if the python object `x=1` was even saved. 

However, whether it was saved doesn't even matter because it doesn't get a chance to use it. When `chunk2` starts down the same path, its hash has changed so it moves onto `block_exec`. If this were R code, it would have access to the `cache$load`ed objects from chunk1 with `env = knit_global()`, but non-R engines go down a separate branch.

`block_exec` tells the reticulate engine to execute `print(x + 9)`, but it fails because it doesn't know `x`. You can verify this in `eng_python_synchronize_before` by checking to see if `main` contains `x`, which it doesn't (assuming you alled `knit` after a session restart). The only thing that passes to the `eng_python` is `options` which as far as I can tell doesn't include any environment information such as `x=1`. 

## What is not clear to me

Despite crawling through the reticulate source code, when `cache=FALSE`, I'm not actually sure how the state is saved between chunks. Each time a python chunk is executed by `eng_python` `reticulate/R/knitr-engine.R` it calls `import_main` which provides an object `main` that has the previous chunk's variables (`x = 1`), but I don't see where this data is saved chunk to chunk. 

I know the `main` data has to be saved somewhere, because if you don't restart the session after calling `knitr::knit('python_test.Rmd', envir = new.env())` the `main` variable will still contain `x = 1`, even though the `knit(..)` call never runs `x = 1` or loads it into memory since it was cached.

--- 
## What to do?
I know everyone's busy, so I'm happy to help by making a PR but it is not clear to me how to fix this.

Ideas:
1. Factor out the chunk hashing. If `autodep=FALSE`, then if any cache fails you have to `block_exec` every chunk. 

2. Rethink the engine API, so the language can provide loading/saving for its chunks. In fact, reticulate already supports pickle, which is a python package that can save and load python objects. As far as I can tell, this would still require refactoring some knitr code since engines don't touch caching at all at the moment, as far as I can tell.

3. Suppose `.RData` is able to save the python object (I don't know if possible). Similar to number 2, you could slightly change the engine API so you pass that object to the engine to process.

---
## Session data

sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: OS X El Capitan 10.11.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached): [1] compiler_3.4.1 backports_1.1.2 magrittr_1.5.0 rsconnect_0.8.5 [5] rprojroot_1.3-2 htmltools_0.3.6 tools_3.4.1 yaml_2.1.16
[9] Rcpp_0.12.15 stringi_1.1.6 rmarkdown_1.8.10 knitr_1.19.2
[13] stringr_1.2.0 digest_0.6.15 evaluate_0.10.1

yihui commented 6 years ago

I'm afraid I'll have to defer this issue to @kevinushey (author of the python engine in reticulate).

kevinushey commented 6 years ago

For what it's worth, I never tried wiring in cache support into the reticulate engine as I wasn't exactly sure what that would entail, but it sounds like we'd need:

Serialization of the 'main' module (probably with pickle?)
A list of Python modules to be imported (what about state within a particular module?)

tmastny commented 6 years ago

Thanks @kevinushey. Something else to look into is dill which extends pickle. From the readme:

Hence, it would be feasable to save a interpreter session, close the interpreter, ship the pickled file to another computer, open a new interpreter, unpickle the session and thus continue from the 'saved' state of the original interpreter session.

And the reason I opened the issue here is because I think knitr will also require some refactoring to allow specific engines to handle caching. Right now all caching is handled by the cache methods in R/block.R call_block for loading and block_cache for saving.

If this is something you'd like for reticulate, I'd be interested in helping out with PRs. I'd like to be able write Python with rmarkdown using all the knitr features.

kevinushey commented 6 years ago

I'd definitely be open to reviewing a PR, but it seems like this will be tough to get right and I unfortunately won't have that much time to help with the actual implementation in the coming months.

leogama commented 2 years ago

Hi! I'm working on it (by sheer necessity). There are some some serious problems on the dill package at the moment, but I'm also updating some currently broken logic in the code proposed by @tmastny and it is mostly working by now, with some edge cases requiring monkey patches. As soon as the bugs on dill are fixed, I'll send a new pull request based on his code.

Beyond basic usage, I find that a Python cache engine for knitr is essential. We need this, folks! :rocket:

yihui / knitr

knitr engine API and cache compatibility with reticulate engine #1505

The bug