yihui / knitr

A general-purpose tool for dynamic report generation in R
https://yihui.org/knitr/
2.4k stars 878 forks source link

Caching reference objects #2176

Open rhijmans opened 2 years ago

rhijmans commented 2 years ago

Is there a way, or could support be added, to tell knitr how to serialize objects of a particular class before caching, and how to restore them after caching? See this and this questions on stackoverflow for more context

The "terra" package uses S4 objects with one slot that has a reference to a C++ object (via a Rcpp module). These S4 objects can be serialized and unserialized with very little effort. For example, one can do

library(terra)
f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f)  # S4 object that has a reference class
x <- serialize(r, NULL) 
# recreate the object 
s <- terra::rast( unserialize(x) )  
s
#class       : SpatRaster 
#dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#source      : memory 
#name        : elevation 
#min value   :       141 
#max value   :       547 

But can one use that mechanism with knitr caching? I suppose that it could be possible, in principle, to specify that objects of class "SpatRaster" should be passed to serialize before caching, and that terra::rast should be used when restoring the object. Whether this is easy to do in practice, I do not know.


By filing an issue to this repo, I promise that

I understand that my issue may be closed if I don't fulfill my promises.

yihui commented 2 years ago

Please ignore my initial reply below. I was using an old version of terra. It seems the latest version does support writing and reading objects from .rds now. I'm investigating and will see what I can do.


Thanks for the idea! I think knitr's cache does use serialize() and unserialize() under the hood (via functions like save() or saveRDS()) to write R objects to disk. The actual tricky thing, as mentioned in the SO post comments, is to unserialize the data from a cache file or in a new R session. That's when it will break. Your example shows that it works in-memory in the same R session.

library(terra)
r = rast(matrix(1:12,3,4))
saveRDS(r, 'rast-data.rds')
readRDS('rast-data.rds')

unlink('rast-data.rds')

If there were a way to write the object to a file and read it back later, there would be a possibility to support caching such objects.

yihui commented 2 years ago

Okay, I have a better idea on how to implement this now. Basically, as you hinted in the SO posts, neither save() + load() or tools:::makeLazyLoadDB() + lazyLoad() works for these raster objects, but saveRDS() + readRDS() works. That means I can cache to .rds files instead of .RData.

However, again, as you mentioned, users would need to call terra::rast() on the value from readRDS(). Although I could do that automatically in knitr, I would prefer not to deal with this type of special cases unless I must. I wonder if it's possible for the terra package to automatically deal with packed objects. I'm asking since I just realized who you are 20 minutes ago :)

rhijmans commented 2 years ago

I agree that you should not have to deal with special cases. Here are some ideas for "terra" and some new things that I have implemented. Feedback appreciated.

1) It would be possible, I think, to automagically restore the "terra" objects if (a) instead of creating separate packed objects like PackedSpatRaster I would put the data in a slot in the original object (SpatRaster) and (b) any time a SpatRaster is used, a check would be done if it needs to be unpacked. So that means that I would have to add that check to all functions that take a SpatRaster (there are several 100s). A bit of a nightmare but doable. There could also be a huge performance penalty because the unpacking would need to be done every time the object is used (unless it is possible to safely overwrite the object in the global environment). If only there were something like an "onReadRDS" event that could trigger unwrapping...

2) I have added a method unwrap that could be run after a cached block. It will restore the original object if it is a PackedSpatRaster or PackedSpatVector, and let all other objects pass through. That is much better than using rast because it works in all cases. Whereas if you run rast on a SpatRaster it returns a different object (a template with no cell values). But this would mean that a user would have to remember to call for all terra objects used after a cached block. Not pretty.

library(terra)
f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f)  # S4 object that has a reference class
p <- wrap(r)
unwrap(p)
  1. Allow for setting an unwrap method that is called after restoring the cache. Something in the spirit of
cacherestore : terra::unwrap 

As all objects could pass through, and a user could provide more specialized functions. That would still be a bit of additional work, but probably a nicer user experience than having to use unwrap on all terra objects in blocks after a cached block.

4) This may be the best and the worst solution. I have made readRDS and unserialize generic functions and now you get the terra version in you load the package (unless another package does the same and it loaded later). I make them do:

base::readRDS("file.rds")  |> unwrap()
# and
base::unserialzie("file.rds")  |> unwrap()

And now I see this:

x <- serialize(r, NULL) 
y <- unserialize(x)
(y <- unserialize(x))
#class       : SpatRaster 
#dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#source      : memory 
#name        : elevation 
#min value   :       141 
#max value   :       547 

And

frds <- "test.rds"
saveRDS(r, frds) 
readRDS(frds)
#class       : SpatRaster 
#dimensions  : 90, 95, 1  (nrow, ncol, nlyr)
#resolution  : 0.008333333, 0.008333333  (x, y)
#extent      : 5.741667, 6.533333, 49.44167, 50.19167  (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326) 
#source      : memory 
#name        : elevation 
#min value   :       141 
#max value   :       547 

This could work well with knitr caching if knitr can use saveRDS. However, while saveRDS has a specific signature for the object to be saved (say, a SpatRaster), the signature for readRDS is "character" (the filename). So this seems a bit fragile as other packages could overwrite it. While method overwriting is not uncommon, this one would be hard to spot for a user and they could not fix it with terra::readRDS as they do not call it.

Also, you have to load the package for this to work. It does not work like this in a clean session:

f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f)
frds <- "test.rds"
saveRDS(r, frds) 
readRDS(frds)

This works

terra::saveRDS(r, frds) 
terra::readRDS(frds)

Perhaps an ideal long-term solution would be for base to have a generic unwrap method that is called in readRDS such that packages could implement it for different types of wrapped objects.

yihui commented 2 years ago

Thanks for these ideas! For now, I feel the best way might be that I provide an S3 generic function in knitr to process cached objects. Then other package authors like you can register the methods in their own packages. The default method of this function will be simply the identity function, i.e., function(x) x.

rhijmans commented 2 years ago

Would the downside of that solution be that these packages then would need to depend on knitr?

The same would be needed when creating the cache, unless you can use saveRDS. Using saveRDS would be great, I think because that is very general.

yihui commented 2 years ago

I'll use saveRDS().

You don't need to depend on knitr. Here is the trick we've been using to register S3 methods dynamically on load: https://github.com/rstudio/htmltools/blob/5fa01e7197143844be141dcc0cca85096059497d/R/tags.R#L22-L56

gavril0 commented 7 months ago

I have submitted a new issue (#2339) that might be related. Caching torch module and dataset cause 'external pointer not valid error`, which might be due to the fact torch package relies on reference classes.

atusy commented 7 months ago

@rhijmans @yihui I opened #2340 to solve this issue, although I am not sure if I fully understand the discussion.

This PR adds knit_cache_preprocess and knit_cache_postprocess who can modify objects being saved and loaded, respectively. Does it meet @rhijmans 's request?

registerS3method("knit_cache_preprocess", "terra", function(x) {
  # do some modifications at here
}, envir = asNamespace("knitr"))
registerS3method("knit_cache_postprocess", "terra", function(x) {
  # do some modifications at here  
}, envir = asNamespace("knitr"))
cderv commented 3 months ago

@atusy can you update the example above on how the new PR logic with hooks could solve this issue ?

thanks!

atusy commented 3 months ago

Sure. I will add comments after I recover from COVID-19...

cderv commented 3 months ago

after I recover from COVID-19...

Oh 😢 - Hope you'll recover quickly and correctly. take care !