Open rhijmans opened 2 years ago
Please ignore my initial reply below. I was using an old version of terra. It seems the latest version does support writing and reading objects from .rds
now. I'm investigating and will see what I can do.
Thanks for the idea! I think knitr's cache does use serialize()
and unserialize()
under the hood (via functions like save()
or saveRDS()
) to write R objects to disk. The actual tricky thing, as mentioned in the SO post comments, is to unserialize the data from a cache file or in a new R session. That's when it will break. Your example shows that it works in-memory in the same R session.
library(terra)
r = rast(matrix(1:12,3,4))
saveRDS(r, 'rast-data.rds')
readRDS('rast-data.rds')
unlink('rast-data.rds')
If there were a way to write the object to a file and read it back later, there would be a possibility to support caching such objects.
Okay, I have a better idea on how to implement this now. Basically, as you hinted in the SO posts, neither save()
+ load()
or tools:::makeLazyLoadDB()
+ lazyLoad()
works for these raster objects, but saveRDS()
+ readRDS()
works. That means I can cache to .rds
files instead of .RData
.
However, again, as you mentioned, users would need to call terra::rast()
on the value from readRDS()
. Although I could do that automatically in knitr, I would prefer not to deal with this type of special cases unless I must. I wonder if it's possible for the terra package to automatically deal with packed objects. I'm asking since I just realized who you are 20 minutes ago :)
I agree that you should not have to deal with special cases. Here are some ideas for "terra" and some new things that I have implemented. Feedback appreciated.
1) It would be possible, I think, to automagically restore the "terra" objects if (a) instead of creating separate packed objects like PackedSpatRaster I would put the data in a slot in the original object (SpatRaster) and (b) any time a SpatRaster is used, a check would be done if it needs to be unpacked. So that means that I would have to add that check to all functions that take a SpatRaster (there are several 100s). A bit of a nightmare but doable. There could also be a huge performance penalty because the unpacking would need to be done every time the object is used (unless it is possible to safely overwrite the object in the global environment). If only there were something like an "onReadRDS" event that could trigger unwrapping...
2) I have added a method unwrap
that could be run after a cached block. It will restore the original object if it is a PackedSpatRaster or PackedSpatVector, and let all other objects pass through. That is much better than using rast
because it works in all cases. Whereas if you run rast
on a SpatRaster it returns a different object (a template with no cell values). But this would mean that a user would have to remember to call for all terra objects used after a cached block. Not pretty.
library(terra)
f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f) # S4 object that has a reference class
p <- wrap(r)
unwrap(p)
cacherestore : terra::unwrap
As all objects could pass through, and a user could provide more specialized functions. That would still be a bit of additional work, but probably a nicer user experience than having to use unwrap on all terra objects in blocks after a cached block.
4) This may be the best and the worst solution. I have made readRDS
and unserialize
generic functions and now you get the terra version in you load the package (unless another package does the same and it loaded later). I make them do:
base::readRDS("file.rds") |> unwrap()
# and
base::unserialzie("file.rds") |> unwrap()
And now I see this:
x <- serialize(r, NULL)
y <- unserialize(x)
(y <- unserialize(x))
#class : SpatRaster
#dimensions : 90, 95, 1 (nrow, ncol, nlyr)
#resolution : 0.008333333, 0.008333333 (x, y)
#extent : 5.741667, 6.533333, 49.44167, 50.19167 (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326)
#source : memory
#name : elevation
#min value : 141
#max value : 547
And
frds <- "test.rds"
saveRDS(r, frds)
readRDS(frds)
#class : SpatRaster
#dimensions : 90, 95, 1 (nrow, ncol, nlyr)
#resolution : 0.008333333, 0.008333333 (x, y)
#extent : 5.741667, 6.533333, 49.44167, 50.19167 (xmin, xmax, ymin, ymax)
#coord. ref. : lon/lat WGS 84 (EPSG:4326)
#source : memory
#name : elevation
#min value : 141
#max value : 547
This could work well with knitr caching if knitr can use saveRDS
. However, while saveRDS
has a specific signature for the object to be saved (say, a SpatRaster), the signature for readRDS
is "character" (the filename). So this seems a bit fragile as other packages could overwrite it. While method overwriting is not uncommon, this one would be hard to spot for a user and they could not fix it with terra::readRDS as they do not call it.
Also, you have to load the package for this to work. It does not work like this in a clean session:
f <- system.file("ex/elev.tif", package="terra")
r <- terra::rast(f)
frds <- "test.rds"
saveRDS(r, frds)
readRDS(frds)
This works
terra::saveRDS(r, frds)
terra::readRDS(frds)
Perhaps an ideal long-term solution would be for base to have a generic unwrap
method that is called in readRDS
such that packages could implement it for different types of wrapped objects.
Thanks for these ideas! For now, I feel the best way might be that I provide an S3 generic function in knitr to process cached objects. Then other package authors like you can register the methods in their own packages. The default method of this function will be simply the identity function, i.e., function(x) x
.
Would the downside of that solution be that these packages then would need to depend on knitr?
The same would be needed when creating the cache, unless you can use saveRDS. Using saveRDS would be great, I think because that is very general.
I'll use saveRDS()
.
You don't need to depend on knitr. Here is the trick we've been using to register S3 methods dynamically on load: https://github.com/rstudio/htmltools/blob/5fa01e7197143844be141dcc0cca85096059497d/R/tags.R#L22-L56
I have submitted a new issue (#2339) that might be related. Caching torch module and dataset cause 'external pointer not valid error`, which might be due to the fact torch package relies on reference classes.
@rhijmans @yihui I opened #2340 to solve this issue, although I am not sure if I fully understand the discussion.
This PR adds knit_cache_preprocess
and knit_cache_postprocess
who can modify objects being saved and loaded, respectively.
Does it meet @rhijmans 's request?
registerS3method("knit_cache_preprocess", "terra", function(x) {
# do some modifications at here
}, envir = asNamespace("knitr"))
registerS3method("knit_cache_postprocess", "terra", function(x) {
# do some modifications at here
}, envir = asNamespace("knitr"))
@atusy can you update the example above on how the new PR logic with hooks could solve this issue ?
thanks!
Sure. I will add comments after I recover from COVID-19...
after I recover from COVID-19...
Oh 😢 - Hope you'll recover quickly and correctly. take care !
Is there a way, or could support be added, to tell knitr how to serialize objects of a particular class before caching, and how to restore them after caching? See this and this questions on stackoverflow for more context
The "terra" package uses S4 objects with one slot that has a reference to a C++ object (via a Rcpp module). These S4 objects can be serialized and unserialized with very little effort. For example, one can do
But can one use that mechanism with knitr caching? I suppose that it could be possible, in principle, to specify that objects of class "SpatRaster" should be passed to
serialize
before caching, and thatterra::rast
should be used when restoring the object. Whether this is easy to do in practice, I do not know.By filing an issue to this repo, I promise that
xfun::session_info('knitr')
. I have upgraded all my packages to their latest versions (e.g., R, RStudio, and R packages), and also tried the development version:remotes::install_github('yihui/knitr')
.I understand that my issue may be closed if I don't fulfill my promises.