Open goldingn opened 9 years ago
If we could spawn modules in their own, clean environments, load the things they need (and only them), then take their products and destroy them, that would be a lovely clean solution.
One approach would be to use existing parallel computing libraries to spawn child processes. E.g. using snowfall
something like this might work:
sfInit(cpus = 1, parallel = FALSE)
sfLibrary(list_of_module_dependencies)
sfExport(input_data_and_module_arguments)
ans <- sfLapply(NULL, some_model_module_in_a_wrapper)[[1]]
sfStop()
We could also act on lists to do parellelism easily, though we'd want to load different packages on different cluster nodes.
A downside of child processes is that it makes debugging much harder.
Tim's suggestion would be worth trying first, although it doesn't protect the modules from packages the user has loaded previously.
any thoughts on this appreciated - maybe @richfitz has ideas?
I don't know much about S4 classes or how often they are going to be a problem. Was the leaflet example an issue with S4? I get the feeling the detach("package:packagename", unload=TRUE)
would fix most problems if not quite all.
It isn't totally obviously how to make sure modules even remotely clear up after themselves as they are user contributed. I don't know if you could add a line in GetPackage
that runs detach in the parent environment (i.e. the module function.)
Something like
package <- as.character(substitute(package))
str <- paste0('package:', package)
eval.parent(on.exit(detach(str, unload=TRUE)) , n= 1)
The idea being, that this will be evaluated in the module function environment so that once the module has finished, any packages used will be detached.
However, I can't work out how to get str
to evaluate immediataly, but the rest remain as an expression.
Anyway, then we only have to push that people use GetPackage
rather than library
or require
. But it doesn't fix the issue of S4 etc.
Child process might still be a better idea at least in the long term as (as you said) we want to make stuff parallel anyway. Just need to be careful that we don't duplicate big memory objects. Model objects (e.g. randomforests) can get pretty big, let alone data.
It's not really an S4 thing - that was part of that SO thread, but the detaching/unloading thing is more generic. It's down to the dependencies and loaded in by each packages, and not being able to work out reverse dependencies without trying to remove them from packages you want to keep.
I tried all the combinations of detach
and unloadNamespaces
and various functions suggested in forums for finding dependencies and removing them. I don't think there's a way to deal with it that way without rewriting R.
I'd love a way of getting isolated environments within an R session, but hands down the safest way would be a subprocess (along those lines I'm stewing on the idea of trying to replicate Python's subprocess module in R which would allow for possibly nicer management of subprocesses).
Previously, I wondered if this could be done by manipulating the search path (this might not work well for S4 classes, but then they're generally horrible things anyway). Basically, I'm wondering if you can either:
baseenv()
so that multiple "global" search paths are possibleattach
while evaluating module code to inject different package sets into the search path (using on.exit
to ensure that detach
is always run).For both approaches you'd want to prohibit user use of library
calls or everything will get messed up.
The trick with most of these is not to unload the packages but to use things like requireNamespace
to get the package set up properly (shared libraries loaded etc) and avoid unloading the namespace as doing that in the wrong order can either fail to unload or crash R if you force it.
Cool, we should definitely try attach
ing or similar then to see if that works.
In the meantime, here's a basic version of child processes using parallel
in a way similar to zoon::workflow
's current method:
# load the parallel package to spawn child processes
library(parallel)
# load some other arbitrary package as a user might
library(plyr)
# set up a cluster on three cores
cl <- makeCluster(3)
# function to laod multiple libraries
libraries <- function (list) {
lapply(as.list(list),
library,
character.only = TRUE)
}
# a list for our three modules, giving the required packages
dependencies <- list(one = c('dismo', 'spocc'),
two = c('leaflet'),
three = 'base') # load nothing
# (need to get a list of dependencies for each module
# - should enforce that these are listed in the docs)
# send packages out
clusterMap(cl, libraries, dependencies) -> .
# packages loaded for the user
names(sessionInfo()$otherPkgs)
# [1] "plyr"
# packages laoded on each child
clusterEvalQ(cl, names(sessionInfo()$otherPkgs))
# [[1]]
# [1] "spocc" "dismo" "raster" "sp"
# [[2]]
# [1] "leaflet"
# [[3]]
# NULL
# ~~~~~~~~~~~
# export modules and arguments in the zoon style
# define some silly modules (all have the x argument)
printer <- function(x) {
paste('result:', x)
}
printer2 <- function(x, ...) {
x <- as.vector(c(x, list(...)))
paste('result:', x)
}
emphasiser <- function(x) {
x <- toupper(x)
paste('result:', x)
}
# define the list of modules and their arguments
# (workflow has something like this)
modules <- list(one = list(module = printer,
paras = list(x = 'test')),
two = list(module = printer2,
paras = list(x = 'test',
a = 1)),
three = list(module = emphasiser,
paras = list(x = 'test')))
# note that we pass the modules, not their names, so that the children
# have them
# a function to execute the modules from their lists
doer <- function(module_list) {
do.call(module_list$module, module_list$paras)
}
# show that it works at the top level
(res1 <- doer(modules$one))
# [1] "result: test"
(res2 <- doer(modules$two))
# [1] "result: test" "result: 1"
(res3 <- doer(modules$three))
# [1] "result: TEST"
# execute on the child processes
(res <- clusterMap(cl, doer, modules))
# $one
# [1] "result: test"
# $two
# [1] "result: test" "result: 1"
# $three
# [1] "result: TEST"
This is all in parallel already, which is nice.
A downside is that we can't plot directly to the open graphics device with the output modules. We can always require the user to save plots instead, and that may solve a lot of other problems (e.g. compiling results into a report, other unforeseen pain) further down the line. We could easily write a ZoonPlot function to handle that nicely with a temp directory.
Here's a more concrete definition of approach 2:
in_environment <- function(expr, packages, envir=parent.frame()) {
nms <- sprintf("fakepackage:%s", packages)
cleanup <- function() {
loaded <- intersect(rev(nms), search())
for (p in loaded) {
try(detach(p, character.only=TRUE))
}
}
on.exit(cleanup())
for (i in seq_along(packages)) {
requireNamespace(packages[[i]], quietly=TRUE)
attach(asNamespace(packages[[i]]), name=nms[[i]])
}
eval(expr, envir)
}
The function in_environment
takes two arguments: an unevaluted expression and a character vector of packages to load (to be loaded first to last). It will load these packages, evaluate the expression and unload the packages in reverse order, leaving the search path untouched:
in_environment(fractions(pi), "MASS")
sweet. thanks!
Hmmm...
Running the following on the definitions in my example above, in_environment
doesn't clear out those attached packages after it's done:
in_environment <- function(expr, packages, envir=parent.frame()) {
nms <- sprintf("fakepackage:%s", packages)
cleanup <- function() {
loaded <- intersect(rev(nms), search())
for (p in loaded) {
try(detach(p, character.only=TRUE))
}
}
on.exit(cleanup())
for (i in seq_along(packages)) {
requireNamespace(packages[[i]], quietly=TRUE)
attach(asNamespace(packages[[i]]), name=nms[[i]])
}
eval(expr, envir)
}
attached <- function() names(sessionInfo()$loadedOnly)
attached()
# "tools" "Rcpp"
in_environment(attached(),
packages = dependencies$one)
# The following object is masked from fakepackage:dismo:
#
# .onLoad
#
# The following objects are masked from package:plyr:
#
# mapvalues, quickdf, rename, revalue
#
# The following object is masked from package:base:
#
# strtrim
#
# [1] "Rcpp" "tools" "digest" "lubridate" "jsonlite"
# [6] "memoise" "ecoengine" "gtable" "lattice" "rgbif"
# [11] "DBI" "rstudioapi" "mapproj" "curl" "rbison"
# [16] "proto" "dismo" "dplyr" "httr" "stringr"
# [21] "leafletR" "raster" "maps" "rvertnet" "grid"
# [26] "rebird" "data.table" "R6" "XML" "sp"
# [31] "ggplot2" "reshape2" "magrittr" "whisker" "AntWeb"
# [36] "scales" "spocc" "MASS" "assertthat" "colorspace"
# [41] "brew" "V8" "stringi" "munsell" "chron"
# [46] "rjson"
attached()
# [1] "Rcpp" "tools" "digest" "lubridate" "jsonlite"
# [6] "memoise" "ecoengine" "gtable" "lattice" "rgbif"
# [11] "DBI" "rstudioapi" "mapproj" "curl" "rbison"
# [16] "proto" "dismo" "dplyr" "httr" "stringr"
# [21] "leafletR" "raster" "maps" "rvertnet" "grid"
# [26] "rebird" "data.table" "R6" "XML" "sp"
# [31] "ggplot2" "reshape2" "magrittr" "whisker" "AntWeb"
# [36] "scales" "spocc" "MASS" "assertthat" "colorspace"
# [41] "brew" "V8" "stringi" "munsell" "chron"
# [46] "rjson"
Yeah, it doesn't unload them (which is the bit that might fail) but it will remove them from the search path which removes conflicts (check out search()
). Is there a situation where having loaded, unattached, packages would cause problems?
As a potential warning about unloading packages: if you use a package that uses Rcpp modules and unload or reload the package before all garbage collection of R objects is complete, R will crash.
Ah OK, thanks.
I'm not sure whether they need to be out of the search path or unloaded in the leafletR
vs leaflet
example that spurred this on, but will give this a go
unfortunately in_environment
seems not to work for the leaflet
case.
On a clean R session, this works just fine:
# install.packages('raster')
# devtools::install_github('environmentalinformatics-marburg/Rsenal')
library(Rsenal); library(raster)
mapView(raster(system.file("external/test.grd", package="raster")))
but this does not:
# install.packages(c('spocc', 'raster'))
# devtools::install_github('environmentalinformatics-marburg/Rsenal')
in_environment <- function(expr, packages, envir=parent.frame()) {
nms <- sprintf("fakepackage:%s", packages)
cleanup <- function() {
loaded <- intersect(rev(nms), search())
for (p in loaded) {
try(detach(p, character.only=TRUE))
}
}
on.exit(cleanup())
for (i in seq_along(packages)) {
requireNamespace(packages[[i]], quietly=TRUE)
attach(asNamespace(packages[[i]]), name=nms[[i]])
}
eval(expr, envir)
}
in_environment(occ('Anopheles plumbeus'), 'spocc')
in_environment(mapView(raster(system.file("external/test.grd", package="raster"))),
c('raster', 'Rsenal'))
for completeness, the parallel version works fine on the leaflet example:
libraries <- function (list)
lapply(list, library, character.only = TRUE)
dependencies <- list(one = 'spocc',
two = c('raster', 'Rsenal'))
expr_list <- list(one = expression(occ('Anopheles plumbeus')),
two = expression(mapView(raster(system.file("external/test.grd",
package="raster")))))
library(parallel)
cl <- makeCluster(2)
clusterMap(cl, libraries, dependencies) -> .
res <- clusterMap(cl, eval, expr_list)
str(res$one, 1)
library('leaflet') # needed for plotting too
m
Is there a copy of external/test.grd somewhere? I could have a look? Alternatively, what is the error message?
The parallel versions will be running in separate clean environments. However, presumably if a future call required the alternate loading order things could fail (non deterministically as jobs get allocated to nodes).
That's in the raster
package, so install that and you should be good to go. It's the meuse spatial dataset, so it'll also be in gstat
and a bunch of other places.
The error message is:
Error in browseURL(x, ...) : 'url' must be a non-empty character string
but when I tried debugging (Rsenal::mapView
is S4 so I didn't get very far) I got a warning about leafletR
not being available. Can't seem to reproduce it now...
what do you mean by the alternate loading order in that example?
OK, so I managed to debug more successfully now. The error arises because the result has the classes: 'leaflet', 'htmlwidget'
and leafletR::print.leaflet
is being called instead of htmlwidgets::print.htmlwidget
library(spocc); library(raster); library(Rsenal)
m <- mapView(raster(system.file("external/test.grd", package="raster")))
m
# Error in browseURL(x, ...) : 'url' must be a non-empty character string
sessionInfo()
# R version 3.2.1 (2015-06-18)
# Platform: x86_64-apple-darwin14.4.0 (64-bit)
# Running under: OS X 10.10.3 (Yosemite)
#
# locale:
# [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] Rsenal_0.1.85 raster_2.4-18 sp_1.1-1 spocc_0.3.0
#
# loaded via a namespace (and not attached):
# [1] Rcpp_0.12.0 plyr_1.8.3 base64enc_0.1-3 tools_3.2.1
# [5] digest_0.6.8 lubridate_1.3.3 jsonlite_0.9.16 memoise_0.2.1
# [9] ecoengine_1.9.1 gtable_0.1.2 lattice_0.20-31 rgbif_0.8.8
# [13] png_0.1-7 DBI_0.3.1 rstudioapi_0.3.1 mapproj_1.2-3
# [17] rgdal_1.0-4 curl_0.9.2 parallel_3.2.1 rbison_0.4.8
# [21] proto_0.3-10 dplyr_0.4.2 httr_1.0.0 stringr_1.0.0
# [25] leafletR_0.3-3 htmlwidgets_0.5 maps_2.3-11 leaflet_1.0.0
# [29] rvertnet_0.3.0 grid_3.2.1 rebird_0.2 data.table_1.9.4
# [33] R6_2.1.0 XML_3.98-1.3 ggplot2_1.0.1 reshape2_1.4.1
# [37] magrittr_1.5 whisker_0.3-2 AntWeb_0.7 htmltools_0.2.6
# [41] scales_0.2.5 MASS_7.3-40 assertthat_0.1 colorspace_1.2-6
# [45] brew_1.0-6 V8_0.6 stringi_0.5-5 munsell_0.4.2
# [49] chron_2.3-47 rjson_0.2.15
I'm now explicitly calling the correct print method in th emodule InteractiveMap, which fixes the problem. Would be good to know if there's an elegant and generalisable way around this though!
I still like the idea of using child processes to execute each module, but realised that would mean predict methods aren't visible in the output modules. We'd need to find a way of identifying the package hosting the relevant predict method to load when executing output modules.
Or, preferably, pass the predict method with the model so that modules writers can define their own predict method for weird models.
As above I think that model modules returning objects that contain a bespoke predict method with the correct behaviour would solve a lot of problems in output modules. It would also mean we can execute those in clean processes without the package containing the predict method having to be in the global namespace.
Here's a prototype which which makes model module developers define prediction code to conform to set (yet to be determined) IO standards via the MakePredictor
function, and enables output module developers to use the ZoonPredict
function, which will have fixed and expected behaviour
# these functions defined in zoon:
# generic predict function for predict objects in zoon output modules
# usage along the lines of:
# prediction <- ZoonPredict(.model$model, SomeNewData)
ZoonPredict <- function(predict_object, newdata) {
# get required packages
require (predict_object$packages,
character.only = TRUE)
# define prediction function using module code
fun_text <- sprintf('fun <- function (model, newdata) {%s}',
predict_object$code)
fun <- eval(parse(text = fun_text))
# run the predictor and return result
ans <- fun(predict_object$model, newdata = newdata)
return (ans)
}
# utility function to help module developers make prediction objects
# packages is the packages required for prediction
# (i.e. wherever the predict method is defined)
# model is the fitted model object
# code is a code snippet used to make predictions that conform to clear
# IO rules (tbd). e.g:
# must use the object 'model' and a dataframe 'newdata' (which may have some constraints)
# and nothing else;
# must return a numeric vector with length matching the number of rows
# of newdata and on the response scale.
MakePredictor <- function(packages,
model,
code) {
# catch the code as text
code <- deparse(substitute(code))
code <- paste(code, collapse = '\n')
# return the list
list(packages = packages,
model = model,
code = code)
}
# ~~~~~~~~~
# example usage
library(zoon)
# an using this approach for the logistic regression module:
LogisticRegression <- function (.df) {
# fit the model, as normal
covs <- as.data.frame(.df[, 6:ncol(.df)])
names(covs) <- names(.df)[6:ncol(.df)]
m <- glm(.df$value ~ .,
data = covs,
family = 'binomial')
# make the predictor object
MakePredictor(packages = 'stats',
model = m,
code = stats::predict.glm(object = model,
newdata = newdata,
type = 'response'))
}
# fake data to test
.df <- data.frame(value = sample(0:1, 10, replace = TRUE),
matrix(rnorm(100), nrow = 10))
# run the module to get the predict object
out <- LogisticRegression(.df)
# make predictions (here, against the training data)
ZoonPredict(predict_object = out,
newdata = .df[, 6:NCOL(.df)])
X5 X5 X5 X5 X5 X5 X5
0.6165592 0.5010722 0.6527457 0.5427995 0.6649432 0.6104505 0.7657966
X5 X5 X5
0.4713027 0.5023563 0.7618802
A similar one for QuickGRaF would therefore account for the fact that predictions are a matrix
QuickGRaF <- function (.df, l = NULL) {
zoon:::GetPackage('GRaF')
if (!all(.df$type %in% c('presence', 'absence', 'background'))) {
stop ('only for presence/absence or presence/background data')
}
# get the covariates
covs <- as.data.frame(.df[, 6:ncol(.df)])
names(covs) <- names(.df)[6:ncol(.df)]
# set up l
if (!is.null(l)) {
# duplicate if necessary
if (length(l) == 1) {
l <- rep(l, ncol(covs))
}
# check l has the correct length
if (length(l) != ncol(covs)) {
stop (sprintf('l has %i elements, but there are %i covariates', length(l), ncol(covs)))
}
# check l is of the correct value
if (any(l <= 0)) {
stop(sprintf('l must be positive, but the values provided were: %s',
paste(format(l, digits = 3), collapse = ', ')))
}
}
# fit the model
m <- graf(.df$value,
covs,
l = l)
# make the predictor object
MakePredictor(packages = 'GRaF',
model = m,
code = {
p <- GRaF::predict.graf(object = model,
newdata = newdata,
type = 'response')
p[, 1]
})
}
# run the module to get the predict object
out <- QuickGRaF(.df)
# make predictions (here, against the training data)
ZoonPredict(predict_object = out,
newdata = .df[, 6:NCOL(.df)])
X5 X5 X5 X5 X5 X5 X5
0.6165592 0.5010722 0.6527457 0.5427995 0.6649432 0.6104505 0.7657966
X5 X5 X5
0.4713027 0.5023563 0.7618802
This would mean we could get of this snippet which appears in lots of output modules:
# if pred is a matrix/dataframe, take only the first column
if(!is.null(dim(pred))) {
pred <- pred[, 1]
}
And because the model object, package name and code are all passed in a single object, the object can easily be passed to a child process and executed cleanly.
What are people's thoughts on this as an approach to the problem?
When I get some time, I'll start a new branch to implement both this and execution of modules in child processes.
I think it makes sense as long as we do good documentation for module builders (which is important anyway).
Part of that will be simply rewritting the model modules we already have. I think people will often use current modules as examples.
I'm just wondering whether the default value for the code
arg in MakePredictor
could be something useful that would cover the many algos that do use
predict(object = model, newdata = newdata, type = 'response')
That said, we would probably want to write modules for most of those basic algorithms anyway. So maybe pointless.
Finally I'm not sure I like the name MakePredictor
as I use predictor to mean independent variable (because I get confused between independent and dependent. I ain't the sharpest knife in the drawer.)
Yup, sounds good.
I'm just wondering whether the default value for the code arg in MakePredictor could be something useful that would cover the many algos that do use
predict(object = model, newdata = newdata, type = 'response')
That would help developers a bit, but is not explicit about the predict function. If a model has two classes, it would help to be explicit about the required function. I think that being explicit won't be to arduous, and as you say they can copy from existing functions.
Finally I'm not sure I like the name MakePredictor as I use predictor to mean independent variable (because I get confused between independent and dependent. I ain't the sharpest knife in the drawer.)
Yup, I agree - worth thinking about a better name... ZoonModel
?
I've been having a back at child processes again and thought I'd record my notes here. TL;DR: nope.
It's definitely possible to execute modules in child processes, but every time a new clean child process is started there's a significant overhead (~2s on my 2015 mbp) to loading zoon and raster. This would apply to every module in a workflow so even the smallest would have 10 seconds tacked on, which I think is unacceptable.
Instead of using brand new processes, we can instead fork from existing processes with next to no computational cost (if the parent process has zoon & raster, the fork is born with them loaded too). If we span up one clean process when zoon is loaded in an R session, then we can fork from that, and only pay the 2s cost once per session.
Unfortunately, #?*!ing windows doesn't do forking so that's a non-starter for the majority of our user base.
We could think about a two-tier option, where things just work more cleanly on Unix, but that doesn't appeal to me - workflows developed on Unix could error for a Windows user.
I think it would be better to invest some time trying to implement @richfitz' suggestion of using environments and attachment. AFAIK that wouldn't protect us from a user library
ing and causing conflicts, but its a step in the right direction.
Packages required for one module are loaded into the global namespace and are visible to subsequent modules and to users. As is happening with #77, already loaded packages can conflict with subsequent modules. As the number of modules increases, the number of such conflicts will rise and it will become impossible to test for them.
It would be great if we could just clean up after each module by unloading all the packages it's loaded. I've searched for a while on this to no avail, and this comment seems pretty definitive: