zoonproject / zoon

The zoon R package
Other
61 stars 13 forks source link

clean environments for modules #84

Open goldingn opened 9 years ago

goldingn commented 9 years ago

Packages required for one module are loaded into the global namespace and are visible to subsequent modules and to users. As is happening with #77, already loaded packages can conflict with subsequent modules. As the number of modules increases, the number of such conflicts will rise and it will become impossible to test for them.

It would be great if we could just clean up after each module by unloading all the packages it's loaded. I've searched for a while on this to no avail, and this comment seems pretty definitive:

It's not possible to return R to a fresh slate. I've talked with John Chambers about this, and it's particularly difficult to do for S4 class/method registration. – hadley Sep 26 '11 at 20:24

goldingn commented 9 years ago

If we could spawn modules in their own, clean environments, load the things they need (and only them), then take their products and destroy them, that would be a lovely clean solution.

One approach would be to use existing parallel computing libraries to spawn child processes. E.g. using snowfall something like this might work:

sfInit(cpus = 1, parallel = FALSE)
sfLibrary(list_of_module_dependencies)
sfExport(input_data_and_module_arguments)
ans <- sfLapply(NULL, some_model_module_in_a_wrapper)[[1]]
sfStop()

We could also act on lists to do parellelism easily, though we'd want to load different packages on different cluster nodes.

A downside of child processes is that it makes debugging much harder.

goldingn commented 9 years ago

Tim's suggestion would be worth trying first, although it doesn't protect the modules from packages the user has loaded previously.

goldingn commented 9 years ago

any thoughts on this appreciated - maybe @richfitz has ideas?

timcdlucas commented 9 years ago

I don't know much about S4 classes or how often they are going to be a problem. Was the leaflet example an issue with S4? I get the feeling the detach("package:packagename", unload=TRUE) would fix most problems if not quite all.

It isn't totally obviously how to make sure modules even remotely clear up after themselves as they are user contributed. I don't know if you could add a line in GetPackage that runs detach in the parent environment (i.e. the module function.)

Something like

package <- as.character(substitute(package))

str <- paste0('package:', package)

eval.parent(on.exit(detach(str, unload=TRUE)) , n= 1)

The idea being, that this will be evaluated in the module function environment so that once the module has finished, any packages used will be detached.

However, I can't work out how to get str to evaluate immediataly, but the rest remain as an expression.

Anyway, then we only have to push that people use GetPackage rather than library or require. But it doesn't fix the issue of S4 etc.

Child process might still be a better idea at least in the long term as (as you said) we want to make stuff parallel anyway. Just need to be careful that we don't duplicate big memory objects. Model objects (e.g. randomforests) can get pretty big, let alone data.

goldingn commented 9 years ago

It's not really an S4 thing - that was part of that SO thread, but the detaching/unloading thing is more generic. It's down to the dependencies and loaded in by each packages, and not being able to work out reverse dependencies without trying to remove them from packages you want to keep.

I tried all the combinations of detach and unloadNamespaces and various functions suggested in forums for finding dependencies and removing them. I don't think there's a way to deal with it that way without rewriting R.

richfitz commented 9 years ago

I'd love a way of getting isolated environments within an R session, but hands down the safest way would be a subprocess (along those lines I'm stewing on the idea of trying to replicate Python's subprocess module in R which would allow for possibly nicer management of subprocesses).

Previously, I wondered if this could be done by manipulating the search path (this might not work well for S4 classes, but then they're generally horrible things anyway). Basically, I'm wondering if you can either:

  1. Build a new chain of environments up from baseenv() so that multiple "global" search paths are possible
  2. Temporarily use attach while evaluating module code to inject different package sets into the search path (using on.exit to ensure that detach is always run).

For both approaches you'd want to prohibit user use of library calls or everything will get messed up.

The trick with most of these is not to unload the packages but to use things like requireNamespace to get the package set up properly (shared libraries loaded etc) and avoid unloading the namespace as doing that in the wrong order can either fail to unload or crash R if you force it.

goldingn commented 9 years ago

Cool, we should definitely try attaching or similar then to see if that works.

In the meantime, here's a basic version of child processes using parallel in a way similar to zoon::workflow's current method:

# load the parallel package to spawn child processes
library(parallel)

# load some other arbitrary package as a user might
library(plyr)

# set up a cluster on three cores
cl <- makeCluster(3)

# function to laod multiple libraries
libraries <- function (list) {
  lapply(as.list(list),
         library,
         character.only = TRUE)
}

# a list for our three modules, giving the required packages
dependencies <- list(one = c('dismo', 'spocc'),
                     two = c('leaflet'), 
                     three = 'base')  # load nothing

# (need to get a list of dependencies for each module
#  - should enforce that these are listed in the docs)

# send packages out
clusterMap(cl, libraries, dependencies) -> .

# packages loaded for the user
names(sessionInfo()$otherPkgs)
# [1] "plyr"

# packages laoded on each child
clusterEvalQ(cl, names(sessionInfo()$otherPkgs))
# [[1]]
# [1] "spocc"  "dismo"  "raster" "sp"    
# [[2]]
# [1] "leaflet"
# [[3]]
# NULL

# ~~~~~~~~~~~
# export modules and arguments in the zoon style

# define some silly modules (all have the x argument)
printer <- function(x) {
  paste('result:', x)
}
printer2 <- function(x, ...) {
  x <- as.vector(c(x, list(...)))
  paste('result:', x)
}
emphasiser <- function(x) {
  x <- toupper(x)
  paste('result:', x)
}

# define the list of modules and their arguments
# (workflow has something like this)
modules <- list(one = list(module = printer,
                         paras = list(x = 'test')),
              two = list(module = printer2,
                         paras = list(x = 'test',
                                      a = 1)),
              three = list(module = emphasiser,
                           paras = list(x = 'test')))

# note that we pass the modules, not their names, so that the children
# have them

# a function to execute the modules from their lists
doer <- function(module_list) {
  do.call(module_list$module, module_list$paras)
}

# show that it works at the top level
(res1 <- doer(modules$one))
# [1] "result: test"
(res2 <- doer(modules$two))
# [1] "result: test" "result: 1"   
(res3 <- doer(modules$three))
# [1] "result: TEST"

# execute on the child processes
(res <- clusterMap(cl, doer, modules))
# $one
# [1] "result: test"
# $two
# [1] "result: test" "result: 1"   
# $three
# [1] "result: TEST"

This is all in parallel already, which is nice.

A downside is that we can't plot directly to the open graphics device with the output modules. We can always require the user to save plots instead, and that may solve a lot of other problems (e.g. compiling results into a report, other unforeseen pain) further down the line. We could easily write a ZoonPlot function to handle that nicely with a temp directory.

richfitz commented 9 years ago

Here's a more concrete definition of approach 2:

in_environment <- function(expr, packages, envir=parent.frame()) {
  nms <- sprintf("fakepackage:%s", packages)
  cleanup <- function() {
    loaded <- intersect(rev(nms), search())
    for (p in loaded) {
      try(detach(p, character.only=TRUE))
    }
  }
  on.exit(cleanup())
  for (i in seq_along(packages)) {
    requireNamespace(packages[[i]], quietly=TRUE)
    attach(asNamespace(packages[[i]]), name=nms[[i]])
  }
  eval(expr, envir)
}

The function in_environment takes two arguments: an unevaluted expression and a character vector of packages to load (to be loaded first to last). It will load these packages, evaluate the expression and unload the packages in reverse order, leaving the search path untouched:

in_environment(fractions(pi), "MASS")
goldingn commented 9 years ago

sweet. thanks!

goldingn commented 9 years ago

Hmmm...

Running the following on the definitions in my example above, in_environment doesn't clear out those attached packages after it's done:

in_environment <- function(expr, packages, envir=parent.frame()) {
  nms <- sprintf("fakepackage:%s", packages)
  cleanup <- function() {
    loaded <- intersect(rev(nms), search())
    for (p in loaded) {
      try(detach(p, character.only=TRUE))
    }
  }
  on.exit(cleanup())
  for (i in seq_along(packages)) {
    requireNamespace(packages[[i]], quietly=TRUE)
    attach(asNamespace(packages[[i]]), name=nms[[i]])
  }
  eval(expr, envir)
}

attached <- function() names(sessionInfo()$loadedOnly)

attached()
# "tools" "Rcpp"

in_environment(attached(),
               packages = dependencies$one)
# The following object is masked from fakepackage:dismo:
#   
#   .onLoad
# 
# The following objects are masked from package:plyr:
#   
#   mapvalues, quickdf, rename, revalue
# 
# The following object is masked from package:base:
#   
#   strtrim
# 
# [1] "Rcpp"       "tools"      "digest"     "lubridate"  "jsonlite"  
# [6] "memoise"    "ecoengine"  "gtable"     "lattice"    "rgbif"     
# [11] "DBI"        "rstudioapi" "mapproj"    "curl"       "rbison"    
# [16] "proto"      "dismo"      "dplyr"      "httr"       "stringr"   
# [21] "leafletR"   "raster"     "maps"       "rvertnet"   "grid"      
# [26] "rebird"     "data.table" "R6"         "XML"        "sp"        
# [31] "ggplot2"    "reshape2"   "magrittr"   "whisker"    "AntWeb"    
# [36] "scales"     "spocc"      "MASS"       "assertthat" "colorspace"
# [41] "brew"       "V8"         "stringi"    "munsell"    "chron"     
# [46] "rjson"     

attached()
# [1] "Rcpp"       "tools"      "digest"     "lubridate"  "jsonlite"  
# [6] "memoise"    "ecoengine"  "gtable"     "lattice"    "rgbif"     
# [11] "DBI"        "rstudioapi" "mapproj"    "curl"       "rbison"    
# [16] "proto"      "dismo"      "dplyr"      "httr"       "stringr"   
# [21] "leafletR"   "raster"     "maps"       "rvertnet"   "grid"      
# [26] "rebird"     "data.table" "R6"         "XML"        "sp"        
# [31] "ggplot2"    "reshape2"   "magrittr"   "whisker"    "AntWeb"    
# [36] "scales"     "spocc"      "MASS"       "assertthat" "colorspace"
# [41] "brew"       "V8"         "stringi"    "munsell"    "chron"     
# [46] "rjson" 
richfitz commented 9 years ago

Yeah, it doesn't unload them (which is the bit that might fail) but it will remove them from the search path which removes conflicts (check out search()). Is there a situation where having loaded, unattached, packages would cause problems?

As a potential warning about unloading packages: if you use a package that uses Rcpp modules and unload or reload the package before all garbage collection of R objects is complete, R will crash.

goldingn commented 9 years ago

Ah OK, thanks.

I'm not sure whether they need to be out of the search path or unloaded in the leafletR vs leaflet example that spurred this on, but will give this a go

goldingn commented 9 years ago

unfortunately in_environment seems not to work for the leaflet case.

On a clean R session, this works just fine:

# install.packages('raster')
# devtools::install_github('environmentalinformatics-marburg/Rsenal')
library(Rsenal); library(raster)
mapView(raster(system.file("external/test.grd", package="raster")))

but this does not:

# install.packages(c('spocc', 'raster'))
# devtools::install_github('environmentalinformatics-marburg/Rsenal')

in_environment <- function(expr, packages, envir=parent.frame()) {
  nms <- sprintf("fakepackage:%s", packages)
  cleanup <- function() {
    loaded <- intersect(rev(nms), search())
    for (p in loaded) {
      try(detach(p, character.only=TRUE))
    }
  }
  on.exit(cleanup())
  for (i in seq_along(packages)) {
    requireNamespace(packages[[i]], quietly=TRUE)
    attach(asNamespace(packages[[i]]), name=nms[[i]])
  }
  eval(expr, envir)
}

in_environment(occ('Anopheles plumbeus'), 'spocc')
in_environment(mapView(raster(system.file("external/test.grd", package="raster"))),
               c('raster', 'Rsenal'))
goldingn commented 9 years ago

for completeness, the parallel version works fine on the leaflet example:

libraries <- function (list)
  lapply(list, library, character.only = TRUE)

dependencies <- list(one = 'spocc',
                     two = c('raster', 'Rsenal'))

expr_list <- list(one = expression(occ('Anopheles plumbeus')),
                  two = expression(mapView(raster(system.file("external/test.grd",
                                                              package="raster")))))

library(parallel)
cl <- makeCluster(2)
clusterMap(cl, libraries, dependencies) -> .
res <- clusterMap(cl, eval, expr_list)

str(res$one, 1)

library('leaflet')  # needed for plotting too
m
richfitz commented 9 years ago

Is there a copy of external/test.grd somewhere? I could have a look? Alternatively, what is the error message?

The parallel versions will be running in separate clean environments. However, presumably if a future call required the alternate loading order things could fail (non deterministically as jobs get allocated to nodes).

goldingn commented 9 years ago

That's in the raster package, so install that and you should be good to go. It's the meuse spatial dataset, so it'll also be in gstat and a bunch of other places.

The error message is: Error in browseURL(x, ...) : 'url' must be a non-empty character string but when I tried debugging (Rsenal::mapView is S4 so I didn't get very far) I got a warning about leafletR not being available. Can't seem to reproduce it now...

goldingn commented 9 years ago

what do you mean by the alternate loading order in that example?

goldingn commented 9 years ago

OK, so I managed to debug more successfully now. The error arises because the result has the classes: 'leaflet', 'htmlwidget' and leafletR::print.leaflet is being called instead of htmlwidgets::print.htmlwidget

library(spocc); library(raster); library(Rsenal)
m <- mapView(raster(system.file("external/test.grd", package="raster")))
m
# Error in browseURL(x, ...) : 'url' must be a non-empty character string
sessionInfo()
# R version 3.2.1 (2015-06-18)
# Platform: x86_64-apple-darwin14.4.0 (64-bit)
# Running under: OS X 10.10.3 (Yosemite)
# 
# locale:
#   [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
#   [1] Rsenal_0.1.85 raster_2.4-18 sp_1.1-1      spocc_0.3.0  
# 
# loaded via a namespace (and not attached):
#   [1] Rcpp_0.12.0      plyr_1.8.3       base64enc_0.1-3  tools_3.2.1     
# [5] digest_0.6.8     lubridate_1.3.3  jsonlite_0.9.16  memoise_0.2.1   
# [9] ecoengine_1.9.1  gtable_0.1.2     lattice_0.20-31  rgbif_0.8.8     
# [13] png_0.1-7        DBI_0.3.1        rstudioapi_0.3.1 mapproj_1.2-3   
# [17] rgdal_1.0-4      curl_0.9.2       parallel_3.2.1   rbison_0.4.8    
# [21] proto_0.3-10     dplyr_0.4.2      httr_1.0.0       stringr_1.0.0   
# [25] leafletR_0.3-3   htmlwidgets_0.5  maps_2.3-11      leaflet_1.0.0   
# [29] rvertnet_0.3.0   grid_3.2.1       rebird_0.2       data.table_1.9.4
# [33] R6_2.1.0         XML_3.98-1.3     ggplot2_1.0.1    reshape2_1.4.1  
# [37] magrittr_1.5     whisker_0.3-2    AntWeb_0.7       htmltools_0.2.6 
# [41] scales_0.2.5     MASS_7.3-40      assertthat_0.1   colorspace_1.2-6
# [45] brew_1.0-6       V8_0.6           stringi_0.5-5    munsell_0.4.2   
# [49] chron_2.3-47     rjson_0.2.15  
goldingn commented 9 years ago

I'm now explicitly calling the correct print method in th emodule InteractiveMap, which fixes the problem. Would be good to know if there's an elegant and generalisable way around this though!

goldingn commented 9 years ago

I still like the idea of using child processes to execute each module, but realised that would mean predict methods aren't visible in the output modules. We'd need to find a way of identifying the package hosting the relevant predict method to load when executing output modules.

Or, preferably, pass the predict method with the model so that modules writers can define their own predict method for weird models.

goldingn commented 9 years ago

As above I think that model modules returning objects that contain a bespoke predict method with the correct behaviour would solve a lot of problems in output modules. It would also mean we can execute those in clean processes without the package containing the predict method having to be in the global namespace.

Here's a prototype which which makes model module developers define prediction code to conform to set (yet to be determined) IO standards via the MakePredictor function, and enables output module developers to use the ZoonPredict function, which will have fixed and expected behaviour

# these functions defined in zoon:

# generic predict function for predict objects in zoon output modules
# usage along the lines of:
#   prediction <- ZoonPredict(.model$model, SomeNewData)
ZoonPredict <- function(predict_object, newdata) {

  # get required packages
  require (predict_object$packages,
           character.only = TRUE)

  # define prediction function using module code
  fun_text <- sprintf('fun <- function (model, newdata) {%s}',
                      predict_object$code)
  fun <- eval(parse(text = fun_text))

  # run the predictor and return result
  ans <- fun(predict_object$model, newdata = newdata)
  return (ans)
}

# utility function to help module developers make prediction objects
# packages is the packages required for prediction
# (i.e. wherever the predict method is defined)
# model is the fitted model object
# code is a code snippet used to make predictions that conform to clear
# IO rules (tbd). e.g: 
# must use the object 'model' and a dataframe 'newdata' (which may have some constraints)
# and nothing else;
# must return a numeric vector with length matching the number of rows
# of newdata and on the response scale.
MakePredictor <- function(packages,
                          model,
                          code) {

  # catch the code as text
  code <- deparse(substitute(code))
  code <- paste(code, collapse = '\n')

  # return the list
  list(packages = packages,
       model = model,
       code = code)
}
# ~~~~~~~~~
# example usage
library(zoon)
# an using this approach for the logistic regression module:
LogisticRegression <- function (.df) {

  # fit the model, as normal
  covs <- as.data.frame(.df[, 6:ncol(.df)])
  names(covs) <- names(.df)[6:ncol(.df)]
  m <- glm(.df$value ~ .,
           data = covs,
           family = 'binomial')

  # make the predictor object
  MakePredictor(packages = 'stats',
                model = m,
                code = stats::predict.glm(object = model,
                                          newdata = newdata,
                                          type = 'response'))

}

# fake data to test
.df <- data.frame(value = sample(0:1, 10, replace = TRUE),
                  matrix(rnorm(100), nrow = 10))

# run the module to get the predict object
out <- LogisticRegression(.df)
# make predictions (here, against the training data)
ZoonPredict(predict_object = out,
            newdata = .df[, 6:NCOL(.df)])
      X5        X5        X5        X5        X5        X5        X5 
0.6165592 0.5010722 0.6527457 0.5427995 0.6649432 0.6104505 0.7657966 
       X5        X5        X5 
0.4713027 0.5023563 0.7618802

A similar one for QuickGRaF would therefore account for the fact that predictions are a matrix

QuickGRaF <- function (.df, l = NULL) {

  zoon:::GetPackage('GRaF')

  if (!all(.df$type %in% c('presence', 'absence', 'background'))) {
    stop ('only for presence/absence or presence/background data')
  }

  # get the covariates
  covs <- as.data.frame(.df[, 6:ncol(.df)])
  names(covs) <- names(.df)[6:ncol(.df)]

  # set up l
  if (!is.null(l)) {

    # duplicate if necessary
    if (length(l) == 1) {
      l <- rep(l, ncol(covs))
    }

    # check l has the correct length
    if (length(l) != ncol(covs)) {
      stop (sprintf('l has %i elements, but there are %i covariates', length(l), ncol(covs)))
    }

    # check l is of the correct value
    if (any(l <= 0)) {
      stop(sprintf('l must be positive, but the values provided were: %s',
                   paste(format(l, digits = 3), collapse = ', ')))
    }
  }
  # fit the model
  m <- graf(.df$value,
            covs,
            l = l)

  # make the predictor object
  MakePredictor(packages = 'GRaF',
                model = m,
                code = {
                  p <- GRaF::predict.graf(object = model,
                                          newdata = newdata,
                                          type = 'response')
                  p[, 1]
                })

}

# run the module to get the predict object
out <- QuickGRaF(.df)
# make predictions (here, against the training data)
ZoonPredict(predict_object = out,
            newdata = .df[, 6:NCOL(.df)])
      X5        X5        X5        X5        X5        X5        X5 
0.6165592 0.5010722 0.6527457 0.5427995 0.6649432 0.6104505 0.7657966 
       X5        X5        X5 
0.4713027 0.5023563 0.7618802

This would mean we could get of this snippet which appears in lots of output modules:

  # if pred is a matrix/dataframe, take only the first column
  if(!is.null(dim(pred))) {
    pred <- pred[, 1]
  }

And because the model object, package name and code are all passed in a single object, the object can easily be passed to a child process and executed cleanly.

What are people's thoughts on this as an approach to the problem?

When I get some time, I'll start a new branch to implement both this and execution of modules in child processes.

timcdlucas commented 9 years ago

I think it makes sense as long as we do good documentation for module builders (which is important anyway).

Part of that will be simply rewritting the model modules we already have. I think people will often use current modules as examples.

I'm just wondering whether the default value for the code arg in MakePredictor could be something useful that would cover the many algos that do use predict(object = model, newdata = newdata, type = 'response')

That said, we would probably want to write modules for most of those basic algorithms anyway. So maybe pointless.

Finally I'm not sure I like the name MakePredictor as I use predictor to mean independent variable (because I get confused between independent and dependent. I ain't the sharpest knife in the drawer.)

goldingn commented 9 years ago

Yup, sounds good.

I'm just wondering whether the default value for the code arg in MakePredictor could be something useful that would cover the many algos that do use predict(object = model, newdata = newdata, type = 'response')

That would help developers a bit, but is not explicit about the predict function. If a model has two classes, it would help to be explicit about the required function. I think that being explicit won't be to arduous, and as you say they can copy from existing functions.

Finally I'm not sure I like the name MakePredictor as I use predictor to mean independent variable (because I get confused between independent and dependent. I ain't the sharpest knife in the drawer.)

Yup, I agree - worth thinking about a better name... ZoonModel?

goldingn commented 8 years ago

I've been having a back at child processes again and thought I'd record my notes here. TL;DR: nope.

It's definitely possible to execute modules in child processes, but every time a new clean child process is started there's a significant overhead (~2s on my 2015 mbp) to loading zoon and raster. This would apply to every module in a workflow so even the smallest would have 10 seconds tacked on, which I think is unacceptable.

Instead of using brand new processes, we can instead fork from existing processes with next to no computational cost (if the parent process has zoon & raster, the fork is born with them loaded too). If we span up one clean process when zoon is loaded in an R session, then we can fork from that, and only pay the 2s cost once per session.

Unfortunately, #?*!ing windows doesn't do forking so that's a non-starter for the majority of our user base.

We could think about a two-tier option, where things just work more cleanly on Unix, but that doesn't appeal to me - workflows developed on Unix could error for a Windows user.

I think it would be better to invest some time trying to implement @richfitz' suggestion of using environments and attachment. AFAIK that wouldn't protect us from a user librarying and causing conflicts, but its a step in the right direction.