zoonproject / zoon

The zoon R package
Other
61 stars 13 forks source link

store and re-use package versions #169

Open goldingn opened 9 years ago

goldingn commented 9 years ago

currently zoon ignores the fact that packages change version between the time a workflow is written and when it is reproduced.

If we could store the versions for all packages being used when running a workflow, we should be able to reinstall the same version when re-running it.

e.g. devtools::install_version will rebuild from the tarballs in the CRAN archives

goldingn commented 9 years ago

I wrote this function to grab the versions of names packages in a named library (by default, whatever's returned by .Library):

# return the version of the package `pkg` installed in library `lib`
getVersion <- function (pkg, lib = .Library) {

  # vectorise by recursion
  if (length(pkg) > 1) {
    ans <- sapply(pkg,
                  getVersion,
                  lib)

    return (ans)
  }

  # get path to package description
  desc_path <- sprintf('%s/%s/DESCRIPTION',
                       lib,
                       pkg)

  # check it exists
  if (!file.exists(desc_path)) {

    warning (sprintf('package %s is not in the specified library, returning NA',
                     pkg))
    return(NA)

  } else {

    lines <- readLines(desc_path)
    vers_line <- lines[grep('^Version: *', lines)]
    vers <- gsub('Version: ', '', vers_line)
    return (vers)

  }
}

e.g.

getVersion('zoon')

[1] "0.3.2"

getVersion(list.files(.Library))
          abind          AntWeb             ape 
        "1.4-3"           "0.7"           "3.3" 
     assertthat            base       base64enc 
          "0.1"         "3.2.2"         "0.1-3" 
             BH         biomod2          bitops 
     "1.58.0-1"        "3.1-64"         "1.0-6" 
           boot            brew         caTools 
       "1.3-17"         "1.0-6"        "1.17.1"
                        ...

which should be a start.

I imagine that there are headaches down the road with multiple libraries. We could enforce .Library as a default, allowing users to change it if they want, provided they tell us what library they are using. I.e. this as the usually unseen default:

workflow(..., library = .Library)
goldingn commented 9 years ago

Packages which provide alternative approaches include:

packrat which sets up a package in a users' directory and does some version control on it. I don't think this is very close to what we want though.

checkpoint which talks with a Revolution R server that copies the CRAN binaries at midnight every day. We could force users to install packages afresh (rather than using whatever they already have installed) in every zoon, and then be able to fetch those again. Assuming that is that the package wasn't updated during the day in question and that Revolution R maintains that server...

I think we'd be better off avoiding these two, but open to suggestions.

timcdlucas commented 9 years ago

Reinstalling packages on a regular basis sounds like a big turn off and a pain.

Maybe the checkpoint idea but by default zoon just uses whats available (and records what it used.) Then have an argument to enforce perfectly reproducing a workflow. This would only be used if someone is failing to reproduce a workflow.

timcdlucas commented 9 years ago

Theres forceReproducible already in workflow. But most of this discussion really refers to running rerunWorkflow on a workflow object.

goldingn commented 9 years ago

Right, the checkpoint thing would work if we only installed that day's package in a force reproducible call.

ReRunWorkflow calls will be rare enough that the overhead of installing specific versions afresh shouldn't be an issue.

Maybe we just try to match by package version visible in the library by the end of the workflow as the standard method. That doesn't require fresh downloading.

We could do checkpoint in forceReproducible calls if needed, though I'm not sure if that would add much...

goldingn commented 9 years ago

Would be great if we could work out how to install binaries of specific versions from checkpoint's MRAN server (which we can query by date). checkpoint may do this internally, or we could scrape something...

goldingn commented 9 years ago

So it looks simple enough to scrape CRAN's archives for version publication dates, then download the specific package version as a binary from MRAN, avoiding the checkpoint package altogether (we can get the required day's mirror as e.g. MRAN.revolutionanalytics.com/snapshot/20140909)

goldingn commented 9 years ago

Sorry, that's https://MRAN.revolutionanalytics.com/snapshot/2014-09-09

goldingn commented 9 years ago

ooh, look at this new R package that's appeared that does just what we want: https://github.com/goldingn/versions

will get it on CRAN soon

goldingn commented 9 years ago

On CRAN now: https://cran.r-project.org/web/packages/versions/

AugustT commented 8 years ago

Nick you are a machine!

Did you just reinvent switchr? https://github.com/gmbecker/switchr

goldingn commented 8 years ago

Ha! I looked around but never found that one.

versions has no dependencies is definitely multi-platform. Judging by the vignette, switchr needs RTools on Windows since it installs from source, but it does have a nice facility for handling multiple libraries.

We can go with whichever works best!

AugustT commented 8 years ago

So currently the session info is captured in a workflow

w <- workflow(UKAnophelesPlumbeus,
              UKAir,
              OneHundredBackground, 
              LogisticRegression,
              SameTimePlaceMap)

w$session.info

R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] rgdal_1.1-3         viridis_0.3.2       htmlwidgets_0.5    
[4] leaflet_1.0.0       randomForest_4.6-12 dismo_1.0-15       
[7] zoon_0.4.21         raster_2.5-2        sp_1.2-2           

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3      magrittr_1.5     munsell_0.4.2   
 [4] colorspace_1.2-6 lattice_0.20-33  R6_2.1.2        
 [7] httr_1.1.0       plyr_1.8.3       tools_3.2.3     
[10] grid_3.2.3       gtable_0.1.2     htmltools_0.3   
[13] yaml_2.1.13      digest_0.6.9     rfigshare_0.3.7 
[16] RJSONIO_1.3-0    gridExtra_2.0.0  ggplot2_2.0.0   
[19] bitops_1.0-6     RCurl_1.95-4.7   scales_0.3.0    
[22] XML_3.98-1.3     httpuv_1.3.3    

I can then use your package to install all these packages at the beginning of the re run

# Something like...
pkgs <- c(w$session.info$otherPkgs, w$session.info$loadedOnly)

to_install <- as.matrix(sapply(pkgs, FUN = function(x){

  return(c(x$Package, x$Version))

}))

install.versions(pkgs = to_install[1,], versions = to_install[2,])

This seems fine, but I worry about overwriting all the versions that the user currently has installed. We need a way to reverse that afterwards. I could simply do the same thing in reverse (capture session info and reinstall the previous versions at the end of the workflow), but i wonder if there is something more elegant?

I also have an error message which I have posted here https://github.com/goldingn/versions/issues/5

goldingn commented 8 years ago

It's a good point, maybe something switchr-like to install the packages in a temp library would be a good shout?

Perhaps that behaviour should be optional as it incurs a significant overhead installing all the used packages and their dependencies. Something like a forceReproducible argument for re-running someone else's workflow? Or a cleanLibrary option?

Thanks, will check out the bug!