workflowr / workflowr

Organize your project into a research website
https://workflowr.github.io/workflowr
Other
815 stars 106 forks source link

Integration with package managers like renv/packrat #128

Open sdhutchins opened 5 years ago

sdhutchins commented 5 years ago
  1. Describe the suggested feature.

My suggestion is that there is a packrat parameter in the wflow_start() function. The user would be able to use a boolean TRUE/FALSE to determine whether packrat would be used with project.

When using wflow_publish(), packrat would take a snapshot packrat::snapshot() of the repository.

There are a few quirks with regards to bioconductor (https://github.com/rstudio/packrat/issues/507), but I just had some success using and setting it up pretty easily.

  1. Describe why you think this feature is needed.

For the sake of reproducibility, using packrat ensures that not only is the session information shared (like workflowr provides) but that also a repository or snapshot of those packages is available for re-running analyses in the future or sharing.

  1. Tell us the version of workflowr you are using. Run packageVersion("workflowr") in R or RStudio, and copy & paste the output here.
    > packageVersion("workflowr")
    [1] ‘1.1.1’
jdblischak commented 5 years ago

@sdhutchins This is an interesting idea. Since packrat is itself an R package, and is limited to versioning R packages, it would be possible to wrap packrat functions automatically within workflowr functions.

The big decision in my view is deciding whether there is enough added value for integrating packrat within workflowr versus using the two separately. This is because integrating packrat will be non-trivial since the workflowr functions would have to handle any errors when running packrat functions.

Would you be interested in trying to implement these changes? Like you suggested, the solution would be to add packrat::init() to wflow_start() (making sure the files it creates are version controlled), and then add packrat::snapshot() to wflow_publish() (and add packrat/packrat.lock, packrat/packrat.opts, and packrat/init.R to the files to be committed).

Also, from my quick experimenting with packrat and workflowr, I think that snapshot() should set ignore.stale = TRUE because packrat is refusing to create a snapshot for me unless I set that argument (I assume I'm not using packrat correctly).

xref: #52

sdhutchins commented 5 years ago

@jdblischak That's interesting about the (ignore.stale = TRUE). That aspect of it (having potentially out of date versions of packages) is the main issue or challenge. I found that it was having a bit of an issue with Bioconductor packages but I found a fix. I'm definitely interested in integrating this, but I'll certainly try contributing some instructions on using them together first and playing with the functionality to make sure it's a good fit.

jdblischak commented 5 years ago

I'm definitely interested in integrating this, but I'll certainly try contributing some instructions on using them together first and playing with the functionality to make sure it's a good fit.

@sdhutchins That sounds like a great plan! And when you're ready to contribute, please check back with me so I can advise on how best to do so.

Thanks!

wlandau commented 5 years ago

I recently learned about pak, and I think it may be worth noting here. I am not sure it is as focused as packrat on tightly-controlled reproducibility for all packages, but it does create an independent local package library, and it is much faster to use. It would be nice if any new integrated package management functionality were as slick as workflowr itself.

jdblischak commented 5 years ago

@wlandau Thanks for the suggestion! I agree that pak looks very useful.

My main concern is still the same as above, will integrating packrat directly into workflowr make it more convenient?

The big decision in my view is deciding whether there is enough added value for integrating packrat within workflowr versus using the two separately. This is because integrating packrat will be non-trivial since the workflowr functions would have to handle any errors when running packrat functions.

Managing dependencies is difficult, and each system that is created to manage them has its pros and cons. I found this rOpenSci discussion illuminating: https://github.com/ropensci/unconf17/issues/74

I'd appreciate any thoughts on how package management could be integrated into workflowr. One big decision would be whether to manage the packages per project, per R Markdown file, or allow the user to decide between either.

sdhutchins commented 5 years ago

A new package/project manager for R has emerged in renv.

You bring up a good point, @jdblischak, about whether using something like packrat makes it more convenient. renv, however, seems (so far) like it intends to be the replacement for packrat and a much better one.

jdblischak commented 5 years ago

@sdhutchins Thanks for alerting me to this new package! It looks really interesting.

Going back to your original proposal, have you been using packrat regularly? Has it worked well for your reproducibility needs? If yes, would you be interested in implementing the features you had described?

jdblischak commented 5 years ago

Another thing to consider: the workflowr functions can be run outside of a workflowr project using the argument project. Thus if the current working directory is not in the workflowr project, then the .Rprofile will not be executed. Thus in this case wflow_publish() would need to first run source(file.path(project, "packrat/init.R").

sdhutchins commented 5 years ago

@jdblischak I'm going to take a look at it this week (again). When I last used packrat, I had a few issues, but I think the project had far too many dependencies. I'm not sure if packrat has improved at handling that, but I have used it for smaller projects.

jdblischak commented 5 years ago

@sdhutchins That'd be great! Thanks!

sdhutchins commented 5 years ago

After testing packrat again in an existing project, I still find a lot of the same issues exist with speed, lack of integration with Bioconductor, and lots of warnings.

I'm still working through it, but I wanted to make note of this article and https://environments.rstudio.com/. To my knowledge, there's no mention of packrat here so I'm growing cautious about integrating it. It seems that Rstudio is making a huge shift towards renv. I'm going to try it out as well this week.

I'm curious what you think. My concern with either is that...packrat may be soon less prioritized by Rstudio and has obvious issues although better known than renv, but renv seems to have a better approach to managing packages but is still in the very early stages of development.

jdblischak commented 5 years ago

@sdhutchins Thanks for these links! They were very informative.

Here are my current thoughts:

  1. I don't personally use packrat (I use conda), and no other workflowr users have requested that packrat support be added. That being said, I think some sort of package management would be beneficial to workflowr users, and could be nicely integrated into workflowr. Thus I would welcome a PR that adds this feature.
  2. I think renv looks promising, but I wouldn't be comfortable integrating it into workflowr until it is released on CRAN.
  3. packrat and renv appear to have a similar API (e.g. init(), snapshot(), and restore()). Thus if you add support for packrat now, it should be straightforward to incorporate support for renv once it is released on CRAN (and then workflowr users could choose which to use, or neither).
sdhutchins commented 5 years ago

@jdblischak Just wanted to let you know I'm currently working on implementing this as we had initially discussed in your comment.

Also, your code is incredibly elegant! I'll do my best to keep up with your standards.

jdblischak commented 5 years ago

Just wanted to let you know I'm currently working on implementing this as we had initially discussed in your comment.

@sdhutchins Awesome! I look forward to your pull request. One thing to keep in mind as you develop this is how to make it general so that in the future we can add support for renv and other package managers (I'd like to try to support conda). Maybe we need to use an option like workflowr.pkgmanager that could be set to "packrat", "renv", "conda", etc?

Also note that I just recently changed my development workflow. I used to develop on a branch named "dev" but now I just use master. Thus make sure you use the master branch as your starting point (and if you use a feature branch, you should branch off of master). Please don't hesitate to ask if you run into Git problems.

Also, your code is incredibly elegant! I'll do my best to keep up with your standards.

Thank you so much! But no pressure. We can always refine the implementation once we it is working well.

jdblischak commented 4 years ago

Some motivation for implementing tighter integration with dependency management package like renv: https://twitter.com/PeteHaitch/status/1192206056488722432

kevinushey commented 4 years ago

Now that renv is on CRAN, I would strongly recommend evaluating it as a replacement for Packrat here. Hopefully the headaches that @sdhutchins bumped into are now gone with renv; and if they're not, I'd be glad to fix any issues ASAP.

Hopefully, porting from Packrat to renv is as simple as replacing packrat:: with renv:: :-)

jdblischak commented 4 years ago

@kevinushey Thanks for the feedback (and for your work on both packrat and renv!). I agree that renv should be the main goal now that it is on CRAN. I've updated the Issue title accordingly.

But from the perspective of workflowr, it should be able to be fairly agnostic to the choice of packrat or renv. Workflowr needs an initiating function to run during wflow_start(), a "snapshot" function to run during wflow_publish(), and a lockfile to commit to the Git repository. Since both packrat and renv have those features, if a workflowr user really wanted to use packrat for some reason, we could let them.

sdhutchins commented 4 years ago

Great news, @kevinushey! I'll have to try that out.

kieran-mace commented 3 years ago

+1 for renv <--> workflowr

rgayler commented 3 years ago

FWIW, I am currently using renv:: with a couple of workflowr:: projects and haven't found the lack of integration between them to be an issue. They really seem quite orthogonal to me (although, related via their ultimate purpose of reproducibility). Of course, it's always possible that somebody else's use-case or workflow might get different mileage.

I can see that there might be some value in workflowr:: reporting the result of renv::status() in the rendered documents rather than it only being visible in the console when run manually.

jdblischak commented 3 years ago

+1 for renv <--> workflowr

@kieran-mace Thanks for adding expressing your interest. It's helpful to know which feature requests are most popular among workflowr users.

I can see that there might be some value in workflowr:: reporting the result of renv::status() in the rendered documents rather than it only being visible in the console when run manually.

@rgayler Thanks for providing your insight. I was going to say almost the exact same thing, but it's much more impactful coming from a user with direct experience with renv+workflowr.

To summarize what @rgayler noted:

  1. Workflowr is compatible with the package manager of your choice (e.g. renv,conda, etc.), and the use of a package manager with workflowr is highly encouraged.
  2. Currently workflowr only automatically records the session information at the end of every analysis run. The idea of this PR would be to add some convenience functionality related to renv, e.g. have wflow_publish() automatically snapshot the environment and commit the changes to renv.lock, as well as adding renv status to the workflowr report at the top of each file.
rgayler commented 3 years ago

FWIW I have taken to adding the following lines to .Rprofile in workflowr projects:

message("renv status")
renv::status()
message()

message("workflowr status")
workflowr::wflow_status()
message()

I find it useful to get a heads-up when re-entering the project. It tends to keep what's happening in renv at front of mind, so I'm less likely to do something dumb with it.

Also, I tend to assume that updated packages will be an improvement unless positively demonstrated to break something. Consequently, I tend to update packages frequently, do that directly via renv, and explicitly commit the updated renv.lock. I think I would be uncomfortable having wflow_publish() automatically update and commit renv.lock for me on the grounds that I would rather retain explicit control of package updating.

I would probably be happier if workflowr just did renv::status() as one of its checks, and failed that check if the renv:::status() was anything other than clean.

jdblischak commented 3 years ago

I find it useful to get a heads-up when re-entering the project. It tends to keep what's happening in renv at front of mind, so I'm less likely to do something dumb with it.

@rgayler I really like that .Rprofile setup. Thanks for sharing!

I think I would be uncomfortable having wflow_publish() automatically update and commit renv.lock for me on the grounds that I would rather retain explicit control of package updating.

I appreciate that feedback. This is always a tough decision for workflowr: how to be useful without being draconian? At the end of the day, users won't put up with a reproducibility framework that is too restrictive, which is then self-defeating. My goal is for workflowr to always feel like it is assisting the user with their reproducibility needs.

I would probably be happier if workflowr just did renv::status() as one of its checks, and failed that check if the renv:::status() was anything other than clean.

I really like this idea. It would provide useful feedback without having to worry about the risk of borking the user's package installations.

Do you know of the easiest way to determine if the current environment matches the lock file? renv::diagnostics() returns NULL. renv::status() returns the current package versions installed and the versions specified in the lock file. At worst I could write code to compare all the package versions and report discrepancies, but I don't want to duplicate functionality that already exists in renv.

rgayler commented 3 years ago

Do you know of the easiest way to determine if the current environment matches the lock file?

@jdblischak - sorry, I don't.

I just went and looked at the source for renv::status() for the first time. The main function (renv_status_impl()) seems to be calling a bunch of other functions for their printing side-effects without capturing their return values - so I don't hold much hope for renv::status() as it currently stands being directly useful programmatically within workflowr.

At worst I could write code to compare all the package versions and report discrepancies

Is that level of detail necessary? I think a simple binary report is all that's needed (renv is happy/unhappy). Then it's the user's responsibility to use renv to figure out why it's unhappy.

I don't want to duplicate functionality that already exists in renv.

Rightly so. Could you use capture.output() to catch the printed output from renv::status() then check to see whether the output is equal to '* The project is already synchronized with the lockfile.'? That's hacky and brittle, but would probably work.

I think the better approach would be to put in a feature request to renv. It's a bit of an anomaly in a functional programming language like R, to have a package where the functions don't return useful values. I would have expected renv::status() to invisibly return a list of the results for all the sub-tests it runs. The documentation of the ... argument shows that they are allowing for future expansion, so the feature request would be to have renv::status() always invisibly return a list of all the test results and to add an argument (e.g. silent = TRUE) to suppress the printing.

Slightly tangentially:

@rgayler I really like that .Rprofile setup. Thanks for sharing!

I am slowly retrofitting targets into my current main workflowr project. Running tar_make() starts a new R process which, by default, sources .Rprofile. I found the repeated running of renv::status() and wflow_status() via .Rprofile to be annoying, so I took those calls out of .Rprofile. (It was easier to do that than work out how to get targets::tar_make() to source an alternate .Rprofile.) I only want to be reminded of the status when I enter the project, not every time I run something.

What I have done is create a workflowr notebook that executes renv::status(), workflowr::wflow_status(), and targets::tar_visnetwork(), so I get a rendered status report. I have put the following lines in .Rprofile so that when I enter the project (but not when I run tar_make()) I get a reminder to look at the project status. This also makes the status update optional, which may be helpful given that it might take a considerable time to calculate if the project is complex.

if(interactive())
  message("Run 'wflow_build(\"analysis/m_00_status.Rmd\")' to see the project status")
rgayler commented 3 years ago

I think the better approach would be to put in a feature request to renv. It's a bit of an anomaly in a functional programming language like R, to have a package where the functions don't return useful values. I would have expected renv::status() to invisibly return a list of the results for all the sub-tests it runs.

@jdblischak I note that workflowr is much more how I would expect functional programming to be, in that the major action functions (e.g. wflow_build() and wflow_publish()) essentially return lists of their arguments, even though they are generally executed only for their side effects. In the spirit of kibble-ophagy, it would be helpful to extend the returned values with fields showing the results of the status tests executed by the functions and the actions that will be carried out as a consequence. That way, if you executed a dry run the returned value would programmatically indicate the actions to be carried out, and why. (The same values for dry_run = FALSE would indicate what actions were carried out, and why.)

EDIT: I just re-read the wflow_build() documentation and noticed that the built and html components of the returned value indicate the files that will-be/have-been rebuilt, so could be considered invalidated targets.

That would make it easier for some external logic to control workflowr actions without replicating all the workflowr internal logic. It might assist in integration with targets, using the returned values from a workflowr dry run to control the invalidation of targets.

wlandau commented 3 years ago

I am slowly retrofitting targets into my current main workflowr project. Running tar_make() starts a new R process which, by default, sources .Rprofile. I found the repeated running of renv::status() and wflow_status() via .Rprofile to be annoying, so I took those calls out of .Rprofile. (It was easier to do that than work out how to get targets::tar_make() to source an alternate .Rprofile.)

You can configure that external process with the callr_arguments argument of tar_make(). Acceptable elements include everything here (if callr_function is callr::r) except func and args. You might try callr_arguments(user_profile = TRUE) with the R_PROFILE_USER environment variable set to the path you choose (disclaimer: haven't tried myself).

rgayler commented 3 years ago

You might try callr_arguments(user_profile = TRUE) with the R_PROFILE_USER environment variable set to the path you choose

Thanks @wlandau . That looks like it ought to work. (It would be even neater if callr::r() allowed direct specification of a file or string to use as the .Rprofile in addition to replicating standard start-up behaviour.)

What has become clear to me is that the limiting factor in my use of workflowr and targets to do non-trivial things is my (lack of) breadth and depth of R knowledge. Both packages are exceptionally well documented, but working out the implications of the documented behaviour can stump me. For example, I read through the tar_make() help and spent some time reading callr::r (it was the first time I had encountered callr). I came away with the impression that I could modify anything to do with the child R process, but no ideas of what might be useful to modify for my purposes.

wlandau commented 3 years ago

It would be even neater if callr::r() allowed direct specification of a file or string to use as the .Rprofile in addition to replicating standard start-up behaviour

On reflection, _targets.R is conceptually kind of like its own .Rprofile for all the targets functions that spawn external processes (tar_visnetwork() etc. as well as tar_make()). So maybe just source("/custom/path/.Rprofile") inside _targets.R?

rgayler commented 3 years ago

@wlandau - even better! I like the idea of everything used by the tar_make() processes being explicit in _targets.R. If you wanted the tar_make() processes to use the project .Rprofile you would put source(here::here(".Rprofile")) in _targets.R. I presume I would need to add something like callr_arguments = list(user_profile = FALSE) as an argument to tar_make() to prevent it from trying to source .Rprofile for itself.

AmelZulji commented 3 years ago

Hi everyone,

I am wondering what would be the recommended way to combine workflowr with renv?

Thank you for your help Amel

rgayler commented 3 years ago

@AmelZulji I have been using renv and workflowr together for a while and it "just works". That is, there is nothing special I have to do. I use renv like I would in any non-workflowr project and I use workflowr like I would in any non-renv project.

I am not an expert in either renv or workflowr, so I may be missing some important point, but I treat them as totally independent, unaware of each other, and not needing to be aware of each other. This is just what I do, I wouldn't go so far as to call it "recommended".

I found the collaboration section of the renv documentation to be helpful. When you are happy that your workflowr project is in a good state you execute renv::snapshot() to capture the state of the packages into the renv lockfile. Then commit the lockfile to the git repo. When your collaborator (possibly future you) wants to run the workflowr project code, they restore the project (including the lockfile) from the git repo and execute renv::restore(), which installs the correct versions of the snapshotted packages.

jdblischak commented 3 years ago

I am wondering what would be the recommended way to combine workflowr with renv?

@AmelZulji Thanks for your question. This thread is discussing how to more tightly integrate workflowr and renv. For example, to include the renv status in the workflowr reproducibility report. This hasn't been implemented yet, but it would only be an added value. As @rgayler noted, you can use renv like you would for any other R project, and it should just work. I recommend adding the lock file renv creates to version control (e.g. git add renv.lock or wflow_git_commit("renv.lock")) and regularly committing any changes made by renv::snapshot().

AmelZulji commented 3 years ago

Thank you very much, @rgayler and @jdblischak !

sdhutchins commented 1 year ago

@jdblischak Found my way back to this 😄 as I'm working on analysis and using workflowr. Our lab has a huge emphasis on reproducibility so I may try a test run of what it'd look like implementing renv into workflowr as an environment manager. When I have a general mapping on that (referencing our previous conversations), I'll will post here on updates!