Open sdhutchins opened 5 years ago
@sdhutchins This is an interesting idea. Since packrat is itself an R package, and is limited to versioning R packages, it would be possible to wrap packrat functions automatically within workflowr functions.
The big decision in my view is deciding whether there is enough added value for integrating packrat within workflowr versus using the two separately. This is because integrating packrat will be non-trivial since the workflowr functions would have to handle any errors when running packrat functions.
Would you be interested in trying to implement these changes? Like you suggested, the solution would be to add packrat::init()
to wflow_start()
(making sure the files it creates are version controlled), and then add packrat::snapshot()
to wflow_publish()
(and add packrat/packrat.lock
, packrat/packrat.opts
, and packrat/init.R
to the files to be committed).
Also, from my quick experimenting with packrat and workflowr, I think that snapshot()
should set ignore.stale = TRUE
because packrat is refusing to create a snapshot for me unless I set that argument (I assume I'm not using packrat correctly).
xref: #52
@jdblischak That's interesting about the (ignore.stale = TRUE
). That aspect of it (having potentially out of date versions of packages) is the main issue or challenge. I found that it was having a bit of an issue with Bioconductor packages but I found a fix. I'm definitely interested in integrating this, but I'll certainly try contributing some instructions on using them together first and playing with the functionality to make sure it's a good fit.
I'm definitely interested in integrating this, but I'll certainly try contributing some instructions on using them together first and playing with the functionality to make sure it's a good fit.
@sdhutchins That sounds like a great plan! And when you're ready to contribute, please check back with me so I can advise on how best to do so.
Thanks!
I recently learned about pak
, and I think it may be worth noting here. I am not sure it is as focused as packrat
on tightly-controlled reproducibility for all packages, but it does create an independent local package library, and it is much faster to use. It would be nice if any new integrated package management functionality were as slick as workflowr
itself.
@wlandau Thanks for the suggestion! I agree that pak looks very useful.
My main concern is still the same as above, will integrating packrat directly into workflowr make it more convenient?
The big decision in my view is deciding whether there is enough added value for integrating packrat within workflowr versus using the two separately. This is because integrating packrat will be non-trivial since the workflowr functions would have to handle any errors when running packrat functions.
Managing dependencies is difficult, and each system that is created to manage them has its pros and cons. I found this rOpenSci discussion illuminating: https://github.com/ropensci/unconf17/issues/74
I'd appreciate any thoughts on how package management could be integrated into workflowr. One big decision would be whether to manage the packages per project, per R Markdown file, or allow the user to decide between either.
A new package/project manager for R has emerged in renv
.
You bring up a good point, @jdblischak, about whether using something like packrat makes it more convenient. renv
, however, seems (so far) like it intends to be the replacement for packrat and a much better one.
@sdhutchins Thanks for alerting me to this new package! It looks really interesting.
Going back to your original proposal, have you been using packrat regularly? Has it worked well for your reproducibility needs? If yes, would you be interested in implementing the features you had described?
wflow_start()
: if packrat is installed and packrat = TRUE
, run packrat::init()
wflow_publish()
: If there is a subdirectory packrat/
, run packrat::snapshot(ignore.stale = TRUE, prompt = FALSE)
and then commit packrat/packrat.lock
and packrat/packrat.opts
(done in the first step, prior to building any of the rmarkdown files)Another thing to consider: the workflowr functions can be run outside of a workflowr project using the argument project
. Thus if the current working directory is not in the workflowr project, then the .Rprofile
will not be executed. Thus in this case wflow_publish()
would need to first run source(file.path(project, "packrat/init.R")
.
@jdblischak I'm going to take a look at it this week (again). When I last used packrat, I had a few issues, but I think the project had far too many dependencies. I'm not sure if packrat has improved at handling that, but I have used it for smaller projects.
@sdhutchins That'd be great! Thanks!
After testing packrat
again in an existing project, I still find a lot of the same issues exist with speed, lack of integration with Bioconductor, and lots of warnings.
I'm still working through it, but I wanted to make note of this article and https://environments.rstudio.com/. To my knowledge, there's no mention of packrat
here so I'm growing cautious about integrating it. It seems that Rstudio is making a huge shift towards renv
. I'm going to try it out as well this week.
I'm curious what you think. My concern with either is that...packrat
may be soon less prioritized by Rstudio and has obvious issues although better known than renv
, but renv
seems to have a better approach to managing packages but is still in the very early stages of development.
@sdhutchins Thanks for these links! They were very informative.
Here are my current thoughts:
init()
, snapshot()
, and restore()
). Thus if you add support for packrat now, it should be straightforward to incorporate support for renv once it is released on CRAN (and then workflowr users could choose which to use, or neither).@jdblischak Just wanted to let you know I'm currently working on implementing this as we had initially discussed in your comment.
Also, your code is incredibly elegant! I'll do my best to keep up with your standards.
Just wanted to let you know I'm currently working on implementing this as we had initially discussed in your comment.
@sdhutchins Awesome! I look forward to your pull request. One thing to keep in mind as you develop this is how to make it general so that in the future we can add support for renv and other package managers (I'd like to try to support conda). Maybe we need to use an option like workflowr.pkgmanager
that could be set to "packrat"
, "renv"
, "conda"
, etc?
Also note that I just recently changed my development workflow. I used to develop on a branch named "dev" but now I just use master. Thus make sure you use the master branch as your starting point (and if you use a feature branch, you should branch off of master). Please don't hesitate to ask if you run into Git problems.
Also, your code is incredibly elegant! I'll do my best to keep up with your standards.
Thank you so much! But no pressure. We can always refine the implementation once we it is working well.
Some motivation for implementing tighter integration with dependency management package like renv: https://twitter.com/PeteHaitch/status/1192206056488722432
Now that renv
is on CRAN, I would strongly recommend evaluating it as a replacement for Packrat here. Hopefully the headaches that @sdhutchins bumped into are now gone with renv
; and if they're not, I'd be glad to fix any issues ASAP.
Hopefully, porting from Packrat to renv
is as simple as replacing packrat::
with renv::
:-)
@kevinushey Thanks for the feedback (and for your work on both packrat and renv!). I agree that renv should be the main goal now that it is on CRAN. I've updated the Issue title accordingly.
But from the perspective of workflowr, it should be able to be fairly agnostic to the choice of packrat or renv. Workflowr needs an initiating function to run during wflow_start()
, a "snapshot" function to run during wflow_publish()
, and a lockfile to commit to the Git repository. Since both packrat and renv have those features, if a workflowr user really wanted to use packrat for some reason, we could let them.
Great news, @kevinushey! I'll have to try that out.
+1 for renv <--> workflowr
FWIW, I am currently using renv::
with a couple of workflowr::
projects and haven't found the lack of integration between them to be an issue. They really seem quite orthogonal to me (although, related via their ultimate purpose of reproducibility). Of course, it's always possible that somebody else's use-case or workflow might get different mileage.
I can see that there might be some value in workflowr::
reporting the result of renv::status()
in the rendered documents rather than it only being visible in the console when run manually.
+1 for renv <--> workflowr
@kieran-mace Thanks for adding expressing your interest. It's helpful to know which feature requests are most popular among workflowr users.
I can see that there might be some value in workflowr:: reporting the result of renv::status() in the rendered documents rather than it only being visible in the console when run manually.
@rgayler Thanks for providing your insight. I was going to say almost the exact same thing, but it's much more impactful coming from a user with direct experience with renv+workflowr.
To summarize what @rgayler noted:
wflow_publish()
automatically snapshot the environment and commit the changes to renv.lock
, as well as adding renv status to the workflowr report at the top of each file.FWIW I have taken to adding the following lines to .Rprofile
in workflowr projects:
message("renv status")
renv::status()
message()
message("workflowr status")
workflowr::wflow_status()
message()
I find it useful to get a heads-up when re-entering the project. It tends to keep what's happening in renv at front of mind, so I'm less likely to do something dumb with it.
Also, I tend to assume that updated packages will be an improvement unless positively demonstrated to break something. Consequently, I tend to update packages frequently, do that directly via renv, and explicitly commit the updated renv.lock
. I think I would be uncomfortable having wflow_publish()
automatically update and commit renv.lock
for me on the grounds that I would rather retain explicit control of package updating.
I would probably be happier if workflowr just did renv::status()
as one of its checks, and failed that check if the renv:::status()
was anything other than clean.
I find it useful to get a heads-up when re-entering the project. It tends to keep what's happening in renv at front of mind, so I'm less likely to do something dumb with it.
@rgayler I really like that .Rprofile
setup. Thanks for sharing!
I think I would be uncomfortable having wflow_publish() automatically update and commit renv.lock for me on the grounds that I would rather retain explicit control of package updating.
I appreciate that feedback. This is always a tough decision for workflowr: how to be useful without being draconian? At the end of the day, users won't put up with a reproducibility framework that is too restrictive, which is then self-defeating. My goal is for workflowr to always feel like it is assisting the user with their reproducibility needs.
I would probably be happier if workflowr just did renv::status() as one of its checks, and failed that check if the renv:::status() was anything other than clean.
I really like this idea. It would provide useful feedback without having to worry about the risk of borking the user's package installations.
Do you know of the easiest way to determine if the current environment matches the lock file? renv::diagnostics()
returns NULL
. renv::status()
returns the current package versions installed and the versions specified in the lock file. At worst I could write code to compare all the package versions and report discrepancies, but I don't want to duplicate functionality that already exists in renv.
Do you know of the easiest way to determine if the current environment matches the lock file?
@jdblischak - sorry, I don't.
I just went and looked at the source for renv::status()
for the first time. The main function (renv_status_impl()
) seems to be calling a bunch of other functions for their printing side-effects without capturing their return values - so I don't hold much hope for renv::status()
as it currently stands being directly useful programmatically within workflowr
.
At worst I could write code to compare all the package versions and report discrepancies
Is that level of detail necessary? I think a simple binary report is all that's needed (renv
is happy/unhappy). Then it's the user's responsibility to use renv
to figure out why it's unhappy.
I don't want to duplicate functionality that already exists in renv.
Rightly so. Could you use capture.output()
to catch the printed output from renv::status()
then check to see whether the output is equal to '* The project is already synchronized with the lockfile.'
? That's hacky and brittle, but would probably work.
I think the better approach would be to put in a feature request to renv
. It's a bit of an anomaly in a functional programming language like R, to have a package where the functions don't return useful values. I would have expected renv::status()
to invisibly return a list of the results for all the sub-tests it runs. The documentation of the ...
argument shows that they are allowing for future expansion, so the feature request would be to have renv::status()
always invisibly return a list of all the test results and to add an argument (e.g. silent = TRUE
) to suppress the printing.
Slightly tangentially:
@rgayler I really like that .Rprofile setup. Thanks for sharing!
I am slowly retrofitting targets
into my current main workflowr
project. Running tar_make()
starts a new R process which, by default, sources .Rprofile
. I found the repeated running of renv::status()
and wflow_status()
via .Rprofile
to be annoying, so I took those calls out of .Rprofile
. (It was easier to do that than work out how to get targets::tar_make()
to source an alternate .Rprofile
.) I only want to be reminded of the status when I enter the project, not every time I run something.
What I have done is create a workflowr
notebook that executes renv::status()
, workflowr::wflow_status()
, and targets::tar_visnetwork()
, so I get a rendered status report. I have put the following lines in .Rprofile
so that when I enter the project (but not when I run tar_make()
) I get a reminder to look at the project status. This also makes the status update optional, which may be helpful given that it might take a considerable time to calculate if the project is complex.
if(interactive())
message("Run 'wflow_build(\"analysis/m_00_status.Rmd\")' to see the project status")
I think the better approach would be to put in a feature request to renv. It's a bit of an anomaly in a functional programming language like R, to have a package where the functions don't return useful values. I would have expected renv::status() to invisibly return a list of the results for all the sub-tests it runs.
@jdblischak I note that workflowr
is much more how I would expect functional programming to be, in that the major action functions (e.g. wflow_build()
and wflow_publish()
) essentially return lists of their arguments, even though they are generally executed only for their side effects. In the spirit of kibble-ophagy, it would be helpful to extend the returned values with fields showing the results of the status tests executed by the functions and the actions that will be carried out as a consequence. That way, if you executed a dry run the returned value would programmatically indicate the actions to be carried out, and why. (The same values for dry_run = FALSE
would indicate what actions were carried out, and why.)
EDIT: I just re-read the wflow_build()
documentation and noticed that the built
and html
components of the returned value indicate the files that will-be/have-been rebuilt, so could be considered invalidated targets.
That would make it easier for some external logic to control workflowr
actions without replicating all the workflowr
internal logic. It might assist in integration with targets
, using the returned values from a workflowr
dry run to control the invalidation of targets.
I am slowly retrofitting targets into my current main workflowr project. Running tar_make() starts a new R process which, by default, sources .Rprofile. I found the repeated running of renv::status() and wflow_status() via .Rprofile to be annoying, so I took those calls out of .Rprofile. (It was easier to do that than work out how to get targets::tar_make() to source an alternate .Rprofile.)
You can configure that external process with the callr_arguments
argument of tar_make()
. Acceptable elements include everything here (if callr_function
is callr::r
) except func
and args
. You might try callr_arguments(user_profile = TRUE)
with the R_PROFILE_USER
environment variable set to the path you choose (disclaimer: haven't tried myself).
You might try callr_arguments(user_profile = TRUE) with the R_PROFILE_USER environment variable set to the path you choose
Thanks @wlandau . That looks like it ought to work. (It would be even neater if callr::r()
allowed direct specification of a file or string to use as the .Rprofile
in addition to replicating standard start-up behaviour.)
What has become clear to me is that the limiting factor in my use of workflowr
and targets
to do non-trivial things is my (lack of) breadth and depth of R knowledge. Both packages are exceptionally well documented, but working out the implications of the documented behaviour can stump me. For example, I read through the tar_make()
help and spent some time reading callr::r
(it was the first time I had encountered callr
). I came away with the impression that I could modify anything to do with the child R process, but no ideas of what might be useful to modify for my purposes.
It would be even neater if callr::r() allowed direct specification of a file or string to use as the .Rprofile in addition to replicating standard start-up behaviour
On reflection, _targets.R
is conceptually kind of like its own .Rprofile
for all the targets
functions that spawn external processes (tar_visnetwork()
etc. as well as tar_make()
). So maybe just source("/custom/path/.Rprofile")
inside _targets.R
?
@wlandau - even better! I like the idea of everything used by the tar_make()
processes being explicit in _targets.R
. If you wanted the tar_make()
processes to use the project .Rprofile
you would put source(here::here(".Rprofile"))
in _targets.R
.
I presume I would need to add something like callr_arguments = list(user_profile = FALSE)
as an argument to tar_make()
to prevent it from trying to source .Rprofile
for itself.
Hi everyone,
I am wondering what would be the recommended way to combine workflowr with renv?
Thank you for your help Amel
@AmelZulji I have been using renv
and workflowr
together for a while and it "just works". That is, there is nothing special I have to do. I use renv
like I would in any non-workflowr
project and I use workflowr
like I would in any non-renv
project.
I am not an expert in either renv
or workflowr
, so I may be missing some important point, but I treat them as totally independent, unaware of each other, and not needing to be aware of each other. This is just what I do, I wouldn't go so far as to call it "recommended".
I found the collaboration section of the renv
documentation to be helpful. When you are happy that your workflowr
project is in a good state you execute renv::snapshot()
to capture the state of the packages into the renv
lockfile. Then commit the lockfile to the git repo. When your collaborator (possibly future you) wants to run the workflowr
project code, they restore the project (including the lockfile) from the git repo and execute renv::restore()
, which installs the correct versions of the snapshotted packages.
I am wondering what would be the recommended way to combine workflowr with renv?
@AmelZulji Thanks for your question. This thread is discussing how to more tightly integrate workflowr and renv. For example, to include the renv status in the workflowr reproducibility report. This hasn't been implemented yet, but it would only be an added value. As @rgayler noted, you can use renv like you would for any other R project, and it should just work. I recommend adding the lock file renv creates to version control (e.g. git add renv.lock
or wflow_git_commit("renv.lock")
) and regularly committing any changes made by renv::snapshot()
.
Thank you very much, @rgayler and @jdblischak !
@jdblischak Found my way back to this 😄 as I'm working on analysis and using workflowr. Our lab has a huge emphasis on reproducibility so I may try a test run of what it'd look like implementing renv into workflowr as an environment manager. When I have a general mapping on that (referencing our previous conversations), I'll will post here on updates!
My suggestion is that there is a
packrat
parameter in thewflow_start()
function. The user would be able to use a booleanTRUE/FALSE
to determine whether packrat would be used with project.When using
wflow_publish()
,packrat
would take a snapshotpackrat::snapshot()
of the repository.There are a few quirks with regards to bioconductor (https://github.com/rstudio/packrat/issues/507), but I just had some success using and setting it up pretty easily.
For the sake of reproducibility, using packrat ensures that not only is the session information shared (like workflowr provides) but that also a repository or snapshot of those packages is available for re-running analyses in the future or sharing.
packageVersion("workflowr")
in R or RStudio, and copy & paste the output here.