The use of in-house scripting is alarming

ypriverol commented 5 years ago

The number of in-house scripting is alarming in computational proteomics by 2019. The use of in-house scripts is the source of:

[ ] lack of traceability, reproducibility
[ ] the more complex part and decision prone of the data analysis is the downstream analysis if is not public the way the analysis has been performed we can't guarantee the results are right.

jspmccain commented 5 years ago

I'm not sure I agree with this. Tools like pyteomics, pyopenms, or MSnbase are all built to write scripts to do analyses on a case-by-case basis!

I think rather than discouraging proteomics researchers from using or writing in-house scripts (mainly because I think it will inevitably happen regardless), encouraging best practices with writing in-house scripts might be a better option. Coding defensively, commenting code and documenting, requiring in-house scripts to be hosted online (or as a supplement!) - this might all be more constructive. This would also enable researchers to actually learn the underpinnings of the software they use, and then maybe lead to more development!

To your second point, can we ever guarantee the results are right? I think there are many more ways to misuse a software that to use it as intended. And I think there are plenty of ways to misuse even well established pipelines!

ypriverol commented 5 years ago

Agree, with you @jspmccain. The use of in-house scripts meaning that we have some customized part of the analysis that is not public making difficult to reproduce the results. I'm not against R/python scripting, I'm against that they are not publically available.

We should probably go deep in this direction and as a result of this study propose a mechanism to easily provide the code and some guidelines about how to executed attached to the research papers. Also, talk to the journals and raise the point that this problem continues happening in the field.

We will recommend that the code gets deposited in a public code repository like GitHub which will enable us in the future to trace usability for python/R/Java libraries. If we also continue adding the code as supplementary (non-downloadable/searchable) materials we will never be able to trace the impact of our libraries.

What do you think @lgatto @mvaudel

jspmccain commented 5 years ago

Great point about the impact of our libraries. I guess one thing I've been thinking about also: do we deposit code to perfectly reproduce an analysis? Or do we deposit it so there is a more clear of what exactly we've done? (probably both!)

Ideally I think it's more the former. But what about long term? Surely some libraries will change and something won't work in the code, or it will become too difficult to reconstruct the exact computing environment. This makes me lean towards the latter goal, such that the ultimate goal of depositing code is a more clear and explicit description of how the data were analyzed.

I would love to hear thoughts on this!

ypriverol commented 5 years ago

Great point about the impact of our libraries. I guess one thing I've been thinking about also: do we deposit code to perfectly reproduce an analysis? Or do we deposit it so there is a more clear of what exactly we've done? (probably both!)

Here, I think that by forcing the researchers to deposit the code in GitHub is the starting point for guidelines of documenting, and implementing best practices on code position.

Ideally I think it's more the former. But what about long term? Surely some libraries will change and something won't work in the code, or it will become too difficult to reconstruct the exact computing environment. This makes me lean towards the latter goal, such that the ultimate goal of depositing code is a more clear and explicit description of how the data were analyzed.

For the long-term we really one to preserve what the core libraries are doing and trace how they are used as you said MSnbase and other python libraries. Is more difficult to know what users are doing with your library if this is only included in the supplementary information. I guess is also a way to recover feedback about the libraries and add some of those analyses into them. Another side effect of this practice is that encourage researchers to do properly scripting code because they know their code will be really exposed.

Finally, we can find a way to pack in containers, for example, the deposited code with the dependencies and move on the field.

I would love to hear thoughts on this!

higsch commented 5 years ago

A suggestion would be to pack a container for each finalised and published paper. The container should contain

code to do the data analysis and to produce the figures
all the dependencies With that even in-house scripted stuff will persist and use is traceable through Dockerfiles.

So to wrap it up: In my opinion an ideal case looks like this:

paper published
all the code to generate the paper analysis/figures is on GitHub
there's a Dockerfile existing
with that I can build a container and rerun the analysis / reproduce the figures

With that, the work is reproducible, comprehensible and documented.

jspmccain commented 5 years ago

I think we can also get some good recommendations from 'Good enough practices in scientific computing' (https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510) and 'Best practices in scientific computing' (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745).

I bring this up because perhaps if the recommendation is for the 'best case scenario' then people might give up right away, and it's helpful to think of what would be 'good enough' (or at least a step in the right direction!).

ypriverol commented 5 years ago

You are right. For me first step is publish your code in github.

higsch commented 5 years ago

And both of the papers do not mention (docker) packaging, which I find quite essential. Probably, it is to complicated to assemble Dockerfiles, yet.

prvst commented 5 years ago

I'm leaving this link here just in case you missed. I think it fits the discussion.

https://elifesciences.org/labs/ad58f08d/introducing-elife-s-first-computationally-reproducible-article

higsch commented 5 years ago

It's a nice effort, but have you read the section "The technical magic behind the reproducible article"? The complexity is quite daunting. It seems easier and more applicable to me to just make a github repo with a Dockerfile for each paper. No matter where it will be published.

prvst commented 5 years ago

Yes, it also brings extra costs to the journal to maintain something like this that can end up being added down to the publication fee.

ypriverol / CompProt2019-paper

The use of in-house scripting is alarming #13