WOMBAT-P Pipelines

Summary

wombat-p pipelines is a bioinformatics analysis pipeline that bundles different workflow for the analysis of label-free proteomics data with the purpose of comparison and benchmarking. It allows using files from the proteomics metadata standard SDRF.

It aims both for experienced end-users that want to test different workflows and configurations and developers that want to e.g. test a new software in a workflow setting. Contributions to this project are most welcome.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. We used one of the nf-core templates.

This work contains four major different workflows for the analysis or label-free proteomics data, originating from LC-MS experiments.

MaxQuant + NormalyzerDE
SearchGui + Proline + PolySTest
Compomics tools + FlashLFQ + MSqRob
Tools from the Trans-Proteomic Pipeline + ROTS

Initialization and parameterization of the workflows is based on tools from the SDRF pipelines, the ThermoRawFileParser with our own contributions and additional programs from the wombat-p organizaion [https://github.com/wombat-p/Utilities] as well as our fork. This includes setting a generalized set of data analysis parameters and the calculation of a multiple benchmarks.

Usage

Installation and testing

Install Nextflow (>=23.0.4)
Install Docker or Singularity (you can follow this tutorial) (you can use Conda both to install Nextflow itself and also to manage software within pipelines. Please only use it within pipelines as a last resort; see docs).
Download the pipeline and test it on a minimal dataset with a single command:
```
wget https://github.com/wombat-p/WOMBAT-Pipelines
nextflow run main.nf -profile test,YOURPROFILE
```
Substitute wget with curlor alike.

Configuration and execution

Setup of system for running the analysis

Note that some form of configuration will be needed so that Nextflow knows how to fetch the required software. This is usually done in the form of a config profile (YOURPROFILE in the example command above). You can chain multiple config profiles in a comma-separated string.

The pipeline comes with config profiles called docker, singularity and conda which instruct the pipeline to use the named tool for software management. For example, -profile test,docker.

If you are using docker, your host system might need to set a parameters that should stop the mono-based programs from failing when running large data sets on multiple threads. For that please set sudo sysctl -w vm.max_map_count=262144

If you are using singularity, setting the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir Nextflow options enables you to store and re-use the images from a central location for future pipeline runs.

If you are using conda, it is highly recommended to use the NXF_CONDA_CACHEDIR or conda.cacheDir settings to store the environments in a central location for future pipeline runs.

Start running your own analysis!

For a detailed explanation of the parameters, see below. Not all parameters are needed.

nextflow run main.nf --sdrf experimental_metadata.sdrf --fasta your_fasta_file.fasta --parameters your_parameters_yaml --raws thermo_raw_files --exp_design simple_experimental_design --workflow [other more specific parameters] -profile <docker/singularity/conda>

Input options and parameters

WOMBAT-P can run workflows using different (minimal) input, such as 1) with SDRF file (raw files can be given as parameter or are download from the location specified in the sdrf file): a) SDRF file + fasta file b) SDRF file + fasta file + experimental design file (will overwrite experimental design in sdrf) c) SDRF file + fasta file + experimental design file + yaml parameter file (will overwrite default and sdrf parameters)

2) without SDRF file: a) Raw files + fasta file + yaml parameter file b) Raw file + fasta file + yaml parameter file + experimental design file

-profile Set the profile and environment as described above

--sdrf This is a tab-delimited file containng details about experimental design and can also include all paramters given in the --parameters yaml file. Several data sets on the PRIDE repository come with an sdrf file which is can then be found toghether with the other deposited files. For the PXD001819, this would be https://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/sdrf.tsv See also the URL for SDRF files: https://github.com/bigbio/proteomics-metadata-standard/tree/master/annotated-projects and the description of the extended SDRF including data analysis parameters: https://github.com/bigbio/proteomics-metadata-standard/blob/master/sdrf-proteomics/Data-analysis-metadata.adoc

--fasta You also need a fasta database to run the database search in the workflows. Standard databases can be downloaded from UniProt

--parameters When deviating from the standard settings, use a yaml file containing new parameters settings. For more details about the different parameters and an example file, see https://github.com/bigbio/proteomics-metadata-standard/blob/master/sdrf-proteomics/Data-analysis-metadata.adoc As not all of these parameters are available for all workflows, see this table for an overview

--raws Without given sdrf file containing the paths to the raw data files (Thermo raw format) or if you have the files already downloaded, specify the wildcard (e.g. "*" or "?") to access the files on your system. We recommend putting this parameters in 'single quotes' as you might run into an error when using wildcards.

--exp_design An experimental design is automatically calculated from differences in the samples in the SDRF file. Alternatively, provide a tab-separated file with the five columns _rawfile, _expcondition, biorep, fraction and techrep.
_rawfile: raw file names without path. Incorrect or incomplete names will lead to errors.
_expcondition: arbitrary names for the sample groups. Files with the same sample group name will be considered replicates.
biorep: biological replicate with numbering starting with 1.
fraction: fraction number with numbering starting with 1.
techrep: technical replicate with numbering starting with 1.
Note: The numbering needs to be consistent and each line needs to be unique in the combination of _expcondition, biorep, fraction and techrep. See example.

--workflow Instead of running 'all' workflows (default), run only one of 'maxquant', 'proline', 'compomics' or 'tpp'

other parameters:

--comps (only maxquant workflow): Provide contrasts (specific comparisons) for the statistical tests. This is a list of comma-separated group names, e.g. "B-A,C-A" when having the three sample groups A, B and C

--proline_engine (only proline workflow): Define the search engine for the database search. Can be one or multiples of "xtandem", "msgf", "ms-amanda", "tide", "comet", "myrimatch", "meta_morpheus" and "andromeda". Note that not all engines necessarily work well with each data set.

You can add other NextFlow parameters as described extensively here

Valid data analysis parameters per workflow

The following parameters can be provided via the parameter yaml file (--parameters flag)

As not all parameters are available for each workflow, the last columns describe their applicability. Here, TRUE means that the parameter is available and can be modified.

parameter	type	sdrf name	default	maxquant	proline	compomics	tpp
fixed_mods	ontology	modification parameters	NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4	TRUE	TRUE	TRUE	TRUE
variable mods	ontology	modification parameters	NT=oxidation;MT=variable;TA=M;AC=UNIMOD:35	TRUE	TRUE	TRUE	TRUE
precursor_mass_tolerance	string	precursor mass tolerance	30 ppm	TRUE	TRUE	TRUE	TRUE
fragment_mass_tolerance	string	fragment mass tolerance	0.05 Da	TRUE	TRUE	TRUE	TRUE
enzyme	ontology	cleavage agent details	Trypsin	TRUE	TRUE	TRUE	TRUE
fions	class	forward ions	b	FALSE	TRUE	TRUE	TRUE
rions	class	reverse ions	y	FALSE	TRUE	TRUE	TRUE
isotope_error_range	integer	isotope error range	0	FALSE	TRUE	TRUE	TRUE
add_decoys	boolean	add decoys	true	FALSE	TRUE	TRUE	TRUE
num_hits	integer	num peptide hits	1	FALSE	FALSE	FALSE	FALSE
allowed_miscleavages	integer	allowed miscleavages	1	TRUE	TRUE	TRUE	TRUE
min_precursor_charge	integer	minimum precursor charge	2	FALSE	TRUE	TRUE	TRUE
max_precursor_charge	integer	maximum precursor charge	3	TRUE	TRUE	TRUE	TRUE
min_peptide_length	integer	minimum peptide length	8	TRUE	TRUE	TRUE	TRUE
max_peptide_length	integer	maximum peptide length	12	FALSE	TRUE	TRUE	TRUE
max_mods	integer	maximum allowed modification	4	TRUE	TRUE	TRUE	TRUE
ident_fdr_psm	float	fdr on psm level	0.01	TRUE	TRUE	TRUE	TRUE
ident_fdr_peptide	float	fdr on peptide level	0.01	TRUE	TRUE	TRUE	TRUE
ident_fdr_protein	float	fdr on protein level	0.01	TRUE	TRUE	Not clear	Not clear
match_between_runs	boolean	run match between runs	true	TRUE	FALSE	TRUE	Not available
protein_inference	class	protein inference method	unique	TRUE	FALSE	TRUE	TRUE
quantification_method	class	quantification method	intensity	FALSE	FALSE	FALSE	FALSE
summarization_proteins	class	summarization of proteins method	sum_abs	FALSE	FALSE	FALSE	FALSE
min_num_peptides	integer	minimum number of peptides per protein	2	TRUE	TRUE	TRUE	TRUE
summarization_psms	class	summarization of psms method	sum_abs	FALSE	FALSE	FALSE	FALSE
quant_transformation	class	transformation of quantitative values	log	FALSE	FALSE	FALSE	FALSE
normalization_method	class	normalization method	median	TRUE	FALSE	FALSE	FALSE
run_statistics	boolean	run statistical tests	true	TRUE	TRUE	TRUE	TRUE
fdr_method	class	method for correction of multiple testing	benjamini-hochberg	FALSE	FALSE	FALSE	FALSE
fdr_threshold	float	threshold for statistical test fdr	0.01	By filtering the results	By filtering the results	By filtering the results	By filtering the results

Workflow output

Intermediate and final files are provided in the results folder or the folder specified via the outdir parameter.

On top of the workflow-specific output, a standardized tabular format on both peptide (stand_pep_quant_merged.csv) and protein (stand_prot_quant_merged.csv) level is given.

For each of the workflows, WOMBAT-Pipelines calculated the same set of benchmarks for more systematic and thorough comparison between workflows and/or between different values of the data analysis parameters. For details about the benchmarks, see the following table:

Category	Aspect	Subgroup	Name	Name in JSON file	Definition	Value
Functionality	Traceability	Spectra	Tracable spectra	TraceableSpectra	Results tracable to original spectra	Y/N
Functionality	Traceability	Spectra	Universal spectrum identifiers	UniversalSpectumIdentifiers	Workflow generates USIs (Universal Spectrum Identifier)	Y/N
Functionality	Traceability	Spectra	Peptide to spectra	PeptideToSpectra	Corresponding spectrum numbers/ids available from peptide level	Y/N
Functionality	Traceability	Spectra	Protein to spectra	ProteinToSpectra	Corresponding spectrum numbers/ids available from protein level	Y/N
Functionality	Traceability	File names	Results to raw files	ResultsToRawFiles	Raw input file names preserved in tables on PSM/peptide/protein level	Y/N
Functionality	Traceability	File names	Public raw files	PublicRawFiles	Raw files publicly available	Y/N
Functionality	Traceability	Parameters	Experimental design	ExperimentalDesign	Biological and technical replicates can be identified in results	Y/N
Functionality	Performance	Identification	PSM number	PSMNumber	Number of identified PSMs passing preset FDR	Integer
Functionality	Performance	Identification	Peptide number	PeptideNumber	Number of uniquely identified peptide identifications passing preset FDR	Integer
Functionality	Performance	Identification	Protein number	ProteinNumber	Number of uniquely identified protein identifications passing preset FDR	Integer
Functionality	Performance	Identification	Protein group number	ProteinGroupNumber	Number of different protein groups passing preset FDR	Integer
Functionality	Performance	Identification	Peptide coverage	PeptideCoverage	Percentage of peptides identified in all samples	Double
Functionality	Performance	Identification	Protein coverage	ProteinCoverage	Percentage of proteins identified in all samples	Double
Functionality	Performance	Identification	Peptides per protein	PeptidesPerProtein	Distribution of peptides per protein group	Set of Integer
Functionality	Performance	Quantification	Correlation peptides	CorrelationPeptides	Mean of Pearsson correlation of protein abundances between replicates (log2-scale)	Double
Functionality	Performance	Quantification	Correlation proteins	CorrelationProteins	Mean of Pearsson correlation of peptide abundances between replicates (log2-scale)	Double
Functionality	Performance	Quantification	Number peptides	NumberOfPeptides	Number of quantified peptides with at least 50% coverage	Integer
Functionality	Performance	Quantification	Number protein groups	NumberOfProteinGroups	Number of quantified proteins groups with at least 50% coverage	Integer
Functionality	Performance	Quantification	Dynamic peptide range	DynamicPeptideRange	Difference of peptide abundance (top 5% versus bottom 5% quantile)	Double
Functionality	Performance	Quantification	Dynamic protein range	DynamicProteinRange	Difference of protein abundance (top 5% versus bottom 5% quantile)	Double
Functionality	Performance	Statistics	Differentially regulated peptides 5%	DifferentialRegulatedPeptides5Perc	Number of differentially regulated peptides with FDR below 5%	Set of Double
Functionality	Performance	Statistics	Differentially regulated proteins 5%	DifferentialRegulatedProteins5Perc	Number of differentially regulated proteins with FDR below 5%	Set of Double
Functionality	Performance	Statistics	Differentially regulated peptides 1%	DifferentialRegulatedPeptides1Perc	Number of differentially regulated peptides with FDR below 1%	Set of Double
Functionality	Performance	Statistics	Differentially regulated proteins 1%	DifferentialRegulatedProteins1Perc	Number of differentially regulated proteins with FDR below 1%	Set of Double
Functionality	Performance	Statistics	Missing peptide values	MissingPeptideValues	Percentage of missing values in entire peptide set	Double
Functionality	Performance	Statistics	Missing protein values	MissingProteinValues	Percentage of missing values in entire protein set	Double
Functionality	Performance	Digestion	Digestion efficiency	Efficiency	Distribution of number of miscleavages	Set of Double
Functionality	Performance	PTMs	PTM Distribution	PTMDistribution	Percentage of peptides with PTM xyz	Set of Double
Functionality	Performance	PTMs	PTM Occupancy	PTMOccupancy	Distribution of peptides with 1,2,... PTMs	Set of Double
Functionality	Parameter	Identification	Database size	DatabaseSize	Number of entries in fasta file	Integer
Functionality	Parameter	Identification	Canonical sequences	CanonicalSequences	Database includes canonical sequences	Y/N
Functionality	Parameter	identification	PTM localization	PTMLocalization	Is PTM localization scoring software included in the workflow	Y/N

Contribute

Contributions to change and modify the workflows are most welcome. For this, please create a fork and add your changes. We strongly recommend reaching out to us either via email or by creating an issue in this repository, as we then help adding the new implementation.

Credits

nf-core/wombat was originally written by the members of the ELIXIR Implementation study Comparison, benchmarking and dissemination of proteomics data analysis pipelines under the lead of Veit Schwämmle and major participation of David Bouyssié and Fredrik Levander.

Citations

Preprint available: https://www.biorxiv.org/content/10.1101/2023.10.02.560412v1

As the workflows are using an nf-core template, we refer to the publication:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

wombat-p / WOMBAT-Pipelines

readme