Make tools for statistical tests exchangeable

veitveit commented 7 months ago

Description of feature

Instead of attached on specific statistical test to a workflow, let them run on any the generalized stand_pep/stand_prot files, then providing an updated version of these files containing the p-values and FDRs, as well as the log-ratios

Questions:

Are p-values really needed?
Include normalization?
Add MSStats?
Do statistical tests (only) need a simplified experimental design file?

Current state:
https://github.com/wombat-p/WOMBAT-Pipelines/tree/mutli_stat_tests

Proline workflow nearly done
Compomics needs conversion to standard format + change to msqrob2
TPP needs conversion to standard format and adaptation of ROTS
MaxQuant needs conversion to standard format and adaptation of Normalyzer

Rough working plan

[ ] Rearrange all workflow modules and provide standard output before running the tests (preferably including protein, peptide and ion level.
[ ] Change to msqrob2
[ ] Adapt modules for statistical testing to read and write standard format
[ ] Add parameter to select any of the current four options

@veitveit @wraff

veitveit commented 7 months ago

For getting started with the convertors, this is my suggestion for the standardized output format, meant for input for the statistical tests.

Experimental design
The experimental design file will already exist and is of the format given in the README of https://github.com/wombat-p/WOMBAT-Pipelines
The column "exp_condition" in this file is crucial as defining the columns names in the standardized format.

Sample nomenclature:
For each of the files, fractions will be summarized in to "samples". Then this will provided by the name in "expcondition" from the experimental design file plus "" and the number of the biological/technical replicate. "INFOTYPE_EXPCOND_BIOREP". For example: "number_of_peptides_100.amol_3"

General comments:
- Different fractions should be merged before providing the standardized format.
- Special characters in column names can complicate things quite a lot, also names starting with a number

Protein level file stand_prot_quant.csv The file should contain the following columns:

"protein_group": Uniprot accession numbers (no "sp|" or other info). If multiple, then join them using ";"
"number_ofpeptides...": Number of different peptides used for the identification/quantification. "..." is the sample name as specified above.
"abundance_...": log-transformed protein abundance. "..." is the sample name as specified above.
Others (optional): protein description, contaminant, ...

Peptide level file stand_pep_quant.csv
The file should contain the following columns:

"modified_peptide": peptide sequence with modifications given via Proforma nomenclature (https://github.com/HUPO-PSI/ProForma). This means the modification is given by its Unimod Interim name in brackets after the modified amino acid. Fixed modification should not be included.
"proteingroup": see protein level file
"charge": charge state(s) of peptide, joined by ";" in the case of multiples
"number_ofpsms...": Number of PSMs per sample. "..." is the sample name as specified above.
"abundance_...": Abundance/intensity that is not (!) log-transformed. "..." is the sample name as specified above.
Others (optional): miscleavages, ...

Ion level file stand_ion_quant.csv (optional and more for being able to send the output to ProteoBench):
Same as peptide level file, but with charge states separated to represent the peaks in the chromatogram

veitveit commented 7 months ago

examples_PXD011153.zip And here are the example files for FlashLFQ and the TPP output generated with my own scripts. The TPP output seems to be mostly complete although missing a good way to deal with the modifications.

veitveit commented 6 months ago

@wraff Sorry, I think we need a small correction for the column names of e.g. "abundance_", as to include the technical replicates: "INFOTYPE_EXPCOND_BIOREP_TECHREP".

wombat-p / WOMBAT-Pipelines

Make tools for statistical tests exchangeable #17

Description of feature