xomicsdatascience / zoDIAq

Cosine Similarity Optimization for DIA qualitative and quantitative analysis
MIT License
3 stars 4 forks source link

Library table lookup + protein names #31

Open AlexandreHutton opened 2 years ago

AlexandreHutton commented 2 years ago

Protein names are not consistent, even for the same protein, across the library and CsoDIAq output files. This applies only to proteins with synonyms (e.g. 3/sp|Q01813|PFKAP_HUMAN/sp|P17858|PFKAL_HUMAN/sp|P08237|PFKAM_HUMAN); the synonyms will be out of order. As a result, it is required to look up each permutation of the order until one is found (assuming that the library is self-consistent).

Recommendation: ensure that CsoDIAq protein output is consistent with the library; add option to parse library and fix it (e.g. order the synonyms alphabetically). This seems to be an issue exclusively for the 'leadingProtein' output field.

jessegmeyerlab commented 2 years ago

This is the behavior csodiaq uses to deal with peptides that match to multiple proteins when no additional peptides can narrow down the list to one of the subset protein. I see how this could be complicate the re-matching to the library for quant so thanks for catching it.

A solution may be to add a colum to csodiaq output for "library protein"?

CCranney commented 1 year ago

If I'm understanding this correctly, would sorting the leadingProtein proteins be sufficient? Part of the problem with matching the library to zoDIAq output files, specifically in regards to the leadingProtein column, is they are not expected to match. For example, if the protein grouping 3/sp|Q01813|PFKAP_HUMAN/sp|P17858|PFKAL_HUMAN/sp|P08237|PFKAM_HUMAN existed in the library, but the first protein sp|Q01813|PFKAP_HUMAN was removed by the ID Picker algorithm, we would expect the leadingProtein column to contain 2/sp|P17858|PFKAL_HUMAN/sp|P08237|PFKAM_HUMAN instead.