Open AlexandreHutton opened 2 years ago
This is the behavior csodiaq uses to deal with peptides that match to multiple proteins when no additional peptides can narrow down the list to one of the subset protein. I see how this could be complicate the re-matching to the library for quant so thanks for catching it.
A solution may be to add a colum to csodiaq output for "library protein"?
If I'm understanding this correctly, would sorting the leadingProtein
proteins be sufficient? Part of the problem with matching the library to zoDIAq output files, specifically in regards to the leadingProtein
column, is they are not expected to match. For example, if the protein grouping 3/sp|Q01813|PFKAP_HUMAN/sp|P17858|PFKAL_HUMAN/sp|P08237|PFKAM_HUMAN
existed in the library, but the first protein sp|Q01813|PFKAP_HUMAN
was removed by the ID Picker algorithm, we would expect the leadingProtein
column to contain 2/sp|P17858|PFKAL_HUMAN/sp|P08237|PFKAM_HUMAN
instead.
Protein names are not consistent, even for the same protein, across the library and CsoDIAq output files. This applies only to proteins with synonyms (e.g. 3/sp|Q01813|PFKAP_HUMAN/sp|P17858|PFKAL_HUMAN/sp|P08237|PFKAM_HUMAN); the synonyms will be out of order. As a result, it is required to look up each permutation of the order until one is found (assuming that the library is self-consistent).
Recommendation: ensure that CsoDIAq protein output is consistent with the library; add option to parse library and fix it (e.g. order the synonyms alphabetically). This seems to be an issue exclusively for the 'leadingProtein' output field.