wfondrie / mokapot

Fast and flexible semi-supervised learning for peptide detection in Python
https://mokapot.readthedocs.io
Apache License 2.0
41 stars 15 forks source link

Issue with decoy prefix when trying to get protein level confidence. #114

Open louisebuur opened 10 months ago

louisebuur commented 10 months ago

Hi!

I am having an issue with getting protein level confidence.

I have an MS Amanda output file that I am reading using the psm_utils package, and then converting that file to a LinearPsmDataset

This is the file that I am using https://drive.google.com/file/d/1PiztK5BY4byAR2Loup6kTf7j9Q4QMgGQ/view?usp=drive_link

from ms2rescore.rescoring_engines.mokapot import convert_psm_list
from psm_utils.io import read_file
import mokapot
psm_list = read_file("path_to_file.csv", filetype ="msamanda") 
mokapot_psms = convert_psm_list(psm_list)

I used the make_decoys function to add decoy sequences to my FASTA file

mokapot.make_decoys(fasta = "path_to_file.fasta",decoy_prefix = "REV_",reverse=True,out_file="E:/test.fasta")

And then I use the add_proteins function and put in the parameters that correspond to the ones I used in the search

mokapot_psms.add_proteins("E:/test.fasta",enzyme = "[KR]",missed_cleavages = 2,min_length = 6,max_length = 60)

Then I want to assign confidence and print the results

confidence_result = mokapot_psms.assign_confidence()
print(confidence_result.accepted)

However I get this error: 25362 out of 46118 peptides could not be mapped. Please check your digest settings. ValueError: Fewer than 90% of all peptides could be matched to proteins. Please verify that your digest settings are correct.

I realized that I didn't include the decoy_prefix, so I tried to do that

mokapot_psms.add_proteins("E:/test.fasta",enzyme = "[KR]",missed_cleavages = 2,min_length = 6,max_length = 60,decoy_prefix = "REV_")

Run this again

confidence_result = mokapot_psms.assign_confidence()
print(confidence_result.accepted)

And then get this error 46118 out of 46118 peptides could not be mapped. Please check your digest settings. ValueError: Fewer than 90% of all peptides could be matched to proteins. Please verify that your digest settings are correct.

I did double check that the digest settings are correct. And it seems that half of the peptides can be mapped if I do not specify the decoy prefix in the add_proteins function. So, to me it seems like there is an issue with the decoy prefix pattern.

I also tried to use the default decoy pattern decoy_ when creating the FASTA file using the make_decoys function, and then also changed the decoy prefix in the input file. I still got the same errors.

I am using version 0.10.0 of Mokapot

Can you give me a hint of what may be wrong here? And please let me know if I should provide further information.

Thanks in advance!

wfondrie commented 10 months ago

This indeed sounds like a problem! I've just requested access to the files so I can take a look. My guess is that perhaps the peptide strings are formatted in a way that isn't accounted for in mokapot, but I'll have to take a closer look.

Would it be alright to post examples from the pin file here as future documentation for the issue?

louisebuur commented 10 months ago

Hi Will, I should have given you access to the file now. And yes of course, feel free to post examples :)

The file I analysed with MS Amanda is not mine, but obtained from this data set

Thanks!