wfondrie / mokapot

Fast and flexible semi-supervised learning for peptide detection in Python
https://mokapot.readthedocs.io
Apache License 2.0
40 stars 14 forks source link

Can't get protein level confidence. "IndexError: index -3 is out of bounds for axis 0 with size 1" #113

Open roushrsh opened 8 months ago

roushrsh commented 8 months ago

I'm trying to run mokapot on this file (TestAX.pin): https://drive.google.com/file/d/1T4447c9Y24hx3qHla_cW55HEg4A_XEuW/view?usp=sharing

Which works great to give peptide confidence:

psms = mokapot.read_pin("TestAX.pin")
results, models = mokapot.brew(psms)
print(results)
A mokapot.confidence.LinearConfidence object:
    - PSMs at q<=0.01: 83617
    - Peptides at q<=0.01: 42508

I've then shuffled the peptides myself to make decoys which I would like to provide.

I've tried keeping them either their own row: https://drive.google.com/file/d/1mP-nsqDAER2JcxoUxHc1H6Ed5qx6SCH7/view?usp=sharing or grouped into one string so mokapot does the digesting (can I have this not done?) https://drive.google.com/file/d/1_A-q9rqytLr9-YYrW6UD69jrtf0APaXA/view?usp=sharing

I do :

psms.add_proteins('Protein.csv',
    enzyme="[KR]",
    decoy_prefix="decoy_",
    missed_cleavages=2,
                )

then run mokapot.brew(psms) again.

Regardless, in case 1)

"WARNING:mokapot.picked_protein: 149241 out of 317925 peptides could not be mapped. Please check your digest settings." 

and I get

ValueError: Fewer than 90% of all peptides could be matched to proteins. Please verify that your digest settings are correct."

and in case 2)

WARNING:mokapot.picked_protein:11128 out of 317925 peptides could not be mapped. Please check your digest settings.

and I get the error:

    258 scores[rights == 0] = g[0] - (medians[0] - scores[rights == 0]) * derr # reuse "scores" array to save memory
    260 derl = (g[-1] - g[-2]) / (medians[-1] - medians[-2]) + (medians[-1] - medians[-2]) / 6 * gamma[-3]
IndexError: index 0 is out of bounds for axis 0 with size 0"

Any suggestions?

Thanks

wfondrie commented 8 months ago

Hi @roushrsh 👋

Can you please provide the version of mokapot that you're using and please provide the full log? You should be able to find the version with:

import mokapot
print(mokapot.__version__)

To enable detailed logging within your python session, you can add:

import logging
logging.basicConfig(level=logging.INFO)

I'll do my best to answer your questions without this information for now though.

For protein-level confidence estimates, mokapot internally digests proteins from a FASTA file; the CSV files that you provided above are not supported. The protein grouping is performed based on all theoretical digested peptides for a protein sequence, so I would generally advise letting mokapot handle it. However, if you really want to provide your own sequences, you can do so by initializing the mokapot.proteins.Proteins object directly:

https://github.com/wfondrie/mokapot/blob/ca5a8dfd72d57156f0a8420ce17b6c1199e4dba7/mokapot/proteins.py#L7-L50

However, you'll need to provide a mapping of target proteins to decoy proteins, a mapping of peptides to their generating proteins, and set of all shared peptides, in addition to other information. I strongly recommend using a FASTA file.

To my knowledge, the error you're seeing does not originate from the mokapot codebase, so I can't provide much insight without seeing the full log.

roushrsh commented 8 months ago

Edit/Update: Version is 0.10.0, but I had the same problem on the older version.

I made the protein structure and all the parts:

proteins = Proteins(peptide_map=dictionaryPepts, protein_map=dictionaryProts, shared_peptides= dict(zip([],[])), has_decoys=True, decoy_prefix='decoy_')

but when I use add_proteins the Protein confidence estimates enabled remains False and nothing happens. How can I add them properly? Even loading from fasta and replacing the parts with mine seems to do nothing.

I got the 'fasta' version to work though, thanks!

wfondrie commented 7 months ago

The approach you took should work with the Proteins object should work. Can you create a tiny reproducible example to share?

Glad the fasta is working for you!