xomicsdatascience / zoDIAq

Cosine Similarity Optimization for DIA qualitative and quantitative analysis
MIT License
3 stars 4 forks source link

ValueError: not enough values to unpack (expected2, got 0) #29

Open LiaSerrano opened 2 years ago

LiaSerrano commented 2 years ago

I tried to format a Prosit library like the TraMl lib. I am getting a similar error to what I did with an MGF massiveKB library—I think its not able to associate peptide to protein ?

I get a “file_corrected” output but the peptide/spectral FDR outputs are empty and there is no proteinFDR output. These outputs appear when I take out the decoys, however.

Let me know if you would like me to send over the library I was using if that would be helpful. csodiaq_error_August.pdf

jessegmeyerlab commented 2 years ago

@AlexandreHutton was going to add direct support for prosit libraries. Let's see if that works because that would avoid any problems with library conversion. Lex can you please update us where we are with that?

AlexandreHutton commented 2 years ago

I found this exact error while working with the FragPipe library. I thought it might be a problem with the library itself, but it sounds like it's an issue with the code. I think the problem might be with the FDR calculation somewhere. Adding in decoys gets us past that error but then produces an empty proteinFDR file. I'm investigating.

jessegmeyerlab commented 2 years ago

Thanks Lex for the update. I have some ideas.

If there was a way to convert a library that we know works to the other formats then we could rule out or confirm the issue relates to edge cases with the library format.

Since you think the problem is with fdr calculations and adding decoys gets past the error to produce empty output, I wonder how the fdr calculation deals with the case where there are no decoy hits. This could happen if the library contains no decoys or by luck in some rare circumstances.

@LiaSerrano, does your library have decoys?

@AlexandreHutton does the frag pipe library have decoys?

LiaSerrano commented 2 years ago

The library I was using has reverse sequence decoys predicted by prosit. This error actually doesnt happen to me when I take out the decoy entries. Let me know if you would like me to send an example! Thank you

AlexandreHutton commented 2 years ago

@AlexandreHutton does the frag pipe library have decoys?

It does not. I converted some entries from another (functional) library and added them in, which resulted in the empty output mentioned previously.

jessegmeyerlab commented 2 years ago

I wonder if the decoy is the same as the label CsoDIAq looks for.

It might help us understand if you can share the exact library you're using. You could email it to Lex and I if you want to keep the library private.

AlexandreHutton commented 2 years ago

The library I was using has reverse sequence decoys predicted by prosit. This error actually doesnt happen to me when I take out the decoy entries. Let me know if you would like me to send an example! Thank you

Please do!

LiaSerrano commented 2 years ago

I'll shoot over an email, thanks!

jessegmeyerlab commented 2 years ago

@AlexandreHutton does the frag pipe library have decoys?

It does not. I converted some entries from another (functional) library and added them in, which resulted in the empty output mentioned previously.

Thanks Lex,

It might be how it handles where the decoys are hit. If it hits a decoy within the first 100 proteins (sorted by MaCC) then I believe it should return an empty list.

I don't remember how @CCranney made it handle when it never hits a decoy but that could be another place to look. It might help you debug if you can look at the intermediate matches list (that would be in memory) for proteins and see where the decoys fall in the order.

Thanks for looking at this Lex

CCranney commented 2 years ago

Hi all,

I dug into the code, looking specifically for the error @LiaSerrano included in her PDF in the first comment. Backtracing the error, I think no peptides were identified (the _peptideFDR.csv output is completely blank). That, or the library used lacks or has different peptide and/or protein labels, and as such the "peptide" and/or "protein" columns of the peptideFDR output file are blank. This is just me extrapolating what the error could be, but could I have access to the data/GUI settings that led to this error?

Breakdown of my thought process: The error is found here:

File "C:\Users\lrserrano\Anaconda3\envs\csod\lib\site-packages\csodiaq\idpicker.py",
line 23, in group_nodes_with_same_edge
if first: l1, l2 = map(list,zip(*data))

It looks like it tried to break data into two lists when data was actually blank. This data variable should have been a list of length-2 tuples, pairing peptides to proteins. So going back to where data came from, it looks like it was created and passed down through the following functions:

  1. Start: File csodiaq_identification_functions.py, . The <peptideDf> variable, a dataframe that was used to create the _peptideFDR.csv file in the output.
  2. This variable was passed into the function format_peptide_protein_connections(peptideDf) on line 104.
  3. Each peptide is tied to the proteins in its protein group in a 1:1 fashion as a list of tuples. For example, if the peptide EHALLAYTLGVK was attached to the protein group 3/sp|Q5VTE0|EF1A3_HUMAN/sp|Q05639|EF1A2_HUMAN/sp|P68104|EF1A1_HUMAN, you would expect the following list of tuples to be created. All peptide-protein connections would be put into the same list.
    [
    ('EHALLAYTLGVK', 'sp|Q5VTE0|EF1A3_HUMAN'),
    ('EHALLAYTLGVK', 'sp|Q05639|EF1A2_HUMAN'),
    ('EHALLAYTLGVK', 'sp|P68104|EF1A1_HUMAN')
    ]
  4. This list of tuples is ultimately what the error is occurring on, like this list is empty (no peptide-protein connections). So either the _peptideFDR.csv is completely empty, or the peptide and/or protein columns of the _peptideFDR.csv file are blank. I'm leaning towards the former, but won't know without looking at the data in question.