xomicsdatascience / zoDIAq

Cosine Similarity Optimization for DIA qualitative and quantitative analysis
MIT License
3 stars 4 forks source link

IndexError: index -1 is out of bounds for axis 0 with size 0 #25

Closed jymbcrc closed 1 year ago

jymbcrc commented 2 years ago

Hello, Cranney, Sorry to disturb you again, recently I meet a new issue when analyzing the DIA data. it reports a IndexError: index -1 is out of bounds for axis 0 with size 0. Below is the whole console content and the raw data and mzMxl data I used. Thank you very much. I am sorry to bother you again csoDIAq console.txt data2.zip

jessegmeyerlab commented 2 years ago

@CCranney just wanted to check if you saw this

CCranney commented 2 years ago

I will investigate this weekend and get back to you by EOD Saturday.

jessegmeyerlab commented 2 years ago

Thanks Caleb!!! I started trying to investigate, and it seems like a problem with at the spectra 33 and 37. will update if I figure out more

jessegmeyerlab commented 2 years ago

I think I figured it out!!!

Within class IdentificationSpectraMatcher, when self.ppmMatches has length zero (no matches) it throws this error.

If we add a check for len(self.ppmMatches) before running line 149 with the np.delete then it should be okay except I'm not sure if skipping that line sometimes will break something else.

I added this above and it allows it to finish the spectra comparisons:

if len(self.ppmMatches>0):

            self.libraryTags, self.queryTags, self.libraryIntensities, self.queryIntensities, self.ppmMatches = [np.delete(x, remove_indices) for x in [self.libraryTags, self.queryTags, self.libraryIntensities, self.queryIntensities, self.ppmMatches]]
            if len(self.decoys) > 0: self.decoys = np.delete(self.decoys, remove_indices)

Then it throws a different error when it tries to write: I think it doesn't know how to handle writing a report if there are zero matches.

This data is quite sparse compared to previous data we used in testing so that makes sense why we never saw this error before.


Begin Writing to File: 
0:01:40.518557

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_14508/3703801407.py in <module>
----> 1 cif.perform_spectra_pooling_and_analysis(   args['files'][i],
      2                                             outFile,
      3                                             lib,
      4                                             args['fragmentMassTolerance'],
      5                                             maxQuerySpectraToPool,

~\Documents\GitHub\CsoDIAq-master\csodiaq_identification_functions.py in perform_spectra_pooling_and_analysis(querySpectraFile, outFile, lib, tolerance, maxQuerySpectraToPool, corrected, histFile)
     94 
     95     smf.print_milestone('\nBegin Writing to File: ')
---> 96     allSpectraMatches.write_output(outFile, querySpectraFile, maccCutoff, queScanValuesDict, libIdToKeyDict, lib)
     97 
     98 

~\Documents\GitHub\CsoDIAq-master\IdentificationSpectraMatcher.py in write_output(self, outFile, expSpectraFile, scoreCutoff, queValDict, idToKeyDict, lib)
    220                             expSpectraFile, #fileName
    221                             scan, #scan
--> 222                             queValDict[scan]['precursorMz'], #MzEXP
    223                             libKey[1], #peptide
    224                             lib[libKey]['ProteinName'], #protein

KeyError: 'precursorMz'
CCranney commented 2 years ago

Hi all,

I've been stuck re-installing conda commands where I'm getting some new errors (with PyQt5 and pip rather than CsoDIAq). I'll be making another issue to address that one - to help fix that, can one or both of you share with me the version of pip and PyQt5 you are using?

In trying to debug what I can without actually being able to run code (thank you for your investigation Jesse!!), there are a few oddities I'd like to explore:

  1. If self.ppmMatches is empty, I would expect self.libraryTags, self.queryTags etc. to also be empty. The use of all those variables was meant to function as a kind of 2D array, and was used as a stop-gap measure until the numba package allows for actual 2D arrays. As such, they should all be the same length. If self.ppmMatches is empty while the others are full, the problem likely lies in the find_matching_peaks function in the spectra_matcher_functions.py file. Can you verify in your if statement in your code that something like self.libraryTags also has a length of 0?
  2. If they are all empty (which is most likely, though it's odd to me that self.ppmMatches would be the parameter to throw the error), then we would need to effectively skip the window that is throwing the error and let the program chug along past it. I think your code effectively does that - we may want to investigate further, but it should work for now.
  3. If none of the windows have any matches, I would actually have expected a different error when writing the output (though also one to address!). I would expect line 206 curLibTag = self.libraryTags[0] to throw an error because this list would be empty. Notice that problem #2 above is from a problem that is done in batches, one per window. It's likely that one batch had no matches, but others did, so in this instance it is writing to a file. If I'm reading this right, it looks like there is a scan number found in the final results that was not originally found in the file - in other words, it's either a bizarre artifact from the earlier error, or it's a completely new one.

I've written some temporary fixes to a new branch emptyWindowSkip. I added three lines that effectively:

  1. Skip a window if there are no matches in it (same thing you did, though a bit differently since I can't see all of your code)
  2. Write an empty file (only header, no values) if no matches are found by the time the output is written.
  3. Skip lines in the output that stem from an abnormal scan number.

These are untested, as I need to resolve the other issue before I can run CsoDIAq, but at the very least in pinpoints places in the code where fixes will likely need to be made. Especially in reference to the last change, these are temporary patches that will require more investigation once I figure out this PyQt5/pip dilemma.

Some notes for investigating problem 3: The queValDict is initialized, I believe, on line 285 of csodiaq_identification_functions.py (queScanValuesDict). It is a defaultdict, which means that if you try to access a key that does not exist, it will populate a new key:value pair rather than throw an error. When the mystery scan is accessed in IdentificationSpectraMatcher.write_output() function, it creates a new, empty dictionary as the value that has none of the expected metadata (precursor mz, etc.) that queValDict was made to reference. I'd want to determine what that mysterious scan is - probably by adding print(scan) right before my third code addition and seeing the output, trying to determine 1) where it came from and 2) why it wasn't a key in queValDict.

Also, sorry for closing the issue temporarily - not sure which key I hit that did that, but it was a complete accident.

jessegmeyerlab commented 2 years ago

Thanks for looking at this Caleb.

To clarify, self.ppmMatches didn't throw an error, the error comes from np.delete index -1 on this line:

self.libraryTags, self.queryTags, self.libraryIntensities, self.queryIntensities, self.ppmMatches = [np.delete(x, remove_indices) for x in [self.libraryTags, self.queryTags, self.libraryIntensities, self.queryIntensities, self.ppmMatches]]

but I was suspicious if the error came from a lack of matches since I was printing the query spectra and # 33 looked spare. Then I printed the length of self.ppmMatches and saw that it was empty for spectra 33 and 37 in this specific file, so likely all relevant match parameters in self are empty. I think the fix needs to account for this rare possibility that zero fragments match. We should exclude them from FDR calculation(s) as I think you suggest in your fix

  1. Does 'windows' == 'queury spectra'? Other query spectra had matches, it looks like at least 33 and 37 do not. Probably more too but I didn't look further

Let me know if I can help

jessegmeyerlab commented 2 years ago

I can confirm that self.queryIntensities is also empty for scan 33

CCranney commented 2 years ago

Perfect, thank you Jesse! That eliminates the likelihood of a bizarre edge case - we're purely looking at what happens when a scan has no hits.

Ok, I think I made the appropriate fixes in the emptyWindowSkip branch. Below are the issues addressed by the fixes:

  1. As you saw and added code for, there were several processes that should be skipped if no hits were found is a query scan (I called it a "window" earlier, sorry about that).
  2. The later bug was a datatype issue. Adding empty lists to the mix coerced the datatype of self.queryTags to floats instead of ints. So it would search for metadata associated with scan 9.0 and not find anything, because all metadata for that scan was saved under 9. That was the "mystery scan" I referenced earlier.
  3. This never came up, but if none of the scans had any hits, the write_output would need to account for that and print a blank document.

I ran the data files given above and reached the end without a problem. Can you switch branches and try it on your end? You may need to uninstall csodiaq (pip uninstall csodiaq) and re-install it. If it works on your end, I'll merge it with the main branch.

jessegmeyerlab commented 2 years ago

Awesome thanks! I will try this in 30 minutes

On Mon, Feb 7, 2022, 11:21 AM Caleb Cranney @.***> wrote:

Perfect, thank you Jesse! That eliminates the likelihood of a bizarre edge case - we're purely looking at what happens when a scan has no hits.

Ok, I think I made the appropriate fixes in the emptyWindowSkip branch. Below are the issues addressed by the fixes:

  1. As you saw and added code for, there were several processes that should be skipped if no hits were found is a query scan.
  2. The later bug was a datatype issue. Adding empty lists to the mix coerced the datatype of self.queryTags to floats instead of ints. So it would search for metadata associated with scan 9.0 and not find anything, because all metadata for that scan was saved under 9. That was the "mystery scan" I referenced earlier.
  3. This never came up, but if none of the scans had any hits, the write_output would need to account for that and print a blank document.

I ran the data files given above and reached the end without a problem. Can you switch branches and try it on your end? You may need to uninstall csodiaq (pip uninstall csodiaq) and re-install it. If it works on your end, I'll merge it with the main branch.

— Reply to this email directly, view it on GitHub https://github.com/CCranney/CsoDIAq/issues/25#issuecomment-1031719611, or unsubscribe https://github.com/notifications/unsubscribe-auth/APRLBLA2AZBUN3KSARN5NE3UZ75SDANCNFSM5NJUJV3Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

jessegmeyerlab commented 2 years ago

@CCranney This appears to work, thank you!

@jymbcrc can you please try this update? With your 45k data I get exactly 1,000 protein groups

jessegmeyerlab commented 2 years ago

Thanks Caleb for putting together the wiki - I was wondering about the two protein columns and then found your descriptions here: https://github.com/CCranney/CsoDIAq/wiki/File-Content-Descriptions#csodiaq-file_input-query-file-name-tag_proteinfdrcsv

Before you merge this can you please rename 'leadingProtein' to 'proteinGroup' and 'protein' to 'allProteinMatches'?

jessegmeyerlab commented 2 years ago

I misunderstood the column definitions and wrote 1000 proteins above but the groups column is 'leadingProtein' so actually there are only 531 unique protein groups CsoDIAq-file1_HELA45K_corrected_proteinFDR.csv

CCranney commented 1 year ago

The 2.0 rewrite no longer has this problem. I ran the data through and it finished without errors. Feel free to reopen it if I missed something.