wenbostar / PDV

PDV: an integrative proteomics data viewer
GNU General Public License v3.0
44 stars 20 forks source link

Fail to parse Tide pepXML #42

Open wsnoble opened 1 year ago

wsnoble commented 1 year ago

I tried to parse a Tide pepXML file, but failed. The error is "Failed to parse the PepXML file, please check your file." I suspect that this is because our format has changed since you first evaluated Tide's PepXML back at Crux v3.2. Can you take a look at the attached file and see if it's possible to support it, or if we need to make changes on our end?

plasmo-neighbors.trypsin-p.narrow.tide-search.pep.xml.txt.gz MSB17171Trypsin030814.mgf.txt.gz

wenbostar commented 1 year ago

The log file generated by PDV when loading the files shows there is a problem in spectra mapping between pepXML and the mgf files.

Tue Oct 25 12:14:52 PDT 2022: PDV-1.7.4
java.lang.IndexOutOfBoundsException: Index: 9729, Size: 8815
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at com.compomics.util.experiment.io.massspectrometry.MgfIndex.getSpectrumTitle(MgfIndex.java:239)
        at com.compomics.util.experiment.massspectrometry.SpectrumFactory.getSpectrumTitle(SpectrumFactory.java:994)
        at PDVGUI.fileimport.PepXMLFileImport.parsePepXML(PepXMLFileImport.java:669)
        at PDVGUI.fileimport.PepXMLFileImport.access$000(PepXMLFileImport.java:33)
        at PDVGUI.fileimport.PepXMLFileImport$1.run(PepXMLFileImport.java:185)

For mgf/pepXML input from Crux, we use start_scan from the pepXML file as spectrum ID to extract MS/MS spectrum data from mgf. The start_scan from the pepXML we generated from a previous version of Crux is the index of spectrum in MGF file not scan number in MGF file. But it looks like in your pepXML file, it’s scan number in the mgf file. Are there any changes in start_scan in the latest Crux?

wsnoble commented 1 year ago

Unfortunately, I don't know the answer to this. Looking back at the release notes, it could be that these changes were in this update:

May 28, 2020: Added fixes for pepXML schema validation failures.

wenbostar commented 1 year ago

If there is no scan number (SCANS) in MGF, what will be used as start_scan in latest Crux pepXML output? It is common that mgf files don't have scan number.

BEGIN IONS
TITLE=controllerType=0 controllerNumber=1 scan=7
SCANS=7
RTINSECONDS=122.229984
PEPMASS=381.409973144531
CHARGE=3+
111.1171341 1240.9335937500
113.8793030 1190.6700439453
115.0367432 1258.7552490234
153.8232880 1130.4077148438
wsnoble commented 1 year ago

It uses ordinal numbers instead in that case. Here is the line that gets printed to the log file:

INFO: Parser could not determine scan numbers for this file, using ordinal numbers as scan numbers.

A sample MGF and pepxml file are attached.

plasmo-neighbors.trypsin-p.narrow.tide-search.pep.xml.txt short.mgf.txt

wenbostar commented 1 year ago

This new example can be imported into PDV successfuly:

image

If an MGF file is combined from multiple MGF files (e.g., multiple fractions of the same sample), this MGF file is likely to have spectra with the same scan numbers. In this case, how will Crux set the start_scan in the pepXML output?

freejstone commented 1 year ago

I am really unsure if this is helpful, but for what its worth I was able to parse PDV using the "database searching" feature using a pepxml containing a single PSM and using the complete mgf file. However no ions are annotated:

Screen Shot 2022-10-27 at 1 58 37 pm

As soon as I reduce the mgf file to the single scan of interest, it does not parse. What does work is using PDV's "one PSM" feature. In that case it will accept the mgf with the single scan.

I have attached the complete mgf, the single mgf, and the pepxml containing the single PSM.

MSB19717Trypsin021915_1910.mgf.txt MSB19717Trypsin021915.mgf.txt plasmo-neighbors.trypsin-p.wide.tide-search_single_psm.pep.xml.txt

kfattila commented 1 year ago

I am not familiar with the data parsing functions in Crux. But it was my understanding so far that crux uses proteowizard to parse the input files, mgf etc. I think there has been some proteowizard update. I hope my comment helps.

wenbostar commented 1 year ago

I am really unsure if this is helpful, but for what its worth I was able to parse PDV using the "database searching" feature using a pepxml containing a single PSM and using the complete mgf file. However no ions are annotated:

Screen Shot 2022-10-27 at 1 58 37 pm

As soon as I reduce the mgf file to the single scan of interest, it does not parse. What does work is using PDV's "one PSM" feature. In that case it will accept the mgf with the single scan.

I have attached the complete mgf, the single mgf, and the pepxml containing the single PSM.

MSB19717Trypsin021915_1910.mgf.txt MSB19717Trypsin021915.mgf.txt plasmo-neighbors.trypsin-p.wide.tide-search_single_psm.pep.xml.txt

For Crux, the current version can correctly match PSMs in pepXML to spectra in mgf file only when the _startscan in pepXML is an ordinal number of spectrum in mgf file.

wenbostar commented 1 year ago

I tested Crux v4.1, the current version of PDV works well with mzML/pepXML, mzML/mzid, mzXML/pepXML, mzXML/mzid files. I added a few examples generated using Crux v4.1 to the README.

For MGF input, it looks like the spectrum ID mapping for both pepXML and mzid outputs was changed in v4.1 so PDV cannot parse the result sucessfully in some cases.

So far, I found the start_scan was assigned differently with different head formats of MGF:

  1. When there is no SCANS in MGF and it looks like scan number cannot be parsed from TITLE by Crux, start_scan is an ordinal number of spectrum in mgf file. PDV works well with this;
  2. When there is no SCANS in MGF and TITLE is in a format like "TITLE=SF_200217_U2OS_TiO2_HCD_OT_rep1.1501.1501.2", start_scan is scan number parsed from the title (1501 for TITLE=SF_200217_U2OS_TiO2_HCD_OT_rep1.1501.1501.2);
  3. When there is SCANS in MGF, start_scan is scan number from SCANS.

Considering different spectra may have the same scan number when a MGF file is combined from multiple MGF files, I would suggest to always assign start_scan as the ordinal number of spectrum in mgf file. Using a consistent way for spectrum mapping for the same format of MS/MS data will make users parse the result easier.