wkumler / RaMS

R-based access to Mass-Spectrometry data
Other
20 stars 7 forks source link

Can't read mzML files written by OpenChrom because of missing namespace declaration #17

Closed ethanbass closed 8 months ago

ethanbass commented 1 year ago

Hi, The mzML files written by OpenChrom are missing a namespace declaration that is required for RaMS to parse the files correctly. It throws an error: Error in xml_find_first.xml_node(xml_data, paste0("//d1:", node_to_check)) : xmlXPathEval: evaluation failed". If I manually add the namespace declaration to the file, it parses fine.

I'm not sure what the best fix would be here. I'm not too familiar with mzML so I don't know if these namespace declarations are "required" for the mzML files to be valid. In any case I am wondering if there is a way to add a check in RaMS for the namespace declaration and maybe modify the xpaths accordingly so that RaMS can read the OpenChrom files without modification. Alternatively, I guess it would be possible to add the namespace declaration to the XML after importing it to R. I'm not sure which way would be a better approach here?

I'm attaching an example file that was produced by OpenChrom (with the .txt extension because GitHub wouldn't let me upload mzML). alkanes 0.25 mg-ml_2-10-2021.txt

Thanks! Ethan

wkumler commented 1 year ago

Hi Ethan, thanks for the bug report - it's good to know that this is happening. Namespaces are definitely one of those things I don't understand about XML very well so I'm not super surprised that we ran into an issue with it at some point. It looks like the main issue is that I wrote RaMS with the expectation that the default namespaces (d1/d2) are the only ones used, but the OpenChrom mzMLs only have the xsi namespace:

> xml_ns(xml_data_blk) # A random blank file from msconvert, read in using read_xml
d1   <-> http://psi.hupo.org/ms/mzml
d2   <-> http://psi.hupo.org/ms/mzml
xsi  <-> http://www.w3.org/2001/XMLSchema-instance
xsi1 <-> http://www.w3.org/2001/XMLSchema-instance
> xml_ns(xml_data_alkanes) # Your alkanes file, read in using read_xml
xsi <-> http://www.w3.org/2001/XMLSchema-instance

The good news is that the fix is relatively simple: we just need to drop the namespaces from the xpath expressions:

# Throws an error:
xml_find_first(xml_data_alkanes, '//d1:cvParam[@accession="MS:1000574"]')
# Works great:
xml_find_first(xml_data_alkanes, '//cvParam[@accession="MS:1000574"]')

I've implemented this in the namespace_free branch and it works perfectly on the demo file you shared, but is then broken for mzMLs from other places.

The bad news is that this is super pervasive - I've basically hard-coded these namespaces everywhere in the code and it won't be easy to carefully strip them out and create a new switch that handles both types. I think the real solution is more along the lines of the second strategy you propos and will probably be for me to learn more about XML namespaces and do the proper handling when the file is initially read in, but it may be a little while before I can set aside time for that. In the meantime, I hope the namespace_free branch will work for your purposes. You can install it with

devtools::install_github("https://github.com/wkumler/RaMS/tree/namespace_free")
ethanbass commented 1 year ago

Thanks for looking into this! It took me stupidly long to figure this out, but I think I actually found a pretty easy solution. It turns out you can just set the namespace as an attribute of the xml. I think a function like this will fix the problem (to be called right after reading in the xml with read_xml:

checkNamespace <- function(xml_data){
  if (is.na(xml2::xml_attr(xml_data,"xmlns"))){
    xml2::xml_attr(xml_data,"xmlns") <- "http://psi.hupo.org/ms/mzml"
  }
}

I could probably put together a pull request later this afternoon if you like

wkumler commented 10 months ago

Hi Ethan, just wanted to let you know I haven't forgotten about this! I'm a bit swamped with other tasks right now and have had to de-prioritize package development but I'm hoping to get v1.4 our by the end of this year or early 2024 and this would be the perfect thing to include at that point. Your PR looks great and I'll probably just implement it as-is, I just don't have the time to go through the updates and versioning process at the moment.

ethanbass commented 10 months ago

Sounds good. Thanks for the update and no worries about the delay -- I am quite familiar with being swamped.

wkumler commented 8 months ago

Completed with #24