wkumler / RaMS

R-based access to Mass-Spectrometry data
Other
20 stars 7 forks source link

tmzML document MS2 voltage encoding is incorrect #9

Closed wkumler closed 1 year ago

wkumler commented 1 year ago
grabMSdata(system.file("extdata", "DDApos_2.mzML.gz", package="RaMS"), grab_what = "MS2")$MS2[premz%between%pmppm(118.0865)]

produces

            rt    premz    fragmz         int voltage         filename
  1:  4.182333 118.0864  51.81098    3809.649      35 DDApos_2.mzML.gz
  2:  4.182333 118.0864  58.06422   10133.438      35 DDApos_2.mzML.gz
  3:  4.182333 118.0864  58.06590  390179.500      35 DDApos_2.mzML.gz
  4:  4.182333 118.0864  59.07371  494165.156      35 DDApos_2.mzML.gz
  5:  4.182333 118.0864  59.56195    4696.181      35 DDApos_2.mzML.gz
 ---                                                                  
584: 14.897500 118.0865 115.38483    2501.394      35 DDApos_2.mzML.gz
585: 14.897500 118.0865 118.08650 5328035.500      35 DDApos_2.mzML.gz
586: 14.897500 118.0865 118.12283   59140.699      35 DDApos_2.mzML.gz
587: 14.897500 118.0865 119.08417    9048.057      35 DDApos_2.mzML.gz
588: 14.897500 118.0865 119.08983  161270.016      35 DDApos_2.mzML.gz

but

tmzmlMaker(system.file("extdata", "DDApos_2.mzML.gz", package="RaMS"), output_filename = "~/../Desktop/test_tmz.tmzML")
grabMSdata("~/../Desktop/test_tmz.tmzML")$MS2[premz%between%pmppm(118.0865)]

produces

            rt    premz    fragmz      voltage         int       filename
  1:  4.182333 118.0864  51.81098 1.72923e-322    3809.649 test_tmz.tmzML
  2:  4.182333 118.0864  58.06422 1.72923e-322   10133.438 test_tmz.tmzML
  3:  4.182333 118.0864  58.06590 1.72923e-322  390179.500 test_tmz.tmzML
  4:  4.182333 118.0864  59.07371 1.72923e-322  494165.156 test_tmz.tmzML
  5:  4.182333 118.0864  59.56195 1.72923e-322    4696.181 test_tmz.tmzML
 ---                                                                     
584: 14.897500 118.0865 115.38483 1.72923e-322    2501.394 test_tmz.tmzML
585: 14.897500 118.0865 118.08650 1.72923e-322 5328035.500 test_tmz.tmzML
586: 14.897500 118.0865 118.12283 1.72923e-322   59140.699 test_tmz.tmzML
587: 14.897500 118.0865 119.08417 1.72923e-322    9048.057 test_tmz.tmzML
588: 14.897500 118.0865 119.08983 1.72923e-322  161270.016 test_tmz.tmzML

Pretty clearly an encoding error but I'm not sure where exactly it's coming from, will look into soon.

wkumler commented 1 year ago

The fact that it's off by such a wild amount implies that this is an encoding issue of some kind. I've seen enormous exponents like that before when doing all the fancy compression stuff incorrectly so that's where I suspect the bug is. The problem is that I don't know whether it's on the encoding end (in tmzMLmaker) or the decoding end (in msdata_connection nonsense). It doesn't seem like either location has an obvious bug - both getEncoded and giveEncoding use compression_type = "gzip", bin_precision = 8, and endi_enc = "little". Which was intentional when designing this so that I wouldn't have exactly this issue. I wonder if it has anything to do with "voltage" being the only one that's encoded as an integer rather than a float, and thus requiring a different bin_precision (4?).

wkumler commented 1 year ago

It doesn't seem to be on the decoding end, at least initially.

Breaking apart grabMSdata("~/../Desktop/test_tmz.tmzML")$MS2[premz%between%pmppm(90.0555)]:

msdata_obj <- list(MS1=NULL, MS2=NULL, connection=list(
  files="~/../Desktop/test_tmz.tmzML", grab_what="MS2", verbosity=0
))
class(msdata_obj) <- "msdata_connection"

isub <- substitute(mz%between%pmppm(90.0555, 10))
function_name <- as.character(isub[[1]])
col_name <- as.character(isub[[2]])
mz_lims <- eval.parent(isub[[3]])

filename <- msdata_obj$connection$files[1]
tmzml <- xml2::read_xml(filename)
mz_selectors <- paste0("@minmz<", max(mz_lims), " and @maxmz>", min(mz_lims))
mz_xpath <- paste0("data/", msdata_obj[["connection"]][["grab_what"]],
                   "/dubset[", mz_selectors, "]")
dubset_node <- xml2::xml_find_all(tmzml, mz_xpath)
raw_encoded <- xml2::xml_text(xml2::xml_children(dubset_node))

mzint_nodes <- raw_encoded[4]
compression_type="gzip"
bin_precision = 8
endi_enc="little"

decoded_mzs <- base64enc::base64decode(mzint_nodes)
decomp_mzs <- memDecompress(decoded_mzs, type = compression_type)
readBin(decomp_mzs, what = "double", n=length(decomp_mzs)/bin_precision,
        size = bin_precision, endian = endi_enc)
wkumler commented 1 year ago

yEP seems to be an encoding problem because "voltage" is an integer:

This one works just fine:

voltage_vals <- rep(35, 34)
voltage_enc <- giveEncoding(mzint_vals = voltage_vals, compression_type = "gzip", bin_precision = 8, endi_enc = "little")
getEncoded(mzint_nodes = voltage_enc, compression_type = "gzip", bin_precision = 8, endi_enc = "little")

This doesn't seem fixable by just changing the precision to 4 - we just get different, equally wrong numbers that way. Time to coerce the integer to numeric? while THIS one returns the problem we saw before:

voltage_vals <- rep(35L, 34)
voltage_enc <- giveEncoding(mzint_vals = voltage_vals, compression_type = "gzip", bin_precision = 8, endi_enc = "little")
getEncoded(mzint_nodes = voltage_enc, compression_type = "gzip", bin_precision = 8, endi_enc = "little")
wkumler commented 1 year ago

Closed with #14