wkumler / RaMS

R-based access to Mass-Spectrometry data
Other
22 stars 7 forks source link

mzXML files can have multiple binary encodings #35

Open wkumler opened 7 months ago

wkumler commented 7 months ago

The DDA file from the Skyline folks in their DIA tutorial throws a bunch of warnings when run. Prying into these revealed that the mzXML file sometimes has an encoding of "none" and sometimes has an encoding of zlib (aka gzip).

Scan 1: none Scan 2: zlib Scan 3: none Scan 4: none Scan 5: none Scan 6: none Scan 7: zlib

with no clear pattern. I had always assumed that the same encoding would be applied to every binary encoding (and RaMS only reads out the first peak's value). I'll have to switch it over to individual encodings which is super annoying.

<scan num="1" msLevel="1" peaksCount="17" polarity="+" scanType="Full" filterLine="FTMS + p NSI Full ms [400.00-2000.00]" retentionTime="PT0.1821S" injectionTime="PT0.1000S" lowMz="423.914" highMz="1765.56" basePeakMz="1765.56" basePeakIntensity="779.182" totIonCurrent="10071.9">
  <peaks precision="32" byteOrder="network" contentType="m/z-int" compressionType="none" compressedLen="0">Q9P1BUPwxShD3HUiRAlb1EPe6yBD+KoQRAOuX0PvpUdEBuBDRAdZKkQH489EAu6lRAmIZEQfVKJEGxU5RBL2OUQj+/pEIzpVRDzChUQky15EXYj5RAlMgERrSNREDDeVRHfOj0QNYvNEfzuWRDAJVESVvzREGRwVRK8DZUQsofNE3LIIRELLow==</peaks>
</scan>
<scan num="2" msLevel="1" peaksCount="352" polarity="+" scanType="Full" filterLine="FTMS + p NSI Full ms [400.00-2000.00]" retentionTime="PT0.6401S" injectionTime="PT0.1000S" lowMz="400.251" highMz="1753.41" basePeakMz="445.117" basePeakIntensity="2.62395e+006" totIonCurrent="1.77468e+007">
  <peaks precision="32" byteOrder="network" contentType="m/z-int" compressionType="zlib" compressedLen="2622">eJwN1vlfzWkfx/GcFnXqtKAmKqQI2UY0JrJ93tf1PaftFJP1niTLhAnNiDv3GLK1nTqRUBNZk5GkRbJlG057Km2kkjZLU9ZKy+2n1+P5H7xINXoi3/bDDFJN2s6uyB+TalcCN3ObQ6pT63Bk8G1SJZfhWFwMqfoOsaT9CZS7c4CtNhVR7q6xaLZwotyvRthgOYvydh5GkEJJeRFOXD7bjfISRfxobyPlpfewhXNXUr7Ehy3dkkr547awxzeaKD/gLA5tEFO+ohdHeB3lZ7Zhl8dMKsh6jpD9elQoXYqfxfepMCZBCGCnqDC1Gxs8t1Nh2nlWoN1GRSYWPNA+j4pSNFhlzU4qulKHhv3mVDyFIdBXl4r3ewhrVUoq0XDiR4sLqCQog7m1r6aSfX3cvuU6lYSlsev631PJ+VY0PE6mJxrq7KBlID0RJyEhbhI92S3iVavK6Mk+d8TlCPQkSV16s0dJTzJrEPiqk54MiLh2+xcqHe8vtMp6qPTPWNZ+JpxKk8r5nj+6qTTrLd48r6LS/jJ8WGVPZZbZzGdBKZVNmMlc3e5S2Q0R81nsQuUXtkPlfI/KU4ayCMer9HTKdpRP1aOnCR8Qtmo3PU014B/yMujp1Q8oun2WKqYlsQbJIqrwz5KJo/2oIuyrTPNXRhUJW+Fv5UkV760ETx8zqtTVkC78okOVHptx+8B3VOkvl/3p6EmVIUuFaQFrqDIsQfjQ3kWVndlC8QM9qtJezhfbelGVOEP4bvUBqtraKnP17qCqAxe49solVBXcyU1zs6gqbD4fNOccVUVZI66rkaou9gh+ojNUldWL0Ht2VNW5WEh89pmqNVW4ZexH1YPv4OFrXaoWH0CO3J+qJ3rx1Ta+VL11rzSl/5uV5TjsakrV2S047jqWqjs6Bb2fd1PNVnPhQvYFqsmah8tLa6jmZg+7clJENR0nuNGjWnq2pYzPW5xGzzrcsD5oMj1P00W6ZwfVHkoURB4nqTZDgwXsNaQXRmncyIDohZ8L+29wBb3YsgVFK5vpRWS302gbB3pxJYsVz9hIL96F4vKs+1RnoCGLuLuN6qbeQNnoLqpTrBY2nGigushjsrGNi6juUqv0bE0u1eu58v8Z7qJ6/d3S6Ws9qH7SEuHhL01Ub3cVmzbLqT70KpsyrpTqw5u5d0A41UfOF+xeiag+WcUDZFnUoPMcvtYO1KD7luHB39Sg78Ht445Rw9i9qAv+Sg2TLfBl1StqiHiJ/Iyh9PIvBVf9HUgv2+wRnFFNjWYvuYmBNzWmN+HoMRW9iqnAwwmf6NW5fI77f1DT2MlMrq5LTTHxPDl2BjUbF3CLm98auAPHpg5Qc4wnS/Jqp+bLH7npL97UMsRTmLOtl1qmbMSXO7HUEsVkmiuyqSVNgYOZN6jV0F3qWrmcWpUNQk+lH7XmVKFZ6xS1GZQJ+imrqU35Gx9+5yi9zuxn6zNy6M2xOgRNLqR3e3RYzqPj1K62AKX2SmofLmG/GY2i9t0PmW/LSmqPnY70R+b074Q0tqJ2FHVkvkHYA0vqvK7Jrfzr6f2sYPTVjKCPB9O4us08+qR1mx2fVk+f9nXjhH84fTq4G2+2jqMvp2NZYnU/dVl2o2JdHHWlnkLFqEPUlf2etwT/RN0OPnxkYCF138hFSHY9fY2QCpWmVvQ1YwhbFvmOeiXLhAXXBlGvqy/+9L9HvWHxgmXkaeqNkPCvQ65Tb/QedmRtAvWJb/CQW87UJ7Fmt9cFUt/QJ8zVyZb6Qobio3ET9YUt5dNb7ahPsQQv//OM+qKHIHyhGfWL3VhMWzT1h2ZjV1oEDcQqmG/pFRq42SuM8X4MtZG2fHW2Empe3vhHqMMgWT9PiZ+BQaH9XC3RDINU52XOsgoMqrXAif5EiLRTpY8qN0Ek3Sn1fxgFUbAT8zAbDFFIoHDFzAGif/5li/zvQPTYjIn7cyBSyXl72Amoa09kFY2VUPdqZ6pKB6gnWOOO3nRouG7lWaG3oRGxh1XMfwCNwrnspbsXNIqiEPd9EzQlTogfroSm82dhoNEUmgo1IbRxAjTzEtjOy1ehmV8gDO3+AVriFfwF7KClW84Pnv4ILdki9CbsgpbTcs7tvtk5hf1xfz20wqaz9DsHoRVejovzMqGV54Doe0OhswDImb8DOhHX2Lt5b6BTeJIFnboNsUSJnm1rIBb24NbNh9B7GIwUr0+QaOxEx4j5kJAhriZJIBH2orPsOgyFEmHTHicYBtfzkubZMHwUwL237Ybh42x+1GsbjLQ2s4v1FjAafI+9uTQVRtyHHd++HEYHfdDtPoCh+SP5ktHWGKY7jWXnj8EwWYk0epKAYaFVUvO7+himOsqebp2NYbnpQot6N4y1I9mIthkw1kkVRn5aAWPpAra4wQ7GstPc45A/jEOD2PmvCTDR9+Pl/66EidsB7mz6DiaR3qw8eR5MilbhuO9IfFdoKU33HAZTyWTpmqvnYeoyIB0W1wzT8ABuFbMfphHagp9JFkzzd7DzNeUwLWjix5LnYLjuOjavbQDD9bpYwwQZzPXzeeZIe5i7FfHlY6JhHpnO9XQ0YV6Uzf7AZFjoX0JJfjtG6tnj47LfMUb1lf8TuxZWOhIu1bgGKyGKbTySDSuZIVdnF2EV/CtKms1gFfKWrRl8AtbOwTj9KgzjFGuRem0SxuV1SYfUpcBGV024kvoGNk5NwmHzMNiEtfAXw1phk/eQz1iWhfHiRub3Kg8T3WRS69mOmBi5RKj4/AATi8YLyht6sNVTsOjBI2CrP1PQv+sLW5d1KPRKga3bHM7GK2Cr8MNftybANnIGW9R5GFOUHdxYTQtTSky4oXgXphp0s9jvN2Cq/NsPDFGDnUQPipaNsHNpQMVuJWY6WbEyrfGYGWbB4nkpZubZsB3BnbAX6+HE+0T86GIpfJ4VhR8VU4T8gHj8WKAvhPuNhYNuH681T4SDcyUPXLYdDgoJqzibDof8t8jMmIY5xf/jTR71cDQI4dHLRsBRvoZfyDgDx4gHXOSZCkflXa6Rq4Rj4VXmmecMx+LDUH3JwVzJc6Yw8sACww6kDLfEgqi3CHH9CxDacGRUOfDoN5yRb4aQ746cB71wLnQVXB54wkUiF+aec4TLhPfY8PwNXFw9eJ98LlwifLhCZAiXQkuuu78LrhILpK/dDrl7Mte4dw3yqIssNfcj5CWJrLbtJdz17zH3xn1wN/yAv+9Zw92tB/F7f4Z7ZC8KFu7DCsUBJPZ0YEXBKubfcAkr9ZSIN+uDl1sAH2gwgFdkAK81bIZX0a+8hGVilf42PpudxSq3n1hH4iv4RM3E4ZZNWGM4GzErfoefJAaBP6zH1mIl14wUw9/gEh8tNMFfHs+nvk6CvzKevetygH+xN1vq8AFBSXLmm6SJIFUAW5d8HkGtXuz0jJPYq72GedfOw75iJyjV3LF/YwIOLZyOsNbXyLwchPAJYjRdP4VYk2c4dP0Qzn7Oxp6ENOQ+2odzz7XwbLINjiRb/B8ZS+lz</peaks>
</scan>
<scan num="3" msLevel="1" peaksCount="84" polarity="+" scanType="Full" filterLine="FTMS + p NSI Full ms [400.00-2000.00]" retentionTime="PT0.9075S" injectionTime="PT0.0023S" lowMz="407.983" highMz="1790.98" basePeakMz="445.12" basePeakIntensity="3.00796e+006" totIonCurrent="1.64408e+007">
  <peaks precision="32" byteOrder="network" contentType="m/z-int" compressionType="none" compressedLen="0">Q8v920asXxtDzZYNRwcRFEPPhMhHQR1XQ8/i7kagqMtD0aWgRqjUyEPRqHNIwsWoQ9Io30fvKtBD1nIdSQBwQkPWi2pJLuT8Q9bykUexHVZD1wttSF/ySUPXccBJcLopQ9eLAkg0avND1/InR0Jy3UPYcVZJMPqXQ9jx1UfkrS5D2XEDSKTaGEPacJNHqFcdQ96Pa0o3l1ND3w94SYGzm0PfjNZHMKAUQ9+O9UlKvzZD36yESJBP00PgDvJIKL3bQ+As6kedu99D55DHSRq49kPnksFGptqKQ+ffWEamyqND6BDLSGjWckPokHtH7zsSQ+kQnkcFr3RD+LpLRw7pfEP7jcZHzBPnQ/wOBkcmXgpD/7xoR7XF/EQBil9G0CXGRAHI7UlzCfREAgjtSKzMwEQCSLdIZwmfRAKIxUeX5FxEBUw3SCa2UEQFjFZHHFy2RAWvRUbFIHpEBcwiRsxJ+kQLNWJGsVU6RAvMcEbNe0FEEMbIRvEJKUQUMsJGr9ZvRBRKKEiE0AhEFIosSExdg0QUygJH74qPRBUJ/0cuEa1EF817SMdhy0QYDX9Iioq7RBhNNUgOsMVEGI02R2ylHEQgUlpGoqg8RCbLZUfS0XBEJwtbR8oaPEQnS0dHQpPURCpOoUgT0eJEKo66R7HmVEQqzshHJfxdRCsOhUbtkfhENQntRsdXuUQ14uRGuu73RDZ3XkbuMw9ENxdNRql+CUQ4zf1GySnfRDlMkUe1PBJEOYyaR53SZUQ5zKxG3sT1RDoMYkbon1NEOvsNRqo4MkRLFOVGyFTLREtyJkbE4DFES4PARrhXdERLzbFHHdvDREwN60cPEzxElaiKRtVP1ESa4JNG6GD2RKnHHUbZld5EuR0xRuc/PkTf34JG/H+s</peaks>
</scan>

and instead of

  vals <- lapply(all_peak_nodes, function(binary){
    if(!nchar(binary))return(matrix(ncol = 2, nrow = 0))
    decoded_binary <- base64enc::base64decode(binary)
    raw_binary <- as.raw(decoded_binary)
    decomp_binary <- memDecompress(raw_binary, type = file_metadata$compression)
    final_binary <- readBin(decomp_binary, what = "numeric",
                            n=length(decomp_binary)/file_metadata$precision,
                            size = file_metadata$precision,
                            endian = file_metadata$endi_enc)
    matrix(final_binary, ncol = 2, byrow = TRUE)

I'll have to do something like

all_peak_nodes <- xml2::xml_text(xml2::xml_find_all(xml_nodes, xpath = "d1:peaks"))
all_peak_encs <- xml2::xml_attr(xml2::xml_find_all(xml_nodes, xpath = "d1:peaks"), "compressionType")
vals <- mapply(function(binary, encoding_i){
    if(!nchar(binary))return(matrix(ncol = 2, nrow = 0))
    decoded_binary <- base64enc::base64decode(binary)
    raw_binary <- as.raw(decoded_binary)
    decomp_binary <- memDecompress(raw_binary, type = encoding_i)
    final_binary <- readBin(decomp_binary, what = "numeric",
                            n=length(decomp_binary)/file_metadata$precision,
                            size = file_metadata$precision,
                            endian = file_metadata$endi_enc)
    matrix(final_binary, ncol = 2, byrow = TRUE)
}, all_peak_nodes, peak_encodings)
wkumler commented 7 months ago

This has been implemented on the enc_hotfix branch - I wonder if it's easier to just convert to mzML though??

wkumler commented 7 months ago

Possible implementation would be to request all the encodings in grabEncodings and if there's multiple (maybe length(unique())>1?) switch to this method and otherwise use the original?