wkumler / mzsql

A repository of data and code showing the efficiency of databases relative to existing mass-spectrometry database formats
0 stars 0 forks source link

Decide whether to access mzML files with pyOpenMS, pyteomics, pymzml, or spectrum_utils #11

Closed wkumler closed 6 days ago

wkumler commented 2 weeks ago

Lots of different ways to get into the MS data. Not sure which ones are fastest/easiest - could be worth doing the direct comparison.

Looks like spectrum_utils uses Numba for optimization after the fact - not sure I love this since it could be just a single call that's being made. spectrum_utils describes itself as "IO functionality to read spectra from MS data files is not directly included in spectrum_utils. Instead you can use excellent libraries to read a variety of mass spectrometry data formats such as Pyteomics or pymzML." So maybe it's not meant to access the spectrum? But it certainly seems to do that... I'm confused.

Looks like pyOpenMS is just a wrapper for C code

Unclear what pyteomics is at all

There's a comparison between pyOpenMS and pymzml here:

and a comparison between pymzml, pyOpenMS, and spectrum_utils here:

wkumler commented 2 weeks ago

Preliminary results from multi_py_package_comp are shown below for a single spectrum access:

image

Suggests that pyOpenMS is the main competitor and pymzml takes second so it's just a question of whether they're also optimized for chrom extraction and rtrange extraction as well.

import pyopenms
from pyteomics import mzml
import pymzml

def pyopenms_fun():
    exp = pyopenms.MSExperiment()
    pyopenms.MzMLFile().load("demo_data/180205_Poo_TruePoo_Full1.mzML", exp)
    mz_intensity_pyopenms = [(spec.get_peaks()[0], spec.get_peaks()[1]) for spec in exp]
    return(mz_intensity_pyopenms)
def pyteomics_fun():
    mz_intensity_pyteomics = [(spec["m/z array"], spec["intensity array"]) for spec in mzml.MzML("demo_data/180205_Poo_TruePoo_Full1.mzML")]
    return(mz_intensity_pyteomics)
def pymzml_fun():
    run = pymzml.run.Reader("demo_data/180205_Poo_TruePoo_Full1.mzML")
    mz_intensity_pymzml = [(spec.mz, spec.i) for spec in run]
    return(mz_intensity_pymzml)

import timeit
pyopenms_times = timeit.repeat('pyopenms_fun()', globals=globals(), number=1, repeat = 10)
pyteomics_times = timeit.repeat('pyteomics_fun()', globals=globals(), number=1, repeat = 10)
pymzml_times = timeit.repeat('pymzml_fun()', globals=globals(), number=1, repeat = 10)

import matplotlib.pyplot as plt
plt.boxplot([pyopenms_times, pyteomics_times, pymzml_times], tick_labels=['pyOpenMS', 'pyteomics', 'pymzml'])
plt.show()