wfondrie / depthcharge

A deep learning toolkit for mass spectrometry
https://wfondrie.github.io/depthcharge/
Apache License 2.0
59 stars 18 forks source link

Obtaining the retention time and ActivationType (e.g., HCD) from mgf files using Depthcharge #55

Closed junxia97 closed 2 months ago

junxia97 commented 3 months ago

Thanks for your excellent contributions to the computational mass spectrum community. Can we obtain the retention time and ActivationType (e.g., HCD) from mgf files using the depthcharge package? And how?

wfondrie commented 3 months ago

Hi @junxia97 👋

Can we obtain the retention time and ActivationType (e.g., HCD) from mgf files using the depthcharge package?

Yes, as long as they are specified in the MGF file itself. Under the hood, Depthcharge uses the MGF parser from Pyteomics, and you can access any available field using the custom_fields parameter within the depthcharge parsers.

And how?

Here's a small example - we'll start by creating an example MGF file:

mgf_contents = """\
BEGIN IONS
PEPMASS=1000.0
CHARGE=3+
NCE=25.0
FOO=BAR
10.0 1234.5
20.0 6789.0
END IONS
"""

with open("example.mgf", "w+") as mgf_out:
    mgf_out.write(mgf_contents)

We can then look a see how Pyteomics will parse the spectrum:

from pyteomics.mgf import MGF

with MGF("example.mgf") as mgf_in:
    print(next(mgf_in))

# {   'charge array': masked_array(data=[--, --],
#             mask=[ True,  True],
#       fill_value=0,
#            dtype=int64),
#    'intensity array': array([1234.5, 6789. ]),
#    'm/z array': array([10., 20.]),
#    'params': {   'charge': [3],
#                  'foo': 'BAR',
#                  'nce': '25.0',
#                  'pepmass': (1000.0, None)}}

If we want to parse the NCE and foo in this case, we now know that we need to access the nce or foo keys within the params key in the spectrum dictionary. So if the spectrum is spec, then we need spec["params"]["nce"] and `spec["params"]["foo"]. Also note that all extra fields in MGF files are parsed as strings.

Now we can tell Depthcharge how we want to parse these as custom fields:

import pyarrow as pa
import depthcharge as dc
from depthcharge.data import CustomField

df = dc.data.spectra_to_df(
    "example.mgf", 
    custom_fields=[
        CustomField("NCE", lambda s: s["params"]["nce"], pa.string()),
        CustomField("foo", lambda s: s["params"]["foo"], pa.string())
    ],
)

print(df)

#┌─────────────┬─────────┬──────────┬──────────────┬───┬───────────┬─────────────────┬──────┬─────┐
#│ peak_file   ┆ scan_id ┆ ms_level ┆ precursor_mz ┆ … ┆ mz_array  ┆ intensity_array ┆ NCE  ┆ foo │
#│ ---         ┆ ---     ┆ ---      ┆ ---          ┆   ┆ ---       ┆ ---             ┆ ---  ┆ --- │
#│ str         ┆ str     ┆ u8       ┆ f64          ┆   ┆ list[f64] ┆ list[f64]       ┆ str  ┆ str │
#╞═════════════╪═════════╪══════════╪══════════════╪═══╪═══════════╪═════════════════╪══════╪═════╡
#│ example.mgf ┆ 0       ┆ 2        ┆ 1000.0       ┆ … ┆ [20.0]    ┆ [1.0]           ┆ 25.0 ┆ BAR │
#└─────────────┴─────────┴──────────┴──────────────┴───┴───────────┴─────────────────┴──────┴─────┘

Now you have df which is a Polars DataFrame 🎉