wkumler / RaMS

R-based access to Mass-Spectrometry data
Other
20 stars 7 forks source link

Add convenience functions trapz, mz_group, and qplotMSdata #22

Closed wkumler closed 8 months ago

wkumler commented 8 months ago

There's a couple functions that I find myself repeatedly needing to use alongside RaMS that should really just be included in the package. I'm calling these "convenience" functions because I don't really need them but they're convenient to have pre-written (and unit-tested) and this'll make it easier to share them. These would be much like the existing pmppm function that I've found very useful.

The first is one for integrating raw MS data using trapezoidal Riemann sums. This is the core step in moving from the mz/rt/int matrix data frame to a feature/area data frame. The general consensus seems to be that trapezoidal integration is the way to go since that 1) nicely handles the uneven spacing between retention times and 2) calculates the exact area under the data points. I quibble with this a little bit and wonder whether an absolute sum would be more accurate/precise because 1) the trapezoidal rule technically underestimates the area under the curve because the chromatographic peaks are mostly concave-down, 2) the detector is in fact measuring counts that could be directly summed, and 3) trapezoids can cause falsely inflate the signal of low-quality peaks by linearly interpolating from a single spike across to the next low data point. But I'm in the minority here so I think the default of trapezoidal integration is still the way to go. I currently just use some code stolen from pracma::trapz but I'd like to do some more careful handling of this. The function should just take the rt/int values and the user should be responsible for filtering those out ahead of time.

  1. The function should check (and warn?) if there's more than one entry per retention time
    • While it's technically possible to perform integration on a chromatogram that has multiple entries per RT this is usually not what's intended
  2. The function should perform baseline subtraction if requested (options = "square", "linear")
    • "Square" subtracts the lowest value in the peak from each RT, essentially removing the area in a rectangle below the lowest point on the peak
    • "Linear" would remove the trapezoid composed of the leftmost point, the rightmost point, and the zero-values at each end.
  3. The function should warn if there's a lot of data points(?)
  4. The function should warn if the RTs are not ordered (reluctant to forcibly order them, I want the user to do this organization)

The second one is a more heuristic function I wrote because I wanted to pull out EICs from the raw data by grouping things in m/z space. When I extract a large m/z window, I often end up with multiple clear chromatograms in m/z space that can clearly be grouped but the only obvious way to do this is to specify repeated ifelse or a case_when with lots of m/z cutoffs. Instead, we can use the algorithm currently implemented in ADAP and start with the largest intensity mass in the data, identify an m/z window around it, then group all of those points into a single mean value. We then remove those points from consideration and repeat the process until there's no points left. The tricky part was figuring out how to do this in a tidy fashion but I think I'm happy with the implementation that returns a vector in the same order as the data with integer values corresponding to the EICs detected. This then can be passed neatly into a mutate statement to create a new column (usually called mz_group) which I can then group or color by. More advanced functions could accept the RTs as well and do some clever ROI detection but I think I'd rather leave this one simple.

  1. The function should accept the m/z vector and a ppm window
  2. Returns an integer vector? "Cluster X"?
  3. Option to drop EICs below a certain (contiguous?) length

The final one is a plotting option - I originally avoided including this because I wanted people to learn/use the actual ggplot2 syntax but my poor fingers are tired of typing the exact same ggplot() + geom_line(aes(x=rt, y=int, group=filename)) over and over again. I think instead I'm going to add a qplotMSdata function that is basically just that line of code. I think it would also be neat to have arguments for color= and facet= because I often color/facet by additional columns too. Unfortunately, adding ggplot2 would add a bunch of dependencies and I'm trying to keep RaMS lightweight so I think it'll be a mimic for now. I wonder if I can add ggplot2 code and only run it if ggplot2 is already loaded or if that would cause issues with CRAN checks?

wkumler commented 8 months ago

Completed with #24