wkumler / RaMS

R-based access to Mass-Spectrometry data
Other
20 stars 7 forks source link

feature request: dia data #25

Open RaynerQueiroz opened 7 months ago

RaynerQueiroz commented 7 months ago

Hi @wkumler, I have been looking into ways I could plot XICs from precursors and their respective fragments from DIA data independently of the search software. You package seems very promising, but I can see current functions are restricted to DDA and SRM. I was wondering if you have plans to implement functions to support DIA and specifically XICs from MS1 and MS2 for precursor and fragments?

wkumler commented 7 months ago

Hi @RaynerQueiroz, RaMS should still be able to pull out the data associated with DIA scans, although I'll be the first to admit it's super messy and a totally unstructured long-format data.table probably isn't the best way to explore DIA data. RaMS specifically doesn't do any deconvolution or anything for the DIA samples so the fragment XICs probably won't look great either.

As an example, here's some things you can do with RaMS for an example file provided by the Skyline group. You'll have to download the files (~4.5GB) from the website and unzip them manually, then run msconvert to get them from .raw into .mzML format.

# msconvert.exe ~/../Downloads/DIA-20_2/DIA-20_2/20130311_DIA_Pit01.raw --filter "peakPicking true 1-" --filter "polarity positive"

After that, you should be able to read in the data directly to R with the typical grabMSdata function, though note that this R object is very large (~2GB in RAM)

library(RaMS)
msdata <- grabMSdata("~/../Downloads/DIA-20_2/DIA-20_2/20130311_DIA_Pit01.mzML")

That allows you to then explore the data like you might with typical DDA methods, i.e. plotting a BPC and pulling out one of the larger masses for inspection:

library(tidyverse)
msdata$BPC %>%
  ggplot() +
  geom_line(aes(x=rt, y=int))

BPC

msdata$MS1[order(int, decreasing = TRUE)]
# Second-largest mass is 657.8411
msdata$MS1[mz%between%pmppm(657.8411, 15)] %>%
  ggplot() +
  geom_line(aes(x=rt, y=int))

Single-mass chrom

If you're looking for the fragments "associated" with this mass, however, you're going to be disappointed with RaMS in DIA space because there's no deconvolution performed, so all that RaMS can tell you is the fragments that were detected within the ~20Da window that includes this mass. The best you can probably do is identifying the RT range of the peak (I do this by eyeballing the edges in the above plot) and then subsetting

# Find the DIA window that contains the mass of interest
dia_windows <- unique(msdata$MS2$premz)
given_bin <- dia_windows[which.min(abs(dia_windows-657.8411))]

# Plot the fragments from the above window within the peak RT bounds
msdata$MS2[premz==given_bin][rt%between%c(45.7,47.1)] %>%
  ggplot() +
  geom_point(aes(x=rt, y=fragmz, color=log10(int)), alpha=0.5) +
  scale_color_viridis_c()

all MS2 data within DIA window and RT range of the peak

Or maybe plotting a fragment spectra for the scans that were collected at the very top of the peak:

msdata$MS2[premz==given_bin][rt%between%c(46, 46.2)] %>%
  ggplot(aes(x=fragmz, y=int)) +
  geom_point() +
  geom_segment(aes(xend=fragmz, yend=0)) +
  facet_wrap(~rt, ncol = 1)

image

Or performing some rudimentary grouping by m/z window (here, the mz_group function in the development version of RaMS) and then plotting various fragments:

msdata$MS2[premz==given_bin][rt%between%c(45.7,47.1)] %>%
  arrange(desc(int)) %>%
  mutate(mz_group=factor(mz_group(fragmz, 5))) %>%
  group_by(rt, mz_group) %>%
  summarise(int=max(int), medmz=round(median(fragmz), 4)) %>%
  ggplot() +
  geom_line(aes(x=rt, y=int, color=medmz, group=mz_group)) +
  scale_color_viridis_c() +
  theme_bw()

image

Of course, if you know of a few specific fragments you can pull those out of the DIA data as well whether or not you've subset for a specific peak, and this is a place where DIA data often looks nicer than DDA data because the coverage across the run is so much better:

msdata$MS2[fragmz%between%pmppm(118.0865)] %>%
  ggplot() +
  geom_line(aes(x=rt, y=int))

image

Hopefully that answers a few of the questions you have about accessing DIA data with RaMS, but please let me know if you had a different vision in mind for a feature that you'd like to see implemented.

RaynerQueiroz commented 7 months ago

Hi @wkumler, thank you for the thorough reply! Yes I have indeed been trying something more or less on those lines. I have searched my data before with DIANN which conveniently includes the column "Fragment.Info" with the fragments, charges and their m/z if I add the --report-lib-info command, and I can compile the fragments I wish to extract for a given peptide from that.

peptide_lib <- diann_out%>%
               dplyr::select(Precursor.Id, Fragment.Info) %>%
               unique() %>%
               separate_rows(Fragment.Info, sep = ";") %>%
               separate(Fragment.Info, into = c("Ion", "MZ"), sep = "/") %>%
               mutate(MZ = round(as.numeric(MZ), 4)) %>%
               separate(Ion, into = c("Fragment", "Charge"), sep = "\\^", remove = FALSE) %>%
               na.omit()

precursor <- diann_output %>% 
             mutate(Ion = "Precursor") %>%
             mutate(Fragment = "Intact_Peptide") %>%
             dplyr::select(Precursor.Id, Ion, Fragment, Precursor.Charge, Precursor.Mz) %>%
             unique() %>%
             rename(all_of(c(Charge = "Precursor.Charge" , MZ = "Precursor.Mz")) )

peptide_lib <- rbind(peptide_lib,precursor)

I have attempted to generate a function to extract the relevant fragments, but I must say I got stuck... I will try to explain it here: 1) Since "premz" will not match to my target peptide in DIA data, I first loaded a csv with the swath windows so I can match down the line the fragments that arose from a precursors in the correct m/z range ("premz" seem to show the center m/z of each window)

2) Create the mzML files. MetabolomicsBasics package has a nice function to run it straight in R

library(MetabolomicsBasics)
##It is a quick and dodgy function to show how to convert vendor MS data into an open format (mzML). 
##You will have to download/install MSConvert prior to usage, and probably adjust the arguments 
##according to your needs. Arguments are documented here https://proteowizard.sourceforge.io/tools/msconvert.html. 
##If you don't know where the msconvert.exe is installed you can check for the correct path using
##list.files(path="C:/", pattern="^msconvert.exe$", recursive = TRUE).

msconvert(raw_files, msc_exe = "C:\\Program Files\\ProteoWizard\\ProteoWizard\\msconvert.exe", 
          args = c("--filter \"scanTime [0,9000]\"", # MS run length in seconds
                   "-o E:\\mzML_Files", # Output folder
                   "--64",  # 64-bit precision
                   "--filter \"zeroSamples removeExtra\""  # Remove extra zero samples
          ))  

3) Load the data

## 'pep' is a given peptide sequence and 'diann_output$Name' is a column crated to simplify the file names and make another matching/filtering easier for me 
msdata <- grabMSdata(file_mzML, ## The mzXL file I want to extract info from 
                     grab_what = c("MS1","MS2"), 
                     rtrange = c(diann_output[diann_output$Name == diann_output[diann_output$Precursor.Id == pep,]$Name[ i ] &
                                              diann_output$Precursor.Id == pep,]$RT.Start - 2.5, 

                                diann_output[diann_output$Name == diann_output[diann_output$Precursor.Id == pep,]$Name[ i ] &
                                              diann_output$Precursor.Id == pep,]$RT.Stop + 2.5) ## the retention time range I want to extract the XIC. The can be eliminate if I want to check for peak splitting 
                     )

4)Then is the function I am trying to develop to extract the fragments to plot their XICs, but not very successfully yet.

extractFragMZ <- function(data, pep_lib, ppm = 5, swath_windows, mz = "mz") {
  # Define the ppm calculation function
  calculate_ppm <- function(mz, ppm) {
    return(mz * ppm / 1e6)
  }

  # Initialize lists to store matching rows
  matching_rows <- list()

  # Iterate over each row in peptide_lib
  #lib <- pep_lib[pep_lib$Precursor.Id == pep,]
  lib <- pep_lib
  for (r in 1:nrow(lib)) {
    mz_value <- lib$MZ[r]
    ion_value <- lib$Ion[r]
    fragment_value <- lib$Fragment[r]
    #premz_value <- lib[lib$Precursor.Id == pep & lib$Ion == "Precursor",]$MZ

    # Calculate ppm range
    ppm_range <- calculate_ppm(mz_value, ppm)

    # Filter rows in msdata based on ppm range
    #matching_rows[[r]] <- data[abs(data$fragmz - mz_value) <= ppm_range & abs(data$premz - premz_value) <= ppm_range, ]
    matching_rows[[r]] <- data[abs(data$fragmz - mz_value) <= ppm_range, ]

    # Add Ion and Fragment values
    matching_rows[[r]]$Ion <- ion_value
    matching_rows[[r]]$Fragment <- fragment_value
    matching_rows[[r]]$LibMZ <- mz_value
    matching_rows[[r]]$Precursor <- pep
    matching_rows[[r]]$PrecursorMZ <- premz_value
  }

  # Combine matching rows into a single data frame
  matched_data <- do.call(rbind, matching_rows)
  matched_data <- merge(matched_data, swath_windows, by.x = "premz", by.y = mz, all.y = F ) ## RQ note: 'mz' is the placeholder for the column name in the file with our method`s DIA windows

  # Filtering rows where the expected precursor m/z is inside the d=DIA method`s swath windows.  
  matched_data <- matched_data %>% dplyr::filter(PrecursorMZ >= window_start & PrecursorMZ <= window_end)

  # Return the filtered and updated data frame
  return(matched_data)
}

Maybe I am overthinking something and you can give help out?

wkumler commented 7 months ago

I'm afraid I'm not especially familiar with DIANN or the format of its outputs so I don't think I have a ton of advice to offer here. It sounds like you've got a peptide and some expected fragments from each in the peptide_lib and you'd like to compile the XICs for each fragment in comparison to the original peak. I agree that the tricky part is that the precursor m/z won't match exactly to the values in premz because they're DIA. The snippet below is what I came up with to match the known peptide to the DIA window in which it'll be found, which can then be used to subset the MS2 data. The numeric in the abs(dia_windows-657.8411) is the mass of the peptide that I think you're hoping to loop over for each peptide entry.

dia_windows <- unique(msdata$MS2$premz)
given_bin <- dia_windows[which.min(abs(dia_windows-657.8411))]
window_ms2 <- msdata$MS2[premz==given_bin]

If you have multiple fragments for each peptide and you're looping over multiple peptides, I would store the fragment masses in a nested list where the outer list is the peptide and the inner list is the fragments.

Making up some demo data:

pep_lib <- structure(list(pep_id = c("A", "B", "C", "D", "E"), pep_mz = c(686.4724, 
657.8403, 637.8713, 557.3064, 687.4749), pep_rt = c(100.13368, 
46.091647, 67.579531, 66.608362, 100.13368), pep_pred_frag_mzs = list(
    c(668.4605, 669.4641, 485.3347, 404.2767), c(214.1193, 659.3476, 
    643.4155), c(227.1761, 687.3588, 326.2448, 850.4219, 199.1812
    ), c(185.1654, 729.8756, 729.3746, 86.0972, 213.1604), c(668.4605, 
    669.4641, 485.3347, 404.2767))), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -5L))
# A tibble: 5 × 4
  pep_id pep_mz pep_rt pep_pred_frag_mzs
  <chr>   <dbl>  <dbl> <list>           
1 A        686.  100.  <dbl [4]>        
2 B        658.   46.1 <dbl [3]>        
3 C        638.   67.6 <dbl [5]>        
4 D        557.   66.6 <dbl [4]>        
5 E        687.  100.  <dbl [3]> 

which can then be used to search for fragments like so:

# Outer loop for each peptide
pep_frag_DIA_data <- list()
for(pep_row_idx in seq_len(nrow(pep_lib))){
  pep_row_data <- pep_lib[pep_row_idx,]

  given_bin <- dia_windows[which.min(abs(dia_windows-pep_row_data$pep_mz))]
  window_ms2 <- msdata$MS2[premz==given_bin][rt%between%(pep_rt+c(-2.5, 2.5))]

  # Inner loop for each fragment
  frag_masses <- pep_row_data$pep_pred_frag_mzs[[1]]
  frag_xics <- list()
  for(frag_mz_i in seq_along(frag_masses)){
    frag_xics[[frag_mz_i]] <- window_ms2[fragmz%between%pmppm(frag_masses[frag_mz_i], 10)]
  }
  pep_frag_DIA_data[[pep_row_idx]] <- do.call(what = rbind, frag_xics)
}
pep_lib$pep_pred_frag_xics <- pep_frag_DIA_data

which can then be plotted like so:

pep_lib$pep_pred_frag_xics[2][[1]] %>%
  mutate(frag_group=mz_group(fragmz, 10)) %>%
  ggplot() +
  geom_line(aes(x=rt, y=int, color=frag_group, group=frag_group))

Not sure whether that's helpful or not - you may need to assemble a reprex and share some example data if that doesn't help since I'm still not sure what exactly the issue is with the code you shared.

RaynerQueiroz commented 7 months ago

This is helpful. Thanks @wkumler! I am trying to implement your scripts now. Also, I would like to check if you are open for consultancy work on this matter should I fail to implement this? If so please let me know how to contact you to discuss that

wkumler commented 7 months ago

No problem! Glad folks are finding the package useful. Unfortunately I'll be defending my general exam in February and time is scarce for me at the moment so I'm about tapped out now that the new 1.3.4 release is on CRAN.