yafeng / DEqMS

DEqMS is a tool for quantitative proteomic analysis
22 stars 3 forks source link

Error in smooth.spline(x, logVAR, cv = FALSE) :missing or infinite values in inputs are not allowed #2

Closed MiguelCos closed 3 years ago

MiguelCos commented 4 years ago

Hello,

I am testing the package to analyze a time-course proteomics data set produced by MaxQuant. The output was processed using MSstats for protein summarization and normalization and then translated into wide format to use it as input in limma + DEqMS. The spectral count data was produced from the msms.txt file. After following the steps in the vignette to modify the limma::lmFit output object to be used as input for the function DEqMS::spectraCounteBayes() I am receiving the error as written in the tittle. I am attaching here the necessary files and the code to reproduce the error. Every row with missing values was filtered-out from the expression data set and the same set of proteins were used to created the count-data matrix.

I would really appreciate if you could offer any support to understand what can be happening.

annotation_liver.txt logexpr_msstats_norm_liver_wide.txt count_liver_msms.txt

Code below

## Load packages ----

library(readr)
library(dplyr)
library(magrittr)
library(DEqMS)
library(limma)

## Annotation for experimental design ----

annot_liv <- read_delim("annotation_liver.txt",delim = " ") %>% 
          mutate(Raw.file = str_trim(Raw.file))

## Prep count data -----

wide_counts2 <- read_delim("count_liver_msms.txt", delim = " ")

wide_counts3 <- dplyr::select(wide_counts2, -Protein)

countsmat <- data.frame(count = rowMins(as.matrix(wide_counts3)),
                        row.names = wide_counts2$Protein)

## Prep expression data ----

liv_reshaped2woNA <- read_delim("logexpr_msstats_norm_liver_wide.txt", delim = " ")

## Formating for limma ----

liv_reshaped3 <- dplyr::select(liv_reshaped2woNA, -Protein)

liv_resh2matrx <- as.matrix(liv_reshaped3)

row.names(liv_resh2matrx) <- liv_reshaped2woNA$Protein

#liv_resh2matrx[1:6, 1:5]

## Prep design matrix ----

levs <- c("6h", "12h","18h", 
          "24h", "48h", "72h", "96h")

factors <- factor(x = annot_liv$Condition, levels = levs)

## design matrix with time as continuos value ----

time <- str_remove(factors, "h") %>% as.numeric()

# Cubic spline fitting ----
library(splines)
cspl <- ns(time, df = 3)

design_cub <- model.matrix(~cspl)

## Fitting model with limma ----

fit_cub1 <- lmFit(liv_resh2matrx, design = design_cub)
fit_cub2 <- eBayes(fit_cub1, trend = TRUE, robust = TRUE)

## Adapting object for DEqMS ----

fit_cub2$count = countsmat[rownames(fit_cub2$coefficients),"count"]
fit_cub3 = spectraCounteBayes(fit_cub2, coef_col = 2:4, fit.method = "spline")

I also tested:

All of these with the same result.

Many thanks in advance!

yafeng commented 4 years ago

fit_cub2$count can't have any zero values in it. try adding a pseudo count 1 to all values before using spectraCounteBayes function. fit_cub2$count = fit_cub2$count +1.

on the other hand, Is there specific reason that you use splines:ns before creating model matrix? is there a difference if you just create design matrix like this: design_cub <- model.matrix(~factors)

I also looked how the variance~count fit look like in your data using VarianceBoxplot. as I can see, the dependence is not as evident. Maybe it is due to the fewer number of proteins.

MiguelCos commented 4 years ago

Hello, many thanks for checking into this.

The tool is now going through but as you noticed it seems that I am not gaining much in terms of detection (of significantly affected genes) capabilities using this adjustment based on spectral counts.

I have indeed a lot of proteins in my dataset that were identified in most samples but not in one or two (I used MaxQuant with match between runs), that leads to a lot of 0s in spectral counts when getting the minimum per row. This, summed with the fact that I have around 500 proteins to work with could be the reason of the small relationship between variance and spectral count numbers.

The reason I used splines::ns is because I am modelling my data as if it has a time-course behavior. I want to see proteins that show a trend of decreasing or increasing values over-time and I understand that this kind of cubic spline model would help me to get that result while allowing for non-linear increase/decrease. So far I can find a handful of proteins following this behavior, using this model.

When I use design_cub <- model.matrix(~factors), I treat my time points as factors (categorical variables). This model (as I understand) detects any kind of effect of time on the expression level. This F test gives me an 'all proteins are affected' result which is not particularly insightful.