wwylab / DeMixT

GNU General Public License v3.0
32 stars 14 forks source link

Error when using DeMixT to process RNA-seq data from TCGA #19

Closed Jwenyi closed 2 years ago

Jwenyi commented 2 years ago

Dear DeMixT team,

I'm trying to use DeMixT to get purified tumor expression profiles from pan-cancer RNA-seq data of TCGA. However, DemixT and DeMixT_GS functions always report error of Error in if (sum(obj == 0) > 1) { : missing value where TRUE/FALSE needed . I attached my workfolw below. And, I would be very grateful if you could give me any suggestion to fix it.

My input data is kind of TPM expression matrix (without log2 convert), which contains over 25k genes and hundreds samples. Normal tissue expression matrix is also obtained from TCGA. In most cases, TCGA provides several normal data (2 to 50 samples) and lots of tumor data (hundreds).

I found that this error is raised by function "Optimum_KernelC", which is used in "DeMixT_GS ". The objreturn NaN NaN, so if(sum(obj == 0)>1) failed.

rres <- .C("Tdemix", dataarray1, as.integer(groupid), as.integer(nsub), 
             as.integer(wgenes), as.integer(nspikein), as.integer(setting.pi), 
             pi01, pi02, givenpi1, givenpi2, 
             givenpi3, as.integer(nCid), as.integer(niter), 
             as.integer(ninteg), tol, as.integer(nthread), 
             s0, m0, rep(0, 2 * intx), 
             rep(0, intx * wgenes), rep(0, niter * wgenes), 
             rep(0, niter * wgenes), rep(0, niter * intx), 
             rep(0, niter * intx), rep(0,niter), 
             rep(0, intx * wgenes), rep(0, intx * wgenes))

  obj<-rres[[25]]
 if(sum(obj == 0)>1){
     niter1 <- which(obj==0)[1]-1
   }else{
     niter1 <- length(obj)
   }
**Error in if (sum(obj == 0) > 1) { : missing value where TRUE/FALSE needed**

obj
[1] NaN NaN
sum(obj == 0)>1
[1] NA

Preview of my input data for "DemixT "function

data.Y = SummarizedExperiment(list(TPM=as.matrix(tumor_exprs)))
data.Y
class: SummarizedExperiment 
dim: 34432 167 
metadata(0):
assays(1): TPM
rownames(34432): TSPAN6 TNMD ... SNORA50A LINC01144
rowData names(0):
colnames(167): TCGA-CI-6621-01A-11R-1830-07 TCGA-EF-5831-01A-01R-1660-07 ... TCGA-AG-3587-01A-01R-0821-07
  TCGA-DY-A1DE-01A-11R-A155-07
colData names(0):

data.N1 = SummarizedExperiment(list(TPM=as.matrix(normal_exprs)))
data.N1
class: SummarizedExperiment 
dim: 34432 10 
metadata(0):
assays(1): TPM
rownames(34432): TSPAN6 TNMD ... SNORA50A LINC01144
rowData names(0):
colnames(10): TCGA-AG-3732-11A-01R-1660-07 TCGA-AF-2689-11A-01R-A32Z-07 ... TCGA-AF-3400-11A-01R-A32Z-07
  TCGA-AF-2692-11A-01R-A32Z-07
colData names(0):

My full code as shown below:

#set work environments
setwd("D:/BioI/2.RNA_seq")
ps <- c("data.table","tidyverse","limma","org.Hs.eg.db","AnnotationDbi",
        "ComplexHeatmap", "TCGAbiolinks","SummarizedExperiment","DT",
        "foreach","doParallel","ggplot2","cowplot","DeMixT")
lapply(ps, function(x){
  suppressMessages(library(x, character.only = T))}) 

# set functions
fpkmToTpm <- function(fpkm)
{
  exp(log(fpkm) - log(sum(fpkm)) + log(1e6))
}

# download TCGA and TARGET pan-cancer RNA-seq data (FPKM exprs)
cl <- makeCluster(8) 
pkgs <- c("foreach","tidyverse","TCGAbiolinks","SummarizedExperiment","DT","data.table") 
registerDoParallel(cl)

cancer_type <- TCGAbiolinks:::getGDCprojects()$project_id
cancer_type <- grep("TCGA|TARGET",cancer_type,value = T)

foreach (i = cancer_type,.packages=pkgs) %dopar% {

query <- GDCquery(project = i, 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM",
                  legacy = F)
 # remove duplicate samples
 tmp <- query$results[[1]]
 tmp <- tmp[which(!duplicated(tmp$cases)),]
 query$results[[1]] <- tmp
GDCdownload(query, method = "api", files.per.chunk = 1000)
expdat <- GDCprepare(query = query)
fpkm_matrix <- assay(expdat)
fpkm_matrix <- cbind(id = row.names(fpkm_matrix),fpkm_matrix) %>% data.table()
fwrite(fpkm_matrix,file = paste0("D:/BioI/2.RNA_seq/data/",i,"FPKM.txt"),sep = "\t",row.names = F,col.names = T,quote = F)
}
stopCluster(cl)

files <- dir("D:/BioI/2.RNA_seq/data/",full.names = F)
files <- gsub("FPKM.txt","",files)
setdiff(cancer_type,files)
 #TCGAbiolinks is unable to download TARGET-AML which need to download manually

rm(list = c("cancer_type","expdat","fpkm_matrix","i","tmp","query"))
gc()

# data processing

files <- dir("D:/BioI/2.RNA_seq/data/",full.names = T)
i = files[33]

# FPKM to TPM
  fpkm <- fread(i,header = T) %>% as.matrix()
  rownames(fpkm) <- fpkm[,1]
  fpkm <- fpkm[,2:ncol(fpkm)]
  dimnames1 <- list(rownames(fpkm),colnames(fpkm))
  fpkm <- matrix(as.numeric(as.matrix(fpkm)),nrow=nrow(fpkm),dimnames=dimnames1)
  TPM_exprs <- apply((fpkm+0.001),2,fpkmToTpm)

  # Ensemble id to gene symbol
  TPM_exprs <- cbind(id = rownames(TPM_exprs),TPM_exprs)
  ID <- rownames(TPM_exprs)
  gene.id <- AnnotationDbi::select(org.Hs.eg.db, keytype = "ENSEMBL", keys =  ID, columns = c("ENTREZID", "SYMBOL", "GENENAME"))
  gene.id <- gene.id[,c(1,3)]
  colnames(gene.id) <- c("id","symbol")
  gene.id <- gene.id[complete.cases(gene.id),] %>% as.matrix()
  TPM_exprs <- merge(gene.id,TPM_exprs,by="id")[,-1] %>% as.matrix()

  # Averaging duplicate genes
  rownames(TPM_exprs) <- TPM_exprs[,1]
  TPM_exprs <- TPM_exprs[,2:ncol(TPM_exprs)]
  dimnames2 <- list(rownames(TPM_exprs),colnames(TPM_exprs))
  TPM_exprs <- matrix(as.numeric(as.matrix(TPM_exprs)),nrow=nrow(TPM_exprs),dimnames=dimnames2)
  TPM_exprs <- avereps(TPM_exprs)

  # extract normal and cancer samples' exprs matrix
  barcode <- colnames(TPM_exprs)

  if(grepl("TARGET",i)){
    if(grepl("TARGET-ALL-P3FPKM",i)){
      tumor_sample_ids <- which(substr(barcode,25,26) %in% c('01',"02","03","04","05","06","07","08","09"))
      tumor_exprs <- TPM_exprs[,tumor_sample_ids]
      normal_exprs <- TPM_exprs[,-tumor_sample_ids]
    }else{
    tumor_sample_ids <- which(substr(barcode,18,19) %in% c('01',"02","03","04","05","06","07","08","09"))
    tumor_exprs <- TPM_exprs[,tumor_sample_ids]
    normal_exprs <- TPM_exprs[,-tumor_sample_ids]
    }
  }else{
    tumor_sample_ids <- which(substr(barcode,14,15) %in% c('01',"02","03","04","05","06","07","08","09"))
    tumor_exprs <- TPM_exprs[,tumor_sample_ids]
    normal_exprs <- TPM_exprs[,-tumor_sample_ids]
  }

# calculate purified tumor exprs from mixture RNA-exprs matrix of tumor samples

res.GS.TCGA = DeMixT(data.Y = SummarizedExperiment(list(TPM=as.matrix(tumor_exprs))),
                            data.N1 = SummarizedExperiment(list(TPM=as.matrix(normal_exprs))),
                            data.N2 = NULL,
                            niter = 10, nbin = 50, filter.sd = 0.5, nspikein = 49,
                            if.filter = TRUE, ngene.Profile.selected = 1500,
                            mean.diff.in.CM = 0.25, ngene.selected.for.pi = 1500,
                            tol = 10^(-5)) 

All the best, Wenyi Jin

jiyunmaths commented 2 years ago

@Jwenyi Thank you for using DeMixT in our study. Can I suggest you to use raw read counts instead of TPM/FPKM as the input for DeMixT? You can obtain the it by query <- GDCquery(project = i, data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts", legacy = F)

and feed the counts directly to DeMixT. Please run DeMixT again and let me know if you have the error again. Thanks.

Jwenyi commented 2 years ago

Thanks @jiyunmaths I fed DeMixT with raw read counts data, but there's another ERROR raised. Error in inputdata[id1, ] : (subscript) logical subscript too long I checked original code of DeMixT_GS and found that this error might be ascribed to a code in Line 433 and 291 inputdatamat1 <- inputdata[id1, ] I thinks this code is to extract a subset of inputdata, in which genes expressions are less thanfilter.sd. So I guess that maybe the inputdata contains too many rows (over 52k), thus[] is unable to process subset operation. This is a very common case, because I also get an error when I use "[ ]" to take a subset of the very large dataframe. So I use inputdatamat1 <- subset(inputdata,row.names(inputdata) %in% names(id1)[id1]==TRUE) to replace inputdatamat1 <- inputdata[id1, ], and found that DemixT works well. However, I still wonder if whether my code changed the workflow of DemixT, if so, could you pls send me some advice? I will be very grateful of it.

And I also got two questions as below: 1) I found that the raw read counts of some genes are 0, could you kindly pls give me some suggestion about how should I to process it? Cuz I found that the original code of DemixT_GSneed to log2-convert inputdata, which will report NAwhen raw read counts is 0. 2) I found that the shape of deconvoluted tumor expression is different from original tumor expression. The inputdata has 56602 rows(genes), but the output data only has 27531 rows. Did I do something wrong? If not, what the reason that lead to genes reduction and is this reduction affects the downstream statistical analysis?

> View(res.S2)
> dim(res.S2[["decovExprT"]])
[1] 27531    36
> dim(tumor_exprs)
[1] 56602    36

Look forward to your kindly suggestion.

Wenyi Jin

Jwenyi commented 2 years ago

Sorry for boring you @jiyunmaths but DeMixT report the same error again (pls see the ERROR REPORT below) when I used modified DeMixT to process other read counts data (PRAD from TCGA)

Error in if (sum(obj == 0) > 1) { : missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In theta.valid + sqrt(abs(qchisq(0.95, 1) * C)) :
  longer object length is not a multiple of shorter object length
2: In theta.valid - sqrt(abs(qchisq(0.95, 1) * C)) :

I attached my code and inputdata, could you pls give me some advice for fixing it?

MY CODE

i = "TCGA-PRAD_RNASeq_Counts.RData"
 load(i)
 # extract normal and cancer samples' exprs matrix
  barcode <- colnames(expDataCounts)

  if(grepl("TARGET",i)){
    if(grepl("TARGET-ALL-P3FPKM",i)){
      tumor_sample_ids <- which(substr(barcode,25,26) %in% c('01',"02","03","04","05","06","07","08","09"))
      tumor_exprs <- expDataCounts[,tumor_sample_ids]
      normal_exprs <- expDataCounts[,-tumor_sample_ids]
    }else{
      tumor_sample_ids <- which(substr(barcode,18,19) %in% c('01',"02","03","04","05","06","07","08","09"))
      tumor_exprs <- expDataCounts[,tumor_sample_ids]
      normal_exprs <- expDataCounts[,-tumor_sample_ids]
    }
  }else{
    tumor_sample_ids <- which(substr(barcode,14,15) %in% c('01',"02","03","04","05","06","07","08","09"))
    tumor_exprs <- expDataCounts[,tumor_sample_ids]
    normal_exprs <- expDataCounts[,-tumor_sample_ids]
  }
# calculate purified tumor exprs from mixture RNA-exprs matrix of tumor samples
  data.Y <- as.matrix(tumor_exprs) + 1
  data.N1 <- as.matrix(normal_exprs) + 1
  data.Y <- SummarizedExperiment(list(Counts=as.matrix(data.Y)))
  data.N1 <- SummarizedExperiment(list(Counts=as.matrix(data.N1)))

  res.TCGA = DeMixT(data.Y = data.Y,
                       data.N1 = data.N1,
                       data.N2 = NULL,
                       niter = 10, nbin = 50, filter.sd = 0.6, #nspikein = 6,
                       if.filter = TRUE, ngene.Profile.selected = 1500,
                       mean.diff.in.CM = 0.25, ngene.selected.for.pi = 1500,
                       tol = 10^(-5)) 

MY INPUTDATA Due to files size limitation of github (less than 25 MB), kindly access the inputdata (TCGA-PRAD_RNASeq_Counts.RData, size of 43MB) from Google Drive by the following link: https://drive.google.com/file/d/1Nh05Z3-4mPZuc-q6jmdZzko4B0h8zitd/view?usp=sharing

All the best, Wenyi Jin

jiyunmaths commented 2 years ago

@Jwenyi Sorry for the inconvenience. Your data may need a preprocessing before feeding into DeMixT. I am composing a tutorial and will release it to this repository soon. Will let you know when it is available. Thanks.

zhaoliang0302 commented 2 years ago

Thanks @jiyunmaths I fed DeMixT with raw read counts data, but there's another ERROR raised. Error in inputdata[id1, ] : (subscript) logical subscript too long I checked original code of DeMixT_GS and found that this error might be ascribed to a code in Line 433 and 291 inputdatamat1 <- inputdata[id1, ] I thinks this code is to extract a subset of inputdata, in which genes expressions are less thanfilter.sd. So I guess that maybe the inputdata contains too many rows (over 52k), thus[] is unable to process subset operation. This is a very common case, because I also get an error when I use "[ ]" to take a subset of the very large dataframe. So I use inputdatamat1 <- subset(inputdata,row.names(inputdata) %in% names(id1)[id1]==TRUE) to replace inputdatamat1 <- inputdata[id1, ], and found that DemixT works well. However, I still wonder if whether my code changed the workflow of DemixT, if so, could you pls send me some advice? I will be very grateful of it.

And I also got two questions as below:

  1. I found that the raw read counts of some genes are 0, could you kindly pls give me some suggestion about how should I to process it? Cuz I found that the original code of DemixT_GSneed to log2-convert inputdata, which will report NAwhen raw read counts is 0.
  2. I found that the shape of deconvoluted tumor expression is different from original tumor expression. The inputdata has 56602 rows(genes), but the output data only has 27531 rows. Did I do something wrong? If not, what the reason that lead to genes reduction and is this reduction affects the downstream statistical analysis?
> View(res.S2)
> dim(res.S2[["decovExprT"]])
[1] 27531    36
> dim(tumor_exprs)
[1] 56602    36

Look forward to your kindly suggestion.

Wenyi Jin

The TCGA raw count data contains numerous genes, over 50000 genes in the expression matrix I guess. I pre-filtered out the extremely low expressed genes with counts less than 5 in half of the entire samples, and it works. Also, the DeMixT built-in function will log2 the count data. So I used count = count +1 before performing the algorithm. Hope it works for you.