wwylab / DeMixT

GNU General Public License v3.0
32 stars 14 forks source link

need little help on DemixT processing #25

Closed nan5895 closed 1 month ago

nan5895 commented 1 year ago

hello, Thank you for wonderful tool to deconvoluting tumor specific expression count!

i have 84 patient of match Normal and tumor RNA-seq raw count.

so, 84 normal and 84 tumor .. with 55882 of genes...

currently, i am following demixT tutorial from https://wwylab.github.io/DeMixT/tutorial.html

But i found out there are DeMIXT_GS and DeMIXT_DE. what are differences between tutorial method and DeMIXT_GS and DeMIXT_DE?

also, in the tutorial

# In practice, we set the maximum number of spike-in as min(n/3, 200), 
# where n is the number of samples. 
nspikesin_list = c(0, 50, 100, 150)

TCGA has a larger sample than I have. should I follow instructions above or is it okay to put nspikein = 84?

ngene.selected_list = c(500, 1000, 1500, 2500)

for ngene_selected_list part, since i do not know the optimal parameter combination, is it good to try ngene list above ? or can be larger than that ?

Lastly, when i try DeMIXT_GS or DeMIXT_DE function as example parameter

res.GS = DeMixT_GS(data.Y = test.data.2comp$data.Y,
                   data.N1 = test.data.2comp$data.N1,
                   niter = 10, nbin = 50, nspikein = 50,
                   if.filter = TRUE, ngene.Profile.selected = 150,
                   mean.diff.in.CM = 0.25, ngene.selected.for.pi = 150,
                   tol = 10^(-5))

it said threshold should be larger than provide... why is that happen?

jiyunmaths commented 1 year ago

Hi @nan5895, thanks for using DeMixT! For your first question: DeMixT_DE and DeMixT_GS are the two functions for estimating the proportion of each component from mixed samples (i.e., tumor RNAseq data), but based on different gene selection methods: the former selects genes for the estimation using differential expressions between tumor and reference normal, while the latter selects genes using a new and more effective approach that utilizes a profile likelihood method and was proven to be able to select most identifiable genes and achieve better model fitting quality. Generally, DeMixT_GS outperforms DeMixT_DE. Can I recommend you to check the supplementary information in our paper https://www.nature.com/articles/s41587-022-01342-x for more information about the benchmarking between DeMixT_DE and DeMixT_GS? Your second and third questions: I would like to recommend you to set a list of the number of spike-ins, like in the tutorial nspikesin_list = c(0, 50, 100, 150), and run DeMixT several times with each is combined with different nspikein and ngene.selected (e.g., ngene.selected_list = c(500, 1000, 1500, 2500)), then select the result using the correlation based method which is introduced in the tutorial. Your last question: for RNAseq data, ngene.Profile.selected and ngene.selected.for.pi should be set between 500-2500. I am not sure what confused you. If you can provide more information, I am happy to help. Thanks again.

nan5895 commented 1 year ago

Thank you for your answer !! i am know understood most of the concept i guess..

i have another question on pre-processing part.

We first select ~9000 genes before running DeMixT with the GS (Gene Selection) method so that our model-based gene selection maintains good statistical properties. DeMixT_preprocessing finds the a range of variance in genes from normal samples (cutoff_normal_range) and from tumor samples (cutoff_tumor_range) which results in roughly 9,000 genes. DeMixT_preprocessing outputs a list object preprocessed_data:

Preprocess filters out the gene with large deviation both in normal and mixed tumor sample So,

### describe remaining gene # based on set normal cutoff range and tumor cutoff range ###
num_gene_remaining_different_cutoffs <- subset_sd_gene_remaining(count.matrix, label, 
                                                                     cutoff_normal_range, 
                                                                     cutoff_tumor_range,
                                                                     cutoff_step)

### and list the cutoff range based on gene # - 9000 > 0 ###
num_gene_remaining_different_cutoffs_filter <- num_gene_remaining_different_cutoffs[which(num_gene_remaining_different_cutoffs$num.gene.remaining - 9000 > 0), ]

### order them into small to large  and  pick 1st row cutoff range of the subset ####
    num_gene_remaining_different_cutoffs_filter <- num_gene_remaining_different_cutoffs[which(num_gene_remaining_different_cutoffs$num.gene.remaining - 9000 > 0), ]
    num_gene_remaining_different_cutoffs_filter <- num_gene_remaining_different_cutoffs_filter[order(num_gene_remaining_different_cutoffs_filter$num.gene.remaining), ]
    sd_cutoff_normal <- c(num_gene_remaining_different_cutoffs_filter$normal.cutoff.low[1], num_gene_remaining_different_cutoffs_filter$normal.cutoff.high[1])
    sd_cutoff_tumor <- c(num_gene_remaining_different_cutoffs_filter$tumor.cutoff.low[1], num_gene_remaining_different_cutoffs_filter$tumor.cutoff.high[1])

### using picked subset as filtering sd  to create filtered matrix ###
count.matrix <- subset_sd(inputdata = count.matrix, label = label, cutoff_normal = sd_cutoff_normal, cutoff_tumor = sd_cutoff_tumor)

I was looking at pre-processing method. I found out I could have a loss of genes of interest... This is because a certain sample has a large gene expression while others don't ...

In our sample, we have samples that have genomic instability, which causes extrachromosomal elements that are unevenly segregated during mitosis.. so their copy number is very high. Its sample mRNA raw count is very high compared to other samples... but the majority of samples do not have high expression, so I think it filters out during pre-processing. I tried to manually pick different cutoffs from the output of subset_sd_gene_remaining, but it still filters out the gene of interest.

Do you think it is okay to manually keep the gene of interest during preprocessing by modifying the pre_process R script? or is there any possible way that you recommend?

jiyunmaths commented 1 year ago

Hi @nan5895. Yes, you can manually select the genes for deconvolution. In the function subset_sd from DeMixT_preprocessing.R you can change the cutoffs to see if the genes of your interest are included. You can also manually include the genes of interest by slicing your input expression data with only these genes and concatenate it with the output from subset_sd_gene_remaining. Let me know if you have any questions. Thanks.

nan5895 commented 1 year ago

After i manually include the genes of interest by slicing your input expression data with only these genes and concatenate it with the output from subset_sd_gene_remaining, I ran Demix with code below


data.Y = SummarizedExperiment(assays = list(counts = filtered_concatenated_data[, Tumor.id]))
data.N1 <- SummarizedExperiment(assays = list(counts = filtered_and_concatenated_data[, Normal.id]))

nspikesin_list = c(0, 50, 100, 150)

ngene.selected_list = c(500, 1000, 1500, 2500)

for(nspikesin in nspikesin_list){
  for(ngene.selected in ngene.selected_list){
    name = paste("Projects_res_nspikesin", nspikesin, "ngene.selected", 
                 ngene.selected,  sep = "_");
    name = paste(name, ".RData", sep = "");
    res = DeMixT(data.Y = data.Y,
                 data.N1 = data.N1,
                 ngene.selected.for.pi = ngene.selected,
                 ngene.Profile.selected = ngene.selected,
                 filter.sd = 1.8, # same upper bound of gene expression standard deviation 
                 # for normal reference. i.e., preprocessed_data$sd_cutoff_normal[2]
                 gene.selection.method = "GS",
                 nspikein = nspikesin)
    save(res, file = name)
  }
}

PiT_GS_PRAD <- c()
row_names <- c()

for(nspikesin in nspikesin_list){
  for(ngene.selected in ngene.selected_list){
    name_simplify <- paste(nspikesin, ngene.selected,  sep = "_")
    row_names <- c(row_names, name_simplify)

    name = paste("Projects_GS_res_nspikesin", nspikesin, 
                 "ngene.selected", ngene.selected,  sep = "_");
    name = paste(name, ".RData", sep = "")
    load(name)
    PiT_GS_PRAD <- cbind(PiT_GS_PRAD, res$pi[2, ])
  }
}
colnames(PiT_GS_PRAD) <- row_names

After DeMixT runs, it still filters out some of the genes that I am interested in ... i think those genes are filtered out during the below part.

 ## filter out genes with constant value across all samples
  if (is.null(data.N2)) {
    if (dim(inputdata)[1] == 1) {
      inputdata <- t(as.matrix(inputdata[apply(data.N1,
                                  1, function(x) length(unique(x)) > 1), ]))
    }
    else {
      inputdata <- inputdata[apply(data.N1, 1, function(x) 
        length(unique(x)) > 1), ]
    }
  }else{
    if (dim(inputdata)[1] == 1) {
      inputdata <- t(as.matrix(inputdata[apply(data.N1, 
      1, function(x) length(unique(x)) > 1) & apply(data.N2,
      1, function(x) length(unique(x)) > 1), ]))
    }
    else {
      inputdata <- inputdata[apply(data.N1, 1, function(x) 
        length(unique(x)) > 1) & apply(data.N2, 1, function(x) 
          length(unique(x)) > 1), ]
    }
  }

  filter2 <- function(inputdata1r, ngene.selected.for.pi, n = 1) {
    if ((ngene.selected.for.pi > 1) & (ngene.selected.for.pi%%1 == 
                                       0)) {
      id2 <- order(inputdata1r, decreasing = TRUE)
      id2 <- id2[seq(1, min(n * ngene.selected.for.pi, 
                            length(inputdata1r)))]
    }
    else if ((ngene.selected.for.pi < 1) & (ngene.selected.for.pi > 
                                            0)) {
      id2 <- (inputdata1r > quantile(inputdata1r, probs = 1 - 
                                       n * ngene.selected.for.pi))
    }
    else {
      stop("The argument ngene.selected.for.pi can only be 
           an integer or a percentage between 0 and 1")
    }
    if (sum(id2) < 20) 
      stop("there are too few genes for filtering stage 1.\n
           Please relax threshold for filtering ")
    return(inputdatamat1[id2, ])
  }

Do you think that I set filter.sd = 1.8 too low? That cut-off was from DeMixT_preprocessing.R results though The raw count of one of my genes of interest (that is actually loss during DemixT) is below.

 0001-N                            0002-N                            0003-N                            0004-N                            0005-N                            0006-N 
    1                                 0                                 2                                 0                                 0                                11 
0007-N                            0008-N                            0009-N                            0010-N                            0011-N                            0012-N 
    0                                 1                                 0                                61                                 6                                 0 
0013-N                            0014-N                            0015-N                            0016-N                            0017-N                            0018-N 
    0                                 0                                 0                                 6                                 0                                 1 
0019-N                            0020-N                            0021-N                            0022-N                            0023-N                            0024-N 
    9                                 3                                 4                                 2                                 1                                 0 
0025-N                            0026-N                            0027-N                            0028-N                            0029-N                            0030-N 
    24                                 0                                 1                                 1                                 4                                 2 
0031-N                            0032-N                            0033-N                            0034-N                            0035-N                            0036-N 
    1                                 3                                 6                                 0                                 5                                 0 
0037-N                            0038-N                            0039-N                            0040-N                            0041-N                            0042-N 
    4                                 1                                 2                                 0                                 3                                 0 
0043-N                            0044-N                            0045-N                            0046-N                            0047-N                            0048-N 
    3                                 2                                 2                                 3                                 0                                 3 
0049-N                            0050-N                            0051-N                            0052-N                            0053-N                            0054-N 
    1                                 1                               193                                 1                                 0                                 1 
0055-N                            0056-N                            0057-N                            0058-N                            0059-N                            0060-N 
    1                                24                                21                                23                                 2                                 5 
0061-N                            0062-N                            0063-N                            0064-N                            0065-N                            0066-N 
   11                                 0                                 2                                 1                                 2                                 4 
0067-N                            0068-N                            0069-N                            0070-N                            0071-N                            0072-N 
   20                                 1                                 6                                 6                                 3                               509 
0073-N                            0074-N                            0075-N                            0076-N                            0077-N                            0078-N 
    1                                 4                                 0                                 0                                 0                                 3 
0079-N                            0080-N                            0081-N                            0082-N                            0083-N                            0084-N 
    3                                11                                 2                                 2                                 0                                 4 
0001-T                            0002-T                            0003-T                            0004-T                            0005-T                            0006-T 
    1                                11                                 1                                19                                 7                                 7 
0007-T                            0008-T                            0009-T                            0010-T                            0011-T                            0012-T 
    9                                40                                50                                 0                                 3                                15 
0013-T                            0014-T                            0015-T                            0016-T                            0017-T                            0018-T 
    9                                 0                                 5                                18                               163                                 2 
0019-T                            0020-T                            0021-T                            0022-T                            0023-T                            0024-T 
    20                                 1                               286                                25                                34                                21 
0025-T                            0026-T                            0027-T                            0028-T                            0029-T                            0030-T 
    58                                 3                                 4                                 0                                 3                                99 
0031-T                            0032-T                            0033-T                            0034-T                            0035-T                            0036-T 
    1                                 4                                 7                                 1                                30                                 0 
0037-T                            0038-T                            0039-T                            0040-T                            0041-T                            0042-T 
    49                                3                                 0                                 5                                 4                                20 
0043-T                            0044-T                            0045-T                            0046-T                            0047-T                            0048-T 
    3                              3441                                11                                90                                10                               180 
0049-T                            0050-T                            0051-T                            0052-T                            0053-T                            0054-T 
    5                                 0                                 0                               110                                43                               106 
0055-T                            0056-T                            0057-T                            0058-T                            0059-T                            0060-T 
   16                                 4                                 9                                 0                                 7                                 0 
0061-T                            0062-T                            0063-T                            0064-T                            0065-T                            0066-T 
    0                                 5                                 8                                11                                 5                                 5 
0067-T                            0068-T                            0069-T                            0070-T                            0071-T                            0072-T 
    1                                18                                 9                               142                                 1                                13 
0073-T                            0074-T                            0075-T                            0076-T                            0077-T                            0078-T 
    9                                 6                                 1                                 0                                 1                                 1 
0079-T                            0080-T                            0081-T                            0082-T                            0083-T                            0084-T 
    97                               161                                 4                                 3                               270                                 0  
jiyunmaths commented 1 year ago

@nan5895 the parameter filter.sd in DeMixT_GS function is similar to subset_sd_gene_remaining. You may set filter.sd to a very high number, for example filter.sd=10, you should be able to keep all the genes selected in the preprocessing step.