zhangyuqing / ComBat-seq

Batch effect adjustment based on negative binomial regression for RNA sequencing count data
154 stars 39 forks source link

ComBat-Seq correction creating very large count values--cannot feed into DESeq2 #20

Open cdharshika opened 3 years ago

cdharshika commented 3 years ago

Hey,

Thanks for making this software, it seems really helpful (and I'd love to use it if I can get this to work). The matrix generated by ComBat-Seq cannot be read into DESeq. My current guess is that something about the batch correction creates very large values in the matrix. I can see this when looking at the highest values in the matrix, where some very low raw counts (<10) are being corrected to very large numbers (+e17).

When DESeq2 tries to convert these to integers, it cannot because the value is too high for DESeq2 to handle. This isn't really the concern, however, as I can't imagine a correction should ever change a count value that drastically so I may be doing something incorrectly in my implementation of ComBat-Seq.

A little information on the project: these are technical replicates where the same library was sequenced twice, the second batch being at much higher read depth. Each batch was analyzed separately and counts generated by HTSeq-Count were used for batch correction. I used the covar_mod to account for my two treatment variables.

Please let me know what you think or what other information would be helpful to provide here.

Thanks in advance--I really appreciate your help.

Christine

joshua-theisen commented 3 years ago

I have a similar issue. For one gene, ComBat-seq adjusts raw counts from ~15 to ~3e+13, but not for all of the samples. Below is R code showing the raw counts and adjusted counts for this gene across all samples:

> counts %>% 
+     dplyr::filter(feature == "Sgcz")
  feature IM.MR01_S1_L001 IM.MR02_S2_L001 IM.MR03_S3_L001 IM.MR04_S4_L001 IM.MR05_S5_L001 IM.MR06_S6_L001 IM.MR07_S7_L001
1    Sgcz              13               7              15               0               0               0               0
  IM.MR08_S8_L001 IM.MR09_S9_L001 IM.MR10_S10_L001 IM.MR11_S11_L001 IM.MR12_S12_L001 IM.MR13_S13_L001 IM.MR14_S14_L001 IM.MR15_S15_L001
1               0               0                0                0                0                0                0                0
  IM.MR16_S16_L001 IM.MR17_S17_L001 IM.MR18_S18_L001 IM.MR19_S19_L001 IM.MR20_S20_L001 IM.MR21_S21_L001 IM.MR22_S22_L001
1                0                0                0                0                0                0                0
  IM.MR23_S23_L001 IM.MR24_S24_L001 IM.MR25_S25_L001 IM.MR26_S26_L001 IM.MR27_S27_L001 IM.MR28_S28_L001 IM.MR29_S29_L001
1                0                0                0                0                0                0                0
  IM.MR30_S30_L001 IM.MR31_S31_L001 IM.MR32_S32_L001 IM.MR33_S33_L001 IM.MR34_S34_L001 IM.MR35_S35_L001 IM.MR36_S36_L001 IM.AR.30
1                0                0                0                0                0                0                0        0
  IM.AR.31 IM.AR.32 IM.AR.33 IM.AR.34 IM.AR.35 IM.AR10_S1_L005 IM.AR11_S2_L005 IM.AR12_S3_L005 IM.AR13_S4_L005 IM.AR14_S5_L005
1        0        1        0        0        0               0               0               1               0               1
  IM.AR15_S6_L005 IM.AR16_S7_L005 IM.AR17_S8_L005 IM.AR.36 IM.AR.37 IM.AR.38 IM.AR.39 IM.AR.40 IM.AR.41 IM.AR51_S1_L006 IM.AR52_S2_L006
1               1               2               0        0        0        0        1        4        0               3               1
  IM.AR53_S3_L006 IM.AR54_S4_L006 IM.AR55_S5_L006 IM.AR56_S6_L006 IM.AR57_S7_L006 IM.AR58_S8_L006
1               0               0              10              16              16              18
> adjusted_counts %>% 
+     dplyr::filter(feature == "Sgcz")
      feature IM.MR01_S1_L001 IM.MR02_S2_L001 IM.MR03_S3_L001 IM.MR04_S4_L001 IM.MR05_S5_L001 IM.MR06_S6_L001 IM.MR07_S7_L001
18818    Sgcz    3.682673e+13    2.440158e+13     3.70514e+13               0               0               0               0
      IM.MR08_S8_L001 IM.MR09_S9_L001 IM.MR10_S10_L001 IM.MR11_S11_L001 IM.MR12_S12_L001 IM.MR13_S13_L001 IM.MR14_S14_L001
18818               0               0                0                0                0                0                0
      IM.MR15_S15_L001 IM.MR16_S16_L001 IM.MR17_S17_L001 IM.MR18_S18_L001 IM.MR19_S19_L001 IM.MR20_S20_L001 IM.MR21_S21_L001
18818                0                0                0                0                0                0                0
      IM.MR22_S22_L001 IM.MR23_S23_L001 IM.MR24_S24_L001 IM.MR25_S25_L001 IM.MR26_S26_L001 IM.MR27_S27_L001 IM.MR28_S28_L001
18818                0                0                0                0                0                0                0
      IM.MR29_S29_L001 IM.MR30_S30_L001 IM.MR31_S31_L001 IM.MR32_S32_L001 IM.MR33_S33_L001 IM.MR34_S34_L001 IM.MR35_S35_L001
18818                0                0                0                0                0                0                0
      IM.MR36_S36_L001 IM.AR.30 IM.AR.31 IM.AR.32 IM.AR.33 IM.AR.34 IM.AR.35 IM.AR10_S1_L005 IM.AR11_S2_L005 IM.AR12_S3_L005
18818                0        0        0        1        0        0        0               0               0               1
      IM.AR13_S4_L005 IM.AR14_S5_L005 IM.AR15_S6_L005 IM.AR16_S7_L005 IM.AR17_S8_L005 IM.AR.36 IM.AR.37 IM.AR.38 IM.AR.39 IM.AR.40
18818               0               1               1               1               0        0        0        0        1        1
      IM.AR.41 IM.AR51_S1_L006 IM.AR52_S2_L006 IM.AR53_S3_L006 IM.AR54_S4_L006 IM.AR55_S5_L006 IM.AR56_S6_L006 IM.AR57_S7_L006
18818        0               1               1               0               0               1               1               1
      IM.AR58_S8_L006
18818               1

Any advice on how to avoid this would be welcome. Thanks! Josh

J-Lye commented 3 years ago

I'm also experiencing this issue.

This is the first small section of compiled raw reads from star (my actual data is 45 samples for 55k genes)

Gene_ID SRB000C1 SRB000C2 SRB0001 SRB0002 SRB0003 ENSG00000242268 0 0 0 0 0 0 ENSG00000270112 0 0 6 5 1 2 ENSG00000280143 0 2 2 3 0 7


pid_snd_data <-read.table("C:/Users/my_data_file.txt", dec = ".", row.names = 1, header=TRUE, sep ="\t")
> count_matrix <-as.matrix(pid_snd_data)
> batch <- c(1,1,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,4,3,3,3,3,3,4,4,5,5,5,5,5,5,5,5,5) 
> Disease <-c(0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
> Sex <- c(0,1,0,1,1,0,0,0,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,1)
> covar_mat <- cbind(Disease, Sex)
> adjusted_counts <- ComBat_seq(count_matrix, batch=batch, group=NULL, covar_mod=covar_mat)
Found 5 batches
Using null model in ComBat-seq.
Adjusting for 2 covariate(s) or covariate level(s)
Estimating dispersions
Fitting the GLM model
Shrinkage off - using GLM estimates for parameters
Adjusting the data
> write.table(adjusted_counts, file='ComBat_seq_adjusted_counts_pid_snd', quote=FALSE, sep='\t', row.names = 1, header = TRUE)

This is the same section after applying combat seq SRB000C1 SRB000C2 SRB0001 SRB0002 ENSG00000242268 0 0 0 0 ENSG00000270112 0 0 5 4 ENSG00000280143 0 2995420 119 208

Also; I have noticed is that the program seems to have a row/column issue. whenever I use it, my output always have a column name missing in row one. Even when I delete all rownames and column names and just load the numerical data the output loses its final rowname and the first cell (A:1) vanishes : see syntax and output below.

library(sva)
pid_snd_data <-read.table("C:/Users/my_data.txt", dec = ".",  sep ="\t")
count_matrix <-as.matrix(pid_snd_data)

batch <- c(1,1,2,2,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,4,3,3,3,3,3,4,4,5,5,5,5,5,5,5,5,5) 
Disease <-c(0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
##Sex <- c(F,M,F,M,M,F,F,F,M,M,M,F,M,M,M,M,M,M,F,M,M,M,M,M,F,F,F,F,F,F,F,F,F,F,F,M,F,F,F,F,M,F,F,M,M)
Sex <- c(0,1,0,1,1,0,0,0,1,1,1,0,1,1,1,1,1,1,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,1)
covar_mat <- cbind(Disease, Sex)
adjusted_counts <- ComBat_seq(count_matrix, batch=batch, group=NULL, covar_mod=covar_mat)
write.table(adjusted_counts, file='ComBat_seq_adjusted_counts_pid_snd_norownames_v6.txt', quote=FALSE, sep='\t')

V1 | V2 | V3 | V4 ............. V43 | V44 | V45 |   -- | -- | -- | -- ............. -- | -- | -- | -- 1 | 0 | 0 | 0 .......... 0 | 4 | 0 | 2 2 | 0 | 0 | 5 0 | 35547 | 0 | 1 3 | 0 | 2995420 | 119 ......... 0 | 1 | 1 | 1

Not that there are 3 entries (v34, v44, v45 - but 4 values underneath, this is because the row names all move to the left. Unsure if the two are related but it others could check their results and see if they have the same bug it might help untangle this?

J-Lye commented 3 years ago

So I've looked into this further, it seems to give reliable numbers if you just use the batch effect function, but adding in covariates causes the compounding factors. So you can use it to just correct for batch effect reliably.

Regarding covariates: I'm going to pick this apart, the investigation continues...

joshua-theisen commented 3 years ago

@J-Lye , I am also able to avoid the very large counts issue by using group = group_vector instead of covar_mod = covar_matrix. However, I need to account for both time and treatment in my data.

Regarding the missing column, have you checked that pid_snd_data and count_matrix have the correct number of columns? read.table sometimes causes these issues.

benostendorf commented 2 years ago

In my case unreasonably large numbers got introduced for genes that were expressed in few samples only. Removing these genes fixed the issue.

As an example, to remove genes that are expressed in fewer than a third of samples you can do counts_filt <- counts[apply(counts, 1, function(x) sum(x == 0)) < ncol(counts) / 3, ]

hoonghim commented 2 years ago

In my case unreasonably large numbers got introduced for genes that were expressed in few samples only. Removing these genes fixed the issue.

As an example, to remove genes that are expressed in fewer than a third of samples you can do counts_filt <- counts[apply(counts, 1, function(x) sum(x == 0)) < ncol(counts) / 3, ]

Dear Benjamin Ostendorf,

Thank you for your great suggestion.

Could I ask one more question?

My question is: Is it okay to run Combat-seq after filtering all those genes that are expressed in fewer than a certain proportion of samples?

I am analyzing bulk RNA-seq dataset of pediatric tumor samples.

I filtered genes based on the following criteria.

  1. non-expressing genes in all samples

  2. Genes without at least 1 TPM in 20% of the sample (I referred to this paper - Dreyer, Stephan B., et al. "Genomic and molecular analyses identify molecular subtypes of pancreatic cancer recurrence." Gastroenterology 162.1 (2022): 320-324.)

I found the author's comment about filtering low expressed genes in here: https://github.com/zhangyuqing/ComBat-seq/issues/3

"Filtering low expressed genes is recommended before using ComBat-Seq, although the latest version should identify genes with only 0s in any batch, and keep them unchanged."

I am confusing the concept between the low expressed gene and the gene that is expressed in fewer than a certain proportion of samples (but expressed very high in one or few samples).

Any comments would be of great help.

Thank you in advance.

Sincerely,

Seunghoon

khughitt commented 2 years ago

Can confirm the issue when using covar_mod to specify covariates:

> max(raw_counts)
[1] 1861813

> max(adj_counts)
[1] 1.437247e+16

sva: v3.42.0

Ilarius commented 1 year ago

anyone figured this out? I have the same problem even if just using group and not covar_mod. I am getting like 898183362094 as max value!

Aravind-sunda commented 3 months ago

Are there any workarounds that can be done for the genes exhibiting a large number of counts before feeding the batch-corrected matrix to DESeq2? Has this issue been solved in any other way?

f12345zxcvbnm2 commented 3 months ago

Are there any workarounds that can be done for the genes exhibiting a large number of counts before feeding the batch-corrected matrix to DESeq2? Has this issue been solved in any other way?

Same thing happened to me, how to deal with it final?