zhangyuqing / ComBat-seq

Batch effect adjustment based on negative binomial regression for RNA sequencing count data
163 stars 39 forks source link

How to adjust by multiple variables using ComBat-Seq? #33

Open evaesquinas opened 2 years ago

evaesquinas commented 2 years ago

Dear @zhangyuqing ,

I am trying to adjust my RNA data using ComBat-Seq since I realised that there are 3 batches that I need to adjust for:

I have 960 samples and around 62000 genes.

In my biological matrix, I have: Age, Sex, Group (cases,controls..) and WBC counts.

biol_mat = model.matrix(~Age + as.factor(Sex) + as.factor(Group) + LYMPH + MONO + NEUT, data=phenotype)

In the tutorial of Combat-Seq appears how to adjust by 1 variable but it doesn't tell you how to adjust by more than 1.

I have seen a lot of posts using combat that the only way is to adjust by 1 variable and then, with those results, adjust again by the 2nd variable and so on.

That would be:

Adjust by library prep.

raw.cts_adjustedLibPrep <- ComBat_seq(counts = raw_cts_matrix, batch=batch_libraryprep, group=NULL, covar_mod = biol_mat)

Adjust by library prep + type of tube.

raw.cts_adjusted_LibPrep_TypeTube <- ComBat_seq(counts = raw.cts_adjustedLibPrep, batch=batch_type_tube, group=NULL, covar_mod = biol_mat)

Adjust by library prep + type of tube + place

 raw.cts_adjusted_LibPrep_TypeTube_Place <- ComBat_seq(counts = raw.cts_adjusted_LibPrep_TypeTube, batch=batch_place, group=NULL, covar_mod = biol_mat)

For the first adjustment (library prep) it takes around 15min, but for the second... it has been running for more than 2 days.. I stopped it and launch it again, changing the order of the adjustment but it is taking too much time and it seems that something is wrong...

Here I attach a similar example of my data:

Covariates (variables to adjust + biological information)

set.seed(11344)
covariates_df <- data.frame(
  Sample = paste0("A", seq(1,960)),
  Age = sample(seq(18, 70), size = 960, replace = TRUE),
  Group = sample(seq(1, 3), size = 960, replace = TRUE),
  Sex = sample(seq(1, 2), size = 960, replace = TRUE),
  WBC = sample(seq(0.5, 35, by=0.3), size = 960, replace = TRUE),
  Place = sample(seq(1, 2), size = 960, replace = TRUE),
  Tubes = sample(seq(1, 2), size = 960, replace = TRUE),
  LibPrep = sample(seq(1, 16), size = 960, replace = TRUE)
)
rownames(covariates_df) <- covariates_df$Sample

categorical_variables <- c("Sample", "Group", "Sex", "Place", "Tubes", "LibPrep")
covariates_df[,categorical_variables] <- lapply(covariates_df[,categorical_variables],as.factor)

>str(covariates_df)

'data.frame':   960 obs. of  8 variables:
 $ Sample : Factor w/ 960 levels "A1","A10","A100",..: 1 112 223 334 445 556 667 778 889 2 ...
 $ Age    : int  40 66 37 40 25 42 46 51 45 61 ...
 $ Group  : Factor w/ 3 levels "1","2","3": 2 2 2 3 2 2 3 3 3 3 ...
 $ Sex    : Factor w/ 2 levels "1","2": 1 2 2 1 1 2 1 1 2 1 ...
 $ WBC    : num  12.8 29.9 25.7 0.8 30.8 5.9 30.5 19.1 2.9 12.2 ...
 $ Place  : Factor w/ 2 levels "1","2": 1 1 2 1 2 1 2 2 1 2 ...
 $ Tubes  : Factor w/ 2 levels "1","2": 1 1 2 1 2 2 2 2 1 2 ...
 $ LibPrep: Factor w/ 16 levels "1","2","3","4",..: 14 1 13 4 10 5 1 8 7 5 ...

The 3 variables that I want to adjust for will be: Place, Tubes and LibPrep. On the other hand, the biological information that I want to preserve would be Age, Group, Sex and WBC.

RNA raw counts

exp_df <- data.frame(replicate(960,sample(0:5216979,66000,rep=TRUE)))
colnames(exp_df) <- covariates_df$Sample
rownames(exp_df) <- paste0("Gene", rownames(exp_df))
exp_df_mat <- as.matrix(exp_df)

> str(exp_df)
'data.frame':   66000 obs. of  960 variables:
 $ A1  : int  3687756 4638393 4315502 316404 2209492 831342 4261323 1283200 1522808 1088673 ...
 $ A2  : int  4645779 3495763 4782397 2869628 2402288 1012125 1267979 1361720 1919367 4859438 ...
 $ A3  : int  2883976 4770076 228011 915940 19038 4650368 112632 1316246 1926498 484384 ...
 $ A4  : int  3496840 3676764 2763012 2723528 944224 3809955 5054054 1122139 116722 4090191 ...
 $ A5  : int  3140122 650422 2075888 2987814 1656462 2863317 155120 1086391 3163073 625809 ...
 $ A6  : int  4084796 1932277 3491059 4898410 4183070 4470479 4193882 928271 3282841 4418068 ...
 $ A7  : int  765797 3153554 5075853 4775395 3582194 4274642 1530455 1433179 4757168 2209519 ...
 $ A8  : int  1652346 3505656 111027 3170748 5087383 912180 2693545 1907366 3135340 548296 ...
 $ A9  : int  441053 5132021 1857530 2413007 1218852 3614836 4388581 106105 3270886 3840996 ...
 $ A10 : int  995597 2076013 1576689 1857073 1435772 3788330 3983860 816969 733090 1357226 ...
 $ A11 : int  4944929 5182067 4415296 3224862 145068 3533346 4174175 2406340 4572529 4820674 ...
 $ A12 : int  2731408 1896439 5063233 4621405 2686720 1507796 4764591 887296 2257893 384470 ...

The biological matrix would be: biol_mat = model.matrix(~Age + as.factor(Sex) + as.factor(Group) + WBC, data=covariates_df)

head(biol_mat)

 (Intercept) Age as.factor(Sex)2 as.factor(Group)2 as.factor(Group)3  WBC
A1           1  40               0                 1                 0 12.8
A2           1  66               1                 1                 0 29.9
A3           1  37               1                 1                 0 25.7
A4           1  40               0                 0                 1  0.8
A5           1  25               0                 1                 0 30.8
A6           1  42               1                 1                 0  5.9

Could you kindly help me, please? I have tried all the possibilities and I do not really know if it is something wrong my code or it is because combat_seq cannot handle too much amount of data.

Thanks very much in advance Regards

punyung commented 1 year ago

@evaesquinas Hi, I've tried your example data and it took me roughly 13 hours to finish the adjustment of 3 batches. I hope this can help you.

maxnest commented 1 year ago

@evaesquinas hi, did you manage to figure out what is the best way to deal with the adjustment of multiple batches? Have you tried any other programs for this task? Faced the same problem on my data and still can't continue the analysis as the adjustment takes too long and seems to be endless. I would be grateful for any help and advice

evaesquinas commented 1 year ago

@maxnest Hi! No, I did not sorry. I ended adjusting only by 1 batch. And I did not find any other way to solve the problem. Sorry.