zhangyuqing / ComBat-seq

Batch effect adjustment based on negative binomial regression for RNA sequencing count data
154 stars 39 forks source link

ComBat (negative value) and ComBat_seq (lib-size) #9

Closed bio-zs closed 3 years ago

bio-zs commented 4 years ago

Hello, I have some questions about ComBat and ComBat_seq. My analysis is based on CPM, so I use ComBat to adjust logCPM data and then convert it to CPM. But there will be a lot of negative numbers. I can't use these negative numbers for subsequent analysis. What should I do with these negative numbers? I've tried to use ComBat_seq to adjust the counts data. But the library size of each sample will become vary greatly. Some lib-sizes become really big and the CPM calculated will be inaccurate. How can I get accurate CPM value? Thank you.

zhangyuqing commented 3 years ago

Hi @bio-zs, when converting ComBat adjusted data back to CPM, are you using exp(adjusted data)? I'm confused why it would lead to negative numbers.

bio-zs commented 3 years ago

@zhangyuqing I use log(CPM+1) to adjusted data and exp(adjusted data)-1 to convert it back to CPM. It is necessary for my data to subtract 1 after taking the exponent. I found that many people have the same problem on the Internet though they don't have to do log conversion, and they solved the problem by removing negative numbers or turning negative numbers into 0. I guess the ComBat function deals with the data mathematically and doesn't take into account that negative numbers are meaningless in some cases. But now I can only remove these negative numbers. Is this operation correct, or is there a better way to handle it? Thank you.

zhangyuqing commented 3 years ago

@bio-zs Well I suppose taking log and transforming it back with exp is also dealing with the data mathematically. The reason for log transforming the data is to often make it more normal-distributed. ComBat was designed to deal with normal-distributed data, in which case it makes sense to have negative values.

For your situation, in logCPM you added 1 to each value (to address CPM=0, I think?), ComBat will not change that 1 value in the same way for all numbers. As a result, it is not necessarily the best to subtract 1 for all values. One solution I can think of, is to calculate log(CPM+ a very small number) to address the 0 CPMs, then directly use exp(adjusted data) without subtracting 1.

For whether that operation is correct / a good way to handle the data - I know it is common practice to log transform the data, but whenever you transform the data, you already lose information in the original data. I think it is always the best to reduce the amount of transformations, and not to transform whenever possible. But if you have to transform, and if the transformation does not hurt your downstream analysis, then it's fine, in my opinion.

bio-zs commented 3 years ago

@zhangyuqing Thanks for your reply. My supervisor told me that taking log and transforming it back with exp is more convictive than adjusting CPM directly. Just as you say, ComBat was designed to deal with normal-distributed data. I have tried several methods and identified the most suitable one. Maybe the need for CPM in my study is really particular and the ComBat function solves at least some of my problems.