wdl2459 / ConQuR

Batch effects removal for microbiome data via conditional quantile regression
GNU General Public License v3.0
27 stars 4 forks source link

Possible inflated covariate results? #17

Open jkcopela opened 1 year ago

jkcopela commented 1 year ago

Hello,

I have some general issue questions, so I apologize if this is page is meant more for specific technical issues.

I have run ConQuR and everything worked well and the batch effect was moderated. I selected the method with the most improvement, as suggested in your vignette.

However, what I am finding is that when I use the ConQuR corrected data, I detect a lot of possibly false positives during differential abundance analysis (taxa identified using multiple verified methods such as Maaslin2 and Aldex2). The data also becomes extremely segregated (BrayCurtis distances) by the covariates specified, much more so then the uncorrected data. It appears as though the difference due to the covariates wasn't just maintained, but was substantially increased.

When I look at my top hits from these differential abundance results, I can see in the corrected table, that the counts do segregate with the covariate of interest, understandably. But when I go back and look at my uncorrected data set, I am finding that a lot of these hits are coming from only 2 or 3 samples, with relatively low reads counts per sample. What ConQuR has done is then inflate the zeros from other samples within the covariate group (as it should be doing, as I understand it), but it seems that it may be going too far and creating false positives?

I understand that I can be more stringent in filtering the table by abundance and prevalence, but I was wondering if there was a way to do this within ConQuR? Am I not understanding the options and settings properly? I am running this on data that has been pretty aggressively filtered for contaminants, should I run this on raw unfiltered data, so that the overall distribution of the taxa is not affected by filtering? The dataset itself is very sparse, so perhaps that is also the issue?

Thanks, Julia

wdl2459 commented 1 year ago

Thanks for your interest in our tool!

ConQuR aims to reduce batch effects while maintaining or even amplifying the biological effect of interest. Therefore, some false negative signals in uncorrected data could be detected after correction, while moderately inflated false positive rate is common. However, false negatives and false positives are called in simulated data where we know the truth. In real data, it is hard to draw conclusions when we observe a non-significant signal becomes significant or a significant signal becomes non-significant after batch correction.

Talking about the moderately inflated false positive rate, like many other batch correction tools, ConQuR uses the metadata twice, in both the correction and subsequent analyses, theoretically leading to over-optimism in association analysis. However, in practice, this bias is modest relative to the batch effects, and the inclusion of metadata is often helpful for estimating conditional distributions when the taxon is uncommon or imbalanced among batches.

Also, before using ConQuR, please check (1) whether the batch completely confounds the key variable, (2) whether there are many small batches (limited information for estimation) or small numbers of sequences/library sizes (poor data quality). If the two problems exist, ConQuR cannot work well or even cannot work. I guess "very sparse" indicates poor data quality, e.g., small numbers of sequences/library sizes? If so, it is a pity that the current version of ConQuR cannot work well.