wdl2459 / ConQuR

Batch effects removal for microbiome data via conditional quantile regression
GNU General Public License v3.0
26 stars 4 forks source link

How can corrected read counts can be increased from 0? #19

Open mhjonathan opened 1 year ago

mhjonathan commented 1 year ago

Hi, I'm applying your tool so usefully and now I'm reviewing from the beginning since I have very robust outcome.

I'm not a statistician but a biologist applying bioinformatic tools/data analysis, so please let me know if it is related with specific statistic principles :) I'll study that for reviewing my data.

After correcting batch effect, I got a corrected taxa count file, and I find out that pretty many taxa which didn't show detected reads in original files are detected in the corrected taxa file.

Same as in your Vignette manual (https://wdl2459.github.io/ConQuR/ConQuR.Vignette.html), Acetitomaculum does exist in 277 sample only in preview, but after correction, all 5 samples in preview get to have reads of the species. How can they have "non-detected" reads in original file? In my point of view, no matter correcting batch effects or confounder effect, it is not detected in real data, and how can they just pop up from nowhere?

wdl2459 commented 1 year ago

The underlying idea of ConQuR is to match conditional distributions of the other batches to the reference batch, including the proportion of zeros and distribution of the positive part. Specifically, the framework handles zero inflation, it calibrates unwanted presence–absence differences among batches, recovering non-zero counts for under-sampled observations and forcing those over-sampled to be zero. Either conversion may seem odd to researchers in different fields. However, it is helpful to keep in mind that zeros in microbiome data may be classified as sampling zeros (due to undersampling) or structural zeros (due to taxon absence), and to understand the introduced zeros as sampling zeros rather than structural zeros. In microbiome studies, there is no way to differentiate between the two kinds of zeros, so we make an assumption that the differences in rate of taxon presence between batches is primarily due to a higher rate of sampling zeros (not structural zeros) in the sparser batches. Instead of recovering the “truth”, which will never be fully feasible given the limitations of the data, ConQuR aims to align all batches’ distributions, including the presence–absence likelihood.