wdl2459 / ConQuR

Batch effects removal for microbiome data via conditional quantile regression
GNU General Public License v3.0
27 stars 4 forks source link

Is there data leakage because labels are used in the batch effect process? #15

Open JiaLonghao1997 opened 1 year ago

JiaLonghao1997 commented 1 year ago

In the tutorial, the key factor (systolic blood pressure) is used as a covariate. For a study-to-study validation, there is a data leakage problem if the label of the test set is used as a covariate; if it is not used, the variance explained by the label may be masked by other covariates. How can we deal with this issue?

Figure 4.c shows cross-validated area under the receiver operating characteristic curve (ROC-AUC) of predicting HIV status based on the taxa read counts via random forest, how can we avoid data leakage? image

tommyfuu commented 1 year ago

hi - i get what you are saying about the data leackage. The biological information in the test set has surely been used by ConQuR (or any other batch correction method that you might want to use at all) for the particular task of correcting out batch effect. However, theoretically all these batch correction methods attempt to recover the real distribution of the data, and that label information is part of the real distribution of the data. One cannot wish to recover the realistic structure of the data without giving the model information - almost a question of chicken and eggs. Note that here the models are for recovering data structure but not for label prediction, so the problem with data leackage should be minimal as these models are not designed to predict these variables anyways.

@wdl2459 adding Wodan to the discussion in case she has additional insights!