wdl2459 / ConQuR

Batch effects removal for microbiome data via conditional quantile regression
GNU General Public License v3.0
27 stars 4 forks source link

Does Tune_ConQuR function test all options I input? #18

Open mhjonathan opened 1 year ago

mhjonathan commented 1 year ago

Hi, thank you for the amazing tool and it helps me a lot now :)

I want to ask you one thing about Tune_ConQuR function. I checked that default setting of ConQur works best so far, and want to check if it could give me a better performance with fine tuning.

I first run Tune_ConQuR with same code written in the manual (https://wdl2459.github.io/ConQuR/ConQuR.Vignette.html), but it gives me worse performance compared to default setting (less decreased batch effect and also decreased key covariate effect).

So what I want to ask you is, if I give all possible parameters of Tune_ConQur like below, does it test all possible combinations of given parameters and give me the best one at the end?

result_tuned = Tune_ConQuR(tax_tab=total_taxa, batchid=batchid, covariates=covar, batch_ref_pool=c("0","1"), logistic_lasso_pool=T, quantile_type_pool=c("standard", "lasso", "composite"), simple_match_pool=T, lambda_quantile_pool=c(NA, "2p/n"), interplt_pool=T, frequencyL=0, frequencyU=1)

For example, if I run above code, does it test "logistic_lasso_pool=T" for all parameters and "logistic_lasso_pool=F" combination also?

And would it be possible that this fine-tuned conqur result show worse performance than default setting?

wdl2459 commented 1 year ago

Thanks for your interest in our tool! Your code will try all possible combinations while logistic_lasso_pool=T always. The following code will help you try all possible combinations:

taxa_tuned = Tune_ConQuR(tax_tab=total_taxa, batchid=batchid, covariates=covar, batch_ref_pool=c("0", "1"), logistic_lasso_pool=c(T, F), quantile_type_pool=c("standard", "lasso", "composite"), simple_match_pool=c(T, F), lambda_quantile_pool=c(NA, "2p/n", "2p/logn"), interplt_pool=c(T, F), frequencyL=0, frequencyU=1, cutoff=0.25)

It is almost impossible that the fine-tuned result would be worse as it helps determine the local optimum for each group of taxa with a specific range of prevalence, while it might happen due to specific data characteristics. Please make sure the total_taxa is the raw count, not relative abundance.

mhjonathan commented 1 year ago

Thank you for the answer.

I run the fine tune code below with test data (Sample_Data in ConQuR library) trying to use all parameters and combinations for best fine-tuned result, but the result seems not good in RMSE boxplot as shown below.

PERMANOVA R2 test gives me the best outcome. Does it make sense I geta negative value of batch-standard like below?

PERMANOVA_R2(TAX=taxa_optimal, batchid=batchid, covariates=covar, key_index=4) $tab_count standard sqrt.dist=T add=T batch -0.004225372 0.0006665144 0.00464748 key 0.010340760 0.0072048730 0.00523369

$tab_rel standard sqrt.dist=T add=T batch 0.0009768619 0.004257547 0.0009768619 key 0.0112542480 0.007244982 0.0112542480

result_tuned = Tune_ConQuR(tax_tab=taxa, batchid=batchid, covariates=covar,
                           batch_ref_pool=c("0", "1","2"),
                           logistic_lasso_pool=c(T,F), 
                           quantile_type_pool=c("standard", "lasso","composite"),
                           simple_match_pool=c(T,F),
                           lambda_quantile_pool=c(NA, "2p/n"),
                           interplt_pool=c(T,F),
                           frequencyL=0,
                           frequencyU=1)
taxa_optimal = result_tuned$tax_final

sbp = covar[, 'sbp']
taxa_result = list(taxa, taxa_corrected1, taxa_corrected2, taxa_optimal)

pred_rmse = matrix(ncol=4, nrow=5)
colnames(pred_rmse) = c("Original", "ConQuR (Default)", "ConQuR (Penalized)", "ConQuR (Fine-Tuned)")

for (ii in 1:4){
  pred_rmse[, ii] = RF_Pred_Regression(TAX=taxa_result[[ii]], variable=sbp)$rmse_across_fold
}

par(mfrow=c(1,1))
boxplot(pred_rmse, main="RMSE of Predicting SBP")

sbp = covar[, 'sbp']
taxa_result = list(taxa, taxa_corrected1, taxa_corrected2, taxa_optimal)

pred_rmse = matrix(ncol=4, nrow=5)
colnames(pred_rmse) = c("Original", "ConQuR (Default)", "ConQuR (Penalized)", "ConQuR (Fine-Tuned)")

for (ii in 1:4){
  pred_rmse[, ii] = RF_Pred_Regression(TAX=taxa_result[[ii]], variable=sbp)$rmse_across_fold
}

par(mfrow=c(1,1))
boxplot(pred_rmse, main="RMSE of Predicting SBP")

Rplot

wdl2459 commented 1 year ago

It is likely that the kernel matrix (-1/2 H D^2 H, H -centering matrix) has negative eigenvalues. In the case, if the batch variable has a more significant projection on the eigenvectors associated with negative eigenvalues, then it could be negative. To test the hypothesis that negative eigenvalues are leading to the negative R^2 values, the documentation for adonis2 permits adding a constant or using square root of dissimilarities (e.g., sqrt.dist=TRUE or add=TRUE) to avoid negative eigenvalues. That's the reason why the output of PERMANOVA_R2 has three columns. You can use the one of the last two columns given your specific case.

For the prediction plot, it is possible. This is because the tuned version try to achieve the smallest PERMANOVA R^2, while PERMANOVA R^2 and prediction accuracy do not imply each other. Also, ConQuR is sensitive to the choice of reference batch. As the tuned version tried all possible candidates for the reference batch, sometimes, it may pick a weird one. Therefore, it is better choose a good reference batch given prior information or domain knowledge. Otherwise, the default version could be the safest choice.

mhjonathan commented 1 year ago

Do you have any special tip or clue that I can try with seemingly "best" reference from the beginning?

And is that because ConQur suppose the reference batch (or reference study in meta-analysis) as a base environment and adjust other batches with that?

wdl2459 commented 1 year ago

This could be tricky. The short answer is the "highest quality" one, determined based on domain knowledge about the experiment, sample processing, etc. If no such information, try some batch that is not super sparse or super dense, and the empirical dispersion is not crazy. And yes, it is because ConQuR uses the reference batch (or reference study in meta-analysis) as the base environment and adjust other batches to align with it.

mhjonathan commented 1 year ago

Thank you so much for your kind explanation!

I have one more question for the reference batch as the base environment. According to the paper of ConQur,

ConQuR assumes that for each microorganism, samples share the same conditional distribution if they have identical intrinsic characteristics (with the same values for key variables and important covariates, e.g., clinical, demographic, genetic, and other features), regardless of in which batch they were processed.

My questions is, if I have 3 different studies about HIV from all different countries, they must have different diet, geographical location, and environment which are the most contributing factors on microbial composition. Then, if I set one of them as a reference and input Age as covariate, ConQur assume the samples from two other studies would have same distribution on each taxa from reference.

Does this mean ConQur ignore those factors and only consider variables (Age) I input when I run the tool?

wdl2459 commented 1 year ago

This question is related to batch correction in general, beyond ConQuR. Also, let's talk about removing batch that confounds the important metadata, and there is no complete confounding: Metadata that are important to shape the distributions of microbiome abundance should be included in the model. Batch removal tools aim to maintain the effect/variability come from those metadata. If not include them, effect/variability from the metadata will also be partially eliminated. In your example, I guess geographical location=batch/study ID, diet, environment, and age are metadata. Therefore, diet, environment, and age should be included.

However, if there is complete confounding. For example, Study 1 only recruits man while Study 2 only recruits women. Then there is no way to separate batch effect from gender effect. Without further information, most existing tools including ConQuR cannot handle the problem.

mhjonathan commented 1 year ago

It seems if I say I have 10 different dataset and want to remove batch effect with geographical_location/Gender/Environment but one study has only female, then I can get a biased batch corrected data at the end, right?

Then do I have to also figure out in advance that covariates for ConQuR are well distributed? At least all of the data contain more than one value for each covariate? (e.g all dataset has Male and Female, and they are pretty well distributed..)

wdl2459 commented 1 year ago

Yes, please check the distribution of the key variable and important covariates you would like to maintain the effects.

Complete confounding is an extreme case, e.g., batch 1 only has female, and batch 2 only has male. In your example, other batches have both female and male, while one batch only has female: this is not a good case of course, while I think batch removal methods including ConQuR can still apply here.

Well-distributed is not a requirement. I guess well-distributed means balanced design, as in the clinical trials, right? Batch removal methods remove batch that confounds the biomedical variables, while with the balanced design, batch effect might not be a big problem.

In short, complete confounding is a problem for batch removal methods, other cases may not be good but batch removal methods can still apply.