zellerlab / siamcat

R package for Statistical Inference of Associations between Microbial Communities And host phenoType
https://siamcat.embl.de/
51 stars 16 forks source link

How to pass large output like HUMAnN features coming from several studies? #34

Closed drelo closed 1 year ago

drelo commented 2 years ago

Hi again,

I wonder if any of you have already used some strategy to accommodate the huge number of features in an output like the gene families coming from a HUMAnN analysis + pooling this data from +4 studies. Since I wanted to isolate the batch effect, I already have the output from the MMUPHin strategy. I wonder if I could pick some number gene families with some criteria or if I will this introduce huge bias in the analysis. I thought one could perform this with two procedures:

(I) In MMUPhin, after obtaining the meta_fits <- fit_lm_meta$meta_fits I could use the features filtered in this step like this: meta_fits %>% filter(qval.fdr < 0.05) %>% arrange(coef) and pick some of the most relevant features.

(II) Another strategy would be just pick the most abundant features present in the whole pooled/merged dataset, but this wouldn't take into account the absence of features for certain (batch) studies.

With either of this strategies I think I would have to (re)normalize it after removing rows/features right? What is the best way to do this within MMUPhin or back again in humman3?

Thanks for your help.

Andrés