zhouhj1994 / LinDA

35 stars 3 forks source link

debias method #9

Open LunavdL opened 1 year ago

LunavdL commented 1 year ago

Hello,

I have been using LinDA on my microbiome dataset and have generated the effect size plots. In the effect size plots, there is a separation between "debiased" and "un-debiased". I was wondering what method is used to "debias"? And can I find both values in the linda.obj output?

I was also wondering if I can find the regression model estimates somewhere?

Cheers

zhouhj1994 commented 1 year ago

Hi,

Let a_i represent the regression coefficient for the centered log-ratio (CLR) transformed count data for taxon i. Let b_i represent the regression coefficient for the true abundance data for taxon i. We have observed that, approximately, a_i = b_i - b, where b is the average of b_1,...,b_m for the m taxa. So after we estimate a_i based on the CLT transformed count data, we need to debias it to obtain the estimate of b_i. Under the assumption that only a small portion of taxa are differential, i.e., most of b_i are zeros, we thus use the mode of a_i to estimate -b (the bias).

The debiased estimate of b_i can be found in linda.obj$output$x$log2FoldChange, x represents a specified regression variable.

The bias can be fould in linda.obj$bias. The un-debiased estimate is log2FoldChange + bias.

The regression estimates results are in linda.obj$output.

LunavdL commented 1 year ago

Thank you very much for the clarification!

In what part of the linda.obj$output can I find the regression estimates specifically? My linda.obj$output looks like this:

List of 1
 $ salinity:'data.frame':   163 obs. of  8 variables:
  ..$ baseMean      : num [1:163] 10920 16355 17078 12381 19091 ...
  ..$ log2FoldChange: num [1:163] -0.08247 -0.00935 0.00725 0.0286 -0.02719 ...
  ..$ lfcSE         : num [1:163] 0.0562 0.0397 0.0414 0.043 0.0427 ...
  ..$ stat          : num [1:163] -1.466 -0.236 0.175 0.666 -0.637 ...
  ..$ pvalue        : num [1:163] 0.146 0.814 0.862 0.507 0.526 ...
  ..$ padj          : num [1:163] 0.389 0.947 0.947 0.808 0.808 ...
  ..$ reject        : logi [1:163] FALSE FALSE FALSE FALSE FALSE FALSE ...
  ..$ df            : int [1:163] 89 89 89 89 89 89 89 89 89 89 ...
zhouhj1994 commented 1 year ago

The regression results are stored in linda.obj$output, which is a list with each element storing the regression results of each variable.

I observe that "salinity" is one of the variables in your regression model, so linda.obj$output$salinity will extract the regression results for salinity, which is a dataframe with columns 'baseMean', 'log2FoldChange', 'lfcSE', 'stat', 'pvalue', 'padj', 'reject', 'df'. Use ?linda to see more details.

In specific, 'log2FoldChange' is the regression estimate as I mentioned in last comment. In addition, based on the adjusted p-value 'pabj', you can determine whether salinity is significantly associated with the abundance of each taxon.