Inference on a linear combination of regression coefficients

koenvandenberge commented 1 year ago

Hi,

When running the LinDA workflow, output is provided that focusses on inference of each individual regression coefficient separately. For example, if I'd be having a design with two treatments x each at two timepoints t, the standard LinDA output using a ~x*t design allows me to perform inference on x, t and x:t. However, say I am interested in assessing the treatment effect in the second timepoint, this would amount to testing the sum of regression coefficients x + x:t. How should one go about this when using LinDA?

zhouhj1994 commented 1 year ago

Hi,

Suppose the two treatments are called "g1" and "g2", the two timepoints are "t1" and "t2"。Denote by alpha1, alpha2, and alpha3 the regression coefficients of x, t and x:t, respectively. Let beta = alpha1 + alpha2 + alpha3.

Then the effect of (g2, t1) is the regression coefficient of x (alpha1), the effect of (g1, t2) is the regression coefficient of t (alpha2), and the effect of (g2, t2) would be the sum of the regression coefficients of x, t and x:t (beta).

Let z-score s=beta/sd(beta), where sd(beta)=sqrt(var(alpha1+alpha2+alpha3))=sqrt(var(alpha1)+var(alpha2)+var(alpha3)+2cov(alpha1,alpha2)+2cov(alpha1,alpha3)+2cov(alpha2,alpha3)). These variances and covariances can be obtained from the LinDA output. Then the significance of beta can be evaluated by the p-value converting from the z-score: p = 2P(Z<=-|s|)，where Z is the standard normal distribution.

Thanks.

koenvandenberge commented 1 year ago

Thank you for the swift response. I have a couple of clarification questions.

The output from LinDA does not contain an intercept so I was unsure of the regression coefficients' interpretation. If there were no intercept one would expect alpha1 to correspond to the expected value at treatment in time point 1. If there would be an intercept, it would be the difference (on the linear predictor scale) between treatment and control at timepoint 1, as is what seems to be suggested here. What is the design matrix being used in the LinDA models? Treatment coding, sum coding, or something else?
Assuming we have an intercept model (treatment coding), the contrast I am interested in is not $H_0: \alpha_1 + \alpha_2 + \alpha_3 = 0$ as this would be answering the question whether the expected value is different at (g2,t2) versus (g1,t1), using your notation. I am rather interested in $H_0: \alpha_1 + \alpha_3 = 0$, comparing (g2,t2) to (g1,t2). Let $\beta = \alpha_1 + \alpha_3$, then it seems $SD(\beta) = \sqrt{var(\alpha_1) + var(\alpha_3) +2cov(\alpha_1, \alpha_3) }$. Correct?
Assessing contrasts of linear combinations of the regression coefficients seems like a prominent case that would often be of interest. It may be worthwhile to have a function that implements this for the user, with e.g., a matrix denoting the linear combination(s) of interest as input. In our case it could be L = matrix(c(1,0,1),ncol=1).

mrml500 commented 1 year ago

I am also interested in a function that compares (g2,t2) to (g1,t2).

zhouhj1994 commented 1 year ago

Thank you for the swift response. I have a couple of clarification questions.

The output from LinDA does not contain an intercept so I was unsure of the regression coefficients' interpretation. If there were no intercept one would expect alpha1 to correspond to the expected value at treatment in time point 1. If there would be an intercept, it would be the difference (on the linear predictor scale) between treatment and control at timepoint 1, as is what seems to be suggested here. What is the design matrix being used in the LinDA models? Treatment coding, sum coding, or something else?

Assuming we have an intercept model (treatment coding), the contrast I am interested in is not H0:α1+α2+α3=0 as this would be answering the question whether the expected value is different at (g2,t2) versus (g1,t1), using your notation. I am rather interested in H0:α1+α3=0, comparing (g2,t2) to (g1,t2). Let β=α1+α3, then it seems SD(β)=var(α1)+var(α3)+2cov(α1,α3). Correct?

Assessing contrasts of linear combinations of the regression coefficients seems like a prominent case that would often be of interest. It may be worthwhile to have a function that implements this for the user, with e.g., a matrix denoting the linear combination(s) of interest as input. In our case it could be L = matrix(c(1,0,1),ncol=1).

The column baseMean in the output dataframe represents 2 to the power of the intercept coefficients (normalized by one million). LinDA fits the linear models on the log2 transformed data. Let v be the vector of the intercept coefficients, then baseMean=2^v / sum(2^v) * 1e6.
Correct.
LinDA uses the classic R function lm or lmer (for mixed models) to fit the linear model. The design matrix of the model y~x*t includes a vector with all 1 denoting the intercept, a 0-1 vector denoting the status of the treatment, a 0-1 vector denoting the status of the timepoint, and the multiply of these two 0-1 vectors. We can obtain this design matrix by model.matrix(fit), where fit=lm(y~x*t).

zhouhj1994 commented 1 year ago

I am also interested in a function that compares (g2,t2) to (g1,t2).

The hypothesis for comparing (g2,t2) to (g1,t2) is $H_0: \alpha1+\alpha3=0$, and the hypothesis for comparing (g2,t2) to (g2,t1) is $H_0: \alpha2+\alpha3=0$.

biouser1992 commented 1 year ago

Hi, Sorry that this is quite a basic question, but I am not very familiar with mixed models.

If you take the smokers example dataset and instead formulate the model to have an interaction between Smoke and Sex:

linda.obj <- linda(otu.tab, meta, formula = '~Smoke*Sex+(1|SubjectID)', alpha = 0.1, prev.cut = 0.1, lib.cut = 1000, winsor.quan = 0.97)

This is my understanding of what the various outputs are comparing: (?)

linda.obj$output$Smokey = the difference between smoke = y and smoke = n
linda.obj$output$Sexmale = the difference between sex= male and sex = female
linda.obj$output$Smokey:Sexmale = the difference between:
1. smoke = y while sex = male
2. smoke = n while sex = female

With the example model you get all these differentially abundant OTU's from linda.obj$output$Smokey :

row.names(linda.obj$output$Smokey[linda.obj$output$Smokey$reject == "TRUE",]) [1] "3931" "15555" "70671" "74391" "86047" "94166" "149109" "185969" "237323" "239506" "428237" [12] "469920" "470738" "518865" "529659" "570119" "573384"

but with the interaction model, you get 0 differentially abundant OTU's both for Smokey and Smokey:Sexmale. Why is there this difference?

zhouhj1994 commented 1 year ago

With the interaction term included, linda.obj$output$Smokey = the difference between smoke = y and smoke = n while fixing the gender, linda.obj$output$Sexmale = the difference between sex= male and sex = female while fixing the smoking status.

The effect of smoking on the outcome is enhanced (or reduced) when Sex=male compared to when Sex=female by the amount of the regression coefficient of Smokey:Sexmale.

One possible explanation to why there is such a difference between the non-interaction model and interaction model is that additional interaction terms in a model would always decrease the significance of other terms by introducing extra variabilities. If domain knowledge tells us that gender affects the effect of smoking on the disease, we should probably add the interaction term. Otherwise, maybe we should keep the simpler model. Nonetheless, choosing the non-interaction model over the interaction model needs more thorough justifications, e.g., domain knowledge, model diagnosis, etc.

zhouhj1994 commented 1 year ago

If we look at the joint effect of Smokey and Smokey:Sexmale (difference beween (smoker, male) and (non-smoker, male)), we could discovery some differential taxa:

linda.obj <- linda(otu.tab, meta, formula = '~Smoke*Sex+(1|SubjectID)', alpha = 0.1,
                   prev.cut = 0.1, lib.cut = 1000, winsor.quan = 0.97)
alp1 <- linda.obj$output$Smokey$log2FoldChange
se1 <- linda.obj$output$Smokey$lfcSE
df1 <- linda.obj$output$Smokey$df

alp2 <- linda.obj$output$Sexmale$log2FoldChange
se2 <- linda.obj$output$Sexmale$lfcSE
df2 <- linda.obj$output$Sexmale$df

alp3 <- linda.obj$output$`Smokey:Sexmale`$log2FoldChange
se3 <- linda.obj$output$`Smokey:Sexmale`$lfcSE
df3 <- linda.obj$output$`Smokey:Sexmale`$df

cov12 <- linda.obj$covariance$Smokey$Sexmale
cov13 <- linda.obj$covariance$Smokey$`Smokey:Sexmale`
cov23 <- linda.obj$covariance$Sexmale$`Smokey:Sexmale`

stat <- (alp1+alp3) / (sqrt(se1^2+se3^2+2*cov13))
pvalue <- 2 * pt(-abs(stat), (df1+df3)/2)
padj <- p.adjust(pvalue, method = 'BH')
rownames(linda.obj$output[[1]])[padj <= 0.1]
[1] "74391"  "94166"  "149109" "237323" "239506" "428237" "518865"
[8] "570119" "573384

mrml500 commented 1 year ago

Thanks for the example code - very helpful thanks!

linlin0026 commented 1 year ago

Hi, thanks for this useful software, I'm new in creating mixed-effect models, so I'm still confused about the result of this code "LinDA::linda(otu.tab[, ind1], meta[ind1, ], formula = '~Smoke+Sex+(1|SubjectID)', feature.dat.type = 'count', prev.filter = 0.1, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = "BH", alpha = 0.1)" I do not interested in "Smoke+Sex," but focus on "Smoke+Sex", so the otus from " linda.obj$output$Smokey" are while fixing the gender = male or female?
And how can I get this information when I create new models like "linda(taxa, meta, formula = ‘~ Ethnicity(WHITE OR BLACK ) + Antibiotics Use(YES OR NO) + sex(MALE OR FEMALE) + (1|Day); zero.handling = ‘pseudo-count’, feature.dat.type = ‘count’, prev.filter = 0, is.winsor = TRUE, outlier.pct = 0.03, p.adj.method = “BH”, alpha = 0.05)"?

The reject==TRUE otus from " linda.obj$output$Ethnicity" are the different otus while fixing the "Antibiotics Use+ Sex" in which status? "YES & MALE"?
Or they are just the different otus while people are black or white, regardless of people's gender and whether they are taking antibiotics?

thanks for your help ~

zhouhj1994 commented 1 year ago

The regression coefficient of a variable in a linear model has the meaning that while fixing other variables, increasing one unit of this variable, the response will increase by the value of the coefficient. The "significance" or "rejection" result means that this coefficient is significantly nonzero, in other words, the corresponding variable is significantly associated with the response.

zhouhj1994 / LinDA

Inference on a linear combination of regression coefficients #5