Statistical interpretation of Annotations

harryyiheyang commented 7 months ago

Hi Dr. Zheng,

Thank you very much for providing such an excellent method. I successfully implemented it yesterday without much effort, and the total computation time was less than five hours.

I would like to ask about the statistical interpretation of the annotation output files here. In your paper, you mainly focused on using annotations and increasing the number of SNPs in the reference panel to improve the prediction of PRS. However, in some fields, interpreting annotations and the contributions of each component of annotations to the variance in PRS is also an important statistical issue.

Here is an example of my annotation output:

_sbrc.AnnoPerSnpHsqEnrichment: Annotation Enrich SD Intercept 0.999999595 5.3140091708849e-07 Adrenal 3.3885779 0.0939944802537942 Artery 5.25641125 0.117194376273301 Heart 3.95274915 0.109830623827918 Kidney 2.7875685 0.072313031233526

As shown, I only have four annotations, each corresponding to a machine-learned CRE from four different tissues. I want to understand which of these tissues is more important (it seems to be the Artery), but comparing their explanatory variance with the non-described part (Intercept) and the other three tissues seems challenging. I noted that in the section "Contributions of unctional categories to prediction accuracy", per-SNP predictability is linear to per-SNP enrichment. Therefore, is it valid to conclude that the variance of PRS due to Kidney tissue is

2.7875685/(1+3.3885779+5.25641125+3.95274915+2.7875685)=17%?

If you could provide some explanation, including another file _sbrc.AnnoJointProb, it would be greatly appreciated.

Best regards, Yihe

zhilizheng commented 7 months ago

Hi @harryyiheyang,

Your calculation is incorrect. Sorry for the confusion. This is the per-SNP heritability enrichment indeed. You can just write: The per-SNP herititablity enrichment for Artery is 5.26 (possibly also the SD). This means, the causal variant for your trait are enriched in this category. You can't sum them up. If you are interested in the variance explained by this compared to other categories, you can refer to ".mcmcsamples.AnnoTotalGenVar", here it's the variance explained in each MCMC sampling. (hence take the mean of each column).

Regards, Zhili

harryyiheyang commented 7 months ago

Dear Dr. Zheng,

Thank you for your prompt response. I have an additional question regarding the data from the file.

The top 3 lines in the file are:

"Intercept Adrenal Artery Heart Kidney 0.23868 0.0654123 0.0775214 0.0663788 0.0602395 0.240164 0.0661823 0.0795995 0.0678905 0.0622744 0.238308 0.0654055 0.078188 0.0662605 0.0609089",

and it appears that the row means of each column are similar. However, all these values are larger than the total heritability (hsq) estimate provided:

"hsq 0.199155033007264 nnz 88178.195".

Could you please check if (1) the row means of the columns are additive, and (2) whether the sum of these row means is equal to the total hsq estimate?

I apologize for troubling you with these questions, as I should be able to figure them out myself. I am very grateful for your assistance.

Best regards, Yihe

zhilizheng commented 6 months ago

Hi @harryyiheyang,

The variances were estimated from sum of beta squared here. It's not a perfect method to calcualte the variance, and is often a bit larger than the hsq (which we estimated from the model), howver, this is easiest way and leveraged by multiple methods. The intercept shall be close to another variable ssq (sum of beta squared) in the output (which is proportion to hsq). The variance explained is highest for Artery (almost 0.08).

Sometimes, there are some overlaps between your annotations, hence, it's not good to sum them together, the overlap part would inflate your results. If you are interested in Adrenal + Artery + Heart + Kidney, then you can just put a column of annotation, which you merge them together (Any is 1 in those categories, new = 1). It's OK to run SBayseRC with Intercept, each annotation, and the combination of annotations.

Hope it's helpful.

Regards, Zhili

harryyiheyang commented 6 months ago

Dear Zhili,

Congratulate you on your publication in Nature Genetics.

I have read the supplementary online materials and noticed that the method for calculating per-SNP heritability enrichment is similar to that used in LDSC. This similarity has made it challenging for me to compute the contributions of different annotations, such as tissue types, based on this statistic.

One potential solution I considered is calculating the odds ratio based on alpha, although this may not provide a direct interpretation. Another solution you suggested was to include a column of annotation in the analysis. Regarding this approach, I would like to ask if it is necessary to retain the intercept in the model?

Additionally, I have a suggestion: when dealing with numerous annotations, have you considered applying annotation selection techniques? A simple lasso or Bayesian variable selection method for binary data might be effective.

Thank you very much for considering my inquiries and suggestions. I look forward to your thoughts. Yihe

zhilizheng commented 6 months ago

HI @harryyiheyang ,

Thanks.

Yes, this method is very similar to LDSC. Our alpha is on conditional p, so the results may be difficult to interpret. My merging way is independent from the intercept, they just sum the SNPs belong to one annotation category (just sum the beta^2). For comparision of annotation, a reference for you when we constributed to the Zoonomia project: https://www.science.org/action/downloadSupplement?doi=10.1126%2Fscience.abn2937&file=science.abn2937_sm.pdf (section 13, Fig SM13.3)

We are considering the variable selection for the annotation. Our current setting is enough for the prediction, so we stop here, but we will extend to a better way. Keep in touch, we will continue exploring.

Regards, Zhili

zhilizheng / SBayesRC

Statistical interpretation of Annotations #24