yandorazhang / GENESIS

GENetic Effect-Size distribution Inference from Summary-level data
14 stars 10 forks source link

Analysis parameters #4

Open haschard opened 2 years ago

haschard commented 2 years ago

Hello, thank you for the nice work. I am trying to run GENESIS on a few GWASs, and I am a bit confused with some of the parameter setting:

1) the input should includes the effective sample size, defined as (#cases * #controls)/(#cases+#controls). It is unclear where this definition is coming from. In other studies, it is usually defined as 2/(1/#cases + 1/#controls) (e.g. PMID=24762786), resulting in a quite different number. Any feedback on this?

2) the main function genesis() requires multiple arguments (filter, LDcutoff...). Would you have a definition of each of them, including in particular

3) from the method section of the paper, I thought the heritability and intercept from the LDscore were used for the initialization. However, I cannot find where it should be incorporated. Is there a version of LDSC integrated within GENESIS?

4) For the heritability of binary outcome, people tend to use the liability threshold model, which requires in-sample and population prevalence. However, it looks like those parameters are not used in GENESIS. Could you comment on that?

Thanks a lot! Hugo

yandorazhang commented 2 years ago

1)

Denote n1= # of cases, n0= # of controls, f = minor allele frequency for a particular SNP

Our “effective sample size” is n1*n0/(n1+n0). This comes from the standard error of estimated log-odds-ratio.

As shown in the attached picture, the variance of estimated log-odds-ratio is 1/( f(1-f) ) (n1+n0)/(n1 n0). For a standard effect-size, i.e. standard estimated log-odds-ratio, which is \sqrt( f(1-f) \hat\beta ), where \hat\beta is the estimated log-odds-ratio returned from the logistic regression (PLINK). Then the variance for the standard estimated log-odds-ratio would be (n1+n0)/(n1n0). So the effective sample size should be 1/variance = n1n0/(n1+n0).

Or simply, you can run below R simulation example, the estimate of the slope term is just the log-odds-ratio, and the s.e. returned from R is equal to sqrt(1/n.case + 1/n.control).

------------------------

x = sample(c(0,1,2), size=1e4, replace=T) x = scale(x)

y = rbinom(n=1e4, size=1, prob=0.01)

n.case = sum(y) n.control = length(y)-sum(y)

fit = glm(y~x, family=binomial, data=data.frame(cbind(x,y))) summary(fit)

sqrt(1/n.case + 1/n.control)

------------------------

2) If you type help(genesis) in the R terminal, the definition of each parameter will be shown. For example,

LDcutoff: {a number from (0.05, 0.1, 0.2); indicating LD score is calculated based on the particular r^2 cutoff. By default, it is 0.1.} LDwindow: {a number from (0.5, 1, 2); indicating LD score is calculated based on the particular window size (MB). By default, it is 1 MB.} c0: {an assumed maximum number of underlying susceptibility SNPs tagged by any individual GWAS marker. By default, c0 is set at 10.}

3) Yes, in GENESIS function, there is one step of a simple version of LDSC, so that the intercept and heritability will be incorporated automatically. For more details, you can check the genesis.R function, line 113-166. Particularly, line 116-118 is conducting LDSC. (https://github.com/yandorazhang/GENESIS/blob/master/R/genesis.R)

4) For binary traits, genesis() returns heritability in log-odds-ratio scale. To change it to liability scale, we have a function called h2transfer() to transfer heritability in log-odds-ratio scale to liability scale. (https://github.com/yandorazhang/GENESIS/blob/master/R/h2transfer.R)

pjawinski commented 1 year ago

Hi & many thanks for your wonderful work! Just to reconfirm: On your README page, you state that the input GWAS summary statistics should contain the effective sample size; for disease traits, the sample size should be effective sample size, i.e., (# of cases)*(# of controls)/(total # of cases & controls). This is equivalent to the 'more traditional definition' of effective sample size per group divided by 2. Is that correct? Here's an R example:

n1=30000 # cases
n0=120000 # controls
n1*n0/(n1+n0) # your definition of effective sample size reveals 24000
2/(1/n1+1/n0) # traditional definition of effective sample size per group reveals 48000
1/(1/n1+1/n0) # traditional definition of effective sample size per group divided by 2 reveals 24000

So your formula of effective sample size reveals 24000, which falls below the sample size of both the cases and controls group. In fact, it's only half the effective sample size per group calculated via the traditional formula. Is that intended?

Many thanks for your help! Philippe