yuanzhongshang / GIFT

GNU General Public License v3.0
16 stars 1 forks source link

Why hasn't the gift finished running after a long time? #7

Open HackerLZH opened 3 months ago

HackerLZH commented 3 months ago

There may be only few genes in some regions, but the GIFT_summary has been running for over three days in these regions.

GIFT_summary(Zscore1, Zscore2, LDmatrix1, LDmatrix2, n1, n2, gene, pindex, R = R, maxiter = 1000, tol = 1e-4, pleio = 0, ncores = 2, in_sample_LD = FALSE)

The LDmatrix1 and LDmatrix2 are both from 1000 genome reference panel.

Regularize the LD matrix as (1-s1)*Sigma1+s1*E and (1-s2)*Sigma2+s2*E 
The likelihood failed to increase!: Success
The likelihood failed to increase!: Invalid argument
The likelihood failed to increase!: Invalid argument
The likelihood failed to increase!: Invalid argument
The likelihood failed to increase!: Invalid argument
The likelihood failed to increase!: Invalid argument
The likelihood failed to increase!: Invalid argument
The likelihood failed to increase!: Invalid argument
The likelihood failed to increase!: Invalid argument
The likelihood failed to increase!: Invalid argument
The likelihood failed to increase!: Invalid argument
...

When running, it outputed many same lines as above. Is that reason of slow running? If not this reason, I could provide my codes generating input data, thank you.

And if regularizing LD matrix needs much space? Some regions reported OOM(out of memory) when regularizing LD matrix even I allocate 30G.

yuanzhongshang commented 3 months ago

Hi,

Thank you for your attention. It appears that the GIFT model may not be converging, which, as you mentioned, is likely causing the slowdown. We haven't encountered this issue before, so to better understand and resolve it, could you please share your code and input data with us?

I acknowledge that GIFT requires a relatively large memory, especially when the number of SNPs in one region is large. I noticed that you used two cores to perform the gene-based test in parallel, which doubles the memory usage. This presents a trade-off between running time and memory consumption. Could you try to increase the memory limit beyond 30GB?

Thanks!

Best, Zhongshang

HackerLZH commented 3 months ago

specify_genes.R.txt First of all, I use this script to screen genes in a region, reducing the eqtl summary statistics of each gene to the range of region. make_ld.R.txt Then I use this script to make a big LD of the region from 1000 genome reference panel of eur. gift_preprocess.R.txt Then I use this script to generate files to input into GIFT_summary, ensuring all snps are in the same region of gwas summary statistics. gift.R.txt In the end, I use this to run GIFT.(gwas sample size n2 is 484598)

other files:

geninfo.txt

exampe gwas summary statistics is downloaded from here, and I add a Z column using beta divides se.

The example region is from 10240 to 694715 in chr4.

Thanks.

yuanzhongshang commented 3 months ago

Hi,

I have checked code you provided. All the processing steps are clear. I would like to emphasize the following points:

  1. I have followed your code for this region, and no errors were found. Here are the results:

    #2024-05-28 17:43:48.271489 INFO::GIFT starting...
    #Regularize the LD matrix as (1-s1)*Sigma1+s1*E and (1-s2)*Sigma2+s2*E 
    #2024-05-29 10:23:40.527548 INFO::Done!
    #    gene causal_effect            p
    #1 ZNF718   0.001932508 6.862450e-02
    #2 ZNF732  -0.521545879 1.876842e-05
    #3 ZNF141  -0.439036822 1.268345e-04
    #4 ZNF721  -0.301864656 2.255578e-03
    #5   PIGG  -0.002627982 9.817354e-01
  2. I guess the gene_frame already uses the extended region, as some genes have a begin value lower than zero, such as

    4   -46821  188099  ZNF595  ENSG00000272602

    Please ensure the definition of extended regions. Indeed, the more SNPs in one region, the the slower the speed of GIFT.

  3. An additional task is to harmonize the SNPs. All SNP genotypes should be harmonized between the gene expression data and GWAS data to ensure that the reference and alternate alleles match those in the reference panel. I apologize for forgetting to provide the allele information from the GEUVADIS summary statistics. I have now uploaded the .bim file here.

  4. Because you used an external LD, there are inevitably some LD mismatch SNPs. We used the kriging_rss function in the susieR package to examine the LD mismatches. Fifteen SNPs were identified as LD mismatches. We recommend deleting these SNPs. Additionally, if you are using GWAS summary data from the UK Biobank, you might consider using a better reference panel called UK10K. Here is the link to download it. This reference panel can also make GIFT more accurate and all the 'likelihood failed to increase' issues will disappear. Below are the diagnostic plots using kriging_rss. GWAS GWAS eQTL ZNF718 ZNF718 ZNF732 ZNF732 ZNF141 ZNF141 ZNF721 ZNF721 PIGG PIGG

  5. We note that there are no GWAS signals in this region. We recommend performing TWAS fine-mapping for regions with at least either marginal GWAS signals or TWAS signals.

Thanks!

Best, Zhongshang

yuanzhongshang commented 3 months ago

Hi,

Here is the result following the steps above. This result looks more convincing.

2024-05-29 17:02:34.432105 INFO::GIFT starting...
Regularize the LD matrix as (1-s1)*Sigma1+s1*E and (1-s2)*Sigma2+s2*E 
2024-05-30 00:53:22.45883 INFO::Done!
    gene causal_effect         p
1 ZNF718  0.0016685087 0.4367735
2 ZNF732  0.0898905695 0.5642811
3 ZNF141  0.0144310625 1.0000000
4 ZNF721  0.0002654957 1.0000000
5   PIGG -0.0180618620 0.9247493

Thank you.

Best, Zhongshang