Overlapping cis-regions

jokamoto97 commented 7 months ago

Hello,

I am interested in applying GIFT to a individual-level data set. The data are structured as follows:

X: A 706 x 39 matrix (expression matrix for 39 genes and 706 individuals) Y: A 706-vector for the GWAS trait Zx: A 706 x 195,176 matrix (genotype matrix for 195,176 SNPs and 706 individuals) Zy: Same as Zx

The cis regions for the 39 genes overlap, so the genotype matrices contain some redundant SNPs (I concatenated the cis genotype matrix for each gene in the order of the genes in the expression matrix). The mean (SD) number of cis SNPs per gene is 5005 (750).

I tried to apply GIFT_individual to my data, but I am getting the error: Mat::init(): requested size is too large; suggest to enable ARMA_64BIT_WORD. I have two questions at this point:

Can GIFT handle scenarios where genes have overlapping cis regions, as in my case?
Can GIFT handle the size of genomic region and number of genes I am working with?

Best, Jeff

yuanzhongshang commented 7 months ago

Hi Jeff,

Thanks for your interest in GIFT.

GIFT is able to handle genes with overlapping cis-SNPs. You can concatenate the cis genotype matrix for each gene in the order of the genes in the expression matrix.
In the TWAS framework, GIFT analyzes one region at a time, and these regions are defined by LDetect, which we also provide here. The size of the genomic region may be fixed. You can also use your own definition of regions. In each region, GIFT performs well. The computational efficiency of GIFT is primarily influenced by the number of SNPs and the sample size of GWAS. I just quickly checked some regions in the real data analysis, and I found GIFT can handle at least 11,270 SNPs with a GWAS sample size of approximately 330,000.

The error suggests to enable ARMA_64BIT_WORD. Please make sure the version of C++ compiler. Under C++11 and C++14, Armadillo now defaults to using int64_t for integers. One straightforward way to address the issue is to add the line "#define ARMA_64BIT_WORD 1" to the RcppArmadilloConfig.h. This file is located in the R package RcppArmadillo folder, such as /home/R/x86_64-pc-linux-gnu-library/4.0/RcppArmadillo/include.

Please let me know if you have further questions.

Best, Zhongshang

jokamoto97 commented 6 months ago

Hi Zhongshang,

Thanks for the helpful response! I followed your suggestion to add #define ARMA_64BIT_WORD 1 to the RcppArmadilloConfig.h file, and I no longer get the Mat::init(): requested size is too large; suggest to enable ARMA_64BIT_WORD error.

The error I get now is Error: std::bad_alloc.

I am not sure if this will be helpful information, but I tried subsetting all the cis-regions to contain only 20 SNPs to see if the function would run (the region size is now 20 SNPs x 39 genes = 780 SNPs). The function ran without any errors in this case.

Do you know what might be causing this error?

Best, Jeff

yuanzhongshang commented 6 months ago

Hi Jeff，

This issue occurred while attempting to allocate memory. The most possible reasons for this error is the insufficient memory. The system may not have enough available memory to fulfill the memory allocation request. Notably, the number of included SNPs is quite extensive, with the LD matrix being a 195,176 by 195,176 matrix. The computations associated with this LD matrix take up a lot of memory. Here are some recommendations:

If you conducted the analysis using parallel computing, please refrain from setting the ncores greater than 1. Because the memory consumed by parallel computing is proportional to the number of cores.
I suggest using the rm() function to delete redundant variables except for the GIFT inputs. This can help free up memory.
Try to reduce the number of SNPs: Using alternative region divisions to decrease the region size. You can use LDetect to your own data for region partitioning. Additionally, consider alternative region divisions with shorter regions as provided by McManus et al, Cell Genomics, 2023. The average number of SNPs per region is 195,176/39=5,004.513. The window size of cis-region per gene determine the number of cis-SNPs. Following GIFT, you may extract cis-SNPs of each region that are within either 100 kb upstream of its transcription start site or 100 kb downstream of its transcription end site. Consider excluding SNPs with higher LD using PLINK (e.g. r^2>0.9), exclude SNPs not marginally related to the trait of interest (e.g. P>0.05), and so on.

Hope it helps! Please feel free to ask if you have further question!

Best, Zhongshang

jokamoto97 commented 6 months ago

Hi Zhongshang,

Thanks for the help! I tried your recommendations, and I no longer get any errors.

Best, Jeff

yuanzhongshang / GIFT

Overlapping cis-regions #1