zkutalik / ssimp_software

GNU General Public License v3.0
15 stars 10 forks source link

build counts clarification #113

Open anujgoel1 opened 3 years ago

anujgoel1 commented 3 years ago

Hello, I am using version 0.5.5 of your software on a GWAS data on build 19 and my reference panel is also build 19. However, the log file gives out some numbers which I am not sure if its something to be concerned about or it can be considered as a normal coincidence. Please advise. Many thanks in advance. Best wishes, Anuj. Log file:

[ssimp-0.5.5     (2018-12-26)]
file_name:../hg19.geno.gwas.txt
... loading the 1.7GB database of positions under three builds. This will take about a minute.  Loaded.
Estimating which build (hg18/hg19/hg37) of the reference panel and the GWAS file, in case it is necessary to modify the GWAS file to match the reference panel
some_records_from_each_chromosome.size():1000
(ref) :count_of_hg18_0based,count_of_hg19_0based,count_of_hg20_0based,count_of_hg18_1based,count_of_hg19_1based,count_of_hg20_1based:   0 , 0 , 0 , 0 , 975 , 975
gwas_all_chrpos.size():500536
(gwas):count_of_hg18_0based,count_of_hg19_0based,count_of_hg20_0based,count_of_hg18_1based,count_of_hg19_1based,count_of_hg20_1based:   1408 , 2238 , 1352 , 1409 , 41540 , 1464 ----> These counts that I am concerned about. Dont add up too.
which_build_gwas,which_build_ref:       hg19_1 , hg19_1
gwas_count_known,gwas_count_unknown:    500536 , 0
Delete the SNPs with unknown position ...
gwas->number_of_snps():500536
zkutalik commented 3 years ago

Sorry, I have only seen your message today. The gwas build is clearly determined, there is no issue there. Can you tell me what reference panel you use?

anujgoel1 commented 3 years ago

Thanks for getting back. My reference panel is derived from hg19 UK Biobank imputed data that I have converted to high-imputation-quality-call best-guess genotype VCF file.

(ref) :count_of_hg18_0based,count_of_hg19_0based,count_of_hg20_0based,count_of_hg18_1based,count_of_hg19_1based,count_of_hg20_1based:   0 , 0 , 0 , 0 , 975 , 975 ---> Not getting mapped to hg19 (either 0 or 1 based)
(gwas):count_of_hg18_0based,count_of_hg19_0based,count_of_hg20_0based,count_of_hg18_1based,count_of_hg19_1based,count_of_hg20_1based:   1408 , 2238 , 1352 , 1409 , 41540 , 1464 ----> All over the place and dont add up too.

Looking at the above 2 rows in the logfile, it seems that the "ref" is not getting mapped to hg19 either (either 0 or 1 based). "gwas" is genotyped data on hg19 (illumina chip).

A bit confused with these counts hence this issue. Many thanks for looking into it.

zkutalik commented 3 years ago

Thanks for sharing these details. Indeed there seems to be some problem with the ref panel. It is quite unusual to use UKB imputed data as ref panel. Just to be sure, could you try with 1KG as ref panel to see if the counts look better for that one?

The counts do not need to add up, since the same position can match both a hg18 and a hg19 existing SNP position, that's normal. The gwas sample seems to be all fine.

anujgoel1 commented 3 years ago

Thanks for the tip. I'll give 1KG a go. The only motivation to use UKB was to impute my genotyped summary statistics to the HRC panel which is missing in 1KG. I am presuming that lack of "phasing" of my UKB reference panel should not affect the imputed summary statistics?