zkutalik / ssimp_software

GNU General Public License v3.0
15 stars 10 forks source link

which_build_t:: unknown' failed. #61

Open jimmyzliu opened 6 years ago

jimmyzliu commented 6 years ago

Hi,

I am trying to impute a small region on chromosome 21 using a custom reference panel of ~500 individuals. I receive the following error:

$ ssimp-linux-0.3 \
> --gwas chr21chunk.txt \
> --ref custom.21.vcf.gz \
> --out test.txt

file_name:chr21chunk.txt
... loading the 1.7GB database of positions under three builds. This will take about a minute.  Loaded.
Estimating which build (hg18/hg19/hg37) of the reference panel and the GWAS file, in caseit is necessary to modify the GWAS file to match the reference panel
some_records_from_each_chromosome.size():0
count_of_hg18_0based,count_of_hg19_0based,count_of_hg20_0based,count_of_hg18_1based,count_of_hg19_1based,count_of_hg20_1based:    0 , 0 , 0 , 0 , 0 , 0
gwas_all_chrpos.size():4,913
count_of_hg18_0based,count_of_hg19_0based,count_of_hg20_0based,count_of_hg18_1based,count_of_hg19_1based,count_of_hg20_1based:    153 , 280 , 153 , 160 , 4,912 , 172
which_build_gwas,which_build_ref:       hg19_1 , unknown
ssimp-linux-0.3: src/ssimp.cc:713: int main(int, char**): Assertion `which_build_ref != ssimp:: which_build_t:: unknown' failed.
Aborted (core dumped)

Does the program think there are some SNPs on hg20? But this is definitely not the case. Moreover, I've ensured that all SNP IDs, positions and alleles in my GWAS file match those in the reference panel VCF.

Any help would be very much appreciated. Thank-you!

Jimmy

sinarueeger commented 6 years ago

Hi Jimmy We had similar issues since launching the new version (0.3), and I think it is tied to a database that was created to ensure compatibility with different position versions (hg20 etc). So indeed you probably did all correctly - the software is buggy. The developer has changed jobs and we are currently looking for a new developer that is familiar with C++. Apologies for all the extra work! Sina

jimmyzliu commented 6 years ago

Hi Sina,

Thanks for the response. Looking forward to the fix!

Jimmy

aaron-mcdaid-zalando commented 6 years ago

Hi @jimmyzliu , Does this bug still exist for you? Would you be able to send the files that you use? I know you might be reluctant to share your GWAS file, but perhaps you could replace the effect column ('z' or 'beta') with zeroes and share it with us? If I can replicate the bug on my laptop, then it is very likely that I will be able to fix it

sinarueeger commented 6 years ago

Update: @aaronmcdaid has fixed the bug and there is now a new version 0.4 available. Please let us know if the bug still exists for you when using the new version.

xtmgah commented 6 years ago

@sinarueeger Hello, I have the same issue using the updated version (0.4, the compiled linux version). All chromosome were completed except chromosome 22. Can you help to fix this? Thanks .

file_name:gwas_chr22.txt.gz ... loading the 1.7GB database of positions under three builds. This will take about a minute. Loaded. Estimating which build (hg18/hg19/hg37) of the reference panel and the GWAS file, in case it is necessary to modify the GWAS file to match the reference panel some_records_from_each_chromosome.size():100 count_of_hg18_0based,count_of_hg19_0based,count_of_hg20_0based,count_of_hg18_1based,count_of_hg19_1based,count_of_hg20_1based: 0 , 0 , 0 , 0 , 0 , 0 gwas_all_chrpos.size():366,826 count_of_hg18_0based,count_of_hg19_0based,count_of_hg20_0based,count_of_hg18_1based,count_of_hg19_1based,count_of_hg20_1based: 11,860 , 26,686 , 12,524 , 11,745 , 354,858 , 12,601 which_build_gwas,which_build_ref: hg19_1 , unknown ssimp: src/ssimp.cc:723: int main(int, char**): Assertion `which_build_ref != ssimp:: which_build_t:: unknown' failed.

aaronmcdaid commented 6 years ago

Hi @xtmgah , Another user had a similar problem, where it worked on every chromosome except chromosome 22, which was fixed by deleting the ~/reference_panels folder and allowing it to be downloaded again with:

  ssimp your_gwas.txt 1KG/EUR output.txt

See issue https://github.com/zkutalik/ssimp_software/issues/64. Perhaps this will work for you too? I agree this is a strange solution, I don't know what it applies to only one chromosome!

sinarueeger commented 6 years ago

@jimmyzliu @xtmgah is this still an issue or did the fix work?

jimmyzliu commented 6 years ago

Hi,

Thank you for looking into this and sorry for the delayed response. I've had a chance to try this again and I'm still getting the same error as my original post. I'm using v0.5 and re-downloaded database.of.builds.1kg.uk10k.hrc.2018.01.18.bin.

My GWAS file works fine with 1000 Genomes reference, so I suspect it has something to do with my custom reference panel VCF. @aaronmcdaid Would it be ok if I shared the first few lines of my reference VCF with you?

Thanks, Jimmy

aaronmcdaid commented 6 years ago

Hi @jimmyzliu ,

Would it be ok if I shared the first few lines of my reference VCF with you?

Certainly! Anything to help replicate the issue. If you could send some data (gwas and reference) and the command line you use, that would be great. I'm at aaron.mcdaid@gmail.com

(We recently release version 0.5.1, it's not very excited but it does have some slightly better checking and it might help)

Jeremy37 commented 5 years ago

Hi Aaron, have you managed to look into this? I work with Jimmy, and can send you some example data if that would help. I'm using version 0.5.2, and I am having the same problem.

aaronmcdaid commented 5 years ago

Hi Jeremy, Thanks for the data you have sent.

I can replicate the problem and I get the same error as you.

Unfortunately, there are very many different causes that lead to the which_build_ref != ssimp:: which_build_t error. We hope to improve the error messages in future!

In this case, ssimp is unable to read any data from the reference panel file. I don't yet know exactly why, I'm currently playing with the code a little bit to try to find out

aaronmcdaid commented 5 years ago

[On second thoughts, I think this comment by me is incorrect! But I'll leave it here for the record]

@Jeremy37 , The reference panel you provide has many entries with ./. I don't know what they are supposed to mean, but I think that might be part of the problem. Does this mean that the genotype is unknown at that position for that individual?

(I'm not 100% certain this is the exact cause of the problem, but it might be)

We currently discard SNPs from the reference panel if they have any missing data. (We do this by passing the DISCARD_MISSING_GT option to libStatGen') (And theDISCARD_FILTERED` rule also seems to be relevant to this problem, but I'm not sure I understand it!)

@sinarueeger , @Jeremy37 , do you have anything thoughts on this? We have designed ssimp to be able to handle missingness in the gwas file, but I don't think we considered missingness in the reference panel.

aaronmcdaid commented 5 years ago

I've got a fix now, @Jeremy37 . If I remove the DISCARD_FILTERED option (this option is passed by ssimp to libStatGen), then your example data works. I'm currently rerunning all our tests, in order to ensure that I haven't broken anything else.

To be honest, I don't know exactly what DISCARD_FILTERED did! I can see that your vcf file has . instead of PASS in the FILTER column. Previously, ssimp discarded such rows.

Do you think we should generally ignore the FILTER in the vcf file? Or should we use only rows with FILTER=PASS?

sinarueeger commented 5 years ago

@zkutalik do you have any thoughts on this?

aaronmcdaid commented 5 years ago

@sinarueeger , @zkutalik , (and anyone else of course!), I suggest that we continue the discussion in this pull request, which is focused exclusively on the question of how to interpret the FILTER field in the vcf

This symptom has a number of different causes, and that PR is focused on one possible cause

Jeremy37 commented 5 years ago

I assume that genotypes with ./. mean that there's no genotype call for that individual at that site. There could be various reasons I guess. I'll follow up with Jimmy since he's familiar with the UK Biobank data. I would have expected that all of these samples have already been imputed to a common reference panel, but maybe that's not the case. Regarding the filter field, I don't think that it has to be "PASS" to indicate that filters have passed. I'm pretty sure that I've seen "." previously, to mean that no filters are indicated for a given variant, so I would take that as equivalent to PASS.

So you're saying that the filter field was the main problem, and not the ./. genotypes in the reference panel? Also, I note that in the VCF I provided, genotypes are e.g. 0/1, and not 0|1. The / usually indicates unphased genotypes. That could relate to how we extracted these VCFs from the original Plink files. Do you think it's a problem?

Thanks for all your help, Jeremy

On Fri, Nov 16, 2018 at 9:31 AM Aaron McDaid notifications@github.com wrote:

@sinarueeger https://github.com/sinarueeger , @zkutalik https://github.com/zkutalik , (and anyone else of course!), I suggest that we continue the discussion in this pull request https://github.com/zkutalik/ssimp_software/pull/94, which is focused exclusively on the question of how to interpret the FILTER field in the vcf

This symptom has a number of different causes, and that PR is focused on one possible cause

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zkutalik/ssimp_software/issues/61#issuecomment-439334837, or mute the thread https://github.com/notifications/unsubscribe-auth/AKj5aOkvD_QVHtd0-uNEZ4Wjs1tw0iRNks5uvoYJgaJpZM4UOj-E .

aaronmcdaid commented 5 years ago

So you're saying that the filter field was the main problem, and not the ./. genotypes in the reference panel?

Yes. It took a few experiments to figure this out. The ./. doesn't matter. Sorry for the confusion!

The only issue was the FILTER field. This morning, I edited the code so that it ignores the field.

Do you use Linux? If so, I can send an updated binary this afternoon. I don't have access to a Mac, so I can't be certain when a new Mac binary will be available

Aaron

On Fri, 16 Nov 2018, 11:14 Jeremy37 <notifications@github.com wrote:

I assume that genotypes with ./. mean that there's no genotype call for that individual at that site. There could be various reasons I guess. I'll follow up with Jimmy since he's familiar with the UK Biobank data. I would have expected that all of these samples have already been imputed to a common reference panel, but maybe that's not the case. Regarding the filter field, I don't think that it has to be "PASS" to indicate that filters have passed. I'm pretty sure that I've seen "." previously, to mean that no filters are indicated for a given variant, so I would take that as equivalent to PASS.

So you're saying that the filter field was the main problem, and not the ./. genotypes in the reference panel? Also, I note that in the VCF I provided, genotypes are e.g. 0/1, and not 0|1. The / usually indicates unphased genotypes. That could relate to how we extracted these VCFs from the original Plink files. Do you think it's a problem?

Thanks for all your help, Jeremy

On Fri, Nov 16, 2018 at 9:31 AM Aaron McDaid notifications@github.com wrote:

@sinarueeger https://github.com/sinarueeger , @zkutalik https://github.com/zkutalik , (and anyone else of course!), I suggest that we continue the discussion in this pull request https://github.com/zkutalik/ssimp_software/pull/94, which is focused exclusively on the question of how to interpret the FILTER field in the vcf

This symptom has a number of different causes, and that PR is focused on one possible cause

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/zkutalik/ssimp_software/issues/61#issuecomment-439334837 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AKj5aOkvD_QVHtd0-uNEZ4Wjs1tw0iRNks5uvoYJgaJpZM4UOj-E

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zkutalik/ssimp_software/issues/61#issuecomment-439346902, or mute the thread https://github.com/notifications/unsubscribe-auth/AAD7XrEEOZDtCOVtcPCm_GARK_uEQnSFks5uvpAFgaJpZM4UOj-E .

aaronmcdaid commented 5 years ago

@Jeremy37 , @jimmyzliu , In the last few minutes, I have merged a fix (including an updated Linux binary, but not yet a Mac binary) to the master branch here. It simply ignores the FILTER field, and it worked for me with your sample data. Can you confirm it works?

(Or, if you use Mac, then maybe @sinarueeger can make a new Mac binary for you. Alternatively, you can run make with the new code:

git co master
git pull --ff-only
make
bin/ssimp # the recompiled binary, made by 'make', is here
Jeremy37 commented 5 years ago

Thanks for the fix, it seems to be working! (Currently processing...)

On Fri, Nov 16, 2018 at 12:37 PM Aaron McDaid notifications@github.com wrote:

So you're saying that the filter field was the main problem, and not the ./. genotypes in the reference panel?

Yes. It took a few experiments to figure this out. The ./. doesn't matter. Sorry for the confusion!

The only issue was the FILTER field. This morning, I edited the code so that it ignores the field.

Do you use Linux? If so, I can send an updated binary this afternoon. I don't have access to a Mac, so I can't be certain when a new Mac binary will be available

Aaron

On Fri, 16 Nov 2018, 11:14 Jeremy37 <notifications@github.com wrote:

I assume that genotypes with ./. mean that there's no genotype call for that individual at that site. There could be various reasons I guess. I'll follow up with Jimmy since he's familiar with the UK Biobank data. I would have expected that all of these samples have already been imputed to a common reference panel, but maybe that's not the case. Regarding the filter field, I don't think that it has to be "PASS" to indicate that filters have passed. I'm pretty sure that I've seen "." previously, to mean that no filters are indicated for a given variant, so I would take that as equivalent to PASS.

So you're saying that the filter field was the main problem, and not the ./. genotypes in the reference panel? Also, I note that in the VCF I provided, genotypes are e.g. 0/1, and not 0|1. The / usually indicates unphased genotypes. That could relate to how we extracted these VCFs from the original Plink files. Do you think it's a problem?

Thanks for all your help, Jeremy

On Fri, Nov 16, 2018 at 9:31 AM Aaron McDaid notifications@github.com wrote:

@sinarueeger https://github.com/sinarueeger , @zkutalik https://github.com/zkutalik , (and anyone else of course!), I suggest that we continue the discussion in this pull request https://github.com/zkutalik/ssimp_software/pull/94, which is focused exclusively on the question of how to interpret the FILTER field in the vcf

This symptom has a number of different causes, and that PR is focused on one possible cause

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <

https://github.com/zkutalik/ssimp_software/issues/61#issuecomment-439334837

, or mute the thread <

https://github.com/notifications/unsubscribe-auth/AKj5aOkvD_QVHtd0-uNEZ4Wjs1tw0iRNks5uvoYJgaJpZM4UOj-E

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/zkutalik/ssimp_software/issues/61#issuecomment-439346902 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AAD7XrEEOZDtCOVtcPCm_GARK_uEQnSFks5uvpAFgaJpZM4UOj-E

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zkutalik/ssimp_software/issues/61#issuecomment-439380661, or mute the thread https://github.com/notifications/unsubscribe-auth/AKj5aPRXTGdR22J5w3VlZi85IgeyHJ5lks5uvrGWgaJpZM4UOj-E .

Jeremy37 commented 5 years ago

@aaronmcdaid Do you know what happens to lines in the reference panel that include a ./.? From looking at a few examples, it looks to me as if all of these lines are dropped. When I do imputation with my UK biobank reference panel (a subsample of 10,000 individuals from the 500k), I only get imputed output for SNPs that have zero missingness in the reference panel. That is, lines from the input GWAS file are even dropped when any of the reference panel samples have a ./., I think.

aaronmcdaid commented 5 years ago

I think you're right, that such lines are completely dropped from the reference panel. When designing this software we considered missingness in the GWAS, but we didn't consider missingness in the reference panel.

In principle, we could compute correlation between such SNPs in the reference panel. But some other formulae would be difficult as we would no longer have a fixed number for "size of the reference panel".

So basically, I don't think this can be easily changed. We would need to revisit some of the formulae. Any thoughts, @sinarueeger @zkutalik ?

sinarueeger commented 5 years ago

Sample size of the reference panel is relevant for

zkutalik commented 5 years ago

If there is low missingness rate (and little variability across SNVs), we could (should in theory) calculate the correlations in the ref panel. But, as Sina said, it is not top priority. Do you have an estimate how many SNVs we lose due to this?

Jeremy37 commented 5 years ago

I'll follow up with Jimmy when he's back. If there is any missingness, then the degree will scale with the size of the reference panel. What I was doing is to subsample 10,000 samples from the 500k samples in UK biobank. With this subsample, about 95% of SNPs have nonzero missingness, so nearly everything is lost. But presumably there is some way for me to find UKBB data that has already been imputed, so that there should be no missingness. I'm not yet experienced with UK biobank (and don't have access myself), so I'll have to find out from others.

Jeremy

On Tue, Nov 20, 2018 at 8:54 AM zkutalik notifications@github.com wrote:

If there is low missingness rate (and little variability across SNVs), we could (should in theory) calculate the correlations in the ref panel. But, as Sina said, it is not top priority. Do you have an estimate how many SNVs we lose due to this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zkutalik/ssimp_software/issues/61#issuecomment-440192994, or mute the thread https://github.com/notifications/unsubscribe-auth/AKj5aNfz2ZWPjz7tzPiPM0O-7L4EBC_Hks5uw8M-gaJpZM4UOj-E .

zkutalik commented 5 years ago

Good point, with increasing ref panel size, the problem becomes more burning. UKB data has been imputed to ~100M variants - so there it shouldn't be an issue.

Jeremy37 commented 5 years ago

Hi all. So even though the UKB data are imputed, not all variants have high confidence genotypes for all samples. You get genotype likelihoods for each SNP/individual. Does SSimp use genotype likelihoods (to get genotype dosages) from the VCF file for it's correlation matrix computation? Or do you only use the hard genotype calls? I currently have only hard genotype calls now, but in principle I could get genotype likelihoods to add to the reference panel VCF. It would be great if you could add functionality to compute SNP correlations for SNPs that have some missing genotypes. In the great majority of cases this should be quite safe, since only a small fraction of samples are missing for each SNP. For example, looking at 340,000 SNPs at one of my regions of interest, 95% have at least ONE sample with a missing genotype. However, only 0.8% of the SNPs have more than 5% of samples missing the genotype, and fewer than 0.1% have at least 30% missingness. What do you think?

zkutalik commented 5 years ago

I think , since the missingness is so low, there is no point imputing the missing values, simply use a pairwise.complete.obs type of correlation calculation.

@aaronmcdaid would it be time-consuming to change the correlation computation?

aaronmcdaid commented 5 years ago

It seems that GSL does not directly support missing values in the correlation computation (here is the GSL function I use . I tested it and it just said the correlation nan

Just to confirm: in every case where the correlation between two vectors is being computed, only use individuals with non-missing values in both vectors? I can change that easily enough I think; but I'll have to write some function to delete those entries explicitly.

So yes, it's doable

One risk is that it leaves a matrix which is not positive-definite

Jeremy37 commented 5 years ago

Yes, that's what would be ideal - using the "pairwise complete observations" as one would say in R. So this might be a different set of individuals for each pair of SNPs. I can see that in C++ this is a bit of a pain. E.g. iterate over the two vectors to get the set of non-missing indices, and then copy those items into two new vectors before doing the correlation. The efficient way to do it I guess would be to preallocate two vectors (of length N individuals) once, and copy into them for each correlation computation, to avoid overhead of allocating / deallocating memory. You would probably want to set some cutoff, e.g. automatically exclude SNPs with more than 50% or 25% missingness or something.

On Fri, Dec 7, 2018 at 1:30 PM Aaron McDaid notifications@github.com wrote:

It seems that GSL does not directly support missing values in the correlation computation (here is the GSL function https://www.gnu.org/software/gsl/manual/html_node/Correlation.html I use . I tested it and it just said the correlation nan

Just to confirm: in every case where the correlation between two vectors is being computed, only use individuals with non-missing values in both vectors? I can change that easily enough I think; but I'll have to write some function to delete those entries explicitly.

So yes, it's doable

One risk is that it leaves a matrix which is not positive-definite

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zkutalik/ssimp_software/issues/61#issuecomment-445233121, or mute the thread https://github.com/notifications/unsubscribe-auth/AKj5aEy-wlxmhV_Qh3JJik0hGofs0XT1ks5u2m2DgaJpZM4UOj-E .

zkutalik commented 5 years ago

Thanks, Aaron, that would be great. Indeed, it may not be pos def. If we set missingness rate <5%, you could simply replace missing values with the mean genotype (2*AF) and go ahead as before. Otherwise, would have a second matrix storing the non-missing positions and replace all NAs with zeros, then rescale based on the product of the non-missingness matrix... I don't know what's better.