zkutalik / ssimp_software

GNU General Public License v3.0
15 stars 10 forks source link

chr 23 / X?? #22

Closed sinarueeger closed 6 years ago

aaronmcdaid commented 6 years ago

I did do something about chr23/X. But I forget exactly what. In certain places in the code, an 'X' is converted to (or from) '23'. But I don't think I made a comprehensive test. So maybe it already works fully, or maybe it's incomplete

sinarueeger commented 6 years ago

Will do some testing.

sinarueeger commented 6 years ago

The sex chromosomes cannot be imputed currently. There are two problems to this:

imputing an rs number, using RSID GWAS

./compiled/ssimp-linux-0.1 gwas/small.random.x.rsid.csv /home/srueger/reference_panels/UK10K/_EGAZ00001016630_X.TWINSUK.beagle.anno.csq.shapeit.20131101.vcf.gz output.txt --impute.snps <(echo rs41297271 | tr ' ' '\n')

Imputing POS:CHR, using pos:chr GWAS

./compiled/ssimp-linux-0.1 gwas/small.random.x.pos.csv /home/srueger/reference_panels/UK10K/_EGAZ00001016630_X.TWINSUK.beagle.anno.csq.shapeit.20131101.vcf.gz output.txt --impute.snps <(echo X:2700089 | tr ' ' '\n')

Currently, the X chromosome is coded as X, and not as a number, so that might need to change.

aaronmcdaid commented 6 years ago

The build does have a few rs-numbers from X and Y. But it's only 66 from X and 9 from Y. I have no idea why that is so low. A total of zero would make more sense!

Perhaps SNPs on those chromosomes are not typically given rs numbers?

Also, we used only SNPs present in 1KG, HRC, or UK10K. Perhaps that is relevant? Perhaps those have very few X/Y SNPs

So, if we want full X/Y support, I guess we need to find more of them and recreate the build database

I'll ignore this for now, as you said this is less urgent. We'd really need to talk on Skype about it

sinarueeger commented 6 years ago

Sina will check why so few rs-numbers from X and Y using the bed files in her public.data/dbsnp/ folder

aaronmcdaid commented 6 years ago

Aaron is rerunning the script which makes the build database, making sure the input files are in the right place, and writing a brief note to document how the system works. He won't improve or extend it, just document the existing system

sinarueeger commented 6 years ago

the file public.data/dbsnp/hg18_hg19_hg20 (pooled version of all the bed files) contains 11'402'643 SNPs on the x chromsome (!!).

aaronmcdaid commented 6 years ago

I've cleaned up the script a little, pushed to a new branch clean.up.the.builddb.script , including a note at the top of rs.chr.pos.database.stu to explain the system in more detail. I'm pretty sure it's all correct, but I'd like to allow it to fully complete and then verify the final resulting database is identical to the one we have made available online.

To follow up the Xs, I suggest getting a few hundred X rs-numbers from 1KG, then verify they are in my /data/sgg2/aaron/homedir/Code/MyCode/ssimp_software/rs.chr.pos.database/rss.in.1kg.uk10k.hrc.txt and then verify that at least one of your three liftover files has them (and has them on X).

Also, my system discards any SNP where two builds disagree about which chromosome the SNP is on. I do not, of course, require that all three builds have every SNP. But if a SNP is in two builds, but they disagree on the chromosome, then I discard that SNP entirely. Maybe that might explain why we have so few X SNPs

Anyway, after checking tomorrow that the resulting database is identical, (update a few hours later: it is identical) that's enough on this issue for me for a few days :-)

aaronmcdaid commented 6 years ago

I've worked it out! At last :-)

I know why the build database had so few X SNPs, and I have now generated a new build database which has 1.4 million SNPs (a small number, but still bigger than chr21 and chr22).

On hpc1, the reference_panels/1KG folder that I was using only has the 22 autosomes. Therefore the list of ~90 million SNPs didn't include any X SNPs from 1kg

I am now rerunning all my tests, just to confirm that all the existing (autosomal) results are unchanged by the new, larger, database . Then, I will email you and Zoltán to suggest replacing the exist 'switch' file with this new database. Longer term, we'll have to encourage existing users to download the new file if they want to do X imputation

sinarueeger commented 6 years ago

Great! maybe we should add a version number to the file, so its clear that the new version is new

sinarueeger commented 6 years ago

I should add examples too.

aaronmcdaid commented 6 years ago

X imputation on now on branch 'chrX23', you can pull that. The eight commits there, roughly in order from oldest to newest, are:

It includes an update to the scripts that I used to create the enlarged database.

It now uses Zoltán's new link to download the file, saving it to database.of.builds.1kg.uk10k.hrc.2018.01.18.bin instead of database.of.builds.1kg.uk10k.hrc.bin in the ~/reference_panels folder

I've made a change to where I call something called sort_my_entries, it's not really relevant to the X chromosome, but it simplifies some things and should speed them up.

Then two commits simply so it can load in the X data

Then, actually do the imputation by including chromosome 23 in the set of chromosomes to impute

Finally, commit a test - called via stu tests/full1KG/first.small.test.on.chrX/imputations.txt, to actually test this. This runs on the full 1kg data, but it's quick as it only has a few tags

I haven't added this test to the list of @all.tests, as we might want to ensure that the tests can work even if somebody hasn't downloaded 1KG. What do you think

aaronmcdaid commented 6 years ago

(Actually, there's a problem. It now doesn't like a reference panel that doesn't include X. I'll fix that shortly

aaronmcdaid commented 6 years ago

Problem fixed, and pushed to the same branch (chrX23). So I think you can merge it now:

The last thing I did was simply to skip a chromosome if the reference file (e.g. X) is missing for it

sinarueeger commented 6 years ago

Thanks Aaron!

I updated version (now 0.3), a comment about the new bin file, manual and examples.

About the test - you created ref/sub1KG-tiny/ for all autosomes. can you do this for chr x too?

BTW, all the 4 last commits are from me. 2 from my mac (my name), 2 from hpc (that have your account). This is the strange thing I told you about once. I think it has to do with the keys or so.

#46 # pastedgraphic-1

aaronmcdaid commented 6 years ago

After that email reply from Zoltán, I've pushed a tiny change to the same branch (chrX23). It's just one commit for you to merge in

(I'll think more about your other points in that email)

Aaron

aaronmcdaid commented 6 years ago

I've pushed a commit which adds X to the ref/sub1KG-tiny. Also to chrX23

Actually, should I do this another way, perhaps by sending you a pull request instead of simply sending a message like this?

Aaron

sinarueeger commented 6 years ago

I did a small example, but it wont run.

./compiled/ssimp-osx-0.3 --gwas gwas/small.random.x.txt --ref ref/sub1KG-tiny/chrX.vcf.gz --out output.txt --impute.snp <(echo rs183055800 rs146115300 rs150092800 | tr ' ' '\n')

Says: WARNING: Ignoring this line, problem with the chromosome and/or position [SNPname:rs189711500], can't parse this int

I guess this is because of the build bin file.

I looked into your x-chr example, but could not find the gwas file.

sinarueeger commented 6 years ago

Works now! #50