Closed aaronmcdaid closed 6 years ago
@aaronmcdaid looks good!
Option _debug_build_chr
does this mean we could use this instead (or together) with the option --impute.range
, e.g. ... --impute.range=22 --debug_build_chr 22
?
Improvements regarding option N_max
looks fine.
Btw - I am now using SSIMP on our HPC. The --download.1KG
option works fine!
That also means that I can start running tests with stu/stu @all.tests
there or stu/stu -j4 @all.tests
.
On hpc1, perhaps you could use more than 4. I forget how many CPUs it has. Is it just four?
... does this mean we could use this instead (or together) ...
Yes, this can now be simplified a lot. If the user specifies --impute.range
or --tag.range
, then ssimp
should identify the set of chromosomes and then load only those. Once that works, we can delete the _debug_build_chr
option.
I only added _debug_build_chr
in order to have something quick and dirty for the testing. I hadn't really planned it well, and it's good you noticed that it can be simplified and improved!
HPC1 has many more CPUs I believe. But I am using different HPC now.
This PR speeds up the tests a lot (although it's an ugly solution!).
The build database takes up approximately 7 GB of RAM. Most of our tests only use a subset of the chromosomes, so in this PR we allow that only a subset of the build database is loaded; and we do this by specifying which chromosomes to load from the build database. Most tests use only up to three chromosomes, and therefore this means we can cut the memory usage down from 7.0GB to between 0.5 and 2.0 GB. Also, some of our tests did use all 22 chromosomes, but I changed those such that no test uses any more than three chromosomes.
In itself, this doesn't make a huge speed improvement, but the memory savings mean that multiple tests can easily be run in parallel even on a normal laptop. My laptop has four processors, so I can do
stu -j4 @all.tests
to complete all the test in 15 minutes where previously I think it took at least an hour.Details:
--_debug_build_chr
option to specify a single chromosome. You can specify this option mutiple times to specify a set of chromosomes. The build database is then only loaded for that set of chromosomes. If this option is never specified, we load the full database of course. (See the diffs in this PR to see examples of usage.)I moved the calculation of
N_max
such that, in some cases, it gives a different number. Now, it simply looks up the largest sample size in the gwas file. Previously, it discarded problem SNPs from the GWAS file (e.g. unknown position) before computing N_max. Is it OK to change this, @sinarueeger ? I can change it back. I did this in order to ensure that the--_debug_build_chr
option didn't also affect N_maxIn the future, we could check if there is a
--impute.range
or--tag.range
argument, and then use that to identify which set of chromosomes to load. This would speed things up for normal users of the software. That would be a nicer solution, and then we could probably remove this--_debug_build_chr
option.