Test.with.less.memory - Githubissues

aaronmcdaid commented 6 years ago

This PR speeds up the tests a lot (although it's an ugly solution!).

The build database takes up approximately 7 GB of RAM. Most of our tests only use a subset of the chromosomes, so in this PR we allow that only a subset of the build database is loaded; and we do this by specifying which chromosomes to load from the build database. Most tests use only up to three chromosomes, and therefore this means we can cut the memory usage down from 7.0GB to between 0.5 and 2.0 GB. Also, some of our tests did use all 22 chromosomes, but I changed those such that no test uses any more than three chromosomes.

In itself, this doesn't make a huge speed improvement, but the memory savings mean that multiple tests can easily be run in parallel even on a normal laptop. My laptop has four processors, so I can do stu -j4 @all.tests to complete all the test in 15 minutes where previously I think it took at least an hour.

Details:

added a --_debug_build_chr option to specify a single chromosome. You can specify this option mutiple times to specify a set of chromosomes. The build database is then only loaded for that set of chromosomes. If this option is never specified, we load the full database of course. (See the diffs in this PR to see examples of usage.)
I moved the calculation of N_max such that, in some cases, it gives a different number. Now, it simply looks up the largest sample size in the gwas file. Previously, it discarded problem SNPs from the GWAS file (e.g. unknown position) before computing N_max. Is it OK to change this, @sinarueeger ? I can change it back. I did this in order to ensure that the --_debug_build_chr option didn't also affect N_max

In the future, we could check if there is a --impute.range or --tag.range argument, and then use that to identify which set of chromosomes to load. This would speed things up for normal users of the software. That would be a nicer solution, and then we could probably remove this --_debug_build_chr option.

sinarueeger commented 6 years ago

@aaronmcdaid looks good!

Option _debug_build_chr does this mean we could use this instead (or together) with the option --impute.range, e.g. ... --impute.range=22 --debug_build_chr 22?

Improvements regarding option N_max looks fine.

sinarueeger commented 6 years ago

Btw - I am now using SSIMP on our HPC. The --download.1KG option works fine!

That also means that I can start running tests with stu/stu @all.tests there or stu/stu -j4 @all.tests.

aaronmcdaid commented 6 years ago

On hpc1, perhaps you could use more than 4. I forget how many CPUs it has. Is it just four?

... does this mean we could use this instead (or together) ...

Yes, this can now be simplified a lot. If the user specifies --impute.range or --tag.range, then ssimp should identify the set of chromosomes and then load only those. Once that works, we can delete the _debug_build_chr option.

I only added _debug_build_chr in order to have something quick and dirty for the testing. I hadn't really planned it well, and it's good you noticed that it can be simplified and improved!

sinarueeger commented 6 years ago

HPC1 has many more CPUs I believe. But I am using different HPC now.

zkutalik / ssimp_software

Test.with.less.memory #84