zjshi / gt-pro

MIT License
23 stars 7 forks source link

reduce RAM requirement and DB size >5x #12

Closed boris-dimitrov closed 5 years ago

boris-dimitrov commented 5 years ago

This is a substantial rewrite of the DB builder and query engine that reduces the database size and query engine RAM requirement over 5x. The most important changes are as follows.

  1. DB builder output format has changed as follows:

    (a) The byte order is now (snp, kmer) rather than (kmer, snp).

    (b) The snp encodes additional information, such as whether the kmer is forward or reverse-complement, and the offset of the SNP within the kmer.

    • The 56 most-signifficant bits encode what was called snp in the previous version.

    • The next bit encodes forward/rc.

    • The 5 least-significant bits encode the SNP offset within the kmer.

      (c) The endianness of kmer nucleotide encoding has changed. It is now little-endian, which improves performance on intel architectures. That means, encoding a sequence of nucleotides S with an integer N now uses the least significant bits of N to represent the initial characters of S. The byte order of N in memory matches the byte order of S in memory.

  2. The DB builder input format has changed. The DB builder can now directly read the sckmer profile files, without requiring a shell script to process them first.

  3. The DB builder now takes 5x less time through various tricks. This saves about an hour when building a DB on 974 species with ~7 billion kmers.

  4. The optimized DB constructed the first time you run the query builder is 6x smaller than the original DB, and consists of 2 files:

    _optimized_db_snps.bin _optimized_db_kmer_index.bin These files are constructed automatically when you provide the original db as a command line argument of the query engine. You may choose to distribute these much smaller files instead of the much larger original DB, and still provide the same command line argument (referring to a phantom file that will never be accessed if the optimized DB files above are found). In addition, the first time this program runs, it will build indexes _optimized_db_mmer_bloom_35.bin _db_lmer_index_30.bin as appropriate for the -l and -m command line parameters. You may ship/download those indexes or allow the query engine to build them the first time it is invoked. If you choose the latter, it is best to then quit the query engine immediately after those indexes are built, and then restart it to run queries, to realize the most compact memory footprint for running queries.

It is important to note that results after all this are exactly identical to the results produced by the previous version. In particular, the bug in the previous version that forgets to output the last SNP for each input, is still present in this version. That bug will be fixed with a separate future change.