zhaoxiaofei / bindash

Fast and precise comparison of genomes and metagenomes (in the order of terabytes) on a typical personal laptop
Other
56 stars 7 forks source link

bindash sketch fails assertion at runtime #3

Closed dnbaker closed 5 years ago

dnbaker commented 5 years ago

Regardless of which kmer or sketch size parameters I use, for --minhashtype=2, I can't get the program to run without failing an internal assert. Is this something I can safely ignore? Results from GNU time -v are below.

bindash: BinDash/src/hashutils.hpp:276: const uint64_t estimate_genome_size2(const std::vector<long unsigned int>&, size_t): Assertion `signs[i] >= binmin' failed.
Command terminated by signal 6
    Command being timed: "bindash sketch --minhashtype=2 --nthreads=100 --bbits=16 --sketchsize64=512 --kmerlen=31 --outfname=bindash.31.16.bdsh --listfname=fnames.txt"
    User time (seconds): 15669.92
    System time (seconds): 11.96
    Percent of CPU this job got: 9961%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 2:37.42
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 391640
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 251835
    Voluntary context switches: 13045
    Involuntary context switches: 62319
    Swaps: 0
    File system inputs: 920
    File system outputs: 40
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 0
zhaoxiaofei commented 5 years ago

Hi Daniel Baker,

Thank you for finding this anomaly, and sorry for responding so lately. Yes. You can safely ignore this assertion failure in practice.

Although this assertion failure can be ignored in practice, its root cause indicates that there is a better way to handle it.

The assertion failure is caused by the densification of an empty bin (a bin is associated with a contiguous range of hash values). When a bin is empty, the value from another bin is inserted into the empty bin. The insertion results in the densification of the otherwise empty bin. The inserted value is not within the expected range of the bin that was empty before the insertion. Therefore, the assertion would fail in this case. If the assertion fail, then the subsequent calculation of genome size is supposed to give grossly inaccurate results.

However, the subsequent calculation of genome size is still approximately correct even if the assertion fails. The reason is that the density of the bin with only one inserted value is very small, so this bin does not significantly affect the median bin density. And only the median bin density is used for estimating genome size. Therefore, the calculation of genome size is still approximately correct even if this assertion fails.

Also, --sketchsize64=512 implies that sketch size is 64*512, which seems to be pretty big.

Btw, I can reproduce this bug with the following command: bindash sketch --minhashtype=2 --nthreads=100 --bbits=16 --sketchsize64=512 --kmerlen=3 --outfname=danielbaker-1 ../benchmark/testdata/testgenome1.fna which results in: ./bindash revision 65df81d () Warning: the genome ../benchmark/testdata/testgenome1.fna is densified with flag 1 bindash: /home/zhaoxiaofei/bindash/src/hashutils.hpp:276: const uint64_t estimate_genome_size2(const std::vector&, size_t): Assertion `signs[i] >= binmin' failed. Aborted (core dumped)

dnbaker commented 5 years ago

I don't seem to be seeing the warning being thrown anymore.

However, on some pairs of genomes it is failing to emit output.

Typical behavior:

Running bindash commit 55af51e-dirty (1 file changed, 1 insertion(+), 1 deletion(-))
Max number of genome comparions per cached data: 2048x1024
Processed up to the cached chunk at (1,1) in 0 CPU seconds and 0 wall-clock seconds.
ref/bacteria/GCF_900148245.1_LRJ08_genomic.fna.gz   ref/bacteria/GCF_900119575.1_IMG-taxon_2599185161_annotated_assembly_genomic.fna.gz 1.6725e-01  4.2492e-170 146/4096
Program ran for 0 CPU seconds and 0 real seconds.

On some pairs of genomes, however, nothing is emitted:

[dbaker49@jhu.edu@langmead-bigmem02 dnb]$ bindash dist GCF_000331305.1_ASM33130v1_genomic.fna.gz__.s256.k16.bdsh GCF_900112965.1_IMG-taxon_2616644814_annotated_assembly_genomic.fna.gz__.s256.k16.bdsh
Running bindash commit 55af51e-dirty (1 file changed, 1 insertion(+), 1 deletion(-))
Max number of genome comparions per cached data: 2048x1024
Processed up to the cached chunk at (1,1) in 0 CPU seconds and 0 wall-clock seconds.
Program ran for 0 CPU seconds and 0 real seconds.

When I redirect stderr, one line is written on the first invocation, while none are written for the second.

dnbaker commented 5 years ago

That was a consequence of filtering by default. Thank you for all your help! I'm closing the issue, as all of my concerns have been addressed.

zhaoxiaofei commented 5 years ago

You are welcome!