Closed dnbaker closed 5 years ago
Hi Daniel Baker,
Thank you for finding this anomaly, and sorry for responding so lately. Yes. You can safely ignore this assertion failure in practice.
Although this assertion failure can be ignored in practice, its root cause indicates that there is a better way to handle it.
The assertion failure is caused by the densification of an empty bin (a bin is associated with a contiguous range of hash values). When a bin is empty, the value from another bin is inserted into the empty bin. The insertion results in the densification of the otherwise empty bin. The inserted value is not within the expected range of the bin that was empty before the insertion. Therefore, the assertion would fail in this case. If the assertion fail, then the subsequent calculation of genome size is supposed to give grossly inaccurate results.
However, the subsequent calculation of genome size is still approximately correct even if the assertion fails. The reason is that the density of the bin with only one inserted value is very small, so this bin does not significantly affect the median bin density. And only the median bin density is used for estimating genome size. Therefore, the calculation of genome size is still approximately correct even if this assertion fails.
Also, --sketchsize64=512 implies that sketch size is 64*512, which seems to be pretty big.
Btw, I can reproduce this bug with the following command:
bindash sketch --minhashtype=2 --nthreads=100 --bbits=16 --sketchsize64=512 --kmerlen=3 --outfname=danielbaker-1 ../benchmark/testdata/testgenome1.fna
which results in:
./bindash revision 65df81d ()
Warning: the genome ../benchmark/testdata/testgenome1.fna is densified with flag 1
bindash: /home/zhaoxiaofei/bindash/src/hashutils.hpp:276: const uint64_t estimate_genome_size2(const std::vector
I don't seem to be seeing the warning being thrown anymore.
However, on some pairs of genomes it is failing to emit output.
Typical behavior:
Running bindash commit 55af51e-dirty (1 file changed, 1 insertion(+), 1 deletion(-))
Max number of genome comparions per cached data: 2048x1024
Processed up to the cached chunk at (1,1) in 0 CPU seconds and 0 wall-clock seconds.
ref/bacteria/GCF_900148245.1_LRJ08_genomic.fna.gz ref/bacteria/GCF_900119575.1_IMG-taxon_2599185161_annotated_assembly_genomic.fna.gz 1.6725e-01 4.2492e-170 146/4096
Program ran for 0 CPU seconds and 0 real seconds.
On some pairs of genomes, however, nothing is emitted:
[dbaker49@jhu.edu@langmead-bigmem02 dnb]$ bindash dist GCF_000331305.1_ASM33130v1_genomic.fna.gz__.s256.k16.bdsh GCF_900112965.1_IMG-taxon_2616644814_annotated_assembly_genomic.fna.gz__.s256.k16.bdsh
Running bindash commit 55af51e-dirty (1 file changed, 1 insertion(+), 1 deletion(-))
Max number of genome comparions per cached data: 2048x1024
Processed up to the cached chunk at (1,1) in 0 CPU seconds and 0 wall-clock seconds.
Program ran for 0 CPU seconds and 0 real seconds.
When I redirect stderr, one line is written on the first invocation, while none are written for the second.
That was a consequence of filtering by default. Thank you for all your help! I'm closing the issue, as all of my concerns have been addressed.
You are welcome!
Regardless of which kmer or sketch size parameters I use, for
--minhashtype=2
, I can't get the program to run without failing an internal assert. Is this something I can safely ignore? Results from GNUtime -v
are below.