zhaoxiaofei / bindash

Fast and precise comparison of genomes and metagenomes (in the order of terabytes) on a typical personal laptop
Other
56 stars 7 forks source link

The input data format #22

Open bingyinglee opened 1 week ago

bingyinglee commented 1 week ago

Hi, I tried to use bindash to process my fasta data. I first used the following command: ./bindash sketch mydata.fas --outfname=genomeA.sketch The mydata.fas file size is about 50M, containing more than 20,000 nucleotide sequences. But the generated .sketch file is only 1kb. There must be something wrong, but I don't know where to modify it. Are there any requirements for the input data format?

jianshu93 commented 1 week ago

Hi @bingyinglee,

The output file size is only related to the sketch size (--sketchsize64 M and --bbits N option) if your purpose is to compute genomic distance among your files. Sketches are just first N bits of M 64 bit integers so it is not that big. You can increase --sketchsize64 to 200 or even several thousand if you want accuracy at 99% or 99.99% ANI above (a widely used metric for genomic distance). This tool is only for genomic distance estimation, not for fastq/fasta file quality control or something.

Let me know if I am not clear.

Best,

Jianshu