Closed kriestof closed 1 year ago
My initial guess is noodles is slower, because it parses upfront every record.
This is very likely the case; noodles does fully parse and validate records.
Can you provide an example input for me to investigate further with? If not, anything with a similar number of info and genotype fields would be helpful.
Since you're reading very few fields, a lazy record would be useful here, but noodles currently does not implement one.
Thanks for the prompt reply. That was also my guess, that lazy parsing is needed here. You can download the data used for the test here. Basically it's from 1kg dataset, but I limited the number of variants to a few thousands.
I guess writing a naive parser in rust for that single purpose should not be too difficult as vcf has quite clear structure and I need only very basic data.
noodles 0.48.0 / noodles-vcf 0.36.0 now has a lazy record and reader, which should help considerably with raw performance in your example. I compared the bcftools command above with an equivalent program that uses the lazy records:
vcf_193.rs
I'm still experimenting with the VCF lazy record API. It's currently more of a raw record and only does structural parsing.
Wow, I really haven't expected it! Thank you for blazingly quick resolution! It saved me much time on some naive implementation.
I have implemented new reader on that parser and it works perfectly. I do not have minimal benchmark so cannot give as accurate results as yours. Anyway, it is surely fast enough and faster than bcf tools.
Great! If you run into any problems, please submit a new issue.
Hello, I try to load vcf file to my program. Basically what I need from vcf file is ids, sample names and genotypes. I've got working code. However, it's a bit slow. On 170MB file it takes around 30 sec to read. I can process the same file with
bcftools query -f "%ID [%GT ]\n"
in around 6 sec. My initial guess is noodles is slower, because it parses upfront every record. I have checked with perf and indeedreader::record::parse_record
alone consumes around 60% of time. I provide you below with my function. Is there any way to improve performance or I am left with using bcftools or writing my own parser? [Hopefully without multithreading. I guess it may help, but at the same time probably I will need to include all dependencies like tokio.]