zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

Performance issues with vcf reader #193

Closed kriestof closed 1 year ago

kriestof commented 1 year ago

Hello, I try to load vcf file to my program. Basically what I need from vcf file is ids, sample names and genotypes. I've got working code. However, it's a bit slow. On 170MB file it takes around 30 sec to read. I can process the same file with bcftools query -f "%ID [%GT ]\n" in around 6 sec. My initial guess is noodles is slower, because it parses upfront every record. I have checked with perf and indeed reader::record::parse_record alone consumes around 60% of time. I provide you below with my function. Is there any way to improve performance or I am left with using bcftools or writing my own parser? [Hopefully without multithreading. I guess it may help, but at the same time probably I will need to include all dependencies like tokio.]

fn vcf_read(file_name: PathBuf) -> Result<(XData, SampleNames), VcfReadError> {
    let mut reader = vcf::reader::Builder::default().build_from_path(file_name)?;
    let header = reader.read_header()?;

    let mut res_vec = vec![];
    let mut res_ids = vec![];

    for result in reader.records(&header) {
        let record = result?;
        let mut col = vec![];

        if record.ids().len() != 1 {
            return Err(VcfReadError::VcfIdError);
        }
        let id = (*record.ids()[0]).to_string();
        res_ids.push(id.clone());

        for gens in record.genotypes().genotypes() {
            for gen in gens {
                let pos1 = gen.as_ref().unwrap()[0].position();
                let pos2 = gen.as_ref().unwrap()[1].position();

                if pos1.is_none() || pos2.is_none() {
                    return Err(VcfReadError::VcfNaAlleles(id));
                }

                let val: i8 = match (pos1.unwrap(), pos2.unwrap()) {
                    (0, 0) => 0,
                    (1, 0) => 1,
                    (0, 1) => 1,
                    (1, 1) => 2,
                    (x, y) => return Err(VcfReadError::VcfMultisite {id, x, y})
                };
                col.push(val);
            }
            res_vec.push(MultiX::ThreeVal(ThreeValCol::new(col.as_slice())));
        }
    }
    return Ok(((XDf::new(res_vec), res_ids), header.sample_names().clone()));
}
zaeleus commented 1 year ago

My initial guess is noodles is slower, because it parses upfront every record.

This is very likely the case; noodles does fully parse and validate records.

Can you provide an example input for me to investigate further with? If not, anything with a similar number of info and genotype fields would be helpful.

Since you're reading very few fields, a lazy record would be useful here, but noodles currently does not implement one.

kriestof commented 1 year ago

Thanks for the prompt reply. That was also my guess, that lazy parsing is needed here. You can download the data used for the test here. Basically it's from 1kg dataset, but I limited the number of variants to a few thousands.

I guess writing a naive parser in rust for that single purpose should not be too difficult as vcf has quite clear structure and I need only very basic data.

zaeleus commented 1 year ago

noodles 0.48.0 / noodles-vcf 0.36.0 now has a lazy record and reader, which should help considerably with raw performance in your example. I compared the bcftools command above with an equivalent program that uses the lazy records:

vcf_193.rs ```rust use std::{ env, io::{self, BufWriter, Write}, }; use noodles_vcf as vcf; fn main() -> io::Result<()> { let src = env::args().nth(1).expect("missing src"); let mut reader = vcf::reader::Builder.build_from_path(src)?; reader.read_header()?; let stdout = io::stdout().lock(); let mut writer = BufWriter::new(stdout); let mut record = vcf::lazy::Record::default(); while reader.read_lazy_record(&mut record)? != 0 { writer.write_all(record.ids().as_bytes())?; writer.write_all(b" ")?; let genotypes = record.genotypes(); let Some("GT") = genotypes.keys()?.next() else { panic!("missing GT column"); }; for sample in genotypes.samples()?.flatten() { let gt = sample .values() .next() .expect("missing GT field") .expect("missing GT value"); writer.write_all(gt.as_bytes())?; writer.write_all(b" ")?; } writer.write_all(b"\n")?; } Ok(()) } ```
Comparison ``` $ bcftools version | head -2 bcftools 1.18 Using htslib 1.18 $ samtools version [...] HTSlib compilation details: Features: build=configure libcurl=no S3=no GCS=no libdeflate=yes lzma=yes bzip2=yes plugins=no htscodecs=1.5.1 CC: gcc CPPFLAGS: CFLAGS: -Wall -g -O2 -fvisibility=hidden LDFLAGS: -fvisibility=hidden [...] $ bcftools query -f "%ID [%GT ]\n" tmp/issues/193/sample_031.vcf.gz > /tmp/out.bcftools.txt $ cargo build --release --features libdeflate --example vcf_193 $ target/release/examples/vcf_193 tmp/issues/193/sample_031.vcf.gz > /tmp/out.noodles.txt $ diff /tmp/out.*.txt $ echo $? 0 $ hyperfine --warmup 3 \ 'bcftools query -f "%ID [%GT ]\n" tmp/issues/193/sample_031.vcf.gz' \ "target/release/examples/vcf_193 tmp/issues/193/sample_031.vcf.gz" Benchmark 1: bcftools query -f "%ID [%GT ]\n" tmp/issues/193/sample_031.vcf.gz Time (mean ± σ): 6.508 s ± 0.113 s [User: 6.374 s, System: 0.041 s] Range (min … max): 6.320 s … 6.703 s 10 runs Benchmark 2: target/release/examples/vcf_193 tmp/issues/193/sample_031.vcf.gz Time (mean ± σ): 2.931 s ± 0.145 s [User: 2.535 s, System: 0.356 s] Range (min … max): 2.783 s … 3.173 s 10 runs Summary 'target/release/examples/vcf_193 tmp/issues/193/sample_031.vcf.gz' ran 2.22 ± 0.12 times faster than 'bcftools query -f "%ID [%GT ]\n" tmp/issues/193/sample_031.vcf.gz' ```

I'm still experimenting with the VCF lazy record API. It's currently more of a raw record and only does structural parsing.

kriestof commented 1 year ago

Wow, I really haven't expected it! Thank you for blazingly quick resolution! It saved me much time on some naive implementation.

I have implemented new reader on that parser and it works perfectly. I do not have minimal benchmark so cannot give as accurate results as yours. Anyway, it is surely fast enough and faster than bcf tools.

zaeleus commented 1 year ago

Great! If you run into any problems, please submit a new issue.