zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

util/variant: Add an indexed reader #188

Closed zaeleus closed 1 year ago

zaeleus commented 1 year ago

Is there an abstraction / trait that unifies reading from an indexed BCF or VCF where the index is either CSI or TBI?

Originally posted by @brentp in https://github.com/zaeleus/noodles/discussions/170#discussioncomment-6529163


E.g.,

let mut reader = variant::indexed_reader::Builder::default().build_from_path("src.vcf.gz")?;
let header = reader.read_header()?;
let region = Region::new("sq0", ..);

for result in reader.query(&header, &region)? {
    let record = result?;
    // ...
}
zaeleus commented 1 year ago

variant::IndexedReader is now available in noodles-util 0.18.0. See noodles-util/examples/util_variant_query.rs for an example.

brentp commented 1 year ago

Thanks very much!

brentp commented 1 year ago

Hi @zaeleus , I'm making use of this, but as it goes, now I could use variant::indexed_reader::Builder::build_from_reader in addition to build_from_path I see that's available for variant::reader, but not variant::indexed_reader. Is there a design reason for this or is it just not yet implemented? thanks again.

zaeleus commented 1 year ago

Not yet implemented due to the BCF's indexed reader builder not having it implemented either. I'll work on adding this functionality this week.

zaeleus commented 1 year ago

noodles-util 0.19.0 now includes building from a reader (variant::indexed_reader::Builder::build_from_reader). Don't forget that this requires the caller to first set the index. Let me know if you have any new issues or questions.

brentp commented 1 year ago

Hi Michael, thanks very much for implementing! I'm having some trouble with this. I modified noodles-util/examples/util_variant_query.rs with the diff at end (also attached) and run as:

cargo run --features variant --example util_variant_query ~/src/slivar/tests/ashk-trio.vcf.gz ~/src/slivar/tests/ashk-trio.vcf.gz.csi 2:200-900000 | grep -v ^# | head

but I see (at the read_header() call)

Error: Custom { kind: InvalidData, error: "invalid BGZF header" }

so either I'm doing something wrong, or the file detection is consuming some of the file?

Here is the diff:

diff --git a/noodles-util/examples/util_variant_query.rs b/noodles-util/examples/util_variant_query.rs
index dc02a0e7..b9b9f5d8 100644
--- a/noodles-util/examples/util_variant_query.rs
+++ b/noodles-util/examples/util_variant_query.rs
@@ -6,19 +6,37 @@

 use std::{
     env,
-    io::{self, BufWriter},
+    fs::File,
+    io::{self, BufReader, BufWriter},
 };

+use noodles_csi as csi;
 use noodles_util::variant;
 use noodles_vcf as vcf;

 fn main() -> Result<(), Box<dyn std::error::Error>> {
     let mut args = env::args().skip(1);

-    let src = args.next().expect("missing src");
-    let region = args.next().expect("missing region").parse()?;
+    let src = args
+        .next()
+        .expect("missing src (args are: src index region)");
+    let index = args
+        .next()
+        .expect("missing index (args are: src index region)");
+    let region = args
+        .next()
+        .expect("missing region (args are: src index region)")
+        .parse()?;
+
+    let f = File::open(&src)?;
+    let r = BufReader::new(f);
+
+    let index = csi::read(index).expect("error reading index");
+
+    let mut reader = variant::indexed_reader::Builder::default()
+        .set_index(index)
+        .build_from_reader(r)?;

-    let mut reader = variant::indexed_reader::Builder::default().build_from_path(src)?;
     let header = reader.read_header()?;

     let query = reader.query(&header, &region)?;

vq.diff.gz

zaeleus commented 1 year ago

The detection was indeed discarding the start of file buffer, sorry! This is fixed (with test) in noodles-util 0.19.1. Thank you for both your feedback and patience .