zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
512 stars 53 forks source link

csi: tabix header indices are off by one #215

Closed zaeleus closed 1 year ago

zaeleus commented 1 year ago

@zaeleus , I am trying to do this with the following code, but the index does not seem to be correct. what do I miss? Thanks in advance:

Example ```rust use noodles::core; use noodles::csi::{self as csi, index::reference_sequence::bin::Chunk}; use noodles::vcf::header::record::value::{map::Contig, Map}; use noodles::vcf::record::Position; use tempfile::tempdir; fn write_vcf( temp_dir: &PathBuf, chrom: String, start: usize, reference: &str, alternate: &str, ) -> String { let vcf_path = temp_dir.join("test.vcf.gz"); let fs = File::create(&vcf_path).expect("error creating file"); let mut wtr = bgzf::Writer::new(&fs); let mut writer = vcf::Writer::new(&mut wtr); let id = chrom.parse().expect("error parsing chrom"); let contig = Map::::new(); let header = vcf::Header::builder().add_contig(id, contig.clone()).build(); writer.write_header(&header).expect("error writing header"); let start_position = writer.get_ref().virtual_position(); let chrom = chrom.parse().expect("error parsing chrom"); let record = vcf::Record::builder() .set_chromosome(chrom) .set_position(Position::from(start)) .set_reference_bases(reference.parse().expect("error parsing reference")) .set_alternate_bases(alternate.parse().expect("error parsing alternate")) .set_info("AF=0.1".parse().expect("error parsing AF")) .build() .expect("error building record"); writer.write_record(&header, &record).expect("error writing test record"); let end_position = writer.get_ref().virtual_position(); let chunk = Chunk::new(start_position, end_position); let alignment_context = Some(( 0 as usize, core::Position::try_from(start).expect("bad start in test"), core::Position::try_from(start + reference.len()).expect("bad end in test"), true, )); let mut indexer = csi::index::Indexer::default().set_header(csi::index::header::Builder::vcf().build()); indexer.add_record(alignment_context, chunk).expect("error indexing"); let index = indexer.build(1 as usize); let index_path = vcf_path.clone().into_os_string().into_string().unwrap() + ".csi"; let index_file = File::create(&index_path).expect("error creating index file"); let mut index_writer = csi::Writer::new(index_file); index_writer.write_index(&index).expect("error writing index"); wtr.finish().expect("error finishing bgzf"); return vcf_path.into_os_string().into_string().unwrap(); } ```

Originally posted by @brentp in https://github.com/zaeleus/noodles/issues/214#issuecomment-1808084159


When using noodles-csi, the column indices when reading and writing a tabix header are off by one. The error comes from the tabix header using 1-based column indices, whereas noodles normalizes them to be 0-based.

This only affects the CSI reader and writer, not tabix.

zaeleus commented 1 year ago

@brentp, this issue is fixed in noodles 0.58.0 / noodles-csi 0.27.0.

The only other issue in the given example is that you must track the reference sequence names, in the case of VCF. E.g., since there's only one record and reference sequence name:

-    let mut indexer =
-        csi::index::Indexer::default().set_header(csi::index::header::Builder::vcf().build());
+    let reference_sequence_names = [chrom].into_iter().collect();
+    let header = csi::index::header::Builder::vcf()
+        .set_reference_sequence_names(reference_sequence_names)
+        .build();
+    let mut indexer = csi::index::Indexer::default().set_header(header);

Alternatively, use tabix::index::Indexer. This works on reference sequence names instead of IDs and adds them to the index header on build.

brentp commented 1 year ago

thanks very much @zaeleus ! Indeed it's working with latest version and that change.