zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

How to index a VCF statically or on the fly #214

Open nh13 opened 10 months ago

nh13 commented 10 months ago

I'd like to programmatically generate VCF records as part of my unit tests for various tools. In some cases I'll create new records then write them to disk to test the tool end to end (versus just passing methods the variant records themselves). The VCF needs to be indexed, and I was wondering how to do that after I've written the VCF as well as ideally on the fly.

I think to create the index, you can get the virtual position from the writer and perhaps something like this: https://github.com/zaeleus/noodles/blob/bd1098699a5a0ed87dab0c3e6c496ae13faa358d/noodles-tabix/examples/tabix_write.rs#L29-L38?

Any guidance would be greatly appreciated!

zaeleus commented 10 months ago

The tabix_write example can indeed be applied to create a tabix index for VFC on the fly. noodles-vcf/examples/vcf_index.rs shows how to create one from a file and what record information is need to build the index.

For the latter case, we could also add a fn index<P: AsRef<Path>>(src: P) -> io::Result<csi::Index> convenience function, if that would be useful.

brentp commented 10 months ago

@zaeleus , I am trying to do this with the following code, but the index does not seem to be correct. what do I miss? Thanks in advance:

    use noodles::core;
    use noodles::csi::{self as csi, index::reference_sequence::bin::Chunk};
    use noodles::vcf::header::record::value::{map::Contig, Map};
    use noodles::vcf::record::Position;

    use tempfile::tempdir;

    fn write_vcf(
        temp_dir: &PathBuf,
        chrom: String,
        start: usize,
        reference: &str,
        alternate: &str,
    ) -> String {
        let vcf_path = temp_dir.join("test.vcf.gz");
        let fs = File::create(&vcf_path).expect("error creating file");

        let mut wtr = bgzf::Writer::new(&fs);

        let mut writer = vcf::Writer::new(&mut wtr);
        let id = chrom.parse().expect("error parsing chrom");

        let contig = Map::<Contig>::new();
        let header = vcf::Header::builder().add_contig(id, contig.clone()).build();

        writer.write_header(&header).expect("error writing header");

        let start_position = writer.get_ref().virtual_position();

        let chrom = chrom.parse().expect("error parsing chrom");
        let record = vcf::Record::builder()
            .set_chromosome(chrom)
            .set_position(Position::from(start))
            .set_reference_bases(reference.parse().expect("error parsing reference"))
            .set_alternate_bases(alternate.parse().expect("error parsing alternate"))
            .set_info("AF=0.1".parse().expect("error parsing AF"))
            .build()
            .expect("error building record");

        writer.write_record(&header, &record).expect("error writing test record");

        let end_position = writer.get_ref().virtual_position();
        let chunk = Chunk::new(start_position, end_position);

        let alignment_context = Some((
            0 as usize,
            core::Position::try_from(start).expect("bad start in test"),
            core::Position::try_from(start + reference.len()).expect("bad end in test"),
            true,
        ));

        let mut indexer =
            csi::index::Indexer::default().set_header(csi::index::header::Builder::vcf().build());

        indexer.add_record(alignment_context, chunk).expect("error indexing");
        let index = indexer.build(1 as usize);

        let index_path = vcf_path.clone().into_os_string().into_string().unwrap() + ".csi";

        let index_file = File::create(&index_path).expect("error creating index file");
        let mut index_writer = csi::Writer::new(index_file);
        index_writer.write_index(&index).expect("error writing index");

        wtr.finish().expect("error finishing bgzf");
        return vcf_path.into_os_string().into_string().unwrap();
    }
zaeleus commented 10 months ago

I am trying to do this with the following code, but the index does not seem to be correct. what do I miss?

Thanks for the example. The output is indeed incorrect, and the issues for these are being tracked in #215 and #216.

zaeleus commented 10 months ago

For the latter case, we could also add a fn index<P: AsRef<Path>>(src: P) -> io::Result<csi::Index> convenience function, if that would be useful.

@nh13, this is added in noodles 0.58.0 / noodles-vcf 0.46.0 as vcf::index.