zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

BCF and VCF `record_bufs` return different `RecordBuf`s #266

Closed tedil closed 2 months ago

tedil commented 3 months ago

noodles 0.74.0

Given the following example VCF file:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1,length=249250621>
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample
1       1   .       AC      A       42.42   .       AC=1;AF=0.5     GT:AD:AF        0/1:1,2:0.34

and its corresponding bcf conversion (bcftools view example.vcf -Ob > example.bcf) the following snippet fails:

fn main() -> Result<(), anyhow::Error> {
    let mut vcf_reader = std::fs::File::open("example.vcf").map(std::io::BufReader::new).map(noodles::vcf::io::Reader::new)?;
    let mut bcf_reader = std::fs::File::open("example.bcf").map(noodles::bcf::io::Reader::new)?;
    let vcf_header = vcf_reader.read_header()?;
    let bcf_header = bcf_reader.read_header()?;
    let vcf_record = vcf_reader.record_bufs(&vcf_header).next().unwrap()?;
    let bcf_record = bcf_reader.record_bufs(&bcf_header).next().unwrap()?;
    assert_eq!(vcf_record, bcf_record);
    Ok(())
}

The reason is that the RecordBuf returned by the bcf reader uses single values (when Number=A happens to be Number=1), while the RecordBuf returned by the vcf reader (correctly) uses arrays (with just a single value)

zaeleus commented 2 months ago

Great observation. This occurs because all BCF values are tagged with physical types, which is how the decoder determined field value types. Info field and samples values decoders now use the header type definitions to resolve the logical type. Note that record buffers from different formats can only be decoded into the same types iff the type definitions in the header are available, which is not guaranteed in VCF. Thanks for reporting!

Fixed in https://github.com/zaeleus/noodles/compare/2fe58ff969e9a5c0a2f451434c3727d11e8985fd...052db767de37ce708c5f5bb7f479411ed2ea54e7.

zaeleus commented 2 months ago

noodles 0.75.0/noodles-bcf 0.55.0 now uses header type definitions when decoding info and samples series values. This is stricter than the previous approach, so let me know if anything surprising happens. Thanks!