zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
467 stars 53 forks source link

Parse pedigree field more completely #134

Closed Euphrasiologist closed 1 year ago

Euphrasiologist commented 1 year ago

Hi, great crate, and thank you for your hard work on this.

I have a question about VCF pedigree parsing. I know you can parse a DB link but I couldn't see anywhere you can parse something like this:

It is possible to record relationships between genomes using the following syntax:
##PEDIGREE=<Name_0=G0-ID,Name_1=G1-ID,...,Name_N=GN-ID>

As per the VCF spec. Is there plans to include this? I might be able to pull together a PR if this functionality is missing... Cheers, M

zaeleus commented 1 year ago

The parser for PEDIGREE header records was removed because they're too poorly defined in the spec. You can still access them through vcf::Header::other_records as a Map<Other>, e.g.,

use noodles_vcf::{self as vcf, header::record::value::Other};

const RAW_HEADER: &str = r#"##fileformat=VCFv4.3
##PEDIGREE=<ID=SampleID,Name_1=Ancestor_1,Name_N=Ancestor_N>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
"#;

fn main() -> Result<(), vcf::header::ParseError> {
    let header: vcf::Header = RAW_HEADER.parse()?;

    let key = vcf::header::record::Key::from("PEDIGREE");
    let records = header
        .other_records()
        .get(&key)
        .expect("missing PEDIGREE records");

    for record in records {
        let Other::Map(map) = record else {
            panic!("record is not a map");
        };

        let fields = map.other_fields();

        eprintln!(
            "ID={} Name_1={:?} Name_N={:?} DNE={:?}",
            map.id(),
            fields.get("Name_1"),
            fields.get("Name_N"),
            fields.get("DNE"),
        );
    }

    Ok(())
}
ID=SampleID Name_1=Some("Ancestor_1") Name_N=Some("Ancestor_N") DNE=None
Euphrasiologist commented 1 year ago

Thank you for explaining this! Makes perfect sense :)