zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

How to access FORMAT fields in VCF records? #192

Closed rickymagner closed 1 year ago

rickymagner commented 1 year ago

Hi, I'm trying to parse and use these fields in VCF records but it doesn't seem like the API provides a straightforward way to do it. Here is some sample code that doesn't seem to behave the way I expected:

use noodles::vcf;
use noodles::vcf::record::genotypes::keys::key;

pub fn test_record(rec: vcf::Record) {
    let x = rec.format().get(&key::READ_DEPTH).map(|k| k.to_string()).unwrap();
    println!("Value for DP is {}", x);
}

Instead of seeing the DP values from the VCF, it seems like this just produces Value for DP is DP. I suppose the types hint at this, since I'd expect the DP values to be a Vec<_> of some kind indexed by the samples, but I can't seem to figure out how to even get something returned in this type of usable format using .format().get(). What's the intended way to generally access this type of data from a record in a VCF to use?

malthesr commented 1 year ago

You can use the noodles_vcf::Record::genotypes method to get a noodles_vcf::record::Genotypes, which holds the information you're looking for.

Here's an example with a small test VCF:

use noodles_vcf as vcf;
use vcf::record::genotypes::{keys::key, sample::Value};

pub fn test_record(record: vcf::Record) -> Result<(), Box<dyn std::error::Error>> {
    for value in record
        .genotypes()
        .values()
        .map(|sample| sample.get(&key::READ_DEPTH))
    {
        let dp = match value {
            Some(Some(Value::Integer(dp))) => dp,
            _ => panic!(),
        };

        println!("Value for DP is {dp}");
    }

    Ok(())
}

pub fn main() -> Result<(), Box<dyn std::error::Error>> {
    let data = br#"##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
1   1   .   A   .   999 PASS    .   GT:DP   0/0:10  ./0:2   1/1:1"#;

    let mut reader = vcf::reader::Builder::default().build_from_reader(&data[..])?;
    let header = reader.read_header()?;

    for result in reader.records(&header) {
        let record = result?;

        test_record(record)?;
    }

    Ok(())
}

This should print:

Value for DP is 10
Value for DP is 2
Value for DP is 1
rickymagner commented 1 year ago

Thanks for the clarification! I didn't realize the Genotypes type would also contain information about other FORMAT fields. It might help to add an example like this to the list of examples since many of them don't get into the details about how to access some of these kinds of information from the record.