zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

Fixed-length `Character` fields #235

Open holtgrewe opened 6 months ago

holtgrewe commented 6 months ago

The glnexus tool writes out a FORMAT/RNC field with the following header.

##FORMAT=<ID=RNC,Number=2,Type=Character,Description="Reason for No Call in GT: . = n/a, M = Missing data, P = Partial data, I = gVCF input site is non-called, D = insufficient Depth of coverage, - = unrepresentable overlapping deletion, L = Lost/unrepresentable allele (other than deletion), U = multiple Unphased variants present, O = multiple Overlapping variants present, 1 = site is Monoallelic, no assertion about presence of REF or ALT allele">

In the data rows this looks as follows:

GT:RNC      ./.:II     1/1:15:1,14:1:16,4,0:..

In other words, it writes it out as strings of length 2. This is not allowed by noodles that expects I,I or .,..

Is this a special case that could be supported by nodles-vcf?

Also see:

zaeleus commented 6 months ago

There is indeed no consensus on how fixed-sized character arrays are encoded/decoded. I'll try to push samtools/hts-specs#631 to see if can be resolved first.

As a workaround in noodles, you can modify the RNC format header record type definition to be a single string value, e.g.,

use noodles::vcf::{self, header::{record::value::map::format::Type, Number};

let mut header = reader.read_header()?;

// In a write context, write the original header before the modification to preserve the type
// definition.
writer.write_header(&header);

if let Some(format) = header.formats_mut().get_mut("RNC") {
    *format.number_mut() = Number::Count(1);
    *format.type_mut() = Type::String;
}

let mut record = vcf::Record::default();
reader.read_record(&header, &mut record)?;

writer.write_record(&header, &record)?;
holtgrewe commented 6 months ago

Thank you very much!