zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
477 stars 53 forks source link

Parsing VCF fails for ID in header #241

Closed apraga closed 2 months ago

apraga commented 5 months ago

Hi,

Thanks for you package. Some data (Clinvar in my case) has a ID in header ##ID=<Description="ClinVar Variation ID">

Minimal working example

##fileformat=VCFv4.1
##ID=<Description="ClinVar Variation ID">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   1   2   A   G   .   .   .

Parsing fails with

calledResult::unwrap()on anErrvalue: Custom { kind: InvalidData, error: InvalidRecord(InvalidValue(InvalidOtherMap(Other("ID"), ParseError { id: None, kind: MissingId }))) }

Removing the offending header line solve the issue. I'm not sure if that should be managed by noodles but figured it would help to report it nevertheless.

Thanks,

zaeleus commented 5 months ago

Thanks for reporting, @apraga. This case is undefined in VCF 4.1/4.2. I asked for clarification in https://github.com/samtools/hts-specs/issues/760.

zaeleus commented 2 months ago

This behavior remains undefined, but for inputs that are VCF < 4.3, noodles will now read other header record values that start with a map prefix (<) as a string if there is no map identifier (i.e., ID=), e.g.,

##fileformat=VCFv4.2
##A=<a>
##B=<ID=b>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
Header {
    file_format: FileFormat { major: 4, minor: 2 },
    other_records: {
        Other("A"): Unstructured(["<a>"]),
        Other("B"): Structured({
            "b": Map { inner: Other { id_tag: Standard(Id) }, other_fields: {} },
        }),
    },
    // ...
}

This is explicitly invalid in VCF >= 4.3 and will continue to fail, e.g.,

##fileformat=VCFv4.3
##A=<a>
##B=<ID=b>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
Error: Custom {
    kind: InvalidData,
    error: InvalidRecord(InvalidValue(InvalidOtherMap(
        Other("A"),
        ParseError {
            id: None,
            kind: InvalidField(InvalidKey(UnexpectedEof)),
        },
    ))),
}

These rules also apply to the serializer when writing a VCF header.

apraga commented 2 months ago

Thanks @zaeleus for this fix, that solves my issue !