zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

Noodles-vcf generates invalid record error on PEDIGREE header #201

Closed xiamaz closed 12 months ago

xiamaz commented 12 months ago

Illumina DRAGEN generates an PEDIGREE vcf header of the following format:

##PEDIGREE=<Child=XXX,Mother=XXX,Father=XXX>

This is an valid format for VCFv4.2, but has been superseded in VCFv4.3 with the following equivalent entry:

##PEDIGREE=<ID=XXX,Mother=XXX,Father=XXX>

Expected behavior should be parsing the Child-Entry or First entry always as ID.

This was tested on noodles-vcf 0.34. Will test on a newer version.

xiamaz commented 12 months ago

Confirmed this also fails on the current master.

zaeleus commented 12 months ago

Thanks for reporting! noodles doesn't claim VCF 4.2 support, but it is largely compatible with it.

Expected behavior should be parsing the Child-Entry or First entry always as ID.

Reviewing § 5.4.11 "Clonal derivation relationships" (4.2 (2022-08-23) and 4.3 (2022-11-27)), I think it's a little more nuanced than that.

It was agreed upon in https://github.com/samtools/hts-specs/issues/96#issuecomment-123861420 that Derived and Child are naturally the IDs of pedigree records. The "general form" including "Name_0" is ambiguous, particularly when the fields are unordered. E.g., the following records don't have IDs in the first position:

##PEDIGREE=<Original=GermlineID,Derived=DerivedID>
##PEDIGREE=<Mother=MotherID,Child=ChildID,Father=FatherID>
##PEDIGREE=<Name_1=G1-ID,Name_0=G0-ID,Name_N=GN-ID>

One strategy would be to reintroduce a pedigree record type back into the header to track the ID key. In this case, it would only support Derived and Child as ID keys in < 4.3 and ID in >= 4.3.

xiamaz commented 12 months ago

Thanks for looking into this further. Yes, in this case parsing should follow the rules laid out there. If you are fine with adding more VCF 4.2 support, I can also look into extending the PR I already created.

Unfortunately VCF version conversion tools don't really exist, so extending compatibility would be great if that were possible!

zaeleus commented 12 months ago

de55838e4211227b6e4e7b7fce83bb03b9b79dc8 adds a specialized PEDIGREE parser that allows Child or Derived to act as ID when the file format is VCF 4.2. It also preserves the tag used for the ID field when reserializing the header. While this won't cover all (ab)uses of the record, it does solve the issue for your original example.