zaeleus / noodles

Bioinformatics I/O libraries in Rust
MIT License
482 stars 52 forks source link

Allow multiple values per attribute entry for GFF #183

Closed J-Wall closed 1 year ago

J-Wall commented 1 year ago

According to the GFF spec, multiple values can be defined for a given key in the attributes column using a comma-separated list:

Multiple attributes of the same type are indicated by separating the values with the comma "," character, as in: Parent=AF2312,AB2812,abc-3 In addition to Parent, the Alias, Note, Dbxref and Ontology_term attributes can have multiple values.

I am trying to write out a GFF file which has multiple values associated with the "Alias" tag, but from what I can tell this is impossible to do, because all the commas get converted to "%2C".

@zaeleus, you comment in #97,

The uniqueness of the tags is undefined, in which I made the assumption that duplicate tags are allowed; and therefore, this cannot be a conventional map.

Given the above specification for multiple values, I'm not sure that your assumption was correct. I think it makes more sense to disallow duplicate tags in favor of comma-separated value lists.

Regarding #97, I'm wondering if would it make more sense to make Attributes something like a HashMap<String, Vec<String>>? If you still want to enable parsing of duplicate tags, you could just append the value to the relevant value in the HashMap (it would then be correctly written out as a comma-separated list).

zaeleus commented 1 year ago

You're right; good catch! A multimap, as suggested, does seem like a better and more general approach, but it will introduce a less elegant API for fields with single values (e.g., attributes.get("ID").and_then(|values| values.first())). Collapsing duplicate fields is also a nice fallback. I'll work on this change.

(It's a shame that the GFF3 spec is seemingly abandoned. It's in dire need of corrections, clarifications, and cleanups.)

J-Wall commented 1 year ago

Thanks! I had the same thought about the single value case (which would be the case most of the time). I'm not sure if this would be more or less elegant, but single values and multiple values could be handled separately.. something like

enum AttributeValue {
    Single(String),
    Multiple(Vec<String>),
}

This might also make it a bit easier to explicitly disallow having no value?

The other option would be to only use a Vec for the attributes which are explicitly allowed to have multiples as per the spec:

In addition to Parent, the Alias, Note, Dbxref and Ontology_term attributes can have multiple values.

but given user-created attributes are allowed, maybe that's a bad idea...

(It's a shame that the GFF3 spec is seemingly abandoned. It's in dire need of corrections, clarifications, and cleanups.)

Totally agree!

zaeleus commented 1 year ago

I ended up changing Attributes to a typed map, i.e., key-value pairs where each value can either be a String or Array. When parsing, value types are inferred by whether it has a comma literal. When building, conversions to a Value are implemented from String and Vec<String>. Examples:

"Parent=AF2312".parse()?;
// => {"Parent": Value::String("AF2312")}

"Parent=AF2312,AB2812".parse()?;
// => {"Parent": Value::Array(["AF2312", "AB2812"])}

// Duplicate tags have their values merged.
"Parent=AF2312;Parent=AB2812".parse()?;
// => {"Parent": Value::Array(["AF2312", "AB2812"])}

let mut attributes = Attributes::default();

// Insert a string.
attributes.insert(Tag::from("Parent"), Value::from("AF2312"));
// => "Parent=AF2312"

// The caller knows `Parent` is a string.
let parent = attributes.get("Parent").and_then(|value| value.as_string()).unwrap();
// => Some("AF2312")

// If not, `Value` implements `IntoIterator`. For strings, it returns a single item.
if let Some(value) = attributes.get("Parent") {
    for v in value { /* ... */ }
}
// => v[0] = "AF2312"

// Insert an array.
attributes.insert(Tag::from("Parent"), Value::from(vec![String::from("AF2312"), String::from("AB2812")]));
// => "Parent=AF2312,AB2812"

// The caller knows `Parent` is an array.
let parent = attributes.get("Parent").and_then(|value| value.as_array()).unwrap();
// => Some(["AF2312", "AB2812"])

// If not, `IntoIterator` can be conveniently used.
if let Some(value) = attributes.get("Parent") {
    for v in value { /* ... */ }
}
// => v[0] = "AF2312"
// => v[1] = "AB2812"

// The serializer still handles percent-encoding.
attributes.insert(Tag::from("Parent"), Value::from("AF2312,AB2812"))
// => "Parent=AF2312%2CAB2812"

This was originally implemented a multimap, but

1) allocating a Vec<String> for the common case (single string values) was expensive and 2) this reduces the chance the caller discards values with multiple items.