Closed J-Wall closed 1 year ago
You're right; good catch! A multimap, as suggested, does seem like a better and more general approach, but it will introduce a less elegant API for fields with single values (e.g., attributes.get("ID").and_then(|values| values.first())
). Collapsing duplicate fields is also a nice fallback. I'll work on this change.
(It's a shame that the GFF3 spec is seemingly abandoned. It's in dire need of corrections, clarifications, and cleanups.)
Thanks! I had the same thought about the single value case (which would be the case most of the time). I'm not sure if this would be more or less elegant, but single values and multiple values could be handled separately.. something like
enum AttributeValue {
Single(String),
Multiple(Vec<String>),
}
This might also make it a bit easier to explicitly disallow having no value?
The other option would be to only use a Vec
for the attributes which are explicitly allowed to have multiples as per the spec:
In addition to Parent, the Alias, Note, Dbxref and Ontology_term attributes can have multiple values.
but given user-created attributes are allowed, maybe that's a bad idea...
(It's a shame that the GFF3 spec is seemingly abandoned. It's in dire need of corrections, clarifications, and cleanups.)
Totally agree!
I ended up changing Attributes
to a typed map, i.e., key-value pairs where each value can either be a String
or Array
. When parsing, value types are inferred by whether it has a comma literal. When building, conversions to a Value
are implemented from String
and Vec<String>
. Examples:
"Parent=AF2312".parse()?;
// => {"Parent": Value::String("AF2312")}
"Parent=AF2312,AB2812".parse()?;
// => {"Parent": Value::Array(["AF2312", "AB2812"])}
// Duplicate tags have their values merged.
"Parent=AF2312;Parent=AB2812".parse()?;
// => {"Parent": Value::Array(["AF2312", "AB2812"])}
let mut attributes = Attributes::default();
// Insert a string.
attributes.insert(Tag::from("Parent"), Value::from("AF2312"));
// => "Parent=AF2312"
// The caller knows `Parent` is a string.
let parent = attributes.get("Parent").and_then(|value| value.as_string()).unwrap();
// => Some("AF2312")
// If not, `Value` implements `IntoIterator`. For strings, it returns a single item.
if let Some(value) = attributes.get("Parent") {
for v in value { /* ... */ }
}
// => v[0] = "AF2312"
// Insert an array.
attributes.insert(Tag::from("Parent"), Value::from(vec![String::from("AF2312"), String::from("AB2812")]));
// => "Parent=AF2312,AB2812"
// The caller knows `Parent` is an array.
let parent = attributes.get("Parent").and_then(|value| value.as_array()).unwrap();
// => Some(["AF2312", "AB2812"])
// If not, `IntoIterator` can be conveniently used.
if let Some(value) = attributes.get("Parent") {
for v in value { /* ... */ }
}
// => v[0] = "AF2312"
// => v[1] = "AB2812"
// The serializer still handles percent-encoding.
attributes.insert(Tag::from("Parent"), Value::from("AF2312,AB2812"))
// => "Parent=AF2312%2CAB2812"
This was originally implemented a multimap, but
1) allocating a Vec<String>
for the common case (single string values) was expensive and
2) this reduces the chance the caller discards values with multiple items.
According to the GFF spec, multiple values can be defined for a given key in the attributes column using a comma-separated list:
I am trying to write out a GFF file which has multiple values associated with the "Alias" tag, but from what I can tell this is impossible to do, because all the commas get converted to "%2C".
@zaeleus, you comment in #97,
Given the above specification for multiple values, I'm not sure that your assumption was correct. I think it makes more sense to disallow duplicate tags in favor of comma-separated value lists.
Regarding #97, I'm wondering if would it make more sense to make Attributes something like a
HashMap<String, Vec<String>>
? If you still want to enable parsing of duplicate tags, you could just append the value to the relevant value in the HashMap (it would then be correctly written out as a comma-separated list).