Open riedelcastro opened 9 years ago
Is there an advantage to having the EntityMention know its character offset and finding token offsets when need be, vs. having token offsets and finding character offsets when need be? I feel like a majority of the structures one posits as interacting with entities -- constituent spans, dependency arcs, coreference links, and relations, are more naturally thought of and operated on at the token level?
On Sat, Aug 15, 2015 at 12:41 PM, Sebastian Riedel <notifications@github.com
wrote:
This makes them unique within a document, and useful even with other tokenizations. It would also make implementing NavigableEntityMentions (see NavigableDocument) trivial. The Token offsets can be derived from the character offsets, given a tokenization.
— Reply to this email directly or view it on GitHub https://github.com/wolfe-pack/wolfe/issues/164.
The navigable entity mention would make it easy to still have token offsets, that is, the entity mentions would know their token offsets as well. I would like to make the entity mentions navigable because I like to write things like 'mention.text' and 'mention.sentence' etc. via 'import doc.navigable'. This would be super easy if the mention knows its character offset instead of the token offset because the navigable document has a very efficient data structure to map from character offsets to everything else. I also like the idea of having any object in the document graph being uniquely defined through their grounding in the raw text.
Alternatively we can give entity mentions the index of the sentence that contains it. This would also enable navigation. It's not quite as clean to me, but fine.
This makes them unique within a document, and useful even with other tokenizations. It would also make implementing NavigableEntityMentions (see NavigableDocument) trivial. The Token offsets can be derived from the character offsets, given a tokenization.