Within-document evaluation mode for cross-doc coreference evaluation

Should be able to evaluate (the micro-average over documents of) within-document coreference resolution performance. With the current implementation the following approaches exist:

append document ID to entity ID manually (or using prepare-conll-coref)
score each document individually by splitting the input, then aggregate

Note that the former approach breaks for the pairwise_negative aggregate, as true negatives from across the corpus will be counted.

My current preferred solution is to add an option to evaluate: which fields to break the calculation down by, ordinarily 'doc' but perhaps also 'type' would be of interest. Evaluate would then calculate all measures over each, then add results for micro-average and macro-average. This would also mean we can rename the aggregate sets-micro to sets.

Thanks for expressing the need for this, @shyamupa

wikilinks / neleval

Within-document evaluation mode for cross-doc coreference evaluation #19