References directories to compare apples and apples

jnothman commented 10 years ago

I propose that under references/ we divide the system outputs into directories representing the different task settings. I propose that we split references into:

references/gold-mentions: the system attempted to link all (including NILs) gold mentions (?schwa-linkable)
references/gold-linked-mentions: the system attempted to link only gold linked mentions (aida, houlsby)
`references/system-mentions': the system identified its own mentions (schwa, tagme)

There's still the potential for the entries in the directories not to be altogether comparable with one another. For example, we could subdivide system-mentions into those that generate NEs only (schwa), and those that include other wikilinks (tagme); we could subdivide gold-mentions according to whether the system had access to CoNLL 2003 type annotations (although this may be harder to infer).

There is also the question of whether the directory structure should similarly be utilised to label (a) the corpus being evaluated (e.g. CoNLL vs ?IITB; testa vs testb), and (b) the ID mapping.

benhachey commented 10 years ago

Also:

references/gold-linked-aidacandidates: Same as references/gold-linked-mentions, but uses aida_means.tsv.bz2 for candidate generation. I.e., the precise Hoffart et al. (2011) task setting.

jnothman commented 10 years ago

I still don't see the difference between that and the setting where a system's input is those mentions in the gold that are linked... assuming this version of the gold, which for now is all we have.

On 23 June 2014 21:27, Ben Hachey notifications@github.com wrote:

Also:

references/gold-linked-aidacandidates: Same as references/gold-linked-mentions, uses YAGO means/label relationships for candidate generation. I.e., the precise Hoffart et al. (2011) task setting.

— Reply to this email directly or view it on GitHub https://github.com/wikilinks/conll03_nel_eval/issues/53#issuecomment-46922515 .

wejradford commented 10 years ago

I agree with the first structure points.

I think we keep the means dataset, as the goal is to demystify the evaluation (and its knobs and levers).

There is also the question of whether the directory structure should similarly be utilised to label (a) the corpus being evaluated (e.g. CoNLL vs ?IITB; testa vs testb), and (b) the ID mapping.

I favour putting in conll or similar, but am not sure about ID mappings. They're nice regression test fodder, but we shouldn't really need them as a user can run the appropriate commands to generate.

benhachey commented 10 years ago

@jnothman - The difference is in the candidates (not the mentions).

On Tue, Jun 24, 2014 at 2:34 PM, jnothman notifications@github.com wrote:

I still don't see the difference between that and the setting where a system's input is those mentions in the gold that are linked... assuming this version of the gold, which for now is all we have.

wikilinks / conll03_nel_eval

References directories to compare apples and apples #53