prepare-conll-coref does not convert AIDA-YAGO2-dataset

userofgithub1 commented 6 years ago

I tried running the prepare-conll-coref no file is generated.

$ neleval prepare-conll-coref /path/to/AIDA-YAGO2-dataset.tsv

And no file is generated. I would like to know how to convert CoNLL-AIDA dataset format to neleval format?

Thanks in advance,

jnothman commented 6 years ago

CoNLL coref is not the same as CoNLL-AIDA. I have clarified this in the docs. Please remind me what the AIDA dataset looks like?

userofgithub1 commented 6 years ago

Oops sorry I missed that in the docs. Here is the format of CoNLL-AIDA:

-DOCSTART- (1 EU)
EU  B   EU  --NME--
rejects
German  B   German  Germany http://en.wikipedia.org/wiki/Germany    /m/0345h
call
to
boycott
British B   British United_Kingdom  http://en.wikipedia.org/wiki/United_Kingdom /m/07ssc
lamb
.

Peter   B   Peter Blackburn --NME--
Blackburn   I   Peter Blackburn --NME--

BRUSSELS    B   BRUSSELS    Brussels    http://en.wikipedia.org/wiki/Brussels   /m/0177z
1996-08-22

The
European    B   European Commission European_Commission http://en.wikipedia.org/wiki/European_Commission    /m/02q9k
Commission  I   European Commission European_Commission http://en.wikipedia.org/wiki/European_Commission    /m/02q9k
said
on
Thursday
it
disagreed

And this is the format of the system output which I believe is accepted by $ neleval evaluate:

1164testb RUGBY 1474    1491    en.wikipedia.org/wiki/Andrea_Castellani 1.0 PERSON
1164testb RUGBY 1452    1471    en.wikipedia.org/wiki/Alessandro_Moscardi   1.0 PERSON
1164testb RUGBY 1433    1449    en.wikipedia.org/wiki/Nicola_Mazzucato  1.0 ORG
1164testb RUGBY 1416    1430    en.wikipedia.org/wiki/Gianluca_Guidi    1.0 PERSON

Thank you so much,

jnothman commented 6 years ago

def aida_to_neleval(f, iob_col=2, kbid_col=3):
    def emit():
        if 'start' not in cur:
            return
        kbid = cur['kbid']
        if kbid == '--NME--':
            kbid = 'NIL0000000'
        print(docid, cur['start'], offset, kbid, sep='\t')
        del cur['start']
        del cur['kbid']

    cur = {}
    for l in f:
        l = l.rstrip()
        if not l:
            continue
        elif l.startswith('-DOCSTART-'):
            emit()
            docid = l[len('-DOCSTART-'):].strip().replace(' ', '_')
            offset = 0
        else:
            offset += 1
            cols = l.split('\t')
            if len(cols) == 1:
                emit()
                continue
            if cols[iob_col] == 'B' or cols[kbid_col] != cur.get('kbid'):
                emit()
                cur['start'] = offset
                cur['kbid'] = cols[kbid_col]

if __name__ == '__main__':
    import io

    f = io.StringIO('''
-DOCSTART- (1 EU)
EU\tB\tEU\t--NME--
rejects
German\tB\tGerman\tGermany\thttp://en.wikipedia.org/wiki/Germany\t/m/0345h
call
to
boycott
British\tB\tBritish\tUnited_Kingdom\thttp://en.wikipedia.org/wiki/United_Kingdom\t/m/07ssc
lamb
.

Peter\tB\tPeter Blackburn\t--NME--
Blackburn\tI\tPeter Blackburn\t--NME--

BRUSSELS\tB\tBRUSSELS\tBrussels\thttp://en.wikipedia.org/wiki/Brussels\t/m/0177z
1996-08-22

The
European\tB\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
Commission\tI\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
said
on
Thursday
it
disagreed
    ''')
    aida_to_neleval(f)

outputs

(1_EU)  1   2   NIL0000000
(1_EU)  3   4   Germany
(1_EU)  7   8   United_Kingdom
(1_EU)  10  12  NIL0000000
(1_EU)  12  13  Brussels
(1_EU)  15  17  European_Commission

I'll look into making a new command out of it.

userofgithub1 commented 6 years ago

Thank you so much. Sorry for the late reply been super busy with other tasks. Will test your code as soon as I get back to this task.

Thanks again :)

userofgithub1 commented 6 years ago

Awesome. I just had to decode before splitting to resolve UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 6: ordinal not in range(128) so I changed cols = l.split('\t') to cols = l.decode('utf-8').split('\t')

I forgot to mention that in some rows the kbid actually has a mixed format which looks something like: People\u0027s_Republic_of_China while it should be People's_Republic_of_China complete rows would look like this:

.

But
Le  B   Le Matin    Le_Matin_\u0028France\u0029 http://en.wikipedia.org/wiki/Le_Matin_(France)  /m/03nrccn
Matin   I   Le Matin    Le_Matin_\u0028France\u0029 http://en.wikipedia.org/wiki/Le_Matin_(France)  /m/03nrccn
newspaper
,
quoting
witnesses

How can I fix the format if I've decoded in UTF-8 when splitting?

Also, could you take a look at this relatable issue https://github.com/wikilinks/nel/issues/21 it's in Conll.py the offsets are calculated completely wrong I tried many different ways and also tried to apply your method there but it still doesn't calculate correctly. Many many thanks :)

wikilinks / neleval

prepare-conll-coref does not convert AIDA-YAGO2-dataset #45