Open userofgithub1 opened 6 years ago
CoNLL coref is not the same as CoNLL-AIDA. I have clarified this in the docs. Please remind me what the AIDA dataset looks like?
Oops sorry I missed that in the docs. Here is the format of CoNLL-AIDA:
-DOCSTART- (1 EU)
EU B EU --NME--
rejects
German B German Germany http://en.wikipedia.org/wiki/Germany /m/0345h
call
to
boycott
British B British United_Kingdom http://en.wikipedia.org/wiki/United_Kingdom /m/07ssc
lamb
.
Peter B Peter Blackburn --NME--
Blackburn I Peter Blackburn --NME--
BRUSSELS B BRUSSELS Brussels http://en.wikipedia.org/wiki/Brussels /m/0177z
1996-08-22
The
European B European Commission European_Commission http://en.wikipedia.org/wiki/European_Commission /m/02q9k
Commission I European Commission European_Commission http://en.wikipedia.org/wiki/European_Commission /m/02q9k
said
on
Thursday
it
disagreed
And this is the format of the system output which I believe is accepted by $ neleval evaluate
:
1164testb RUGBY 1474 1491 en.wikipedia.org/wiki/Andrea_Castellani 1.0 PERSON
1164testb RUGBY 1452 1471 en.wikipedia.org/wiki/Alessandro_Moscardi 1.0 PERSON
1164testb RUGBY 1433 1449 en.wikipedia.org/wiki/Nicola_Mazzucato 1.0 ORG
1164testb RUGBY 1416 1430 en.wikipedia.org/wiki/Gianluca_Guidi 1.0 PERSON
Thank you so much,
def aida_to_neleval(f, iob_col=2, kbid_col=3):
def emit():
if 'start' not in cur:
return
kbid = cur['kbid']
if kbid == '--NME--':
kbid = 'NIL0000000'
print(docid, cur['start'], offset, kbid, sep='\t')
del cur['start']
del cur['kbid']
cur = {}
for l in f:
l = l.rstrip()
if not l:
continue
elif l.startswith('-DOCSTART-'):
emit()
docid = l[len('-DOCSTART-'):].strip().replace(' ', '_')
offset = 0
else:
offset += 1
cols = l.split('\t')
if len(cols) == 1:
emit()
continue
if cols[iob_col] == 'B' or cols[kbid_col] != cur.get('kbid'):
emit()
cur['start'] = offset
cur['kbid'] = cols[kbid_col]
if __name__ == '__main__':
import io
f = io.StringIO('''
-DOCSTART- (1 EU)
EU\tB\tEU\t--NME--
rejects
German\tB\tGerman\tGermany\thttp://en.wikipedia.org/wiki/Germany\t/m/0345h
call
to
boycott
British\tB\tBritish\tUnited_Kingdom\thttp://en.wikipedia.org/wiki/United_Kingdom\t/m/07ssc
lamb
.
Peter\tB\tPeter Blackburn\t--NME--
Blackburn\tI\tPeter Blackburn\t--NME--
BRUSSELS\tB\tBRUSSELS\tBrussels\thttp://en.wikipedia.org/wiki/Brussels\t/m/0177z
1996-08-22
The
European\tB\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
Commission\tI\tEuropean Commission\tEuropean_Commission\thttp://en.wikipedia.org/wiki/European_Commission\t/m/02q9k
said
on
Thursday
it
disagreed
''')
aida_to_neleval(f)
outputs
(1_EU) 1 2 NIL0000000
(1_EU) 3 4 Germany
(1_EU) 7 8 United_Kingdom
(1_EU) 10 12 NIL0000000
(1_EU) 12 13 Brussels
(1_EU) 15 17 European_Commission
I'll look into making a new command out of it.
Thank you so much. Sorry for the late reply been super busy with other tasks. Will test your code as soon as I get back to this task.
Thanks again :)
Awesome. I just had to decode before splitting to resolve UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 6: ordinal not in range(128)
so I changed cols = l.split('\t')
to cols = l.decode('utf-8').split('\t')
I forgot to mention that in some rows the kbid actually has a mixed format which looks something like: People\u0027s_Republic_of_China
while it should be People's_Republic_of_China
complete rows would look like this:
.
But
Le B Le Matin Le_Matin_\u0028France\u0029 http://en.wikipedia.org/wiki/Le_Matin_(France) /m/03nrccn
Matin I Le Matin Le_Matin_\u0028France\u0029 http://en.wikipedia.org/wiki/Le_Matin_(France) /m/03nrccn
newspaper
,
quoting
witnesses
How can I fix the format if I've decoded in UTF-8 when splitting?
Also, could you take a look at this relatable issue https://github.com/wikilinks/nel/issues/21 it's in Conll.py the offsets are calculated completely wrong I tried many different ways and also tried to apply your method there but it still doesn't calculate correctly. Many many thanks :)
I tried running the
prepare-conll-coref
no file is generated.$ neleval prepare-conll-coref /path/to/AIDA-YAGO2-dataset.tsv
And no file is generated. I would like to know how to convert CoNLL-AIDA dataset format to neleval format?
Thanks in advance,