wikilinks / neleval

Entity disambiguation evaluation and error analysis tool
Apache License 2.0
116 stars 23 forks source link

segfault? #2

Closed krivard closed 10 years ago

krivard commented 10 years ago

Hi, I'm getting a segfault on evaluation:

$ ./nel evaluate -m all -f tab -g ../scoring/gold/e54_v11.kbp_train.combined.tab ../scoring/0.4/kbp_train.combined.tsv 
neleval/evaluate.py:173: StrictMetricWarning: Strict P/R defaulting to zero score for zero denominator
  StrictMetricWarning)
./nel: line 2: 11358 Segmentation fault      (core dumped) python -m neleval.__main__ "$@"

I've tried this on two different machines, and on both my train and test splits of the LDC2014E54 data, without luck. Oddly, it will run on the first half and the last half of a file, even on the first two-thirds and the last two-thirds, but not on the whole thing. Any ideas?

jnothman commented 10 years ago

Hi @krivard, a segfault is a bit surprising, but I guess this is what one pays for speed.

Could you please tell us which versions of:

you are using?

You could find the latter with:

pip freeze | grep -e numpy -e scipy

Otherwise, it might be best to debug using a copy of your data. Please forward it to joel.nothman@gmail.com.

jnothman commented 10 years ago

Also, if the versions of numpy and scipy are not their latest stable releases, are you able to upgrade and test again?

krivard commented 10 years ago

Python: 2.7.3 numpy: 1.8.0 scipy: 0.13.1

It looks like numpy 1.8.1 is available, is that what you're using? I can try that Monday.

I'll send along my data too.

jnothman commented 10 years ago

I'd be more surprised if the segfault was in numpy than in scipy. The numpy operations are relatively innocuous. But the error message you show doesn't give me enough detail to tell where in the Python code this is breaking.

jnothman commented 10 years ago

So far I've identified that:

jnothman commented 10 years ago

The segfault is the system's very bad way of saying there are mentions assigned to more than one cluster (in the gold standard!):

NYT_ENG_20090922.0042   315     322     E0264963        1.0     GPE
NYT_ENG_20090922.0042   315     322     NIL0835 1.0     ORG
...
eng-NG-31-142064-9994272        1628    1637    NIL0032 1.0     PER
eng-NG-31-142064-9994272        1628    1637    NIL0340 1.0     PER
...
eng-NG-31-142064-9994272        673     680     E0339633        1.0     GPE
eng-NG-31-142064-9994272        673     680     E0477916        1.0     GPE

It is possible that this is related to an bug in our conversion of the gold standard to the tab format, as I see other duplicate rows, such as:

eng-NG-31-142064-9994272        2267    2271    NIL0107 1.0     GPE
eng-NG-31-142064-9994272        2267    2271    NIL0107 1.0     GPE

We shall have to resolve it by:

Thanks Kathryn/@krivard for bringing this to our attention!

jnothman commented 10 years ago

(and for giving me a better idea of what to do if someone else reports a bug)

benhachey commented 10 years ago

The duplicates are in the source data (as of E54v1.0), e.g.:

<query id="EDL14_ENG_TRAINING_1578">
    <name>Nebraska</name>
    <docid>NYT_ENG_20090922.0042</docid>
    <beg>315</beg>
    <end>322</end>
</query>
<query id="EDL14_ENG_TRAINING_4055">
    <name>Nebraska</name>
    <docid>NYT_ENG_20090922.0042</docid>
    <beg>315</beg>
    <end>322</end>
</query>

I will chase this with the organisers to see how it should be handled.

benhachey commented 10 years ago

I believe this is resolved by the updated release.

See #6 for related data validation.