Closed krivard closed 10 years ago
Hi @krivard, a segfault is a bit surprising, but I guess this is what one pays for speed.
Could you please tell us which versions of:
you are using?
You could find the latter with:
pip freeze | grep -e numpy -e scipy
Otherwise, it might be best to debug using a copy of your data. Please forward it to joel.nothman@gmail.com.
Also, if the versions of numpy and scipy are not their latest stable releases, are you able to upgrade and test again?
Python: 2.7.3 numpy: 1.8.0 scipy: 0.13.1
It looks like numpy 1.8.1 is available, is that what you're using? I can try that Monday.
I'll send along my data too.
I'd be more surprised if the segfault was in numpy than in scipy. The numpy operations are relatively innocuous. But the error message you show doesn't give me enough detail to tell where in the Python code this is breaking.
So far I've identified that:
The segfault is the system's very bad way of saying there are mentions assigned to more than one cluster (in the gold standard!):
NYT_ENG_20090922.0042 315 322 E0264963 1.0 GPE
NYT_ENG_20090922.0042 315 322 NIL0835 1.0 ORG
...
eng-NG-31-142064-9994272 1628 1637 NIL0032 1.0 PER
eng-NG-31-142064-9994272 1628 1637 NIL0340 1.0 PER
...
eng-NG-31-142064-9994272 673 680 E0339633 1.0 GPE
eng-NG-31-142064-9994272 673 680 E0477916 1.0 GPE
It is possible that this is related to an bug in our conversion of the gold standard to the tab format, as I see other duplicate rows, such as:
eng-NG-31-142064-9994272 2267 2271 NIL0107 1.0 GPE
eng-NG-31-142064-9994272 2267 2271 NIL0107 1.0 GPE
We shall have to resolve it by:
Thanks Kathryn/@krivard for bringing this to our attention!
(and for giving me a better idea of what to do if someone else reports a bug)
The duplicates are in the source data (as of E54v1.0), e.g.:
<query id="EDL14_ENG_TRAINING_1578">
<name>Nebraska</name>
<docid>NYT_ENG_20090922.0042</docid>
<beg>315</beg>
<end>322</end>
</query>
<query id="EDL14_ENG_TRAINING_4055">
<name>Nebraska</name>
<docid>NYT_ENG_20090922.0042</docid>
<beg>315</beg>
<end>322</end>
</query>
I will chase this with the organisers to see how it should be handled.
I believe this is resolved by the updated release.
See #6 for related data validation.
Hi, I'm getting a segfault on evaluation:
I've tried this on two different machines, and on both my train and test splits of the LDC2014E54 data, without luck. Oddly, it will run on the first half and the last half of a file, even on the first two-thirds and the last two-thirds, but not on the whole thing. Any ideas?