Working practical and competitive linear chain application

rockt commented 10 years ago

There was indeed a bug in reading in the Genia training corpus. I accidentally used the sample corpus as training corpus. The correct training corpus contains over 20k sentences. Wolfe seems to be quite memory hungry at the moment — I needed to provide 12G RAM. Results after 20 epochs of average perceptron learning (~1h training and testing) look quite good, although there is still much space for improvement. The performance is head-to-head with the winner of the task 2004 (72.5 F₁): http://acl.ldc.upenn.edu/coling2004/W1/pdf/19.pdf

Train:
Total Gold:  109588
Total Guess: 120081
Precision:   0.852466
Recall:      0.934089
F1:          0.891413

Test:
Total Gold:  19392
Total Guess: 22570
Precision:   0.673859
Recall:      0.784292
F1:          0.724894

rockt commented 10 years ago

What could improve the performance:

2nd-order linear chain instead of 1st-order
POS features
Character left and right to token as feature (cannot be reconstructed from Genia IOB corpus; needs fetching medline abstracts)
Character n-gram features
Dictionary containment features

riedelcastro commented 10 years ago

Excellent! I think there is quite a bit of memory usage improvement to do, particularly in the MPGraph generation step. But this is a great start...

wolfe-pack / wolfe

Working practical and competitive linear chain application #26