wpm / mycroft

Text classifier
MIT License
3 stars 2 forks source link

Reproduce published results from Yoon Kim 2014 #11

Closed wpm closed 7 years ago

wpm commented 7 years ago

Use this code to reproduce published results for RNNs for text classification. Put the results and references in the README.

Nothing exhaustive, just reproduce something to show that it's working.

wpm commented 7 years ago

Maybe the Stanford sentiment analysis, or IMDB dataset.

wpm commented 7 years ago

This Keras tutorial gets 95% accuracy on 20 Newsgroups.

I'm nowhere near this. I must have a bug.

wpm commented 7 years ago

I was confused about how what the data set for the Stanford Sentiment Treebank is. Is think the answer is that published results only consider full sentences.

wpm commented 7 years ago

Yoon Kim 2014 "Convolutional neural networks for sentence classification" gives results for a sequential neural net classifier over several different data sets.

First step is to assemble all the data sets described in this paper. Reproducing this paper's experiments with Mycroft's models would be the right thing to do for this issue.

wpm commented 7 years ago

The datasets from Yoon Kim's paper:

wpm commented 7 years ago

The CR link is 404. (And the paper is written in Microsoft Word. Ick.)

wpm commented 7 years ago

Data Set Statistics

SST

Partition Longest Text Vocabulary Size
Train 55 1,864,603
Dev 49 20,432
Test 56 41,724

Set the sequence length to 60. Maybe increase the vocabulary size to 40,000.

MR

Longest Text Vocabulary Size
2,649 2,242,761
wpm commented 7 years ago

To call this done I need

for my best run of

on the following data sets

Plus the best accuracy I can find in the literature (probably just from the Yoon Kim paper) for each of these data sets.

wpm commented 7 years ago
Model Data Set Training Time Loss Accuracy
Conv SST Fine 52:04 1.55702 49.412
RNN SST Fine 1d 12:35:07 2.13949 51.200
BOW SST Fine < 20 min 1.60353 22.965
Paragraph-Vec SST Fine 48.7
Conv SST Coarse 27:47 0.27419 92.796
RNN SST Coarse 17:51:46 0.58275 92.567
BOW SST Coarse < 20 min 0.48611 76.730
CNN-multichannel SST Coarse 88.1
Conv SUBJ 0.16319 93.600
RNN SUBJ 0.15511 94.200
BOW SUBJ < 20 min 0.31200 88.200
F-Dropout SUBJ 93.6
wpm commented 7 years ago

Parameter grid directions to explore:

wpm commented 7 years ago

Training runs with both train and dev data sets.