wpm commented 7 years ago

Use this code to reproduce published results for RNNs for text classification. Put the results and references in the README.

Nothing exhaustive, just reproduce something to show that it's working.

wpm commented 7 years ago

Maybe the Stanford sentiment analysis, or IMDB dataset.

wpm commented 7 years ago

This Keras tutorial gets 95% accuracy on 20 Newsgroups.

I'm nowhere near this. I must have a bug.

wpm commented 7 years ago

I was confused about how what the data set for the Stanford Sentiment Treebank is. Is think the answer is that published results only consider full sentences.

wpm commented 7 years ago

Yoon Kim 2014 "Convolutional neural networks for sentence classification" gives results for a sequential neural net classifier over several different data sets.

First step is to assemble all the data sets described in this paper. Reproducing this paper's experiments with Mycroft's models would be the right thing to do for this issue.

wpm commented 7 years ago

The datasets from Yoon Kim's paper:

MR Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews (Pang and Lee, 2005, Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales).

Movie reviews grouped by author. 3-class and 4-class sentiment ranking. 10-fold cross-validation. Results reported by author. Accuracy in the 60-70% range.
SST-1 Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by (Socher et al. 2013, Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank).
SST-2 Same as SST-1 but with neutral reviews removed and binary labels.
SUBJ Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004, A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts).

2 classes, 10-fold cross validation, accuracies in the 80-90% range.
TREC TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002, Learning Question Classifiers).
CR Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004, Mining and Summarizing Customer Reviews).
MPQA Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005, Annotating Expressions of Opinions and Emotions in Language).

Looks like the data set contains GATE parses instead of just plain text. This is a little trickier to work with.

wpm commented 7 years ago

The CR link is 404. (And the paper is written in Microsoft Word. Ick.)

wpm commented 7 years ago

Data Set Statistics

SST

Partition	Longest Text	Vocabulary Size
Train	55	1,864,603
Dev	49	20,432
Test	56	41,724

Set the sequence length to 60. Maybe increase the vocabulary size to 40,000.

MR

Longest Text	Vocabulary Size
2,649	2,242,761

wpm commented 7 years ago

To call this done I need

Loss
Accuracy
Training time

for my best run of

Bag of words
RNN
Convolution Net

on the following data sets

MR
SST1
SST2
SUBJ

Plus the best accuracy I can find in the literature (probably just from the Yoon Kim paper) for each of these data sets.

wpm commented 7 years ago

Model	Data Set	Training Time	Loss	Accuracy
Conv	SST Fine	52:04	1.55702	49.412
RNN	SST Fine	1d 12:35:07	2.13949	51.200
BOW	SST Fine	< 20 min	1.60353	22.965
Paragraph-Vec	SST Fine			48.7
Conv	SST Coarse	27:47	0.27419	92.796
RNN	SST Coarse	17:51:46	0.58275	92.567
BOW	SST Coarse	< 20 min	0.48611	76.730
CNN-multichannel	SST Coarse			88.1
Conv	SUBJ		0.16319	93.600
RNN	SUBJ		0.15511	94.200
BOW	SUBJ	< 20 min	0.31200	88.200
F-Dropout	SUBJ			93.6

Paragraph-Vec: Le and Mikolov, 2014
CNN-multichannel: Yoon Kim, 2014
F-Dropout: Wang and Manning, 2013

wpm commented 7 years ago

Parameter grid directions to explore:

Batch size (64)
Dropout (0.25)
Different RNN units (64, 256)
Multiple RNN layers
Vocabulary (200,000)
Convnet kernel (4, 5)
Convnet filters (50, 150)

wpm commented 7 years ago

Training runs with both train and dev data sets.

Stanford Coarse Conv Full: Experiment 46
Stanford Fine Conv Full: Experiment 47
Stanford Coarse RNN Full: Experiment 49
Stanford Fine RNN Full: Experiment 48

wpm / mycroft

Reproduce published results from Yoon Kim 2014 #11

The datasets from Yoon Kim's paper:

Data Set Statistics

SST

MR