Closed wpm closed 7 years ago
Maybe the Stanford sentiment analysis, or IMDB dataset.
This Keras tutorial gets 95% accuracy on 20 Newsgroups.
I'm nowhere near this. I must have a bug.
I was confused about how what the data set for the Stanford Sentiment Treebank is. Is think the answer is that published results only consider full sentences.
Yoon Kim 2014 "Convolutional neural networks for sentence classification" gives results for a sequential neural net classifier over several different data sets.
First step is to assemble all the data sets described in this paper. Reproducing this paper's experiments with Mycroft's models would be the right thing to do for this issue.
MR Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews (Pang and Lee, 2005, Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales).
Movie reviews grouped by author. 3-class and 4-class sentiment ranking. 10-fold cross-validation. Results reported by author. Accuracy in the 60-70% range.
SST-1 Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by (Socher et al. 2013, Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank).
SST-2 Same as SST-1 but with neutral reviews removed and binary labels.
SUBJ Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004, A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts).
2 classes, 10-fold cross validation, accuracies in the 80-90% range.
TREC TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002, Learning Question Classifiers).
CR Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004, Mining and Summarizing Customer Reviews).
MPQA Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005, Annotating Expressions of Opinions and Emotions in Language).
Looks like the data set contains GATE parses instead of just plain text. This is a little trickier to work with.
The CR link is 404. (And the paper is written in Microsoft Word. Ick.)
Partition | Longest Text | Vocabulary Size |
---|---|---|
Train | 55 | 1,864,603 |
Dev | 49 | 20,432 |
Test | 56 | 41,724 |
Set the sequence length to 60. Maybe increase the vocabulary size to 40,000.
Longest Text | Vocabulary Size |
---|---|
2,649 | 2,242,761 |
To call this done I need
for my best run of
on the following data sets
Plus the best accuracy I can find in the literature (probably just from the Yoon Kim paper) for each of these data sets.
Model | Data Set | Training Time | Loss | Accuracy |
---|---|---|---|---|
Conv | SST Fine | 52:04 | 1.55702 | 49.412 |
RNN | SST Fine | 1d 12:35:07 | 2.13949 | 51.200 |
BOW | SST Fine | < 20 min | 1.60353 | 22.965 |
Paragraph-Vec | SST Fine | 48.7 | ||
Conv | SST Coarse | 27:47 | 0.27419 | 92.796 |
RNN | SST Coarse | 17:51:46 | 0.58275 | 92.567 |
BOW | SST Coarse | < 20 min | 0.48611 | 76.730 |
CNN-multichannel | SST Coarse | 88.1 | ||
Conv | SUBJ | 0.16319 | 93.600 | |
RNN | SUBJ | 0.15511 | 94.200 | |
BOW | SUBJ | < 20 min | 0.31200 | 88.200 |
F-Dropout | SUBJ | 93.6 |
Parameter grid directions to explore:
Training runs with both train and dev data sets.
Use this code to reproduce published results for RNNs for text classification. Put the results and references in the README.
Nothing exhaustive, just reproduce something to show that it's working.