yao8839836 / text_gcn

Graph Convolutional Networks for Text Classification. AAAI 2019
1.36k stars 435 forks source link

Comment for the paper #1

Closed nzw0301 closed 6 years ago

nzw0301 commented 6 years ago

I read a paper, "Graph Convolutional Networks for Text Classification", on arXiv. I noticed a strange point in Table 2 and typo, so I report them.


Summary: fastText's test accuracy is too low on 20NG.

Pre-processing

I used the script below to normalize train and test data.

#!/usr/bin/perl

# Script to clean text files based on https://raw.githubusercontent.com/facebookresearch/fastText/9fbc035bac1899a2f4c508e361b39f3047819ef3/wikifil.pl
# Each line represents one training sample such as sentence/document

while (<>) {
    # Remove any text not normally visible

    # s/<.*>//;               # remove xml tags
    s/&amp;/&/g;            # decode URL encoded chars
    s/&lt;/</g;
    s/&gt;/>/g;
    # s/<ref[^<]*<\/ref>//g;  # remove references <ref...> ... </ref>
    # s/<[^>]*>//g;           # remove xhtml tags
    s/\[http:[^] ]*/[/g;    # remove normal url, preserve visible text
    s/\|thumb//ig;          # remove images links, preserve caption
    s/\|left//ig;
    s/\|right//ig;
    s/\|\d+px//ig;
    s/\[\[image:[^\[\]]*\|//ig;
    s/\[\[category:([^|\]]*)[^]]*\]\]/[[$1]]/ig;  # show categories without markup
    s/\[\[[a-z\-]*:[^\]]*\]\]//g;  # remove links to other languages
    s/\[\[[^\|\]]*\|/[[/g;  # remove wiki url, preserve visible text
    s/\{\{[^\}]*\}\}//g;         # remove {{icons}} and {tables}
    s/\{[^\}]*\}//g;
    s/\[//g;                # remove [ and ]
    s/\]//g;
    s/&[^;]*;/ /g;          # remove URL encoded chars

    # convert to lowercase letters and spaces, spell digits
    $_=" $_ ";
    tr/A-Z/a-z/;
    s/0/ zero /g;
    s/1/ one /g;
    s/2/ two /g;
    s/3/ three /g;
    s/4/ four /g;
    s/5/ five /g;
    s/6/ six /g;
    s/7/ seven /g;
    s/8/ eight /g;
    s/9/ nine /g;
    tr/a-z/ /cs;
    # chop;
    print $_;
    print "\n";
}

Training

∫./fasttext supervised -input ./news20.train.txt -output model-news -dim 10 -epoch 100  -minCount 5
Read 3M words
Number of words:  27502
Number of labels: 20
...

Evaluate

∫ ./fasttext test model-news.bin news20.test.txt

N    7532
P@1    0.769
R@1    0.769
Number of examples: 7532

So accuracy is 0.769, but Table 2 reports 0.1138. Even if I do not remove stopwords using NLTK, this accuracy is too low in my experience. For the same reason, fastText (bigrams)'s result also seems to be unfair.

The default epoch 5 is too small to learn classifier because 20NG is a small corpus. I suggest that you make the number of epoch bigger value for fair comparison.


Misc

Adam's paper BibTeX seems to be wrong.

yao8839836 commented 6 years ago

@nzw0301 Hi, Thank you very much for your valuable comments! I just tried 100 epochs with my preprocessing (cleaning and tokenizing text as Kim 2014, removing stop words using NLTK and words appearing less than 5 times).

Now the accuracy for fastText is 0.7876 and the accuracy for fastText (bigrams) is 0.7978.

Thanks again for your comments and correction!

nzw0301 commented 6 years ago

Thank you for your quick response and experiment. My pleasure.