Pre-processing

I used the script below to normalize train and test data.

#!/usr/bin/perl

# Script to clean text files based on https://raw.githubusercontent.com/facebookresearch/fastText/9fbc035bac1899a2f4c508e361b39f3047819ef3/wikifil.pl
# Each line represents one training sample such as sentence/document

while (<>) {
    # Remove any text not normally visible

    # s/<.*>//;               # remove xml tags
    s/&amp;/&/g;            # decode URL encoded chars
    s/&lt;/</g;
    s/&gt;/>/g;
    # s/<ref[^<]*<\/ref>//g;  # remove references <ref...> ... </ref>
    # s/<[^>]*>//g;           # remove xhtml tags
    s/\[http:[^] ]*/[/g;    # remove normal url, preserve visible text
    s/\|thumb//ig;          # remove images links, preserve caption
    s/\|left//ig;
    s/\|right//ig;
    s/\|\d+px//ig;
    s/\[\[image:[^\[\]]*\|//ig;
    s/\[\[category:([^|\]]*)[^]]*\]\]/[[$1]]/ig;  # show categories without markup
    s/\[\[[a-z\-]*:[^\]]*\]\]//g;  # remove links to other languages
    s/\[\[[^\|\]]*\|/[[/g;  # remove wiki url, preserve visible text
    s/\{\{[^\}]*\}\}//g;         # remove {{icons}} and {tables}
    s/\{[^\}]*\}//g;
    s/\[//g;                # remove [ and ]
    s/\]//g;
    s/&[^;]*;/ /g;          # remove URL encoded chars

    # convert to lowercase letters and spaces, spell digits
    $_=" $_ ";
    tr/A-Z/a-z/;
    s/0/ zero /g;
    s/1/ one /g;
    s/2/ two /g;
    s/3/ three /g;
    s/4/ four /g;
    s/5/ five /g;
    s/6/ six /g;
    s/7/ seven /g;
    s/8/ eight /g;
    s/9/ nine /g;
    tr/a-z/ /cs;
    # chop;
    print $_;
    print "\n";
}

Training

∫./fasttext supervised -input ./news20.train.txt -output model-news -dim 10 -epoch 100  -minCount 5
Read 3M words
Number of words:  27502
Number of labels: 20
...

Evaluate

∫ ./fasttext test model-news.bin news20.test.txt

N    7532
P@1    0.769
R@1    0.769
Number of examples: 7532

So accuracy is 0.769, but Table 2 reports 0.1138. Even if I do not remove stopwords using NLTK, this accuracy is too low in my experience. For the same reason, fastText (bigrams)'s result also seems to be unfair.

The default epoch 5 is too small to learn classifier because 20NG is a small corpus. I suggest that you make the number of epoch bigger value for fair comparison.

Misc

Adam's paper BibTeX seems to be wrong.

Correct authors are Diederik P. Kingma and Jimmy Lei Ba, but Kinga, D., and Adam, J. B. in your paper.
Adam's paper title is wrong.

yao8839836 commented 6 years ago

@nzw0301 Hi, Thank you very much for your valuable comments! I just tried 100 epochs with my preprocessing (cleaning and tokenizing text as Kim 2014, removing stop words using NLTK and words appearing less than 5 times).

Now the accuracy for fastText is 0.7876 and the accuracy for fastText (bigrams) is 0.7978.

Thanks again for your comments and correction!

nzw0301 commented 6 years ago

Thank you for your quick response and experiment. My pleasure.

yao8839836 / text_gcn

Comment for the paper #1

Pre-processing

Training

Evaluate

Misc

Adam's paper BibTeX seems to be wrong.