zhongkaifu / RNNSharp

RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling, sequence-to-sequence and so on. It's written by C# language and based on .NET framework 4.6 or above versions. RNNSharp supports many different types of networks, such as forward and bi-directional network, sequence-to-sequence network, and different types of layers, such as LSTM, Softmax, sampled Softmax and others.
BSD 3-Clause "New" or "Revised" License
285 stars 92 forks source link

Poor performance in Sequential Tagging #4

Closed bratao closed 8 years ago

bratao commented 8 years ago

Hello !

Thanks again for this awesome project !

From my understanding, the performance for text sequential tagging should be equal or more than CRFSharp or CRF++ right ? My problem is to semantic tag a big single, continuous text ( hundreds of pages).

In CRF++ I get a error token ratio of about 0.5%. In RNNSharp I can´t get it better than 40%. a gigantic different. I tried LSTM and BPTT with CRF on or off. No luck.

This is expected for my use case, or am I doing something wrong ?

zhongkaifu commented 8 years ago

Hi Bratao,

What's feature set did you use ? I suggest you to use both lexical features and dense features, such as word embedding. Could you please share your configuration file and command line with me ? I can look into it.

For CRFSharp and CRF++, both of them should have similiar performance. In addition, both of them are able to generate huge features according unigram and bigram feature template, the number of feature set can be much larger than what's RNNSharp has.

Did you try your data set by using CRFSharp or CRF++ ?

zhongkaifu commented 8 years ago

1: How did you generate vector.bin for word embedding features ?

2: I saw you are using U02:%x[0,1] in template. How many columns are you using in your training corpus "bruno-data.txt“ ? Can you share a few lines as example with me ?

3: I suggest you trying these parameters at first.

.\Bin\RNNSharpConsole.exe -mode train -trainfile bruno-data.txt -modelfile .\bruno-model.bin -validfile bruno-valid.txt -ftrfile .\config_bruno.txt -tagfile .\bruno-tags.txt -modeltype 0 -layersize 200 -alpha 0.1 -crf 0 -maxiter 20 -savestep 200K -dir 1

zhongkaifu commented 8 years ago

I think 1M file for word embedding training is not enough. You can try "Txt2VecConsole.exe -mode distance..." to verify the quality of vector.bin

Since word embedding training is totally unsupervised, you can try to use bigger corpus to train it.

bratao commented 8 years ago

@zhongkaifu , Oh, thank you for clarifying this. I will try with a bigger corpus !

My understanding was that word embedding was just a extra set of features, and without it , it would compare to CRFSharp.

Thank you again so much for the help !!! I will report if I get any success !