Closed bratao closed 8 years ago
Hi Bratao,
What's feature set did you use ? I suggest you to use both lexical features and dense features, such as word embedding. Could you please share your configuration file and command line with me ? I can look into it.
For CRFSharp and CRF++, both of them should have similiar performance. In addition, both of them are able to generate huge features according unigram and bigram feature template, the number of feature set can be much larger than what's RNNSharp has.
Did you try your data set by using CRFSharp or CRF++ ?
.\Bin\RNNSharpConsole.exe -mode train -trainfile bruno-data.txt -modelfile .\bruno-model.bin -validfile bruno-valid.txt -ftrfile .\config_bruno.txt -tagfile .\bruno-tags.txt -modeltype 0 -layersize 200 -alpha 0.1 -crf 0 -maxiter 20 -savestep 200K -dir 1
I think 1M file for word embedding training is not enough. You can try "Txt2VecConsole.exe -mode distance..." to verify the quality of vector.bin
Since word embedding training is totally unsupervised, you can try to use bigger corpus to train it.
@zhongkaifu , Oh, thank you for clarifying this. I will try with a bigger corpus !
My understanding was that word embedding was just a extra set of features, and without it , it would compare to CRFSharp.
Thank you again so much for the help !!! I will report if I get any success !
Hello !
Thanks again for this awesome project !
From my understanding, the performance for text sequential tagging should be equal or more than CRFSharp or CRF++ right ? My problem is to semantic tag a big single, continuous text ( hundreds of pages).
In CRF++ I get a error token ratio of about 0.5%. In RNNSharp I can´t get it better than 40%. a gigantic different. I tried LSTM and BPTT with CRF on or off. No luck.
This is expected for my use case, or am I doing something wrong ?