zhongkaifu / RNNSharp

RNNSharp is a toolkit of deep recurrent neural network which is widely used for many different kinds of tasks, such as sequence labeling, sequence-to-sequence and so on. It's written by C# language and based on .NET framework 4.6 or above versions. RNNSharp supports many different types of networks, such as forward and bi-directional network, sequence-to-sequence network, and different types of layers, such as LSTM, Softmax, sampled Softmax and others.
BSD 3-Clause "New" or "Revised" License
285 stars 92 forks source link

Training takes too long!! #34

Open andy-soft opened 7 years ago

andy-soft commented 7 years ago

Hello, I was wondering what are the training times for the demonstrations. I just tried the english seq labeler, and it took 1 hour to process 10% of the corpus! (is this normal?) It's known Deep Learning is CPU hungry, I have only 2 cores and 8GB RAM (sorry) ¿do I need to change the PC) or acquire a CUDA core to help computing? ¿Is there a way to stop learning manually, or programmatically after reaching certain error rate?

I am wondering if you ever tried sequence labelling on highly inflectional languages (like Spanish) which has lots of inflectional power (complexity) and the words as a whole string are useless, the vocabulary explodes into >300M words! and the "examples"found on text begins to be too sparse, even with negative sampling you never get certain combinations, because most verbs have over 200 versions of itself (inflections), including time-tense, person, gender, plurality, mode, etc. So there is need to train on higher level features, but not losing the "semantic" sense. ¿do you think this could be possible, like decomposing the words (by means of controlled independent lemmatization) into parts/chunks (prefix, root, suffix, as well as modal information and semantic features of the parts,) My intuition is that this might lower the training and may be better the generalization power with less extensive corpus. Like capturing higher level syntax rules, and by the way generating semantic content constraints (may be even some common sense)...

It's just a question, on theory!

zhongkaifu commented 7 years ago

Hi @andy-soft,

For your labeling task, how many categories do you want to label ? Could you please share the configuration file your are using with me ? Then I will estimate if current performance is reasonable. Currently, RNNSharp doesn't support GPU training. It supports CPU training with SIMD instruction only, so you need to have a powerful CPU with new SIMD instruction set, such as AVX, AVX2 and so on.

I did use RNNSharp for sequence label tasks on inflectional languages, such as English, such as pos-tag, named entity and so on. Usually, the labeling categories is no more than 50. If labeling categories is too much, it will definitely affect performances, and you should optimize them, such as splitting them to a few of basic units for labeling. If it's really hard to reduce the number of them, you could use SampledSoftmax as output layer type. For each token, It randomly samples some categories plus categories on current labeling sentence for training, instead of the entire categories set.

andy-soft commented 7 years ago

Hello, thanks for the reply.

Actually I was only testing the labels on the english NER labels (I think there are only 5-6 entity types, and with the BIO labels there are at most 12-15 labels). My Level of classification is about the same, may be at most 20 total labels. (B I O S, with 10-15 entity types) the problem is the many labels of each word, the variability is huge, more than 900 different POS labels, (EAGLES 2 version)

I can sub-sample them, creating a 2 level problem, but I don't know how the LSTM will behave on this. and even how to compose the output layer. Too much complex is the problem, I hope to discover how to combine so much labels and POS parts into one problem. As I told you before. But thank you for answering and doing such a good programming,

I used another word2vec encoder from google, and got a severe low performance, then y Refactored it, but something went wrong and the resulting system trains well, but the cosine distance is too close to one always, and could not find the problem I might have introduced. I speed the thing 20 times, by programming well some parts. May be I will help you also, I am some sort of skilled programmer under C# (30 years programming)

& thanks again!

On Fri, Apr 21, 2017 at 9:22 PM, Zhongkai Fu notifications@github.com wrote:

Hi @andy-soft https://github.com/andy-soft,

For your labeling task, how many categories do you want to label ? Could you please share the configuration file your are using with me ? Then I will estimate if current performance is reasonable. Currently, RNNSharp doesn't support GPU training. It supports CPU training with SIMD instruction only, so you need to have a powerful CPU with new SIMD instruction set, such as AVX, AVX2 and so on.

I did use RNNSharp for sequence label tasks on inflectional languages, such as English, such as pos-tag, named entity and so on. Usually, the labeling categories is no more than 50. If labeling categories is too much, it will definitely affect performances, and you should optimize them, such as splitting them to a few of basic units for labeling. If it's really hard to reduce the number of them, you could use SampledSoftmax as output layer type. For each token, It randomly samples some categories plus categories on current labeling sentence for training, instead of the entire categories set.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zhongkaifu/RNNSharp/issues/34#issuecomment-296333011, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcZoGdi_nQOkZ5NDlNoD9IjFYWh5iks5ryUhUgaJpZM4NE58d .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

zhongkaifu commented 7 years ago

It's really appreciated if you could make contribution for RNNSharp. :)

For word2vec, you can try my version: https://github.com/zhongkaifu/Txt2Vec It has higher performance than original word2vec and supports incremental training.

For "the problem is the many labels of each word, the variability is huge, more than 900 different POS labels, (EAGLES 2 version)", could you please make a specified example about it ? Sorry that I don't understand about it.

andy-soft commented 7 years ago

Hi Zhongkai

I tried your txt2vector version, and the performance of yours is similar (even slightly less) than word2vec (a C# port I've modified) the Threaded part and the logger may slow down the system, even the double[] vector calculus makes it a bit slower, I guess. I will send you my modified version, I have made some optimizations manually inside the code (well commented)

The 900 Label problem lies on that Spanish, as well as many european langs, do swallow many aspects of the statements, like numerosity, gender, diminutives, augmentatives, in the case of verbs even the person and mode are inside the morphology of each word, worst of all, people use many prefixes to modify the semantics-only of the word, which results in a new word for the vocabulary, if it takes only the written form, not the decomposed one which I can split using my libraries based upon a huge Spanish word corpus (>300Mw) I've collected over the last 12 years. Even each prefix and suffix, adds semantic information upon the word, and I guess this can be used to train several "aspects" of a RNN, allowing a better comprehension of the phrase, for NLU based systems.

For example the way a Named entity like a place is prepended and the way a sentence construct treats it is similar to those of an organization, but slightly different to a proper name, or person. But a person is also addressed inside a sentence as something different, and the "semantic" properties of the verbs involved as well as the adjectives, do possess information to determine even if an adjective or a simple noun, or pronoun is referring to a person, so the Anaphora detection should be done inside this "smarter" Named Entity Detector which I am seeking to build, may be with several parallel stages, trained upon "nameability" or "placeability" (sorry for teh OOV, but this is what I want to mean).

For example, in Spanish the word: "hiperrecontrabuenísimo" is an OOV (not in conventional dictionaries) but for a native speaker, it means clearly the prefix+suffix+root meanings, this is "hiper" (augmentative) "recontra" (another augmentative) buen (good) + (ísimo) another augmentative, so this word as well as many others of this type, are used in colloquial chat/conversation, but neither ends in any dictionary ever!

So my idea is to train a network capable of extracting sense-relationship embedded inside syntax relations, like verb (root) to direct object (root) relation with semantic features.

The tags have a string representation, just search for EAGLES format, it;s a extended POS tag, much more complete than english tags, (Penn treebank and others alike) the length is variable and they are like this: NCMS for Common Noun Masculine Singular you can imagine the thousand combinations for each POS class, as well as the sub-classifications.

Just this, hope you understood my ideas, If you have any idea or question, it will be addressed, and responded quickly!

best regards

Andrés

On Sat, Apr 22, 2017 at 12:45 AM, Zhongkai Fu notifications@github.com wrote:

It's really appreciated if you could make contribution for RNNSharp. :)

For word2vec, you can try my version: https://github.com/zhongkaifu/ Txt2Vec It has higher performance than original word2vec and supports incremental training.

For "the problem is the many labels of each word, the variability is huge, more than 900 different POS labels, (EAGLES 2 version)", could you please make a specified example about it ? Sorry that I don't understand about it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zhongkaifu/RNNSharp/issues/34#issuecomment-296344702, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcetxlLeVjPt2oLaYXvcFnPMuZGmNks5ryXfAgaJpZM4NE58d .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

zhongkaifu commented 7 years ago

Hi Andrés,

Thanks for your explanation in details. It's really helpful. For your task, to improve performance and reduce the number of output categories, you could try sub-word level segmentation and labeling or character level segmentation and labeling. As the example you mentioned in above "hiperrecontrabuenísimo", if you have sub-word dictionary for training, you could build training corpus likes:

hiper \t S_Aug1 recontra \t S_Aug2 buen \t S_CorePart ísimo \t S_Aug3

So, Label "Aug1Aug2CorePartAug3" is split into four basic tags. Or you could try character level labeling, such as

h \t B_Aug1 I \t M_Aug1 p \t M_Aug1 e \t M_Aug1 r \t E_Aug1

By this way, it will significantly reduce the number of output categories.

Thanks Zhongkai Fu

zhongkaifu commented 7 years ago

In addition, did you try the latest RNNSharp code (check out from master branch) ? It's much faster than the released version, since I have not updated release package yet.

andy-soft commented 7 years ago

No, I'll try tomorrow, thanks I am still trying to understand all you told me, but too tired now! Until tomorrow!

On Sat, Apr 22, 2017 at 8:00 PM, Zhongkai Fu notifications@github.com wrote:

In addition, did you try the latest RNNSharp code (check out from master branch) ? It's much faster than the released version, since I have not updated release package yet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zhongkaifu/RNNSharp/issues/34#issuecomment-296407061, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcYUlPJgVHZuxi3vB5n1SlztVrIaAks5ryoZ_gaJpZM4NE58d .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

andy-soft commented 7 years ago

I am training a NER for spanish, with a small corpus 8k tokens, and it takes too long, adn got stuck, Token error rate don't lowers, less than 9%

info,31/5/2017 3:02:41 p. m. Progress = 8K/8,323K info,31/5/2017 3:02:41 p. m. Train cross-entropy = 0,320236720036713 info,31/5/2017 3:02:41 p. m. Error token ratio = 7,50999862854771% info,31/5/2017 3:02:41 p. m. Error sentence ratio = 64,1830228778597% info,31/5/2017 3:03:06 p. m. Iter 33 completed info,31/5/2017 3:03:06 p. m. Sentences = 8323, time escape = 00:09:20.6314274s, speed = 14,8457606784532 info,31/5/2017 3:03:06 p. m. In training: log probability = -27240,7618759363, cross-entropy = 0,321621874395561, perplexity = 1,24973470831842 info,31/5/2017 3:03:06 p. m. Verify model on validated corpus. info,31/5/2017 3:03:06 p. m. Start validation ... info,31/5/2017 3:03:24 p. m. In validation: error token ratio = 7,13251598951747% error sentence ratio = 66,4469347396177% info,31/5/2017 3:03:24 p. m. In training: log probability = -5154,61463169968, cross-entropy = 0,313802466020867, perplexity = 1,24297946835413

it is training since yesterday, when will it stop?

this was the bat:

SET CorpusPath=.\Data\Corpus\NER_ES SET ModelsPath=.\Data\Models\NER_ES SET BinPath=..\Bin

REM Build template feature set from training corpus %BinPath%\TFeatureBin.exe -mode build -template %CorpusPath%\template.txt -inputfile %CorpusPath%\train.txt -ftrfile %ModelsPath%\tfeatures -minfreq 1

REM Encoding LSTM-BiRNN-CRF model %BinPath%\RNNSharpConsole.exe -mode train -trainfile %CorpusPath%\train.txt -validfile %CorpusPath%\valid.txt -cfgfile .\config_ner_enu.txt -tagfile %CorpusPath%\tags.txt -alpha 0.1 -maxiter 0 -savestep 200K

I can send you the training samples, also.

should I buy a CUDA thing!?

best thanks

On Sat, Apr 22, 2017 at 8:37 PM, Andres Hohendahl < andres.hohendahl@gmail.com> wrote:

No, I'll try tomorrow, thanks I am still trying to understand all you told me, but too tired now! Until tomorrow!

On Sat, Apr 22, 2017 at 8:00 PM, Zhongkai Fu notifications@github.com wrote:

In addition, did you try the latest RNNSharp code (check out from master branch) ? It's much faster than the released version, since I have not updated release package yet.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zhongkaifu/RNNSharp/issues/34#issuecomment-296407061, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcYUlPJgVHZuxi3vB5n1SlztVrIaAks5ryoZ_gaJpZM4NE58d .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

zhongkaifu commented 7 years ago

According RNN output lines, you are still using older RNNSharp, please sync the latest source code (not released demo package, since I have not updated it yet), build it and train your model.

It's okay you can send training example, configuration file and command line you ran to me.

andy-soft commented 7 years ago

Thanks Zhonkaifu

I'll download the latest and build them ASAP, & tell you the result

Also I saw some of my mistakes:

I missed an absolute reference on the configuration files, pointing towards the english version "xxx_enu" and I need to train a text2vec , now I replaced it as:

WORDEMBEDDING_FILENAME = D:\RNNSharpDemoPackage\WordEmbedding\wordvec_es.bin

And I am going to generate a embedding file for Spanish also!

Andrés

PD: The bad run, (based on english resources.. bad!) ended it's training 2 days after, and generated a 1.0 GB model file. I think this lengthy files need to be pruned somehow, it's not practical to bavea 1.0 Gb parameter file! The resulting model would be memory + resources hungry ¿might not be useful for production?

On Wed, May 31, 2017 at 3:20 PM, Zhongkai Fu notifications@github.com wrote:

According RNN output lines, you are still using older RNNSharp, please sync the latest source code (not released demo package, since I have not updated it yet), build it and train your model.

It's okay you can send training example, configuration file and command line you ran to me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zhongkaifu/RNNSharp/issues/34#issuecomment-305273643, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcVDZCEDg37ES4t3ZMhMnOiD8J5OIks5r_a9tgaJpZM4NE58d .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

andy-soft commented 7 years ago

Hi, now I am training on real words, with new routines (just downloaded)

I trained a Spanish corpus of 380 megabytes of raw text, using your text2vec and created the *.bin for the training, then redirected all the documents in the configuration, corrected the files with the new syntax, on the github, and started to train it, but it is still running fo over 1 day,

¿when should it stop?

¿is it not worth to be able to break the training and resume later, using some kind of console, or whatever? if I abort the training, I loose all the work done!

Any clue? (I guessed it will stop after the 20th iteration, but the show still goes on...)

The config file is this:

Working directory

CURRENT_DIRECTORY = .

Model type. Sequence labeling (SEQLABEL) and sequence-to-sequence

(SEQ2SEQ) are supported. MODEL_TYPE = SEQLABEL

!Model direction. Forward and BiDirectional are supported

!MODEL_DIRECTION = BiDirectional

Network type. Four types are supported:

For sequence labeling tasks, we could use: Forward, BiDirectional,

BiDirectionalAverage

For sequence-to-sequence tasks, we could use: ForwardSeq2Seq

BiDirectional type concatnates outputs of forward layer and backward layer

as final output

BiDirectionalAverage type averages outputs of forward layer and backward

layer as final output NETWORK_TYPE = BiDirectional

Model file path

MODEL_FILEPATH = Data\Models\NER_ES\model.bin

Hidden layers settings. LSTM and Dropout are supported. Here are examples

of these layer types

Dropout: Dropout:0.5 -- Drop out ratio is 0.5

If the model has more than one hidden layer, each layer settings are

separated by comma. For example:

"LSTM:300, LSTM:200" means the model has two LSTM layers. The first layer

size is 300, and the second layer size is 200 HIDDEN_LAYER = LSTM:200

Output layer settings. Simple, softmax ands sampled softmax are supported.

Here is an example of sampled softmax:

"SampledSoftmax:20" means the output layer is sampled softmax layer and

its negative sample size is 20

"Simple" means the final result is the raw output of the layer.

"Softmax" means the final result is based on "Simple" layer and run softmax

OUTPUT_LAYER = Simple

CRF layer settings

CRF_LAYER = True

The file name for template feature set

TFEATURE_FILENAME = Data\Models\NER_ES\tfeatures

The context range for template feature set. In below, the context is

current token, next token and next after next token TFEATURE_CONTEXT = 0,1,2 TFEATURE_WEIGHT_TYPE = Binary

PRETRAIN_TYPE = Embedding

The word embedding data file name generated by Txt2Vec (

https://github.com/zhongkaifu/Txt2Vec) WORDEMBEDDING_FILENAME = Data\WordEmbedding\wordvec_es.bin

The context range for word embedding.

WORDEMBEDDING_CONTEXT = 0

The column index applied word embedding feature

WORDEMBEDDING_COLUMN = 0

The run time feature

RTFEATURE_CONTEXT: -1,-2,-3


The bat file is here: SET CorpusPath=.\Data\Corpus\NER_ES SET ModelsPath=.\Data\Models\NER_ES SET BinPath=..\Bin

REM Encoding LSTM-BiRNN-CRF model %BinPath%\RNNSharpConsole.exe -mode train -trainfile %CorpusPath%\train.txt -validfile %CorpusPath%\valid.txt -cfgfile .\config_ner_es.txt -tagfile %CorpusPath%\tags.txt -alpha 0.1 -maxiter 0 -savestep 200K

As I have set maxiter to 0 it will stop when the system no longer betters its training?

¿how much does your trainer last for english corpus on the downloaded sample? and on wnay kind of PC / Ram, processor / OS .

Actually my process is consuming 1,9 Gbytes of RAM 100% CPU (2 cores) on a W10 x64 dual core G2020 (2.66Ghz pentium), not any beast or numeric workhorse no GPU installed. The process now has this lectures on the logfile:

info,2/6/2017 2:58:13 p. m. End 28 iteration. Time duration = 00:13:43.3734915 info,2/6/2017 2:58:13 p. m. info,2/6/2017 2:58:13 p. m. Verify model on validated corpus. info,2/6/2017 2:58:29 p. m. Progress = 1K/1,517K info,2/6/2017 2:58:29 p. m. Error token ratio = 3,60980155684684% info,2/6/2017 2:58:29 p. m. Error sentence ratio = 47,7% info,2/6/2017 2:58:37 p. m. End model verification. info,2/6/2017 2:58:37 p. m. info,2/6/2017 2:58:37 p. m. info,2/6/2017 2:58:37 p. m. Start to training 29 iteration. learning rate = 0,00078125 info,2/6/2017 3:01:26 p. m. Progress = 2K/8,323K info,2/6/2017 3:01:26 p. m. Error token ratio = 0,0864513976309284% info,2/6/2017 3:01:26 p. m. Error sentence ratio = 1,65% info,2/6/2017 3:01:26 p. m. Progress = 2K/8,323K info,2/6/2017 3:01:26 p. m. Error token ratio = 0,0864513976309284% info,2/6/2017 3:01:26 p. m. Error sentence ratio = 1,65% info,2/6/2017 3:06:03 p. m. Progress = 5K/8,323K info,2/6/2017 3:06:03 p. m. Error token ratio = 0,091533180778032% info,2/6/2017 3:06:03 p. m. Error sentence ratio = 1,78% info,2/6/2017 3:06:03 p. m. Progress = 5K/8,323K info,2/6/2017 3:06:03 p. m. Error token ratio = 0,091533180778032% info,2/6/2017 3:06:03 p. m. Error sentence ratio = 1,78% info,2/6/2017 3:07:35 p. m. Progress = 6K/8,323K info,2/6/2017 3:07:35 p. m. Error token ratio = 0,0901244410551492% info,2/6/2017 3:07:35 p. m. Error sentence ratio = 1,81666666666667% info,2/6/2017 3:09:01 p. m. Progress = 7K/8,323K info,2/6/2017 3:09:01 p. m. Error token ratio = 0,0972418293405895% info,2/6/2017 3:09:01 p. m. Error sentence ratio = 1,9% info,2/6/2017 3:09:01 p. m. Progress = 7K/8,323K info,2/6/2017 3:09:01 p. m. Error token ratio = 0,0972418293405895% info,2/6/2017 3:09:01 p. m. Error sentence ratio = 1,9% info,2/6/2017 3:10:41 p. m. Progress = 8K/8,323K info,2/6/2017 3:10:41 p. m. Error token ratio = 0,100398246377297% info,2/6/2017 3:10:41 p. m. Error sentence ratio = 1,9125% info,2/6/2017 3:11:11 p. m. End 29 iteration. Time duration = 00:12:34.1796782 info,2/6/2017 3:11:11 p. m. info,2/6/2017 3:11:11 p. m. Verify model on validated corpus. info,2/6/2017 3:11:27 p. m. Progress = 1K/1,517K info,2/6/2017 3:11:27 p. m. Error token ratio = 3,53827053870236% info,2/6/2017 3:11:27 p. m. Error sentence ratio = 46,4% info,2/6/2017 3:11:34 p. m. End model verification. info,2/6/2017 3:11:34 p. m. info,2/6/2017 3:11:34 p. m. info,2/6/2017 3:11:34 p. m. Start to training 30 iteration. learning rate = 0,00078125

On Thu, Jun 1, 2017 at 3:49 PM, Andres Hohendahl <andres.hohendahl@gmail.com

wrote:

Thanks Zhonkaifu

I'll download the latest and build them ASAP, & tell you the result

Also I saw some of my mistakes:

I missed an absolute reference on the configuration files, pointing towards the english version "xxx_enu" and I need to train a text2vec , now I replaced it as:

WORDEMBEDDING_FILENAME = D:\RNNSharpDemoPackage\ WordEmbedding\wordvec_es.bin

And I am going to generate a embedding file for Spanish also!

Andrés

PD: The bad run, (based on english resources.. bad!) ended it's training 2 days after, and generated a 1.0 GB model file. I think this lengthy files need to be pruned somehow, it's not practical to bavea 1.0 Gb parameter file! The resulting model would be memory + resources hungry ¿might not be useful for production?

On Wed, May 31, 2017 at 3:20 PM, Zhongkai Fu notifications@github.com wrote:

According RNN output lines, you are still using older RNNSharp, please sync the latest source code (not released demo package, since I have not updated it yet), build it and train your model.

It's okay you can send training example, configuration file and command line you ran to me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zhongkaifu/RNNSharp/issues/34#issuecomment-305273643, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcVDZCEDg37ES4t3ZMhMnOiD8J5OIks5r_a9tgaJpZM4NE58d .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

zhongkaifu commented 7 years ago

First of all, your CPU has only two cores, this is the main reason why training is slowly.

Secondly, I don't know if your CPU support AVX and AVX2 instructions which is for SIMD to speed up training. You could show a few of first log lines with me, and I will take a look.

Finally, you could set TFEATURE_CONTEXT=0 to reduce the number of sparse features to speed up training.

andy-soft commented 7 years ago

It just ended running, I'll attach the log file (compete)

My CPU is a G2020, here is the FULL Spec: [image: Inline image 1]

I will probably make a FROM application to control the training-run process, to visualize, and eventually stop and/or change parameters on-the-run, may be this could be better to control, for guys who as me, have slow CPU's

On the other hand, ¿have you seen the approach done by the people of facebook, with fast-text? they claim to have lowered many orders of magnitude the training by means of subsampling, I attach the link to the git. (may be I can port it into C#, bettering the overall methods). I plan to use this for real world application NER, in Spanish, which has as I told you before, a very rich inflective mode on the words, and therefore the words are agglutinative, unlike chinese, they contain lots of semantic, grammatical and modal information inside their structure, generating a very huge number of inflected word-forms, based upon only a small root-words set. This makes a trainer collide if not using this info, I have built a morphological analyzer which is able to "strip down" the words into a large set of variables, some are semantic, some are gender, numer, root, prefixes, suffixes, time/person/mode/colloquial in case of verbs, among many others. ¿is there a way to train a RNN with this sparse-features representing the same word and position of the word? I guess the feature extractor should be tailored, I can do this!

Also I guess a NER chunker, could benefit from a multi-stage classifier, one to detect the "entities boundaries" and other to classify the entity, once you get the segment, you can re-classify it with another better and simpler in-segment classifier.

For example, time-related entities, the variability in Spanish is so huge that probably any classifier get nuts to set the boundaries, and also there will never be enough samples of this named entities, due to the multiple variations, to train a complete system. I will try to do this, even with other named entities the problem is very alike.

Thanks for your collaboration.

If you want to take a look to my work in NLP, I can attach you a PDF, in English (but I don want it public on this thread) to a private email address.

best regards

Andrés

On Fri, Jun 2, 2017 at 4:43 PM, Zhongkai Fu notifications@github.com wrote:

First of all, your CPU has only two cores, this is the main reason why training is slowly.

Secondly, I don't know if your CPU support AVX and AVX2 instructions which is for SIMD to speed up training. You could show a few of first log lines with me, and I will take a look.

Finally, you could set TFEATURE_CONTEXT=0 to reduce the number of sparse features to speed up training.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zhongkaifu/RNNSharp/issues/34#issuecomment-305891358, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcSJI4hsRDO4Gqtr5_uYuHPGM6hjOks5sAGXzgaJpZM4NE58d .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

zhongkaifu commented 7 years ago

Hi Andrés

It's really appreciated if you would like to contribute RNNSharp project. :)

I cannot get your inline image for CPU G2020. According information at http://www.cpu-world.com/CPUs/Pentium_Dual-Core/Intel-Pentium%20G2020.html, it seems this CPU doesn't support AVX and AVX2 instructions, so RNNSharp cannot emit SIMD instruction to speed up.

andy-soft commented 7 years ago

Hi Zhongkai Fu

I'll see to get a intel Core i5-3570 ASAP, if the speedup is worth (at least 2 more cores!) i7 is the same, as hyperthreading does not help too much, each thread has only half time slot!!!

I am also seeing the new AMD Ryzen 7 chips, but still fear their compatibility with MS .net technologies for now, there have been some awkward reports!

One more question ¿do you instruct explicitly in RNNSharp to issue SIMD instructions? ¿or is this handled by the internal .NET CLS JIT compiler? - I guess.

have to check if the .net runtime informs something on this! I guessed (as in many information boards) Ivy Bridge series chips were AVX compatible, but it seems intel has crippled them down in the chip, to brighten their core i5/i7/i9 series, as he does to tailor our budgets.

thanks for the info! Andrés

On Sat, Jun 3, 2017 at 3:02 AM, Zhongkai Fu notifications@github.com wrote:

Hi Andrés

It's really appreciated if you would like to contribute RNNSharp project. :)

I cannot get your inline image for CPU G2020. According information at http://www.cpu-world.com/CPUs/Pentium_Dual-Core/Intel-Pentium%20G2020.html, it seems this CPU doesn't support AVX and AVX2 instructions, so RNNSharp cannot emit SIMD instruction to speed up.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zhongkaifu/RNNSharp/issues/34#issuecomment-305954159, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcVSAjeTKf76TWQ8UoLY4kzvZNqRdks5sAPbpgaJpZM4NE58d .

-- ing. Andrés T. Hohendahl director PandoraBox www.pandorabox.com.ar web.fi.uba.ar/~ahohenda

zhongkaifu commented 7 years ago

I'm using System.Vectors which is a component of .NET core to emit SIMD instruction (AVX and AVX2) for RNNSharp.

If that AMD CPU supports these AVX instructions, RNNSharp can leverage them as well.

andy-soft commented 5 years ago

Hi there, I just got a CPU with 16 cores and 128Gbytes of RAM. Ready to train hard!!

zhongkaifu commented 5 years ago

Cool! I recently introduced MKL into Seq2SeqSharp and got a significantly improvement on performance, if you like, you could try it in RNNSharp.

andy-soft commented 5 years ago

Greath, which are the files to download and test it? Also ¿did you try to make int into service to set this to work as a REST or something else? (to be consumed from other apps) BTW I am going to send you some improvements over text2vec soon. (speed and flexibility)

& + thanks

On Sat, Aug 25, 2018 at 8:47 PM Zhongkai Fu notifications@github.com wrote:

Cool! I recently introduced MKL into Seq2SeqSharp and got a significantly improvement on performance, if you like, you could try it in RNNSharp.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zhongkaifu/RNNSharp/issues/34#issuecomment-416003669, or mute the thread https://github.com/notifications/unsubscribe-auth/ANCPcdsGOqxYE1JvbKiP_EwLIpu2LNImks5uUeIPgaJpZM4NE58d .

andy-soft commented 5 years ago

I just put to train the sample of English SeqClassif (NER) from your sample 143Mb flat text file, 2.2M words. I got a 32 core Xeon 3500 server, with 128 Gb Ram, and... it took >24 hours to reach a mere 0.89% token error, 8.89% seq error. (About 40% of total training time, then I aborted it) I am scared of the unusual time to train those sets.... The binary model file is 1.8Gb long !! ¿Are those normal training times, and model sizes.. ? ¿or should I go and purchase a CUDA multi core and use another LSTM library?