Problem with all non english corpora

yoonkim / lstm-char-cnn

LSTM language model with CNN over characters

MIT License

826 stars 221 forks source link

Problem with all non english corpora #7

Closed vseledkin closed 9 years ago

vseledkin commented 9 years ago

Great code/model but i see one problem: I think that results for character models are only valid for english corpus (ptb). For all other languages (especially for russian where all letters are 2-byte sequences) you actually have models for sequences of bytes not for sequences of characters. Am i right? or you converted corpora to language specific one-byte encoding before processing?

yoonkim commented 9 years ago

thanks for letting us know! we've uploaded a fix for this. we will also be updating the paper with the new results (results are largely the same, but we do a little better on russian).