wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.07k stars 175 forks source link

Make MAX_LENGTH an options parameter #76

Closed porkopek closed 4 years ago

porkopek commented 4 years ago

Hello!

First of all, thank you for this wonderful project.

It seems that franc limits the text sample to analyse to a hard-coded 2048 chars in these lines

https://github.com/wooorm/franc/blob/5842af9c1a74ffb47ebe3307bfc61cf29b6e842e/packages/franc/index.js#L21 https://github.com/wooorm/franc/blob/5842af9c1a74ffb47ebe3307bfc61cf29b6e842e/packages/franc/index.js#L93

Could this MAX_LENGTH const be part of options? It seems to me this is due to speed reasons, but I care more about accuracy than speed.

I am reading web pages that have parts in more than one language, and need to detect the most used language, but maybe the first 2048 characters are in the less used language.

Sorry if I misinterpreted the code and is not doing what I thought

wooorm commented 4 years ago

Interesting, yeah that could be an option!

If you have documents with mixed languages though, maybe it’s better to first parse paragraphs, and pass each paragraph to franc, and use that to figure out what languages are used?

porkopek commented 4 years ago

I'm already parsing chunks of 2048 chars and then returning the most used language.

Paragraphs is not an exact measure, because I'm parsing every block element as a paragraph (makes sense), so, many times, an ul>li>a does not have enough text to find out the language, and, what is worse, paragraphs of low length may be detected with the wrong language.

I need to improve accuracy, because my project relies heavily in detect the language correctly.

Do you think that if I create more trigrams for the languages I support based on a list of the most used words, and remove those languages I don't support, into a franc fork, accuracy would improve? I don't care if the detection consumes more time

wooorm commented 4 years ago

Sounds like you’re dealing with HTML. In that case, can the lang attribute be used?

I need to improve accuracy, because my project relies heavily in detect the language correctly.

That’s always going to be hard. Franc supports 300+ languages. What languages are your documents in?

Do you think that if I create more trigrams for the languages I support based on a list of the most used words, and remove those languages I don't support, into a franc fork, accuracy would improve? I don't care if the detection consumes more time

More trigrams won’t help. More trigrams aren’t available (we use UDHR and you can get about 300 interesting trigrams from that)

Have you thought about weighing in speakers count?

porkopek commented 4 years ago

Yeah, I'm parsing html and extracting text. As you already may know, html is not a perfect world, and you can not rely in anything :-D, many times attributes are missing or are wrong.

I'm supporting only the European Union languages (24) and I have a super hyper large corpus already classified by languages where to extract trigrams (https://eur-lex.europa.eu/homepage.html, it is the European Union official page. This is not the site I have to identify languages from). Maybe I can improve the trigrams for languages that my app supports, or more trigrams are really not useful at all?

wooorm commented 4 years ago

In that case, you can see how wooorm/udhr and franc itself are built!

If the text you train on is more similar to the text you want to detect, it’ll probably help! I do think that 300 trigrams should be fine tho, but maybe more will help?

wooorm commented 4 years ago

I think a) this proposal doesn’t solve the actual problem at hand, and b) I don’t see a clear need for this proposal, therefore I’m closing this issue.

porkopek commented 4 years ago

Yes, after thinking about this after we talked here, I think you are right.

The real problem I had with the library is that it recognizes wrongly some texts, especially between French and English, and tends to recognize as Catalan a lot of of Romance language texts, when Catalan is not that significant.

I think that the problem, maybe, is that the source of trigrams is not the ideal. I saw you took it from UN's Universal Declaration of Human Rights, and it is a text with some 2000 words. With a lot of repeated words. I suppose you did it in that way because it's a document with a lot of translations.

I am wondering if I can improve the recognition supplying another trigrams source, with a more realistic distribution, but at this moment I'm working in another part of my program, so I will do this at the end. I'll let you know if I got better results with a "better" trigrams source.

Thank you

wooorm commented 4 years ago

Yeah let me know what your findings are!