wooorm / franc

Natural language detection
https://wooorm.com/franc/
MIT License
4.11k stars 173 forks source link

Getting weird results #41

Closed zeke closed 7 years ago

zeke commented 7 years ago

Hey @wooorm am I doing something wrong here?

> apps.forEach(app => console.log(franc(app.description), app.description))

eng A universal clipboard managing app that makes it easy to access your clipboard from anywhere on any device
fra 5EPlay CSGO Client
nob Open-source Markdown editor built for desktop
eng Communication tool to optimize the connection between people
vmw Wireless HDMI
eng An RSS and Atom feed aggregator
eng A work collaboration product that brings conversation to your files.
src Pristine Twitter app
dan A Simple Friendly Markdown Note.
nno An open source trading platform
eng A hackable text editor for the 21 st Century
eng One workspace open to all designers and developers
nya A place to work + a way to work
cat An experimental P2P browser
sco Focused team communications
sco Bitbloq is a tool to help children to learn and create programs for a microcontroller or robot, and to load them easily.
eng A simple File Encryption application for Windows. Encrypt your bits.
eng Markdown editor witch clarity +1
eng Text editor with the power or Markdown
eng Open-sourced note app for programmers
sco Web browser that automatically blocks ads and trackers
bug Facebook Messenger app
dan Markdown editor for Mac / Windows / Linux
fra Desktop build status notifications
sco Group chat for global teams
src Your rubik's cube solves
sco Orthodox web file manager with console and editor
cat Game development tools
sco RPG style coding application
deu Modern browser without tabs
eng Your personal galaxy of inspiration
sco A menubar/taskbar Gmail App for Windows, macOS and Linux.
wooorm commented 7 years ago

Hey sorry about that @zeke. I don't have much time now, so I'll try to respond more extensively later. Essentially: franc is good at many languages, which means it needs bigger input to get better results! 😞😐

zeke commented 7 years ago

No worries! Unfortunately these short strings are all I have.

I'm really just trying to answer the question, "Is this string in English?"

Do you know of any alternatives?

I guess I could look for each word in https://github.com/zeke/an-array-of-english-words, and if most of them are found, call it English. ¯\_(ツ)_/¯

wooorm commented 7 years ago

Could you use franc.all and, when the English score is bigger than .95 (for example), call it English? Maybe that'll work?

wooorm commented 7 years ago

This is a problem inherent to the algorithm: more languages means you need bigger documents for better guessing. I’ve noted that in the readme.

You could use franc-min, this supports less languages, making the guessing better, if you’re only dealing with top-languages.

Finally, this problem sounds more like asserting that something is English. Franc solves a slightly different problem: out of all languages, which one is the most likely? To assert that something is probably English, I suggest using franc.all and checking of eng has a certainty of 0.9 or higher.

zeke commented 7 years ago

Thanks for following up. I ended up using https://github.com/dachev/node-cld which has a binary dependency but the results are very accurate.

wooorm commented 7 years ago

Cool project! Yeah, there‘s definitely other algorithms better at smaller input!