yoeo / guesslang

Detect the programming language of a source code
https://guesslang.readthedocs.io
MIT License
773 stars 110 forks source link

Add all github/linguist extensions and code samples #32

Closed ghost closed 3 years ago

ghost commented 3 years ago

There's a wealth of code samples and extension mappings over at github/linguist that can be used in this repository

yoeo commented 3 years ago

Hi @4086606,

It will be quite challenging to support all the 500+ languages because Guesslang's machine learning model needs a lot of sample files for training.

In fact to reach 70% to 80% of correct language predictions, you'll have to train the model with around 1k samples files per language. And to reach 90% to 95% of prediction accuracy, you'll need up to 25k samples for each language.

I'm working on supporting 14 new languages https://github.com/yoeo/guesslang/issues/29#issuecomment-863867962 and I'll check Linguist for sure to see how they managed to handle so much languages.

Thank you.

ghost commented 3 years ago

1 THOUSAND!? I greatly underestimated the model sizes, my lack of familiarity with ML has brought this on sorry

Thanks for your hard work 👍

yoeo commented 3 years ago

No problem, and thanks for the support!!!