xhluca / dl-translate

Library for translating between 200 languages. Built on 🤗 transformers.
https://xhluca.github.io/dl-translate/
MIT License
432 stars 45 forks source link

Detect source language with langdetect package #37

Open awalker88 opened 3 years ago

awalker88 commented 3 years ago

The langdetect has worked well for me in the past for language detection problems. How would you feel about allowing users to pass 'auto' as an option for source? I could see some pros and cons:

Pros

Cons

I'm a little new to open source but I would love to contribute 🙂 Of course, if you feel this doesn't fit this package's mission that's totally understandable.

xhluca commented 3 years ago

Hey langdetect is cool! However it seems there's many options for language detection, including fasttext and langid.py. Each option will have a certain accuracy (none of them are 100%) and speed - so I feel it might be difficult to choose for the end user.

Also since we are now using m2m100 by default, it might create confusion with users that try to auto-detect a language that's not available with the chosen detection algorithm (but available in m2m100).

I think a good option would be to start with a section in the user guide showing how to use any (or all) of the language detection libraries. Then from there, we could build a util function along the lines of:

src = dlt.lang.detect(source_text, backend="fasttext")  # or backend="langdetect" or backend="langid"
mt.translate(source_text, source=src,...)

Which will throw an error that requires a user to install the library if they want to use a specific backend.

awalker88 commented 3 years ago

Those are some good points, I agree it would be confusing to have the library detect a language but not translate it. I'll take a look into writing something that could potentially put into the user guide.

xhluca commented 3 years ago

Thank you. Once we have something in the user guide I'd welcome another PR that'd update dlt.utils or dlt.lang as well, if you wish!

banyous commented 2 years ago

Hi, Any updates about this issue. Is there any hint for making language source auto-detected?

xhluca commented 2 years ago

@banyous Feel free to contribute a section in the user guide about using language detection, and from there, if we feel a wrapper around fasttext would make life easier, then I'm happy to welcome a PR to add language detection to dlt.utils or dlt.lang

I think this is a decent starting point: https://fasttext.cc/docs/en/language-identification.html