Support for sentence splitting

xhluca commented 3 years ago

Right now TranslationModel.translate will translate each input string as is, which can be extremely slow for longer sequences due to the quadratic runtime of the architecture. The current recommended way is to use nltk:

import nltk

nltk.load("punkt")

text = "Mr. Smith went to his favorite cafe. There, he met his friend Dr. Doe."
sents = nltk.tokenize.sent_tokenize(text, "english")  # don't use dlt.lang.ENGLISH
" ".join(model.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.FRENCH))

Which works well but doesn't include all possible languages. It would be interesting to train the punkt model on each of the language made available (though we'd need to use a very large dataset for that). Once that's done, the snippet above could be a simple argument, e.g. model.translate(..., max_length="sentence"). With some more effort, max_length parameter could also be an integer n between 0 and 512, which represents the length of the max token. Moreover, rather than truncating at that length, we could break down the input text into sequences of length n or less, which would include the aggregated sentences.

xhluca commented 3 years ago

stanza might be a good option

xhluca commented 3 years ago

Might be worth training punkt on cc 100 or mC4 (which is the dataset behind mT4)

fbaeumer commented 3 years ago

What do you think about https://pypi.org/project/sentence-splitter/ ?

xhluca / dl-translate

Support for sentence splitting #8