Right now TranslationModel.translate will translate each input string as is, which can be extremely slow for longer sequences due to the quadratic runtime of the architecture. The current recommended way is to use nltk:
import nltk
nltk.load("punkt")
text = "Mr. Smith went to his favorite cafe. There, he met his friend Dr. Doe."
sents = nltk.tokenize.sent_tokenize(text, "english") # don't use dlt.lang.ENGLISH
" ".join(model.translate(sents, source=dlt.lang.ENGLISH, target=dlt.lang.FRENCH))
Which works well but doesn't include all possible languages. It would be interesting to train the punkt model on each of the language made available (though we'd need to use a very large dataset for that). Once that's done, the snippet above could be a simple argument, e.g. model.translate(..., max_length="sentence"). With some more effort, max_length parameter could also be an integer n between 0 and 512, which represents the length of the max token. Moreover, rather than truncating at that length, we could break down the input text into sequences of length n or less, which would include the aggregated sentences.
Right now
TranslationModel.translate
will translate each input string as is, which can be extremely slow for longer sequences due to the quadratic runtime of the architecture. The current recommended way is to use nltk:Which works well but doesn't include all possible languages. It would be interesting to train the punkt model on each of the language made available (though we'd need to use a very large dataset for that). Once that's done, the snippet above could be a simple argument, e.g.
model.translate(..., max_length="sentence")
. With some more effort,max_length
parameter could also be an integern
between 0 and 512, which represents the length of the max token. Moreover, rather than truncating at that length, we could break down the input text into sequences of lengthn
or less, which would include the aggregated sentences.