Open Nickwiz opened 3 months ago
I'm a bit confused, why does it say that it is true by default, then immediately after it is false by default?
I'm a bit confused, why does it say that it is true by default, then immediately after it is false by default?
As I read it; it is set to True
by default as of now, under patches to current version 4.44, as the code as-is in effect gives a result as if it were.
When they move to version 4.45 it is going to be set to False
by default as that would make the tokenized string reversible into the original.
Would True or False be better in this case? Perhaps it is useful to have reversible string, but that would be bad if it affects the final results (regression)
I do not know the code good enough to answer that. What I found, from some simple tests, is that formatting is not preserved. E.g.
Input text:
Text with extra spaces and line terminators.
End.
mt.translate(text, source="en", target="en")
Output:
Text with extra spaces and line terminators. End.
Setting "clean_up_tokenization_spaces"
to False
does not change result.
As the project is for now, I do not see any feature for reverting and not sure if that is something that would be of need. Perhaps in some sub-component?
Preserving formatting on the other hand would be nice, but a rather huge challenge when it comes to the nature of translations. E.g:
Input: line terminators
no: linjeavslutninger
fi: linjapäätteet
ru: терминаторы линии (terminator lines)
...
Where if anywhere would one add the extra white-space etc.; And a rather big beast to handle - unless this is something already a feature or being worked on in the models.
Hmm I see. In this case, i'm happy to merge a PR if you are interested in creating one!
FYI
Changes in transformers tokenizer gives deprecation warning.
Something like this can be used:
Have not found any difference in result by using True vs False, but then again I just started looking at this project.