Open ybracke opened 11 months ago
Idea for an update to the train configs (here in yaml instead of toml):
tokenizer:
from-model: true|false
path: str # only if from-model == false
input-transliterator: str # optional
language models:
model-type: str # must be one of {T5, separate, ...}
encoder: # only if model-type in {separate, ...}
checkpoint: str
decoder: # only if model-type in {separate, ...}
checkpoint: str
encoder-decoder: # only if model-type in {T5, ...}
checkpoint: str
It is allowed to use either a byte-based encoder-decoder as base model or a two BPE-based models (encoder and decoder). However, the implementation is still a bit awkward with multiple
if/else
-structures, concerning (1) the choice of the tokenizer and transliterator (see here), (2) the choice of the model class and loading function (see here) and the correct special tokens (see here).Update the implementation. Perhaps use an abstract base class for the models with a
load
function that is defined in the subclasses forT5ForConditionalGeneration
,EncoderDecoderModel
and perhaps others.