ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.
GNU General Public License v3.0
6 stars 1 forks source link

Improve implementation for different model types #67

Open ybracke opened 11 months ago

ybracke commented 11 months ago

It is allowed to use either a byte-based encoder-decoder as base model or a two BPE-based models (encoder and decoder). However, the implementation is still a bit awkward with multiple if/else-structures, concerning (1) the choice of the tokenizer and transliterator (see here), (2) the choice of the model class and loading function (see here) and the correct special tokens (see here).

Update the implementation. Perhaps use an abstract base class for the models with a load function that is defined in the subclasses for T5ForConditionalGeneration, EncoderDecoderModel and perhaps others.

ybracke commented 11 months ago

Idea for an update to the train configs (here in yaml instead of toml):

tokenizer:
    from-model: true|false 
    path: str # only if from-model == false    
    input-transliterator: str # optional
language models:
    model-type: str # must be one of {T5, separate, ...}
    encoder: # only if model-type in {separate, ...}
        checkpoint: str
    decoder: # only if model-type in {separate, ...}
        checkpoint: str   
    encoder-decoder: # only if model-type in {T5, ...}
        checkpoint: str