ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.
GNU General Public License v3.0
6 stars 1 forks source link

Evaluation: Allow ignoring types #92

Open ybracke opened 7 months ago

ybracke commented 7 months ago

Concerns file(s): src/transnormer/evaluation/align_levenshtein.py

Note: Perhaps this update should be done in the original package instead of here

Adjust align functions so that tokens of a certain type are excluded from the alignment, e.g. if they only contain punctuation symbols.

Desired behavior:

    >>>regex = r"..." # should be a regex that matches strings that only contain punctuation
    >>>align(['Sie bekommen ferner --'], ['bekommen ferner an —'], exclude=regex) 
    >>>[
        [
            ("Sie", "░", 4),
            ("bekommen", "bekommen", 0),
            ("ferner", "ferner▁an", 3.5999999999999996),
            # not here: ("--", "—", 2)
        ],
    ]