Dev refactor loading - Githubissues

ybracke / transnormer

A lexical normalizer for historical spelling variants using a transformer architecture.

GNU General Public License v3.0

6 stars 1 forks source link

Dev refactor loading #24

Closed ybracke closed 1 year ago

ybracke commented 1 year ago

Adds more testdata
Update loader.py to contain more and better functions for loading data, specifically
- Separated reading data (with read_* functions) from loading it into a datasets.Dataset
- Reader functions for different data sets: DTAEval (the original XML version), RIDGES (as provided by M. Bollmann), Leipzig Corpora.
- Future reader functions should just keep to the structure of the above ones
- Option to include metadata like filename or publication year
- Detokenization with NLTK