issues
search
ybracke
/
transnormer
A lexical normalizer for historical spelling variants using a transformer architecture.
GNU General Public License v3.0
6
stars
1
forks
source link
Dev refactor loading
#24
Closed
ybracke
closed
1 year ago
ybracke
commented
1 year ago
Adds more testdata
Update
loader.py
to contain more and better functions for loading data, specifically
Separated reading data (with
read_*
functions) from loading it into a
datasets.Dataset
Reader functions for different data sets:
DTAEval
(the original XML version),
RIDGES
(as provided by M. Bollmann),
Leipzig Corpora
.
Future reader functions should just keep to the structure of the above ones
Option to include metadata like filename or publication year
Detokenization with NLTK
loader.py
to contain more and better functions for loading data, specificallyread_*
functions) from loading it into adatasets.Dataset