welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Fix split introductions #429

Closed MansMeg closed 8 months ago

MansMeg commented 8 months ago

Some introductions in the records are split based on line breaks. We should fix this so that there is only one line break. Jesper looked at this in his thesis and might both have good training data and even a model we could use/try. Otherwise we could fix the simple cases right away programatically.

BobBorges commented 8 months ago

There is a script already, but I'm missing a file it looks for input/segmentation/join_intro_patterns.csv -- if no one has that file, I think I can just do without it...

Two sequential speaker intros are merged -- if intros[0].text ends with "-" then remove "-" and join w/ no space, else join with space after intros[0].text. Caveman style, but it should work in most cases I think.

MansMeg commented 8 months ago

At least that would be a first run to get the obvious errors fixed.

MansMeg commented 8 months ago

Here is the model: https://huggingface.co/jesperjmb/MergeIntrosNSP