welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

take into account various spelling reforms #273

Open fredrik1984 opened 1 year ago

fredrik1984 commented 1 year ago

There might be a need to take into account different spelling reforms that could have an impact on our algorithms to, for instance, identify speaker introductions et cetera.

There seem to have been two bigger reforms during the Swerik period: around 1869 and 1906 (when we got the “modern” Swedish). After 1906, the “hv” was replaced by “v” (e.g. “hvad” became “vad”). Already after 1890, “qv” started to be replaced by “kv” (e.g. “qvinna” became “kvinna”).

I did a lite bit of search and it seems that Svenska Akademien’s dictionary (SAOL) is a good source to trace these changes. For example, comparing the different editions. For example, 1874, 1889, and 1923 editions.

https://spraakbanken.gu.se/saolhist/

http://runeberg.org/saol/

BobBorges commented 1 year ago

From what I can see (@ninpnin correct me if I'm wrong here), intro detection relies primarily on herr/fru/fröken/talman + name + :. Spelling of these is consistent in the 1874 SAOL, except talman, which is not listed - we may look out for double \<l>, by analogy with similar words (There are 61 instances of tallman up to 1891, but none of them appear to be intros on first glance).

ninpnin commented 1 year ago

The intro mapping prioritizes finding Herr/fru/fröken + Name. It also tries to find names by capitalization, which should not break even if the spelling is weird:

Screenshot 2023-04-13 at 10 14 16

(the list is in priority order)

MansMeg commented 1 year ago

How is the introductions predicted by Jespers thesis included. Is that used 1920- and regexp before?

fredrik1984 commented 1 year ago

Hm, although "talman" is not mentioned in the dictionary from 1874, it was still a used term in the parliament. But not common in intros from what I can see. Rather, it often refers to "talmannen" in descriptions of what is going on in the chamber.

The older spelling will impact other intros like:

"Grefve Hamilton:" (count Hamilton, later spelled Greve) "Chefen för Kongl. Ecklesiastik-departementet, Herr Statsrådet Wennerberg:" (here, Kongl later became Kungl as an ambrivation of Kungliga/Royal)

There also seems to be manny "Friherrar" in the parliament from the 19th century... I guess we have to adapt the algorithm for that!

ninpnin commented 1 year ago

However, both the intro detection and segment classification rely on neural networks, and they have only been trained on data from 1920-1989. I.e. the training data does not match the data we use it on.

ninpnin commented 1 year ago

@fredrik1984 can we safely assume all friherrar are MPs, or could a minister be called a friherr too?

fredrik1984 commented 1 year ago

Yes, and there are different ways to present MPs before 1920, especially when it comes to titles: Greve, Friherre etc

fredrik1984 commented 1 year ago

@ninpnin I guess a friherre could be a minister as well. In the 19th century, if the speaker of the house was a count he was introduced as "Herr greve and talman".

I guess that a person who is "friherre" could also be a minister who is not a MP. But most are of course MPs. But then I suppose they are also introduced as a minister. Like this: Chefen för Kongl. Ecklesiastik-departementet, Friherre Statsrådet Wennerberg

"Friherre" is some kind of lord: https://en.wikipedia.org/wiki/Freiherr

ninpnin commented 1 year ago

It looks like they have both titles then. Eg. Herr Statsrådet Friherre von Otter. In that case it shouldn't be an issue for us.

fredrik1984 commented 1 year ago

Ok, good!

MansMeg commented 1 year ago

How is the introductions predicted by Jespers thesis included. Is that used 1920- and regexp before?

Ping @ninpnin . Is this correct? Or is the regexp the way to identify the individual person from a intro segment?

salgo60 commented 1 year ago

OT SPARQL Wikidata Swedish MPs with P97 Nobel title as Friherre Q1338119 - quality unknown

ninpnin commented 1 year ago

@MansMeg NN for intro detection, regexp for intro mapping

ninpnin commented 1 year ago

Friherrar pull request https://github.com/welfare-state-analytics/riksdagen-corpus/pull/274