Description:
Detecting who is speaking in an intro with multiple people mentioned is a problem in the corpus which in the long run likely will be solved by a language model. It has multiple levels of difficulty which implicitly are handled differently by the algorithm.
If more than one person of the same category (member of parliament, minister, speaker) is mentioned, I think it picks the person mentioned first.
If people of different categories are mentioned, they are matched hierarchically as minister, speaker, member of parliament.
If several people are mentioned a combination of the two first points is used, matching hierarchically and picking the first person mentioned within each category. This with the exception of a speaker and minister being mentioned, I am not sure who is identified then.
A problem is that speakers/ministers are matched aggressively, and that it has to be that way (they are often referred to by for example only "Första kammarens vice talman:").
Examples:
prot: 197778/prot-197778--5.xml, hash: ba402f42
113 Intro detection regex bugs
Description:
There are some systematic ways that the regex introduction detection currently systematically missclassifies introductions (both type-1 and type-2 errors).
Examples:
OCR splitting
Type 1 error: "Herr statsrådet Stadener avlämnade Kungl. Maj:"
Type 2 error: "Fru Sjöström-Bengtsson: Herr talman! Min blanka reservation vid detta utlåtande föranleder mig att, utan att ha något yrkande, ändå säga ett par ord i denna sak."
After going through digitised era manually it seems like a big majority of misses are due to intro misses caused by incorrect splitting such as "Anf. 35 Näringsminister BJÖRN ROSEN" \n "GREN (s):"
Protocol dates ending with ":"
"1977/78:", is classified as intro with unknown speaker.
Anföranden (believe there also are type-1 errors)
Type 2 error: "Anf. 107 STEN ANDERSSON i Malmö (m):"
Type 2 error: "Anf. 108 Statsrådet BENGT K. Å. JOHANSSON:"
Unclear reason
Type 2 error: "Herr statsrådet LIDBOM erhöll ordet för (...)"
Type 2 error: "Herr finansministern STRÄNG erhöll ordet för att besvara fru (...)"
Short comments
(FÖRSTE VICE TALMANNEN: Debatten handlar inte om Förbifart Helsingfors.Jag får be talaren att hålla sig till ämnet.)
We are most likely switching to using BERT for introduction detection, but could be good to have this list of bugs so we can double check that they are solved with BERT.
134 Intros replying to minister/speaker
Description: Detecting who is speaking in an intro with multiple people mentioned is a problem in the corpus which in the long run likely will be solved by a language model. It has multiple levels of difficulty which implicitly are handled differently by the algorithm.
Examples:
113 Intro detection regex bugs
Description: There are some systematic ways that the regex introduction detection currently systematically missclassifies introductions (both type-1 and type-2 errors).
Examples: OCR splitting
Protocol dates ending with ":"
Anföranden (believe there also are type-1 errors)
Unclear reason
Short comments