welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Megaissue for corpus bugs #135

Closed rbbby closed 1 year ago

rbbby commented 2 years ago

134 Intros replying to minister/speaker

Description: Detecting who is speaking in an intro with multiple people mentioned is a problem in the corpus which in the long run likely will be solved by a language model. It has multiple levels of difficulty which implicitly are handled differently by the algorithm.

Examples:

113 Intro detection regex bugs

Description: There are some systematic ways that the regex introduction detection currently systematically missclassifies introductions (both type-1 and type-2 errors).

Examples: OCR splitting

Protocol dates ending with ":"

Anföranden (believe there also are type-1 errors)

Unclear reason

Short comments

rbbby commented 2 years ago

We are most likely switching to using BERT for introduction detection, but could be good to have this list of bugs so we can double check that they are solved with BERT.

ninpnin commented 1 year ago

We use BERT now.