welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

review of chairs document #463

Closed SimonHallen closed 4 months ago

SimonHallen commented 5 months ago

review of chairs document. Finding duplicates etc

MansMeg commented 5 months ago

There seem to be a crazy amount of edits. Im not sure why. My guess is that the file has some wrong formating somehow. We need to fix this. @BobBorges do you see what whent wrong?

MansMeg commented 5 months ago

Yes and the chair test also fails.

BobBorges commented 5 months ago

Simon added a column -- I didn't anticipate that / warn about it. I will fetch this file and somehow try to see only the meaningful changes + removing the extra column.

ninpnin commented 5 months ago

Maybe my git-comma-diff would come in handy?

git diff --word-diff-regex=[^[:space:],]+ $argv
BobBorges commented 5 months ago

@fredrik1984 50-random-edited-rows.csv please have a look

MansMeg commented 5 months ago

? The test still fails? We want to fix that first?

BobBorges commented 5 months ago

The test fails because the delimiter changed so it doesn't find the columns. I want to make sure that we can see what Simon actually changed and that those edits are reasonable (hence the csv w/ tagged Fredrik) before I start messing with the file. Right now the edits look reasonable, but I'd like a second opinion, then it's just a few minutes to fix the formatting.

MansMeg commented 5 months ago

It is always better to fix the tests first. Otherwise, we might find new bugs after @fredrik1984 has done his checks.

MansMeg commented 5 months ago

Why not just remove the column? You could do a separate PR to see the diffs?

BobBorges commented 5 months ago

@fredrik1984 -- chairs test, as it was before simon's work passes.

MansMeg commented 5 months ago

Yay!

BobBorges commented 5 months ago

Don't merge yet...

BobBorges commented 5 months ago

The tests that were already running on the chairs data are now passing, but there are three skipped tests that still fail. I think we can merge the data as it is and potentially continue fixing inaccuracies. What do you all say? @MansMeg @ninpnin

@SimonHallen I'm attaching some files from unit tests here. Do you think you will have time to look at some of these?

Also @LaurineMir went through some of the places where the matriklar didn't line up with the Swerik metadata and I'm also attaching a list of issues that might be related to OCR, like the year is wrong and in some cases the MP was long dead by the time of the seat datum. Maybe you could have a look at some of those, as it's relevant for the result of your project.

20240206-1551_ChairHogs.csv 20240206-1551_EmptySeats.csv 20240206-1551_LoveSeats.csv probable_ocr-err_in_matriklarna.csv

SimonHallen commented 5 months ago

I'll take a look at it!