Closed BobBorges closed 1 year ago
Great! Lets leave this as an issue right now. I guess you can use the union of all dates to test the mp database for now? Then we can fix this further down the line.
No. These are not the same date issues that were addressed in the last PR (date-issues).
The issue that was fixed in that PR relates to underspecified dates (e.g. a whole year of protocols dated to Jan 1).
The issue here is related to protocols that list multiple dates (incl. the actual date of the protocol) in the protocol metadata. It happens, in the cases I investigated, due to a cover sheet that lists dates for a bundle of protocols, or due to mention of other dates in the notes/utterances that the date finding algorithm picks up. It's not an urgent problem for checking MP completeness or identifying Parliament days, but maybe sth to fix in the long term.
I think this might NOT be an issue. I've just been looking at high frequency dates (those occurring in 8 and 9 protocols) and it seems like they're there for good reason. Those dates appear in notes of protocols where they are listed in metadata.
It seems like the issue is my assumption that each protocol has one date. Do we want to spend more time on this?
No, if you don't think it is an issue, let's close this, and then we can reopen it later if needed.
In relation to #332, some issues remain relating to protocol dates. I checked how often individual dates occur in protocols of given chamber. The expectation is that a date occurs in one protocol per chamber, with the exception where two protocols have start/end points on the same page. This however isn't the case -- here's a summary:
In the list, you see counts of dates occurring N protocols -- 22161 dates occur once, 3025 dates twice, etc. Per chamber date counts attached. ak_date-count.txt ek_date-count.txt fk_date-count.txt
There may be legit reasons for it, but I think this is something to look at in detail, as it may be indicative of other issues. For instance, looking at those 8 dates that occur in 9 protocols, we find that the first few pages of 198283 prot 1--9 are the same and incidentally list a range of dates (like a cover sheet).
I didn't look too much more into it because automating a test doesn't seem straightforward at this point (re duplicate pages: element IDs are still unique due to curation, so line-by-line comparison doesn't help much), but the issue I'm raising here is twofold: