welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Renamed protcol files 1875-2021 to get zero-padded running numbers with 3 positions, fixes issue #100 #390

Closed ljo closed 9 months ago

ljo commented 10 months ago
BobBorges commented 9 months ago

I've cloned @ljo's branch and the file names look fine.

I'm worried about the merge causing conflicts etc. I did a test with a dummy repo and changed file names of three files file-one --> file_1 using git mv. When I merged with the master branch, only 1 of the three was recognized as 'renamed'. The other two were marked as "deleted by them"/"new file".

Some googling gets me a stackoverflow answer "Git will automatically detect the move/rename if your modification is not too severe. ... 'not too severe' means that the new file and old file are >50% 'similar' based on some similarity indexes that git uses" (my emphasis). Probably padding numbers in our file names is similar enough, but it's not completely transparent what's going on under the hood.

To avoid potential issues with other ongoing work, I suggest we merge this only when there are no branches with ongoing work involving protocol files -- i.e. when all other branches are merged into dev.

MansMeg commented 9 months ago

I think this souns like a plan. Ie take this as the last PR. It might mean som additional work for @ljo. Are you ok with this?

ljo commented 9 months ago

I think this souns like a plan. Ie take this as the last PR. It might mean som additional work for @ljo. Are you ok with this?

Yes, with the amendment of the decision on today's meeting.

BobBorges commented 9 months ago

I was a little premature in the meeting today. I want to raise a couple points for discussion on this issue:

BobBorges commented 9 months ago

If we can decide on höst and get those 1892/1905 urtima protocols zero padded, then I think this can be merged today. All my stuff touching protocols has been merged.

ljo commented 9 months ago

I was a little premature in the meeting today. I want to raise a couple points for discussion on this issue:

* @ljo, you removed `höst` from fall sessions' file names. I don't think we should do that. (Getting the `ö` out of the file name is good though). I looked for other instances of these 'specifiers' getting removed from file names, but I didn't see others -- did you take out any thing else from file names?

No I only removed höst since that was the only one of these specifiers which were in the general sequence, all other had their own sequences.

* 1892, 1905: zero padding didn't take effect on urtima  and urtima2 sessions.

Fixing now

BobBorges commented 9 months ago

I just wonder if that can be relevant info for the mandate periods of the MPs. I had been using these specifiers categorically to determine the 'standard' start/end of parliament sessions. @fredrik1984 @MansMeg, what do you say? I could pull this info from elsewhere if we really want höst out of the filenames.

fredrik1984 commented 9 months ago

@BobBorges I am not sure what you mean here?

ljo commented 9 months ago

Preferably, I also would like urtima2 out of these specifiers since it looks like the other years with more than 1 urtima are in the same sequence still. But I did not change this now. For höst we have the robustness perspective as well which talks in favour of its removal.

BobBorges commented 9 months ago

@fredrik1984 I just mean that I treated it the same way as urtima -- that höst sessions have their own start and end dates.

@ljo. I guess so long as we don't get rid of this info completely it can be removed from the file names. Right now it's still in the TEI/text/front/div/head element and pb facs attrib....

fredrik1984 commented 9 months ago

Yes – höst/vår/lagtima/urtuma/a/b riksdag meeting should be treated in the same way.

ljo commented 9 months ago

For mandate periods of MPs, govs, etcetera I think the metadata should be used. Currently, different types of specifiers are used in the filenames with some small variations. Yes, the identifiers in the documents are still the same. I only want the filenames changed, for the previously stated reasons and on getting a) robustness, b) clarity, and very minor c) not getting out of date order of documents without looking at the dates inside (which requires parsing).

ljo commented 9 months ago

Yes – höst/vår/lagtima/urtuma/a/b riksdag meeting should be treated in the same way.

@fredrik1984 Could you please elaborate a bit on this?

fredrik1984 commented 9 months ago

We should add start/end dates for höst/vår/lagtima/urtuma/a/b riksdag meetings using Lottas curated list (attached here).

Riksmöten_def.xlsx

BobBorges commented 9 months ago

I don't have a strong feeling about the filenames, but I have been using the file names as a convenient way to fetch these periods -- I think we should either keep all the specifiers or not (in this case we need some additional means to store the specifier data -- either in xml or in csv files)

ljo commented 9 months ago

We should add start/end dates for höst/vår/lagtima/urtuma/a/b riksdag meetings using Lottas curated list (attached here).

Riksmöten_def.xlsx

OK, yes, so the very first date I picked out of the list seems wrong though. 1967 vår has start value 1967-06-10 but should be 1967-01-10.

BobBorges commented 9 months ago

@MansMeg GH review seems impossible b/c the diffs for ∞ files won't load in the browser. Aside from the question about whether or not to keep höst in the file names is not really answered clearly for me. If you're OK with that, please merge.

I've cloned @ljo's fork and looked at the changes locally -- changes are what we expect.

MansMeg commented 9 months ago

Okay, great! Then Ill wait for the tests to run and then merge.