Add an paragraph level ID in the parla-clarin files

welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today

Other

26 stars 5 forks source link

Add an paragraph level ID in the parla-clarin files #146

Closed rbbby closed 2 years ago

rbbby commented 2 years ago

The xml schema has not been updated for a long while and we have found both bugs to solve and improvements to make.

[x] #138 solve bug regarding pointers for unknown speakers
[ ] Add an OCR-block level id

And more...

MansMeg commented 2 years ago

I think we should split this to two issues.

MansMeg commented 2 years ago

Based on the discussion we should add a paragraph ID in the form "protocol_id" + random hash with 8 characters.

MansMeg commented 2 years ago

After discussions with others I suggest that we use a UUID as ID here: https://en.wikipedia.org/wiki/Universally_unique_identifier

This can be done with the uuid package in R (and I assume there is an equivalent library for Python). This should be globally unique, that was my worry.

ninpnin commented 2 years ago

I would probably go with this: https://github.com/skorokithakis/shortuuid . It generates an UUID and formats it in base57 so the IDs aren't going to be awfully long in text.

>>> shortuuid.uuid()
'AzwhqneGDoe2EgepXtWusu'

MansMeg commented 2 years ago

Ok. Does this exist in other languages than Python? The benefit with UUID is that more or less any language seem to support generating UUID?

ninpnin commented 2 years ago

It is UUID (you get the same amount of random bits), it's just how you format it.

MansMeg commented 2 years ago

Ok. Is the formating easy to do? I can't find a quick R package that does this? UUID has the benefit that it is very much standard. Although, I could probably code up an R package to do this if it is beneficial. How much space would we save bu using short uuid compared to UUID?

ninpnin commented 2 years ago

Done in https://github.com/welfare-state-analytics/riksdagen-corpus/commit/aa17c1c0e3bab4fa19926ab691dcc8928452a20e . The IDs are encoded in base58.

MansMeg commented 2 years ago

Great!