Closed rbbby closed 2 years ago
I think we should split this to two issues.
Based on the discussion we should add a paragraph ID in the form "protocol_id" + random hash with 8 characters.
After discussions with others I suggest that we use a UUID as ID here: https://en.wikipedia.org/wiki/Universally_unique_identifier
This can be done with the uuid package in R (and I assume there is an equivalent library for Python). This should be globally unique, that was my worry.
I would probably go with this: https://github.com/skorokithakis/shortuuid . It generates an UUID and formats it in base57 so the IDs aren't going to be awfully long in text.
>>> shortuuid.uuid()
'AzwhqneGDoe2EgepXtWusu'
Ok. Does this exist in other languages than Python? The benefit with UUID is that more or less any language seem to support generating UUID?
It is UUID (you get the same amount of random bits), it's just how you format it.
Ok. Is the formating easy to do? I can't find a quick R package that does this? UUID has the benefit that it is very much standard. Although, I could probably code up an R package to do this if it is beneficial. How much space would we save bu using short uuid compared to UUID?
Done in https://github.com/welfare-state-analytics/riksdagen-corpus/commit/aa17c1c0e3bab4fa19926ab691dcc8928452a20e . The IDs are encoded in base58.
Great!
The xml schema has not been updated for a long while and we have found both bugs to solve and improvements to make.
[x] #138 solve bug regarding pointers for unknown speakers
[ ] Add an OCR-block level id
And more...