Closed roger-mahler closed 2 years ago
Great that you are bringing this up. I am not expert on the schema and @ninpnin who designed it is sick atm. But it seems like this is something we may want to correct.
When doing topic modelling on the corpus I have so far used the speech_iterator function from the pyparlaclarin package. The function has been updated to work for 0.4.X version of the corpus, but it is not yet in the module. I added an example of how I gather concatenated speeches with the updated function below, as well as an example of how to get metadata for unknown speakers.
I.e., now speeches where the speaker is unknown are linked with next and prev attributes, too.
The sequence of unknown utterances seems to not be chained using prev/next pointers in many instances (see example below, actually all that I have seen). Since the prev/next pointers are used to aggregate speeches, the number of speeches explodes when every single utterance becomes a speech. The following segment prot-1958-b-fk--12 illustrates the problem.
A consequnce (for the notebooks pipeline) of not merging consequtive unknowns after a
<note type=speaker
tag is an increase of the number of speeches. Furthermore, the level of documents is systematically different;utterance
for unknown speakers but for known speakers documents are made up of a sequence of utterances.Is this behavioisr by design or an issue that can be corrected in futute versions of the metadata?