wdjacca / SherlockHolmesRadioScripts

https://radioholmes.newtfire.org/
2 stars 0 forks source link

Editorial Methodology #6

Open wdjacca opened 2 years ago

wdjacca commented 2 years ago

All of the radio scripts came from the Generic Radio Workshop, which provide a downloadable TXT file that includes only the body of the scripts and exclude the metadata that is commonly shown on the top of the webpage. To generate the corpus, I copied the metadata portion into the TXT files that I downloaded. While doing so, I noticed that the metadata almost always starts with "series", "show", "date" and "cast". For this reason, I made the Relax NG schema code restrict the order of the elements , , , and . There were some discrepancies between the layout of some of the files, most troubling being "Murder in Casbath".

Known differences between the files include:

Discrepancies also occur within the file, mainly being that not all spoken lines have a speaker tag, some lines were spoken by the same speaker but depicted as separate lines in the files. To counter these, I added a <lineGrp>element that wraps around the spilt lines, which are tagged with <line> elements. The overall speech is wrapped with <ln>, which is consistent with all other speeches in the script to ensure that querying can be performed on the scripts with relative ease.

To regularize the scripts to ensure that queries could be performed on them in the future, and after double checking that the changes will not affect the analysis of the scripts, I moved the metadata elements to be in the same order throughout the corpus. I also removed the SOUND/MUSIC tags from the script itself since the element tags <sound>and <music> serve the same purpose and can be regularized.

wdjacca commented 2 years ago

Rewriting to publish on website

The radio scripts were sourced from the Generic Radio Workshop, where there were downloadable plain text (.txt) files that includes the body contents of the scripts. For the purpose of this research, metadata that was excluded from the plain text files at the top of the webpage display were also included into the files. There was a regular pattern of "series", "show", "date", and "cast", which led to the decision of strictly limiting the Relax NG schema code in this specific order. There were some discrepancies within the files, most notably being with "Murder in Casbah".

There were some major differences in the structure of the files in the radio script corpus, that being of:

The content in the files in the corpus had to be regularized so that queries could be performed easily on them in the future, taking into consideration that any changes would not affect the contextual or structural analysis of the project.

The following changes to the corpus were made: