All of the radio scripts came from the Generic Radio Workshop, which provide a downloadable TXT file that includes only the body of the scripts and exclude the metadata that is commonly shown on the top of the webpage. To generate the corpus, I copied the metadata portion into the TXT files that I downloaded. While doing so, I noticed that the metadata almost always starts with "series", "show", "date" and "cast". For this reason, I made the Relax NG schema code restrict the order of the elements , , , and . There were some discrepancies between the layout of some of the files, most troubling being "Murder in Casbath".

Known differences between the files include:

Discrepancies also occur within the file, mainly being that not all spoken lines have a speaker tag, some lines were spoken by the same speaker but depicted as separate lines in the files. To counter these, I added a <lineGrp>element that wraps around the spilt lines, which are tagged with <line> elements. The overall speech is wrapped with <ln>, which is consistent with all other speeches in the script to ensure that querying can be performed on the scripts with relative ease.

To regularize the scripts to ensure that queries could be performed on them in the future, and after double checking that the changes will not affect the analysis of the scripts, I moved the metadata elements to be in the same order throughout the corpus. I also removed the SOUND/MUSIC tags from the script itself since the element tags <sound>and <music> serve the same purpose and can be regularized.

