Closed liamtabib closed 1 year ago
Some disadvantages of re-running the curation script would be that it takes time to do the hyperparameter search and re-annotation of gold standards, and the xml schema can be seen as a personal opinion. Hence there are two sides of the coin.
After discussion at this mornings meeting, the researchers will be running their queries on the corpus through functions written by us. Therefore the xml schema may not be so problematic for them.
Hi!
The BLM corpus, at the moment, has a number of issues relating to segmentation and schema. These issues can be fixed by running a newly written curation script, with the downside that these will result in new uuid's to all element in the corpus, going against the 'working on data' philosophy.
Although this seems worrisome, Upon inspecting all metadata files, there are no files which utilize individual element tags. The only files currently containing individual tags inside each edition file are the unprocessed manually written register.csv and toc.csv, they include the column page which points to the element where the register or table of content information was extracted from. From my understanding, this information is redundant.
In the near future, issue #33 , we will create a metadata file articles.csv which are to map each article, given by their element uuid, to toc.csv and page_headers. This will therefore include specific tags uuids, and is important. Hence, if one wants to rerun the curation script (which will give new uuids), it must be done before this metadata is extracted.
Some issues that I am hoping to solve with re-running the curation script includes:
lack of xml schema. The current schema is hard to understand and inconsistent. This makes it hard for any researchers to extract specific elements, and hard to curate. The new curation script includes an easy to understand schema and logical hierarchy.
Moreover, the segmentation accuracy of page_headers is not as perfect as reported in the readme, after looking through the corpus. In the newly written curation script, the segmentation will include a grid-search for the hyper-parameters for the segmentation algorithms, both for headers and page_headers. This was missing in the first curation. Gold standard datasets will be annotated and tested against.
Lastly, the current corpus does not preserve information on block level for each segmented block. More specifically, blocks are merged together into
elements. This is worrisome for merging the new OCR files, which operate on blocks, with the current corpus. This may lead to new errors. But, If we do this at the curation step, then this is easier.
What do you think? Can it be worth it? @MansMeg @ninpnin