Open liamtabib opened 1 year ago
Yes. That sounds like a good idea. I have been thinking about how to store this information. We have unique IDs for individual articles, right? Then we should probably create a CSV-file mapping these IDs to wether they are reviews or not. Like:
article_id,table_of_content_id
article_id, page_header
article_id, register_id
where table_of_content_id is a mapping to the TOC-files?
Then we could extract the books reviews from there? Is this reasonable or do you have an better idea? @ninpnin , any takes?
I think that sounds good. One can also mark that inside the files with an attribute 'type' to the 'article' element.
Yes. I think we might want to do this long term. Although I think this is a second step. Simply because it is not clear how to categorize article types.
The above three metadata files are created; i.e. a mapping between articles and one of each toc.csv, register.csv, and all page_headers tags inside the corpus. What is the next step? It may be good to work on the topic modelling pipeline for some time instead of BLM.
So the next step here is to
Just quick thoughts - but at the demo Friday we will know more. Good work!
Here are visualisations for the mappings
As one can see, the register file has gaps. I will check the workflow to see that no mistakes are made on my part. Otherwise, it should be remember that the segmentation algorithm used is by no means the goldstandard, and one should find a way to combine the information from these sources into an accurate articles segmentation.
One thing that was noticed in the procedure was that some articles in our corpus span way too many pages, and this explains the many-to-one mapping in the metadata.
Here is the file which is the source for the table above article_pointers.csv
There are two sources of errors that I have spotted: actual header is mistaken as page header, and articles that are too short, the first issue will add more articles and in the process shorten too long articles.
Im not sure I follow this fully. What is the y axis?
Also, can you explain the table? I dont follow it. What is the frequenties and what does n=0 mean?
the first plot is the cumulative sum of the second plot; the y-axis indicates the number of articles for each file/segmentation for each edition, sorted by date. For the segmentation algorithm, I just looked at how many article tags we have for each edition. For the table of contents/register metadata files, I iterated over each edition, and filtered each file on rows matching that edition, and counted the number of rows remaining. So each row in the metadata file is assumed to be a unique article. The number of rows that point to an edition is the number of articles in that edition.
It is simply a way to visualise how much information each file gives. Is this what you were asking for?
The table stems from the mapping of our segmented articles with the files.
So if you recall we have created the mapping csv's, where we tried to map each row inside table of contents or register to an article in the corpus. This mapping has a many-to-many relation (as our segmentation algorithm has faults, ideally it should be a one-to-one, and this is what we aim to achieve). Therefore I iterated over each article tag in the corpus, and looked at how many rows inside table of contents/register are pointing to that article. So for most articles in the corpus, 0 rows are pointing to them, that is not good. Again the ideal here is to have 1 row point to each article. Is this clear now?
@ninpnin Feel free to give an input
There seems to be page headers 'Bokrecensioner' on all pages where a review is. Tests will be run to check this. If this holds, one can use this as a heuristic to signal that all articles on those pages should be of type 'review'.