welfare-state-analytics / blm-corpus

Code and issues related to Bonniers litterära magasin at KB lab
0 stars 0 forks source link

Segment reviews #33

Open liamtabib opened 1 year ago

liamtabib commented 1 year ago

There seems to be page headers 'Bokrecensioner' on all pages where a review is. Tests will be run to check this. If this holds, one can use this as a heuristic to signal that all articles on those pages should be of type 'review'.

MansMeg commented 1 year ago

Yes. That sounds like a good idea. I have been thinking about how to store this information. We have unique IDs for individual articles, right? Then we should probably create a CSV-file mapping these IDs to wether they are reviews or not. Like:

article_id,table_of_content_id
article_id, page_header
article_id, register_id

where table_of_content_id is a mapping to the TOC-files?

Then we could extract the books reviews from there? Is this reasonable or do you have an better idea? @ninpnin , any takes?

liamtabib commented 1 year ago

I think that sounds good. One can also mark that inside the files with an attribute 'type' to the 'article' element.

MansMeg commented 1 year ago

Yes. I think we might want to do this long term. Although I think this is a second step. Simply because it is not clear how to categorize article types.

liamtabib commented 1 year ago

The above three metadata files are created; i.e. a mapping between articles and one of each toc.csv, register.csv, and all page_headers tags inside the corpus. What is the next step? It may be good to work on the topic modelling pipeline for some time instead of BLM.

MansMeg commented 1 year ago

So the next step here is to

  1. add these files to the BLM corpus (and then we will get a demo Friday)
  2. check the quality of this mapping (i.e. how large proportion are covered by each file) - eg plot a file with time as x axis and percentage of articles that are covered by each file
  3. (but I think we should do this after the demo) is to map the categories in these files to wether they are defined as reviews or not by Alexandra. That would require you to list the unique categories in these files.

Just quick thoughts - but at the demo Friday we will know more. Good work!

liamtabib commented 1 year ago

Here are visualisations for the mappings pointer-1

total_articles_number articles_number_edition

As one can see, the register file has gaps. I will check the workflow to see that no mistakes are made on my part. Otherwise, it should be remember that the segmentation algorithm used is by no means the goldstandard, and one should find a way to combine the information from these sources into an accurate articles segmentation.

One thing that was noticed in the procedure was that some articles in our corpus span way too many pages, and this explains the many-to-one mapping in the metadata.

liamtabib commented 1 year ago

Here is the file which is the source for the table above article_pointers.csv

There are two sources of errors that I have spotted: actual header is mistaken as page header, and articles that are too short, the first issue will add more articles and in the process shorten too long articles.

MansMeg commented 1 year ago

Im not sure I follow this fully. What is the y axis?

Also, can you explain the table? I dont follow it. What is the frequenties and what does n=0 mean?

liamtabib commented 1 year ago

the first plot is the cumulative sum of the second plot; the y-axis indicates the number of articles for each file/segmentation for each edition, sorted by date. For the segmentation algorithm, I just looked at how many article tags we have for each edition. For the table of contents/register metadata files, I iterated over each edition, and filtered each file on rows matching that edition, and counted the number of rows remaining. So each row in the metadata file is assumed to be a unique article. The number of rows that point to an edition is the number of articles in that edition.

It is simply a way to visualise how much information each file gives. Is this what you were asking for?

The table stems from the mapping of our segmented articles with the files.

So if you recall we have created the mapping csv's, where we tried to map each row inside table of contents or register to an article in the corpus. This mapping has a many-to-many relation (as our segmentation algorithm has faults, ideally it should be a one-to-one, and this is what we aim to achieve). Therefore I iterated over each article tag in the corpus, and looked at how many rows inside table of contents/register are pointing to that article. So for most articles in the corpus, 0 rows are pointing to them, that is not good. Again the ideal here is to have 1 row point to each article. Is this clear now?

@ninpnin Feel free to give an input