welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Work with corpus in R / Create dataframe #315

Open vincentheddesheimer opened 1 year ago

vincentheddesheimer commented 1 year ago

Hi, great work!

I was wondering whether there is a way of loading protocols into R. I tried to use the read_parla_clarin_xml_file command but it did not read the notes correctly.

read_parla_clarin_xml_file <- function(x, ...){
  checkmate::assert_file_exists(x)
  pc <- xml2::read_xml(x, ...)
  pc <- xml2::as_list(pc)
  pc
}
x1 <- read_parla_clarin_xml_file(x = "1946/prot-1946--ak--1.xml")

When running unlist(x1$teiCorpus$TEI$text$body$div$note), only one note appears: "RIKSDAGENS PROTOKOLL."

Basically the end goal would be to read specific files (or all in one folder) and construct a dataframe such that it looks like this: SpeechID SpeakerID Text Date
prot-1946--ak--1 Q6201372 ... 1946-01-10
... ... ... ...

I am definitely more proficient in R so would appreciate help here but also can't seem to understand how I would do this in Python from reading your documentation.

Thanks a lot in advance!

MansMeg commented 1 year ago

Hi!

I'm happy that you found the corpus. The R package is still in beta since everyone else is Python people in the project. That said, I'm happy to get it to work. What do you want to extract from the corpus? Speeches in the parliament?

vincentheddesheimer commented 1 year ago

Understandable - sorry for my Python-ignorance!

Yes! I would love to get each speech into one row of a dataframe and the respective speech id, date, and the speaker id (to link with metadata file) in additional columns (as in the example table above).

Thanks!

BobBorges commented 1 year ago

In python, you could use the pyriksdagen module, from pyriksdagen.utils import protocol_iterators, elem_iter, to iterate over each element in each protocol. The protocol docs are in parlaclarin specified xml, so you'd have to decide which elements / element attributes you want to target and write a few conditions to extract relevant stuff and put in a data frame -- speeches are <u> and <seg> (utterance and segment) elements.

You can see one example of all these things in action in the scripts/KWIC-iter-search.py script. This is definitely not a minimal working example, but I think everything you would need is in that script.

vincentheddesheimer commented 1 year ago

Ok thanks a lot! I take from this that there is no easy way to do this in R?

ahrawu commented 1 year ago

Hi, I am one of Vincent's coauthors for this project. I reviewed the scripts/KWIC-iter-search.py script. From what I understand, the script generates a data set that includes N characters to the left and right of the search term. However, as Vincent mentioned earlier, we are more interested in extracting the complete speech ID-level data (or concatenated data from the same speaker) within a specific timeframe. I wonder if the examples/corpus-walkthrough.ipynb script might be more suitable for our purpose. Could you please confirm if I am on the right track?

Also, I have a couple of questions about the examples/corpus-walkthrough.ipynb script:

  1. Is the "hash" variable equivalent to the speech ID (e.g., <u xml:id="i-GRCSGuiWNchTwGtpg41nnk"...>)? I have played with a few examples so far but the elem.get('n') argument in the script does not seem to return anything, so I would like to double-check.
  2. If my assumption in question 1 is correct, can we utilize the aforementioned speech IDs for unknown speakers to match their party affiliations, using input/matching/unknowns.csv?

Thank you for your assistance! I greatly appreciate it.

BobBorges commented 1 year ago

Hi! You've got the right idea about the KWIC-iter-search script. I didn't mean that it will do what you want it to, rather that all the functionality you need (iterating over protocols and XML elements, finding utterances, etc.) is in that script and you can look at it as an example to write your own.

Is the "hash" variable equivalent to the speech ID (e.g., <u xml:id="i-GRCSGuiWNchTwGtpg41nnk"...>)?

That 'hash' is a uuid for that particular element, i.e. the <u> tag (u stands for utterance), so in that sense it's an id for that particular part of a speech.

I have played with a few examples so far but the elem.get('n') argument in the script does not seem to return anything, so I would like to double-check.

I think it's out of date -- predates my involvement here -- maybe not functional. You can look at the KWIC-iter or other scripts in the scripts/ directory for inspiration.

can we utilize the aforementioned speech IDs for unknown speakers to match their party affiliations, using input/matching/unknowns.csv?

You should match speaker to party by the who attribute e.g.<u who="Q2254365"> -- current party info is in corpus/metadata/party_affiliation.csv where you find that same ID (from wikidata) with the party info. Don't build on anything in the input folder -- this is like a short-term scrap storage and will eventually be removed all together.

ninpnin commented 1 year ago

A more up-to-date version of the notebook is available here: https://colab.research.google.com/drive/1C3e2gwi9z83ikXbYXNPfB6RF7spTgzxA?usp=sharing . It's linked in the Readme, but I'll also update the notebook in the examples folder to avoid confusion in the future!