Open vincentheddesheimer opened 1 year ago
Hi!
I'm happy that you found the corpus. The R package is still in beta since everyone else is Python people in the project. That said, I'm happy to get it to work. What do you want to extract from the corpus? Speeches in the parliament?
Understandable - sorry for my Python-ignorance!
Yes! I would love to get each speech into one row of a dataframe and the respective speech id, date, and the speaker id (to link with metadata file) in additional columns (as in the example table above).
Thanks!
In python, you could use the pyriksdagen module, from pyriksdagen.utils import protocol_iterators, elem_iter
, to iterate over each element in each protocol. The protocol docs are in parlaclarin specified xml, so you'd have to decide which elements / element attributes you want to target and write a few conditions to extract relevant stuff and put in a data frame -- speeches are <u>
and <seg>
(utterance and segment) elements.
You can see one example of all these things in action in the scripts/KWIC-iter-search.py
script. This is definitely not a minimal working example, but I think everything you would need is in that script.
Ok thanks a lot! I take from this that there is no easy way to do this in R?
Hi, I am one of Vincent's coauthors for this project. I reviewed the scripts/KWIC-iter-search.py script. From what I understand, the script generates a data set that includes N characters to the left and right of the search term. However, as Vincent mentioned earlier, we are more interested in extracting the complete speech ID-level data (or concatenated data from the same speaker) within a specific timeframe. I wonder if the examples/corpus-walkthrough.ipynb script might be more suitable for our purpose. Could you please confirm if I am on the right track?
Also, I have a couple of questions about the examples/corpus-walkthrough.ipynb script:
<u xml:id="i-GRCSGuiWNchTwGtpg41nnk"...>
)? I have played with a few examples so far but the elem.get('n')
argument in the script does not seem to return anything, so I would like to double-check. Thank you for your assistance! I greatly appreciate it.
Hi! You've got the right idea about the KWIC-iter-search script. I didn't mean that it will do what you want it to, rather that all the functionality you need (iterating over protocols and XML elements, finding utterances, etc.) is in that script and you can look at it as an example to write your own.
Is the "hash" variable equivalent to the speech ID (e.g., <u xml:id="i-GRCSGuiWNchTwGtpg41nnk"...>)?
That 'hash' is a uuid for that particular element, i.e. the <u>
tag (u stands for utterance), so in that sense it's an id for that particular part of a speech.
I have played with a few examples so far but the elem.get('n') argument in the script does not seem to return anything, so I would like to double-check.
I think it's out of date -- predates my involvement here -- maybe not functional. You can look at the KWIC-iter or other scripts in the scripts/
directory for inspiration.
can we utilize the aforementioned speech IDs for unknown speakers to match their party affiliations, using input/matching/unknowns.csv?
You should match speaker to party by the who attribute e.g.<u who="Q2254365">
-- current party info is in corpus/metadata/party_affiliation.csv
where you find that same ID (from wikidata) with the party info. Don't build on anything in the input folder -- this is like a short-term scrap storage and will eventually be removed all together.
A more up-to-date version of the notebook is available here: https://colab.research.google.com/drive/1C3e2gwi9z83ikXbYXNPfB6RF7spTgzxA?usp=sharing . It's linked in the Readme, but I'll also update the notebook in the examples folder to avoid confusion in the future!
Hi, great work!
I was wondering whether there is a way of loading protocols into R. I tried to use the
read_parla_clarin_xml_file
command but it did not read the notes correctly.When running
unlist(x1$teiCorpus$TEI$text$body$div$note)
, only one note appears: "RIKSDAGENS PROTOKOLL."I am definitely more proficient in R so would appreciate help here but also can't seem to understand how I would do this in Python from reading your documentation.
Thanks a lot in advance!