welfare-state-analytics / riksdagen-corpus

Swedish parliamentary proceedings - Riksdagens protokoll 1867-today
Other
26 stars 5 forks source link

Unit tests Validate Parla-Clarin XML

DEPRECATION NOTICE

The Swedish Parliament Corpus has moved over to the SWERIK Project github page. v.0.14.0 was the last release under this repository. Please visit the new repo for the most up-to-date releases.

Swedish parliamentary proceedings --- 1867--today --- v0.14.0

Westac Project, 2020--2024 | Swerik Project, 2023--2025

The data set

The full data set consists of multiple parts:

Basic use

A full dataset is available under this download link. It has the following structure

The workflow to use the data is demonstrated in this Google Colab notebook.

Design choices of the project

The Riksdagen corpus is released as an iterative process, where the corpus is continuously curated and expanded. Semantic versioning is used for the whole corpus, following the established major-minor-patch practices as they apply to data. For each major and minor release, a battery of unit tests are run and a statistical sample is drawn, annotated and quantitatively evaluated to ensure integrety and quality of updated data. Errors are fixed as they are detected in order of priority. Moreover, the edit history is kept as a traceable git repository.

While the contents of the corpus will change due to curation and expansion, we aim to keep the deliverable API, the corpus/ folder, as stable as possible. This means we avoid relocating files or folders, changing formats, changing columns in metadata files, or any other changes that might break downstream scripts. Conversely, files outside the corpus/ folder are internal to the project. End users may find utility in them but we make no effort to keep them consistent.

The data in the corpus is delivered as TEI XML files to follow established practices. The metadata is delivered as CSV files, following a normal form database structure while allowing for a legible git history. A more detailed description of the data and metadata structure and formats can be found in the README files in the corpus/ folder.

Descriptive statistics at a glance

Currently, we have an extensive set of Parliamentary Records (Riksdagens Protokoll) from 1867 until now. We are in the process of preparing Motions for inclusion in the corpus and other document types will follow.

v0.14.0 v0.13.1 v0.13.0
Corpus size (GB) 5.48 5.48 5.48
Number of parliamentary records 17642 17642 17642
Total parliamentary record pages* 1045458 1045458 1045458
Total parliamentary record speeches 1014214 1014214 1014214
Total parliamentary record words 442634322 442634322 442634322
Number of Motions 0 0 0
Total motion pages 0 0 0
Total motion words 0 0 0
Number of people with MP role 5975 5975 5975
Number of people with minister role 546 546 546

* Digital original parliamentary records for some years in the 1990s are not paginated and thus do not contribute to the page count.See also §Number of Pages in Parliamentary Records.

Parliamentary Records over time

Number of Parliamentary Records

Number of Parliamentary Records

Number of Pages in Parliamentary Records

Number of Pages in Parliamentary Records

Number of Speeches in Parliamentary Records

Number of Speeches in Parliamentary Records

Note: We are aware of an issue whereby speeches are over counted in the data's current form in the years after 2014 -- we're working on a fix. Until then, the following static graph is a better representation of the actual speeches in the Parliamentary Records for those years.

Static Number of Speeches

Number of Words in Parliamentary Records

Number of Words in Parliamentary Records

Members of Parliament over time

Members of Parliament over time

Quality assessment

Speech-to-speaker mapping

We check how many speakers in the parliamentary records our algorithms idenify in each release.

Estimate of mapping accuracy

Correct number of MPs over time

Ratio of MP to seats over time

Participate in the curation process

If you find any errors, it is possible to submit corrections to them. This is documented in the project wiki.

Acknowledgement of support


Last update: 2024-02-23, 15:37:59