| Literature DB >> 35125984 |
Tomaž Erjavec1, Maciej Ogrodniczuk2, Petya Osenova3, Nikola Ljubešić4, Kiril Simov5, Andrej Pančur6, Michał Rudolf2, Matyáš Kopp7, Starkaður Barkarson8, Steinþór Steingrímsson8, Çağrı Çöltekin9, Jesse de Does10, Katrien Depuydt10, Tommaso Agnoloni11, Giulia Venturi12, María Calzada Pérez13, Luciana D de Macedo14, Costanza Navarretta15, Giancarlo Luxardo16, Matthew Coole17, Paul Rayson17, Vaidas Morkevičius18, Tomas Krilavičius19, Roberts Darǵis20, Orsolya Ring21, Ruben van Heusden22, Maarten Marx22, Darja Fišer23.
Abstract
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project's GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis.Entities:
Keywords: Comparable corpora; Parliamentary proceedings; TEI
Year: 2022 PMID: 35125984 PMCID: PMC8807381 DOI: 10.1007/s10579-021-09574-0
Source DB: PubMed Journal: Lang Resour Eval ISSN: 1574-020X Impact factor: 1.358
Basic information about the ParlaMint corpora including the corpus ID, the covered language(s), the houses and number of terms included, from and to months of included transcriptions, the number of years covered, the number of millions of words per year and in total
| ID | Lang | Houses | Ts | From | To | Yrs | Mw/Yr | Mw |
|---|---|---|---|---|---|---|---|---|
| BE | nl+fr | Lower | 2 | 2015–11 | 2020–08 | 4.8 | 6.50 | 31.37 |
| BG | bg | Unicameral | 2 | 2014–10 | 2020–07 | 5.8 | 3.42 | 20.02 |
| CZ | cs | Lower | 2 | 2013–11 | 2021–04 | 7.5 | 3.03 | 22.56 |
| DK | da | Unicameral | – | 2014–10 | 2020–09 | 6.1 | 4.85 | 29.40 |
| ES | es | Lower | 5 | 2015–01 | 2020–12 | 6.0 | 2.19 | 13.10 |
| FR | fr | Lower | 1 | 2017–07 | 2020–07 | 3.0 | 10.75 | 32.73 |
| GB | en | Lower+Upper | 4 | 2015–01 | 2021–03 | 6.3 | 17.25 | 109.30 |
| HR | hr | Unicameral | 1 | 2016–11 | 2020–05 | 3.6 | 5.81 | 20.65 |
| HU | hu | Unicameral | 2 | 2014–05 | 2020–12 | 6.7 | 0.13 | 0.87 |
| IS | is | Unicameral | 3 | 2015–01 | 2020–09 | 5.8 | 4.06 | 23.66 |
| IT | it | Upper | 2 | 2013–03 | 2020–11 | 7.8 | 3.46 | 26.94 |
| LT | lt | Unicameral | 2 | 2012–11 | 2020–11 | 8.1 | 1.82 | 14.78 |
| LV | lv | Unicameral | 2 | 2014–11 | 2021–02 | 6.3 | 1.02 | 6.48 |
| NL | nl | Lower+Upper | 5 | 2014–04 | 2020–11 | 6.6 | 7.74 | 51.45 |
| PL | pl | Lower+Upper | 4 | 2015–11 | 2020–08 | 4.9 | 5.66 | 27.45 |
| SI | sl | Lower | 2 | 2014–08 | 2020–07 | 6.0 | 3.34 | 20.19 |
| TR | tr | Unicameral | 4 | 2009–04 | 2021–02 | 12.0 | 3.65 | 43.99 |
Metadata on speakers with the number of political parties and groups, other “organisations”, number of defined persons, those with assigned gender, how many are MPs, and how many have party affiliation(s), known date of birth, one or more associated URLs and link to their photo
| ID | Prts | C/O | Orgs | Prsns | Gender | MP | Affill | Birth | URL | IMG |
|---|---|---|---|---|---|---|---|---|---|---|
| BE | 63 | 10 | 2 | 775 | 548 | 548 | 548 | 548 | 0 | 548 |
| BG | 14 | 4 | 5 | 606 | 606 | 420 | 310 | 534 | 99 | 0 |
| CZ | 61 | 5 | 851 | 485 | 461 | 366 | 366 | 403 | 463 | 364 |
| DK | 19 | 4 | 2 | 454 | 454 | 446 | 454 | 454 | 0 | 0 |
| ES | 50 | 10 | 2 | 814 | 814 | 764 | 758 | 793 | 0 | 0 |
| FR | 16 | 0 | 100 | 670 | 670 | 609 | 585 | 664 | 0 | 0 |
| GB | 31 | 5 | 2 | 1901 | 1901 | 1865 | 1897 | 0 | 1901 | 1029 |
| HR | 16 | 2 | 2 | 322 | 322 | 182 | 186 | 168 | 0 | 0 |
| HU | 10 | 0 | 2 | 194 | 194 | 194 | 194 | 192 | 0 | 0 |
| IS | 10 | 6 | 2 | 205 | 205 | 113 | 201 | 205 | 0 | 0 |
| IT | 42 | 22 | 2 | 739 | 739 | 689 | 589 | 739 | 0 | 0 |
| LT | 13 | 20 | 214 | 799 | 799 | 247 | 233 | 247 | 0 | 0 |
| LV | 11 | 0 | 2 | 219 | 219 | 174 | 174 | 0 | 0 | 0 |
| NL | 29 | 12 | 3 | 492 | 492 | 454 | 457 | 0 | 0 | 0 |
| PL | 10 | 3 | 1 | 1123 | 1122 | 743 | 709 | 742 | 0 | 0 |
| SI | 15 | 8 | 5 | 377 | 377 | 167 | 163 | 193 | 78 | 0 |
| TR | 19 | 3 | 2 | 1236 | 1236 | 1223 | 1203 | 0 | 0 | 0 |
Overview of selected data from ParlaMint corpora with the number of speeches, of speeches with a known speaker, of those not spoken by the chair of the sessions, spoken by MPs, and the number of marked-up headings, notes, and incidents
| ID | Speeches | W.Spks | W.NCs | W.MPs | Heads | Notes | Incidents |
|---|---|---|---|---|---|---|---|
| BE | 148,425 | 147,940 | 116,214 | 141,340 | 0 | 140,512 | 865 |
| BG | 146,351 | 146,295 | 73,981 | 120,780 | 0 | 0 | 34,313 |
| CZ | 154,460 | 154,460 | 72,301 | 150,957 | 0 | 188,563 | 25,692 |
| DK | 287,144 | 287,144 | 137,210 | 277,835 | 10,544 | 10,544 | 0 |
| ES | 49,919 | 27,812 | 21,414 | 27,709 | 1728 | 46,965 | 0 |
| FR | 481,603 | 465,590 | 421,241 | 437,965 | 12,498 | 12,498 | 62,709 |
| GB | 552,103 | 549,710 | 537,928 | 547,305 | 31,389 | 165,648 | 0 |
| HR | 124,496 | 124,486 | 62,128 | 116,716 | 16 | 9 | 11,842 |
| HU | 3086 | 3086 | 3086 | 3086 | 0 | 2958 | 3752 |
| IS | 74,132 | 74,132 | 71,693 | 71,900 | 0 | 99 | 41,405 |
| IT | 79,373 | 79,373 | 50,735 | 78,269 | 11,585 | 192,855 | 61,607 |
| LT | 244,835 | 244,835 | 126,488 | 229,980 | 1752 | 35,406 | 30,155 |
| LV | 122,136 | 122,136 | 60,663 | 117,899 | 0 | 122,136 | 0 |
| NL | 474,964 | 474,964 | 351,789 | 463,629 | 0 | 191,113 | 0 |
| PL | 331,044 | 331,044 | 226,046 | 302,965 | 516 | 9453 | 112,786 |
| SI | 75,122 | 75,122 | 37,216 | 70,609 | 1240 | 85,111 | 2337 |
| TR | 692,161 | 660,239 | 432,618 | 660,239 | 0 | 142,415 | 0 |
Fig. 1Encoding of the start of a corpus header
Fig. 2Example of encoding of a legislature taxonomy category
Fig. 3Example of parliament, political party and coalition/opposition encoding
Fig. 4Example of a speaker encoding
Fig. 5Example of encoding the start of a corpus component header
Fig. 6Example of encoded text with speeches
Fig. 7Example of a linguistically analysed text