Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 3.0

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • معلومة اضافية
    • بيانات النشر:
      CLARIN ERIC
    • الموضوع:
      2023
    • Collection:
      OLAC: Open Language Archives Community
    • نبذة مختصرة :
      ParlaMint-en 3.0 comprises linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 (http://hdl.handle.net/11356/1488) which were machine translated to English and the translation linguistically annotated. Except for the translation to English, small changes in the metadata and the absence of the British parliament corpus, the corpora included in this entry are all respects identical to the source language corpora, i.e. the entry comprises the same 26 European parliamentary corpora, together with over 1.1 billion words. The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) with OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/), using the English language model. For NER the conll03 model with 4 NE classes was used. Note that the automatically produced translation to English contains errors typical of neural machine translation, which also includes factual errors even when a high level of fluency is achieved, and any manual or automatic usage of this corpus should take the machine translation limitations into account. Note also that some metadata errors were noticed after the source 3.0 corpora were released, and were corrected for the MTed corpus, so there are slight differences in the metadata between the two. The files associated with this entry include the linguistically annotated corpora in several formats: the corpora in thje canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corproa in the CoNLL-U format with TSV speech metadata. In contrast to the source language corpora, the CoNLL-U ...
    • Relation:
      http://hdl.handle.net/11356/1810
    • Rights:
      Creative Commons - Attribution 4.0 International (CC BY 4.0) ; https://creativecommons.org/licenses/by/4.0/
    • الرقم المعرف:
      edsbas.A6402332