Dataset for evaluation of Slovene spell- and grammar-checking tools Šolar-Eval 1.0

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ أكثر حفظ في قائمتي

المؤلفون: Arhar Holdt, Špela; Gantar, Polona; Bon, Mija; Gapsa, Magdalena; Lavrič, Polona; Klemen, Matej
المصدر:
https://www.cjvt.si/prop/en/.
الموضوع:
evaluation; student writing; error annotation; grammatical error correction
نوع التسجيلة:
other/unknown material
اللغة:
Slovenian

معلومة اضافية
- بيانات النشر:
  Centre for Language Resources and Technologies, University of Ljubljana
- الموضوع:
  2023
- Collection:
  Linguistic Data and NLP Tools (CLARIN - Common Language Resources and Technology Infrastructure, Slovenia)
- نبذة مختصرة :
  Šolar-Eval is a specialized dataset designed for the evaluation of Slovene spell- and grammar-checking tools and methodologies. It encompasses 109 essays authored by Slovene primary and secondary school students, featuring 9,808 language corrections meticulously annotated based on the identified language problems. The essays are sourced from the Šolar 3.0 corpus, which integrates authentic corrections from language teachers (http://hdl.handle.net/11356/1589). However, inconsistencies and heterogeneity are common in teacher corrections, particularly in style improvements, making this corpus suboptimal for evaluation tasks. For Šolar-Eval, the corrections were conducted by researchers aiming to ensure consistency, homogeneity, and minimal language intervention. The corrections are annotated according to the reference guidelines found in the attached document. The codes for language errors are structured hierarchically, facilitating robust or fine-grained evaluation. The dataset is accessible in JSON format as generated by the CJVT Svala 1.1 annotation tool (https://orodja.cjvt.si/svala/). The source and target text is also available in the CoNLL-U format (https://universaldependencies.org/format.html). Furthermore, linguistic annotations were applied using the CLASSLA pipeline (https://github.com/clarinsi/classla/) across various levels, including tokenization, sentence segmentation, lemmatization, MULTEXT-East v6 MSD-tags, JOS-SYN dependency syntax, Universal Dependencies, and named entities (more about specific annotation layers: https://wiki.cjvt.si/shelves/linguistic-annotation-of-slovene-corpora). For better accessibility and wider usability, we provide versions with JOS-SYN as well as Universal Dependencies, and English as well as Slovene tags.
- File Description:
  text/plain; charset=utf-8; application/zip; application/pdf; downloadable_files_count: 4
- Relation:
  http://hdl.handle.net/11356/1902
- Rights:
  Creative Commons - Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) ; https://creativecommons.org/licenses/by-nc-sa/4.0/ ; PUB
- الرقم المعرف:
  edsbas.A3F672F8

تعليقات

No Comments.