نبذة مختصرة : Bosnian, Croatian, Montenegrin and Serbian are the official standard linguistic varieties in Bosnia and Herzegovina, Croatia, Montenegro, and Serbia, respectively. When these four countries were part of the former Yugoslavia, the varieties were considered to share a single linguistic standard. After the individual countries were established, the national standards emerged. Today, a central question about these varieties remains the following: How different are they from each other? How hard is it to distinguish them? While this has been addressed in NLP as part of the task on Distinguishing Between Similar Languages (DSL), little is known about human performance, making it difficult to contextualize system results. We tackle this question by reannotating the existing BCMS dataset for DSL with annotators from all target regions. We release a new gold standard, replacing the original single-annotator, single-label annotation by a multi-annotator, multi-label one, thus improving annotation reliability and explicitly coding the existence of ambiguous instances. We reassess a previously proposed DSL system on the new gold standard and establish the human upper bound on the task. Finally, we identify sources of annotation difficulties and provide linguistic insights into the BCMS dialect continuum, with multiple indicators highlighting an intermediate position of Bosnian and Montenegrin. ; Peer reviewed
Relation: Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024; International conference on computational linguistics; LREC proceedings; We thank the anonymous reviewers for their feedback, as well as Mikko Aulamo, Yves Scherrer, and Amelie W\u00FChrl for their help in setting up the annotation platform. We are also indebted to our volunteer annotators, whose participation enabled this work: Vesna Arsenovi\u0107, Katja Bila\u0107, Bojana Damnjanovi\u0107, Ljubomir Ivanovi\u0107, Biljana Kaurin, Marijana Kaurin, Maida Koji\u0107 McAndrew, Irina Masnikosa, Sne\u017Eana Nai\u0107, Marija Runi\u0107, Tibor Weigand, and the remaining 22 participants who wished to remain anonymous. Aleksandra Mileti\u0107 was supported by Academy of Finland project number 342859. Filip Mileti\u0107 was supported by DFG research grant SCHU 2580/5-1.; Miletić , A & Miletić , F 2024 , A Gold Standard with Silver Linings : Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian . in S Balloccu , A Belz , R Huidrom , E Reiter , J Sedoc & C Thomson (eds) , Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024 . International conference on computational linguistics , LREC proceedings , European Language Resources Association (ELRA) , Paris , pp. 36-46 , Workshop on Human Evaluation of NLP Systems , Torino , Italy , 21/05/2024 .; workshop; http://hdl.handle.net/10138/577247; abe197ac-4301-4360-9294-807a2f3cf134; 85195208713
No Comments.