Macedonian-English parallel corpus MaCoCu-mk-en 2.0

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ أكثر حفظ في قائمتي

المؤلفون: BaÃ±Ã³n, Marta; Chichirau, Malina; EsplÃ -Gomis, Miquel; Forcada, Mikel L.; Galiano-JimÃ©nez, AarÃ³n; GarcÃa-Romero, Cristian; Kuzman, Taja; LjubeÅ¡iÄ‡, Nikola; van Noord, Rik; Pla Sempere, Leopoldo; RamÃrez-SÃ¡nchez, Gema; Rupnik, Peter; Suchomel, VÃt; Toral, Antonio; Zaragoza-Bernabeu, Jaume
الموضوع:
parallel corpus; web corpus; multilingual
نوع التسجيلة:
text
اللغة:
Macedonian
English

معلومة اضافية
- بيانات النشر:
  JoÅ¾ef Stefan Institute
  Prompsit
  Rijksuniversiteit Groningen
  Universitat d'Alacant
- الموضوع:
  2023
- Collection:
  OLAC: Open Language Archives Community
- نبذة مختصرة :
  The Macedonian-English parallel corpus MaCoCu-mk-en 2.0 was built by crawling the â€œ.mkâ€ and â€œ.Ð¼ÐºÐ´â€ internet top-level domains in 2021, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus. The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., â€œp35:77s1/3â€ which means â€œparagraph 35 out of 77, sentence 1 out of 3â€); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (â€œbiroamer-entities-detectedâ€): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine ...
- Relation:
  http://hdl.handle.net/11356/1513; http://hdl.handle.net/11356/1817
- Rights:
  CC0-No Rights Reserved ; https://creativecommons.org/publicdomain/zero/1.0/
- الرقم المعرف:
  edsbas.3824B691

تعليقات

No Comments.

Macedonian-English parallel corpus MaCoCu-mk-en 2.0

اتصل بنا

اتبع