Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Macedonian-English parallel corpus MaCoCu-mk-en 2.0

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • معلومة اضافية
    • بيانات النشر:
      Jožef Stefan Institute
      Prompsit
      Rijksuniversiteit Groningen
      Universitat d'Alacant
    • الموضوع:
      2023
    • Collection:
      OLAC: Open Language Archives Community
    • نبذة مختصرة :
      The Macedonian-English parallel corpus MaCoCu-mk-en 2.0 was built by crawling the “.mk†and “.мкд†internet top-level domains in 2021, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable effort was devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and Bifixer (https://github.com/bitextor/bifixer) and BicleanerAI (https://github.com/bitextor/bicleaner-ai) were used for fixing, cleaning, and deduplicating the final version of the corpus. The corpus is available in three formats: two sentence-level formats, TXT and TMX, and a document-level TXT format. TMX is an XML-based format and TXT is a tab-separated format. They both consist of pairs of source and target segments (one or several sentences) and additional metadata. The following metadata is included in both sentence-level formats: - source and target document URL; - paragraph ID which includes information on the position of the sentence in the paragraph and in the document (e.g., “p35:77s1/3†which means “paragraph 35 out of 77, sentence 1 out of 3â€); - quality score as provided by the tool Bicleaner AI (a likelihood of a pair of sentences being mutual translations, provided with a score between 0 and 1); - similarity score as provided by the sentence alignment tool Bleualign (value between 0 and 1); - personal information identification (“biroamer-entities-detectedâ€): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - translation direction and machine ...
    • Relation:
      http://hdl.handle.net/11356/1513; http://hdl.handle.net/11356/1817
    • Rights:
      CC0-No Rights Reserved ; https://creativecommons.org/publicdomain/zero/1.0/
    • الرقم المعرف:
      edsbas.3824B691