Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Macedonian-English parallel corpus MaCoCu-mk-en 1.0

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • معلومة اضافية
    • بيانات النشر:
      Jožef Stefan Institute
      Prompsit
      Rijksuniversiteit Groningen
      Universitat d'Alacant
    • الموضوع:
      2022
    • Collection:
      OLAC: Open Language Archives Community
    • نبذة مختصرة :
      The Macedonian-English parallel corpus MaCoCu-mk-en 1.0 was built by crawling the ".mk" and ".мкд" internet top-level domains in 2021, extending the crawl dynamically to other domains as well. All the crawling process was carried out by the MaCoCu crawler (https://github.com/macocu/MaCoCu-crawler). Websites containing documents in both target languages were identified and processed using the tool Bitextor (https://github.com/bitextor/bitextor). Considerable efforts were devoted into cleaning the extracted text to provide a high-quality parallel corpus. This was achieved by removing boilerplate and near-duplicated paragraphs and documents that are not in one of the targeted languages. Document and segment alignment as implemented in Bitextor were carried out, and BicleanerAI (https://github.com/bitextor/bicleaner-ai) and Bifixer (https://github.com/bitextor/bifixer) were used for fixing, cleaning, and deduplicating the final version of the corpus. While the TXT format consists solely of pairs of source and target segments (consisting of one or several sentences), each segment pair in the TMX format is accompanied by the following metadata: - source and target document URL; - quality score as provided by the tool BicleanerAI; - translation direction identification: the source segment in each segment pair was identified by using a probabilistic model; - personal information identification (“biroamer-entitiesâ€): segments containing personal information are flagged, so final users of the corpus can decide whether to use these segments; - language variants: the language variant of English (British or American) was identified for every segment pair on document and domain level. Notice and take down: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please: (1) Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted. (2) Clearly identify the copyrighted work claimed ...
    • Relation:
      http://hdl.handle.net/11356/1817; http://hdl.handle.net/11356/1513
    • Rights:
      CC0-No Rights Reserved ; https://creativecommons.org/publicdomain/zero/1.0/
    • الرقم المعرف:
      edsbas.3791E1DB