Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Space and Time-Efficient Data Structures for Massive Datasets

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • المؤلفون: PIBIRI, GIULIO ERMANNO
  • المصدر:
    http://etd.adm.unipi.it/theses/available/etd-02262019-145646/.
  • الموضوع:
  • نوع التسجيلة:
    text
  • اللغة:
    Italian
  • معلومة اضافية
    • Contributors:
      Venturini, Rossano
    • بيانات النشر:
      Pisa University
    • الموضوع:
      2019
    • Collection:
      Università di Pisa: ETD (Electronic Theses and Dissertations)
    • نبذة مختصرة :
      This thesis concerns the design of compressed data structures for the efficient storage of massive datasets of integer sequences and short strings. The studied problems arise in several real-world scenarios, such as inverted index engineering, thus a consistent part of the thesis focuses on the experimental analysis of the proposed, practical, solutions. All the code used in the experimentation is publicly available in the hope of being useful for further research. For the problem of representing in compressed space the sorted integer sequences of inverted indexes, we propose three different solutions showing new interesting space/time trade-offs. The first proposal shows that clustering the inverted lists yields more compact indexes at the expense of a moderate slowdown in query processing speed. The core idea is that clusters of similar inverted lists, i.e., the ones sharing many integers, are compressed together, thus permitting to reduce their redundancy. The second proposal describes an optimal, linear-time, algorithm that splits an inverted list in variable-size partitions in order to minimize its space when the Variable-Byte encoding is used to compress its partitions. The proposed algorithm is fast and yields indexes that are more than twice smaller than regular Variable- Byte, without compromising their speed efficiency. The third proposal exploits the repetitiveness of the d-gapped inverted lists to achieve compact storage and very fast decoding speed. Specifically, we show that a dictionary of repeated integer patterns can be built to compress the inverted lists. This permits to represent the inverted lists as references to the dictionary’s patterns, thus allowing the two-fold advantage of: encoding many integers with a single reference identifier; decoding many integers with a single dictionary access. We then take into account strings of at most n words, i.e., n-grams and address the two fundamental problems of: indexing large and sparse n-gram datasets; esti- mating language models from a large ...
    • File Description:
      application/pdf
    • الدخول الالكتروني :
      http://etd.adm.unipi.it/theses/available/etd-02262019-145646/
    • Rights:
      info:eu-repo/semantics/openAccess ; Copyright information available at source archive
    • الرقم المعرف:
      edsbas.AF0301EF