Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • معلومة اضافية
    • Contributors:
      Universitat Politècnica de Catalunya. Departament de Ciències de la Computació; Universitat Politècnica de Catalunya. ALBCOM - Algorismia, Bioinformàtica, Complexitat i Mètodes Formals
    • الموضوع:
      2007
    • Collection:
      Universitat Politècnica de Catalunya (UPC): Tesis Doctorals en Xarxa (TDX) / Theses and Dissertations Online
    • نبذة مختصرة :
      Similarity of sequences is a key mathematical notion for Classification and Phylogenetic studies in Biology. It is currently primarily handled using alignments. However, the alignment methods seem inadequate for post-genomic studies since they do not scale well with data set size and they seem to be confined only to genomic and proteomic sequences. Therefore, alignment-free similarity measures are actively pursued. Among those, USM (Universal Similarity Metric) has gained prominence. It is based on the deep theory of Kolmogorov Complexity and universality is its most novel striking feature. Since it can only be approximated via data compression, USM is a methodology rather than a formula quantifying the similarity of two strings. Three approximations of USM are available, namely UCD (Universal Compression Dissimilarity), NCD (Normalized Compression Dissimilarity) and CD (Compression Dissimilarity). Their applicability and robustness is tested on various data sets yielding a first massive quantitative estimate that the USM methodology and its approximations are of value. Despite the rich theory developed around USM, its experimental assessment has limitations: only a few data compressors have been tested in conjunction with USM and mostly at a qualitative level, no comparison among UCD, NCD and CD is available and no comparison of USM with existing methods, both based on alignments and not, seems to be available. Results: We experimentally test the USM methodology by using 25 compressors, all three of its known approximations and six data sets of relevance to Molecular Biology. This offers the first systematic and quantitative experimental assessment of this methodology, that naturally complements the many theoretical and the preliminary experimental results available. Moreover, we compare the USM methodology both with methods based on alignments and not. We may group our experiments into two sets. The first one, performed via ROC (Receiver Operating Curve) analysis, aims at assessing the intrinsic ability of the ...
    • File Description:
      20 p.
    • ISSN:
      1471-2105
    • Relation:
      https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-8-252; Ferragina, P., Giancarlo, R., Greco, V., Manzini, Giovanni, Valiente, G. Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment. "BMC bioinformatics", Juliol 2007, vol. 8, núm. 252, p. 1-20.; http://hdl.handle.net/2117/113030
    • الرقم المعرف:
      10.1186/1471-2105-8-252
    • Rights:
      Attribution 3.0 Spain ; http://creativecommons.org/licenses/by/3.0/es/ ; Open Access
    • الرقم المعرف:
      edsbas.20E25412