Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

StackDPP: a stacking ensemble based DNA-binding protein prediction model.

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • معلومة اضافية
    • المصدر:
      Publisher: BioMed Central Country of Publication: England NLM ID: 100965194 Publication Model: Electronic Cited Medium: Internet ISSN: 1471-2105 (Electronic) Linking ISSN: 14712105 NLM ISO Abbreviation: BMC Bioinformatics Subsets: MEDLINE
    • بيانات النشر:
      Original Publication: [London] : BioMed Central, 2000-
    • الموضوع:
    • نبذة مختصرة :
      Background: DNA-binding proteins (DNA-BPs) are the proteins that bind and interact with DNA. DNA-BPs regulate and affect numerous biological processes, such as, transcription and DNA replication, repair, and organization of the chromosomal DNA. Very few proteins, however, are DNA-binding in nature. Therefore, it is necessary to develop an efficient predictor for identifying DNA-BPs.
      Result: In this work, we have proposed new benchmark datasets for the DNA-binding protein prediction problem. We discovered several quality concerns with the widely used benchmark datasets, PDB1075 (for training) and PDB186 (for independent testing), which necessitated the preparation of new benchmark datasets. Our proposed datasets UNIPROT1424 and UNIPROT356 can be used for model training and independent testing respectively. We have retrained selected state-of-the-art DNA-BP predictors in the new dataset and reported their performance results. We also trained a novel predictor using the new benchmark dataset. We extracted features from various feature categories, then used a Random Forest classifier and Recursive Feature Elimination with Cross-validation (RFECV) to select the optimal set of 452 features. We then proposed a stacking ensemble architecture as our final prediction model. Named Stacking Ensemble Model for DNA-binding Protein Prediction, or StackDPP in short, our model achieved 0.92, 0.92 and 0.93 accuracy in 10-fold cross-validation, jackknife and independent testing respectively.
      Conclusion: StackDPP has performed very well in cross-validation testing and has outperformed all the state-of-the-art prediction models in independent testing. Its performance scores in cross-validation testing generalized very well in the independent test set. The source code of the model is publicly available at https://github.com/HasibAhmed1624/StackDPP . Therefore, we expect this generalized model can be adopted by researchers and practitioners to identify novel DNA-binding proteins.
      (© 2024. The Author(s).)
    • References:
      Sci Rep. 2017 Nov 2;7(1):14938. (PMID: 29097781)
      Nucleic Acids Res. 2008 Jul;36(12):3978-92. (PMID: 18515839)
      PLoS One. 2014 Jan 24;9(1):e86703. (PMID: 24475169)
      Biopolymers. 1988 Mar;27(3):451-77. (PMID: 3359010)
      J Theor Biol. 2019 Jan 7;460:64-78. (PMID: 30316822)
      Methods Mol Biol. 2017;1484:55-63. (PMID: 27787820)
      OMICS. 2015 Oct;19(10):648-58. (PMID: 26406767)
      Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. (PMID: 9254694)
      J Mol Biol. 2003 Feb 28;326(4):1065-79. (PMID: 12589754)
      Nucleic Acids Res. 2003 Jul 1;31(13):3692-7. (PMID: 12824396)
      Bioinformatics. 2013 Jul 01;29(13):1614-22. (PMID: 23626001)
      Mol Ther Nucleic Acids. 2017 Jun 16;7:267-277. (PMID: 28624202)
      PLoS One. 2014 Sep 03;9(9):e106691. (PMID: 25184541)
      J Mol Biol. 2004 Jul 30;341(1):65-71. (PMID: 15312763)
      Am J Phys Anthropol. 2006 Jan;129(1):121-31. (PMID: 16261547)
      J Theor Biol. 2018 Sep 7;452:22-34. (PMID: 29753757)
      Bioinformatics. 2018 Jul 15;34(14):2499-2502. (PMID: 29528364)
      Proteins. 1997 Jul;28(3):405-20. (PMID: 9223186)
      Bioinformatics. 2011 Jul 1;27(13):1780-7. (PMID: 21551145)
      Nucleic Acids Res. 2004 Sep 08;32(16):4732-41. (PMID: 15356290)
      Anal Biochem. 2008 Feb 15;373(2):386-8. (PMID: 17976365)
      Bioinformatics. 2017 Sep 15;33(18):2842-2849. (PMID: 28430949)
      J Mol Biol. 2009 Apr 10;387(4):1040-53. (PMID: 19233205)
      Nature. 2021 Aug;596(7873):583-589. (PMID: 34265844)
      Nucleic Acids Res. 2019 Jan 8;47(D1):D506-D515. (PMID: 30395287)
      J Mol Biol. 2006 May 5;358(3):922-33. (PMID: 16551468)
      PLoS One. 2019 Nov 14;14(11):e0225317. (PMID: 31725778)
      Genome Biol. 2000;1(1):REVIEWS001. (PMID: 11104519)
      IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. (PMID: 34232869)
      Bioinformatics. 2010 Mar 1;26(5):692-3. (PMID: 20089514)
      Proc Natl Acad Sci U S A. 2007 Mar 13;104(11):4337-41. (PMID: 17360525)
      Science. 1974 Sep 6;185(4154):862-4. (PMID: 4843792)
      Biophys J. 1994 Feb;66(2 Pt 1):335-44. (PMID: 8161687)
      Nucleic Acids Res. 2005 Nov 10;33(20):6486-93. (PMID: 16284202)
      BMC Bioinformatics. 2015;16 Suppl 4:S1. (PMID: 25734546)
      Peptides. 2001 Dec;22(12):1973-9. (PMID: 11786179)
      Bioinformatics. 2010 Mar 1;26(5):680-2. (PMID: 20053844)
      Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W94-8. (PMID: 15980589)
      Bioinformatics. 2019 Jun 1;35(11):1844-1851. (PMID: 30395157)
      J Protein Chem. 2000 May;19(4):269-75. (PMID: 11043931)
      Bioinformatics. 2022 Apr 12;38(8):2102-2110. (PMID: 35020807)
    • Contributed Indexing:
      Keywords: Classification; DNA-binding protein; Data imbalance; Recursive feature elimination; Sequence identity
    • الرقم المعرف:
      0 (DNA-Binding Proteins)
      9007-49-2 (DNA)
    • الموضوع:
      Date Created: 20240315 Date Completed: 20240318 Latest Revision: 20240318
    • الموضوع:
      20250114
    • الرقم المعرف:
      PMC10941422
    • الرقم المعرف:
      10.1186/s12859-024-05714-9
    • الرقم المعرف:
      38486135