Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Systematic characterizations of text similarity in full text biomedical publications.

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • المؤلفون: Sun Z;Sun Z; Errami M; Long T; Renard C; Choradia N; Garner H
  • المصدر:
    PloS one [PLoS One] 2010 Sep 15; Vol. 5 (9), pp. e12704. Date of Electronic Publication: 2010 Sep 15.
  • نوع النشر :
    Journal Article; Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov't
  • اللغة:
    English
  • معلومة اضافية
    • المصدر:
      Publisher: Public Library of Science Country of Publication: United States NLM ID: 101285081 Publication Model: Electronic Cited Medium: Internet ISSN: 1932-6203 (Electronic) Linking ISSN: 19326203 NLM ISO Abbreviation: PLoS One Subsets: MEDLINE
    • بيانات النشر:
      Original Publication: San Francisco, CA : Public Library of Science
    • الموضوع:
    • نبذة مختصرة :
      Background: Computational methods have been used to find duplicate biomedical publications in MEDLINE. Full text articles are becoming increasingly available, yet the similarities among them have not been systematically studied. Here, we quantitatively investigated the full text similarity of biomedical publications in PubMed Central.
      Methodology/principal Findings: 72,011 full text articles from PubMed Central (PMC) were parsed to generate three different datasets: full texts, sections, and paragraphs. Text similarity comparisons were performed on these datasets using the text similarity algorithm eTBLAST. We measured the frequency of similar text pairs and compared it among different datasets. We found that high abstract similarity can be used to predict high full text similarity with a specificity of 20.1% (95% CI [17.3%, 23.1%]) and sensitivity of 99.999%. Abstract similarity and full text similarity have a moderate correlation (Pearson correlation coefficient: -0.423) when the similarity ratio is above 0.4. Among pairs of articles in PMC, method sections are found to be the most repetitive (frequency of similar pairs, methods: 0.029, introduction: 0.0076, results: 0.0043). In contrast, among a set of manually verified duplicate articles, results are the most repetitive sections (frequency of similar pairs, results: 0.94, methods: 0.89, introduction: 0.82). Repetition of introduction and methods sections is more likely to be committed by the same authors (odds of a highly similar pair having at least one shared author, introduction: 2.31, methods: 1.83, results: 1.03). There is also significantly more similarity in pairs of review articles than in pairs containing one review and one nonreview paper (frequency of similar pairs: 0.0167 and 0.0023, respectively).
      Conclusion/significance: While quantifying abstract similarity is an effective approach for finding duplicate citations, a comprehensive full text analysis is necessary to uncover all potential duplicate citations in the scientific literature and is helpful when establishing ethical guidelines for scientific publications.
    • References:
      Nucleic Acids Res. 2009 Jan;37(Database issue):D921-4. (PMID: 18757888)
      Nat Med. 2009 May;15(5):469. (PMID: 19424192)
      BMJ. 2000 May 27;320(7247):1468. (PMID: 10827061)
      Science. 2009 Mar 6;323(5919):1293-4. (PMID: 19265004)
      Rev Med Chil. 2007 Apr;135(4):529-33. (PMID: 17554464)
      Nature. 2008 Jan 24;451(7177):397-9. (PMID: 18216832)
      Bioinformatics. 2008 Jan 15;24(2):243-9. (PMID: 18056062)
      Science. 2009 May 22;324(5930):1004-7. (PMID: 19460978)
      Health Info Libr J. 2010 Jun;27(2):148-54. (PMID: 20565556)
      JAMA. 2004 Feb 25;291(8):974-80. (PMID: 14982913)
      Rev Gastroenterol Peru. 2008 Oct-Dec;28(4):390-1; author reply 392. (PMID: 19156185)
      Rev Med Chil. 2007 Aug;135(8):1087-8. (PMID: 17989871)
      Bioinformatics. 2006 Sep 15;22(18):2298-304. (PMID: 16926219)
    • Grant Information:
      R01 LM009758 United States LM NLM NIH HHS; R01 LM009758-01 United States LM NLM NIH HHS
    • الموضوع:
      Date Created: 20100922 Date Completed: 20110218 Latest Revision: 20220316
    • الموضوع:
      20221213
    • الرقم المعرف:
      PMC2939881
    • الرقم المعرف:
      10.1371/journal.pone.0012704
    • الرقم المعرف:
      20856807