Item request has been placed! ×
Item request cannot be made. ×
loading  Processing Request

Leveraging large language models for structured information extraction from pathology reports

Item request has been placed! ×
Item request cannot be made. ×
loading   Processing Request
  • معلومة اضافية
    • الموضوع:
      2025
    • Collection:
      Computer Science
    • نبذة مختصرة :
      Background: Structured information extraction from unstructured histopathology reports facilitates data accessibility for clinical research. Manual extraction by experts is time-consuming and expensive, limiting scalability. Large language models (LLMs) offer efficient automated extraction through zero-shot prompting, requiring only natural language instructions without labeled data or training. We evaluate LLMs' accuracy in extracting structured information from breast cancer histopathology reports, compared to manual extraction by a trained human annotator. Methods: We developed the Medical Report Information Extractor, a web application leveraging LLMs for automated extraction. We developed a gold standard extraction dataset to evaluate the human annotator alongside five LLMs including GPT-4o, a leading proprietary model, and the Llama 3 model family, which allows self-hosting for data privacy. Our assessment involved 111 histopathology reports from the Breast Cancer Now (BCN) Generations Study, extracting 51 pathology features specified in the study's data dictionary. Results: Evaluation against the gold standard dataset showed that both Llama 3.1 405B (94.7% accuracy) and GPT-4o (96.1%) achieved extraction accuracy comparable to the human annotator (95.4%; p = 0.146 and p = 0.106, respectively). While Llama 3.1 70B (91.6%) performed below human accuracy (p <0.001), its reduced computational requirements make it a viable option for self-hosting. Conclusion: We developed an open-source tool for structured information extraction that can be customized by non-programmers using natural language. Its modular design enables reuse for various extraction tasks, producing standardized, structured data from unstructured text reports to facilitate analytics through improved accessibility and interoperability.
      Comment: 29 pages, 6 figures
    • الرقم المعرف:
      edsarx.2502.12183