Optimizing unsupervised feature engineering and classification pipelines for differentiated thyroid cancer recurrence prediction

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ على الانترنت اقرأ أكثر حفظ في قائمتي

المؤلفون: Emmanuel Onah; Uche Jude Eze; Abdullahi Salahudeen Abdulraheem; Ugochukwu Gabriel Ezigbo; Kosisochi Chinwendu Amorha; Fidele Ntie-Kang
المصدر:
BMC Med Inform Decis Mak
BMC Medical Informatics and Decision Making, Vol 25, Iss 1, Pp 1-22 (2025)
الموضوع:
Male; Adult; PCA; Recurrence prediction; Differentiated thyroid cancer (DTC); Research; Computer applications to medicine. Medical informatics; R858-859.7; Logistic regression; Middle Aged; Dimensionality reduction; Machine Learning; Machine learning; Humans; Female; Thyroid Neoplasms; Neural Networks, Computer; Neoplasm Recurrence, Local; Unsupervised Machine Learning
نوع التسجيلة:
Article
Other literature type
اللغة:
English
الدخول الالكتروني :
https://pubmed.ncbi.nlm.nih.gov/40361143
https://doaj.org/article/e9dc652702124332821d7cb4d2a909ad

معلومة اضافية
- بيانات النشر:
  Springer Science and Business Media LLC, 2025.
- الموضوع:
  2025
- نبذة مختصرة :
  Differentiated thyroid cancer (DTC) is a common endocrine malignancy with rising incidence and frequent recurrence, despite a generally favorable prognosis. Accurate recurrence prediction is critical for guiding post-treatment strategies. This study aimed to enhance predictive performance by refining feature engineering and evaluating a diverse ensemble of machine learning models using the UCI DTC dataset.Unsupervised data engineering-specifically dimensionality reduction and clustering-was used to improve feature quality. Principal Component Analysis (PCA) and Truncated Singular Value Decomposition (t-SVD) were selected based on superior clustering metrics: adjusted Rand Index (ARI > 0.55) and V-measure (> 0.45). These were integrated into classification pipelines using Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), Feedforward Neural Network (FNN), and Gradient Boosting (GB). Model performance was evaluated through bootstrapping on an independent test set, stratified 10-fold cross-validation (CV), and subgroup analyses. Metrics included balanced accuracy, F1 score, AUC, sensitivity, specificity, and precision, each reported with 95% confidence intervals (CIs). SHAP analysis supported model interpretability.The PCA-based LR pipeline achieved the best test set performance: balanced accuracy of 0.95 (95% CI: 0.90-0.99), AUC of 0.99 (95% CI: 0.97-1.00), and sensitivity of 0.94 (95% CI: 0.84-1.00). In stratified CV, it maintained strong results (balanced accuracy: 0.86; AUC: 0.97; sensitivity: 0.80), with consistent performance across clinically relevant subgroups. The t-SVD-based LR pipeline showed comparable performance on both test and CV sets. SVM and FNN pipelines also performed robustly (test AUCs > 0.99; CV AUCs > 0.96). RF and KNN had high specificity but slightly lower sensitivity (test: ~0.87; CV: 0.77-0.80). GB pipelines showed the lowest overall performance (test balanced accuracy: 0.86-0.88; CV: 0.85-0.88).Dimensionality reduction via PCA and t-SVD significantly improved model performance, particularly for LR, SVM, FNN, RF and KNN classifiers. The PCA-based LR pipeline showed the best generalizability, supporting its potential integration into clinical decision-support tools for personalized DTC management.Not applicable.
- ISSN:
  1472-6947
- الرقم المعرف:
  10.1186/s12911-025-03018-3
- Rights:
  CC BY NC ND
- الرقم المعرف:
  edsair.doi.dedup.....47eca219c4d6377d624e6cad30633d87

تعليقات

No Comments.

Optimizing unsupervised feature engineering and classification pipelines for differentiated thyroid cancer recurrence prediction

اتصل بنا

اتبع