Publication
Leveraging Clinical Notes and Natural Language Processing for Dementia Detection (Preprint)
Publisher:
JMIR Publications Inc.
Date:
11-10-2022
DOI:
10.2196/PREPRINTS.43417
Abstract: outinely collected data (e.g. coded hospital data, clinical notes) are widely being used to develop dementia prevalence estimates. This is limited and often returns low sensitivity (around 0.14-0.26) because dementia is not always recorded in the hospital notes in such a way as to allow hospital coders to code for dementia. e aim to develop various NLP models with clinical notes for more accurate identification of dementia than using structured (coded) data. fter filtering patients who were admitted before 2021 in Peninsula Health, the Australia National Care Heathy Ageing data team established three validation cohorts based on 1048 patients: (1) confirmed dementia, (2) confirmed non-dementia, and (3) possible/uncertain dementia. 62 dementia related terms and phrases were identified by medical experts, we extended the list with UMLS and generated 245 dementia related key concepts. For each patient, all the clinical notes were used. We applied a fine-tuned MedCat model on the clinical notes and generated the corresponding UMLS annotations for each patient. For model development, we developed four statistical Machine Learning models: Naive Bayes, Logistic Regression, Support Vector Machine and RandomForest, as well as a Deep Learning pipeline which combines text selection and a fine tuned ClinicalBioBERT. We evaluated the above models within two settings: fine grained (dementia, non-dementia, uncertain) and binary (dementia vs non-dementia) classification. 10 fold cross validation was used to evaluate the classifiers’ performance. e found that the dementia patients had three times more medical notes than non-dementia patients (241-283 v.s. 66-74). In the dementia group % of documents were progress notes. Among the four statistical models, Random Forest performed the best for binary classification and Linear SVM performed the best on fine grained classification. The deep learning pipeline didn’t perform well, the main reason could be that the black box encoding of the ClinicalBioBERT results in heavy information loss. he 245 dementia related UMLS concepts that we recognize are useful for dementia detection based on medical notes. Random Forest and Support Vector Machine achieved the best performance for binary and fine grained dementia classification. Deep learning models are not ready yet for the low resource biomedical classification tasks.