ARDC Research Link Australia

Publication

Defining Patient-Oriented Natural Language Processing: A New Paradigm for Research and Development to Facilitate Adoption and Use by Medical Experts (Preprint)

Publisher: JMIR Publications Inc.

Date: 28-02-2020

Abstract: he capabilities of natural language processing (NLP) methods have expanded significantly in recent years, and progress has been particularly driven by advances in data science and machine learning. However, NLP is still largely underused in patient-oriented clinical research and care (POCRC). A key reason behind this is that clinical NLP methods are typically developed, optimized, and evaluated with narrowly focused data sets and tasks (eg, those for the detection of specific symptoms in free texts). Such research and development (R& D) approaches may be described as i roblem oriented /i , and the developed systems perform specialized tasks well. As standalone systems, however, they generally do not comprehensively meet the needs of POCRC. Thus, there is often a gap between the capabilities of clinical NLP methods and the needs of patient-facing medical experts. We believe that to increase the practical use of biomedical NLP, future R& D efforts need to be broadened to a new research paradigm—one that explicitly incorporates characteristics that are crucial for POCRC. We present our viewpoint about 4 such interrelated characteristics that can increase NLP systems’ suitability for POCRC (3 that represent NLP system properties and 1 associated with the R& D process)—(1) interpretability (the ability to explain system decisions), (2) patient centeredness (the capability to characterize erse patients), (3) customizability (the flexibility for adapting to distinct settings, problems, and cohorts), and (4) multitask evaluation (the validation of system performance based on multiple tasks involving heterogeneous data sets). By using the NLP task of clinical concept detection as an ex le, we detail these characteristics and discuss how they may result in the increased uptake of NLP systems for POCRC.

Publication

Defining Patient-Oriented Natural Language Processing: A New Paradigm for Research and Development to Facilitate Adoption and Use by Medical Experts

Publisher: JMIR Publications Inc.

Date: 28-09-2021

DOI: 10.2196/18471

Abstract: The capabilities of natural language processing (NLP) methods have expanded significantly in recent years, and progress has been particularly driven by advances in data science and machine learning. However, NLP is still largely underused in patient-oriented clinical research and care (POCRC). A key reason behind this is that clinical NLP methods are typically developed, optimized, and evaluated with narrowly focused data sets and tasks (eg, those for the detection of specific symptoms in free texts). Such research and development (R& D) approaches may be described as problem oriented, and the developed systems perform specialized tasks well. As standalone systems, however, they generally do not comprehensively meet the needs of POCRC. Thus, there is often a gap between the capabilities of clinical NLP methods and the needs of patient-facing medical experts. We believe that to increase the practical use of biomedical NLP, future R& D efforts need to be broadened to a new research paradigm—one that explicitly incorporates characteristics that are crucial for POCRC. We present our viewpoint about 4 such interrelated characteristics that can increase NLP systems’ suitability for POCRC (3 that represent NLP system properties and 1 associated with the R& D process)—(1) interpretability (the ability to explain system decisions), (2) patient centeredness (the capability to characterize erse patients), (3) customizability (the flexibility for adapting to distinct settings, problems, and cohorts), and (4) multitask evaluation (the validation of system performance based on multiple tasks involving heterogeneous data sets). By using the NLP task of clinical concept detection as an ex le, we detail these characteristics and discuss how they may result in the increased uptake of NLP systems for POCRC.

Publication

Can accurate demographic information about people who use prescription medications nonmedically be derived from Twitter?

Publisher: Proceedings of the National Academy of Sciences

Date: 14-02-2023

DOI: 10.1073/PNAS.2207391120

Abstract: Traditional substance use (SU) surveillance methods, such as surveys, incur substantial lags. Due to the continuously evolving trends in SU, insights obtained via such methods are often outdated. Social media-based sources have been proposed for obtaining timely insights, but methods leveraging such data cannot typically provide fine-grained statistics about subpopulations, unlike traditional approaches. We address this gap by developing methods for automatically characterizing a large Twitter nonmedical prescription medication use (NPMU) cohort (n = 288,562) in terms of age-group, race, and gender. Our natural language processing and machine learning methods for automated cohort characterization achieved 0.88 precision (95% CI:0.84 to 0.92) for age-group, 0.90 (95% CI: 0.85 to 0.95) for race, and 94% accuracy (95% CI: 92 to 97) for gender, when evaluated against manually annotated gold-standard data. We compared automatically derived statistics for NPMU of tranquilizers, stimulants, and opioids from Twitter with statistics reported in the National Survey on Drug Use and Health (NSDUH) and the National Emergency Department S le (NEDS). Distributions automatically estimated from Twitter were mostly consistent with the NSDUH [Spearman r : race: 0.98 ( P 0.005) age-group: 0.67 ( P 0.005) gender: 0.66 ( P = 0.27)] and NEDS, with 34/65 (52.3%) of the Twitter-based estimates lying within 95% CIs of estimates from the traditional sources. Explainable differences (e.g., overrepresentation of younger people) were found for age-group-related statistics. Our study demonstrates that accurate subpopulation-specific estimates about SU, particularly NPMU, may be automatically derived from Twitter to obtain earlier insights about targeted subpopulations compared to traditional surveillance approaches.

Publication

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid (Preprint)

Publisher: JMIR Publications Inc.

Date: 18-12-2020

DOI: 10.2196/PREPRINTS.26616

Abstract: he wide adoption of social media in daily life renders it a rich and effective resource for conducting near real-time assessments of consumers’ perceptions of health services. However, its use in these assessments can be challenging because of the vast amount of data and the ersity of content in social media chatter. his study aims to develop and evaluate an automatic system involving natural language processing and machine learning to automatically characterize user-posted Twitter data about health services using Medicaid, the single largest source of health coverage in the United States, as an ex le. e collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website’s search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a s le of tweets in 5 predetermined categories or i other /i and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), naïve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional encoder representations from transformers (BERT). We then applied the best-performing classifier to the collected tweets for postclassification analyses to assess the utility of our methods. e manually annotated 11,379 tweets (Corpus 1: 9179 Corpus 2: 2200) and used 7930 (69.7%) for training, 1449 (12.7%) for validation, and 2000 (17.6%) for testing. A classifier based on BERT obtained the highest accuracies (81.7%, Corpus 1 80.7%, Corpus 2) and F sub /sub scores on consumer feedback (0.58, Corpus 1 0.90, Corpus 2), outperforming the second best classifiers in terms of accuracy (74.6%, RF on Corpus 1 69.4%, RF on Corpus 2) and F sub /sub score on consumer feedback (0.44, NN on Corpus 1 0.82, RF on Corpus 2). Postclassification analyses revealed differing intercorpora distributions of tweet categories, with political (400778/628411, 63.78%) and consumer feedback (15073/27337, 55.14%) tweets being the most frequent for Corpus 1 and Corpus 2, respectively. he broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization and can be deployed and generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies. >

Publication

Text classification models for the automatic detection of nonmedical prescription medication use from social media

Publisher: Springer Science and Business Media LLC

Date: 26-01-2021

DOI: 10.1186/S12911-021-01394-0

Abstract: Prescription medication (PM) misuse/abuse has emerged as a national crisis in the United States, and social media has been suggested as a potential resource for performing active monitoring. However, automating a social media-based monitoring system is challenging—requiring advanced natural language processing (NLP) and machine learning methods. In this paper, we describe the development and evaluation of automatic text classification models for detecting self-reports of PM abuse from Twitter. We experimented with state-of-the-art bi-directional transformer-based language models, which utilize tweet-level representations that enable transfer learning (e.g., BERT, RoBERTa, XLNet, AlBERT, and DistilBERT), proposed fusion-based approaches, and compared the developed models with several traditional machine learning, including deep learning, approaches. Using a public dataset, we evaluated the performances of the classifiers on their abilities to classify the non-majority “abuse/misuse” class. Our proposed fusion-based model performs significantly better than the best traditional model (F 1 -score [95% CI]: 0.67 [0.64–0.69] vs. 0.45 [0.42–0.48]). We illustrate, via experimentation using varying training set sizes, that the transformer-based models are more stable and require less annotated data compared to the other models. The significant improvements achieved by our best-performing classification model over past approaches makes it suitable for automated continuous monitoring of nonmedical PM use from Twitter. BERT, BERT-like and fusion-based models outperform traditional machine learning and deep learning models, achieving substantial improvements over many years of past research on the topic of prescription medication misuse/abuse classification from social media, which had been shown to be a complex task due to the unique ways in which information about nonmedical use is presented. Several challenges associated with the lack of context and the nature of social media language need to be overcome to further improve BERT and BERT-like models. These experimental driven challenges are represented as potential future research directions.

Publication

The Early Detection of Fraudulent COVID-19 Products From Twitter Chatter: Data Set and Baseline Approach Using Anomaly Detection

Publisher: JMIR Publications Inc.

Date: 14-03-2023

DOI: 10.2196/43694

Abstract: Social media has served as a lucrative platform for spreading misinformation and for promoting fraudulent products for the treatment, testing, and prevention of COVID-19. This has resulted in the issuance of many warning letters by the US Food and Drug Administration (FDA). While social media continues to serve as the primary platform for the promotion of such fraudulent products, it also presents the opportunity to identify these products early by using effective social media mining methods. Our objectives were to (1) create a data set of fraudulent COVID-19 products that can be used for future research and (2) propose a method using data from Twitter for automatically detecting heavily promoted COVID-19 products early. We created a data set from FDA-issued warnings during the early months of the COVID-19 pandemic. We used natural language processing and time-series anomaly detection methods for automatically detecting fraudulent COVID-19 products early from Twitter. Our approach is based on the intuition that increases in the popularity of fraudulent products lead to corresponding anomalous increases in the volume of chatter regarding them. We compared the anomaly signal generation date for each product with the corresponding FDA letter issuance date. We also performed a brief manual analysis of chatter associated with 2 products to characterize their contents. FDA warning issue dates ranged from March 6, 2020, to June 22, 2021, and 44 key phrases representing fraudulent products were included. From 577,872,350 posts made between February 19 and December 31, 2020, which are all publicly available, our unsupervised approach detected 34 out of 44 (77.3%) signals about fraudulent products earlier than the FDA letter issuance dates, and an additional 6 (13.6%) within a week following the corresponding FDA letters. Content analysis revealed misinformation, information, political, and conspiracy theories to be prominent topics. Our proposed method is simple, effective, easy to deploy, and does not require high-performance computing machinery unlike deep neural network–based methods. The method can be easily extended to other types of signal detection from social media data. The data set may be used for future research and the development of more advanced methods.

Publication

The Role of Natural Language Processing during the COVID-19 Pandemic: Health Applications, Opportunities, and Challenges

Publisher: MDPI AG

Date: 12-11-2022

DOI: 10.3390/HEALTHCARE10112270

Abstract: The COVID-19 pandemic is the most devastating public health crisis in at least a century and has affected the lives of billions of people worldwide in unprecedented ways. Compared to pandemics of this scale in the past, societies are now equipped with advanced technologies that can mitigate the impacts of pandemics if utilized appropriately. However, opportunities are currently not fully utilized, particularly at the intersection of data science and health. Health-related big data and technological advances have the potential to significantly aid the fight against such pandemics, including the current pandemic’s ongoing and long-term impacts. Specifically, the field of natural language processing (NLP) has enormous potential at a time when vast amounts of text-based data are continuously generated from a multitude of sources, such as health/hospital systems, published medical literature, and social media. Effectively mitigating the impacts of the pandemic requires tackling challenges associated with the application and deployment of NLP systems. In this paper, we review the applications of NLP to address erse aspects of the COVID-19 pandemic. We outline key NLP-related advances on a chosen set of topics reported in the literature and discuss the opportunities and challenges associated with applying NLP during the current pandemic and future ones. These opportunities and challenges can guide future research aimed at improving the current health and social response systems and pandemic preparedness.

Publication

Text Classification Models for the Automatic Detection of Nonmedical Prescription Medication Use from Social Media

Publisher: Research Square Platform LLC

Date: 12-01-2021

DOI: 10.21203/RS.3.RS-58679/V2

Abstract: Background Prescription medication (PM) misuse/abuse has emerged as a national crisis in the United States, and social media has been suggested as a potential resource for performing active monitoring. However, automating a social media-based monitoring system is challenging—requiring advanced natural language processing (NLP) and machine learning methods. In this paper, we describe the development and evaluation of automatic text classification models for detecting self-reports of PM abuse from Twitter. Methods We experimented with state-of-the-art bi-directional transformer-based language models, which utilize tweet-level representations that enable transfer learning (e.g., BERT, RoBERTa, XLNet, AlBERT, and DistilBERT), proposed fusion-based approaches, and compared the developed models with several traditional machine learning, including deep learning, approaches. Using a public dataset, we evaluated the performances of the classifiers on their abilities to classify the non-majority “abuse/misuse” class. Results Our proposed fusion-based model performs significantly better than the best traditional model (F1-score [95% CI]: 0.67 [0.64-0.69] vs. 0.45 [0.42-0.48]). We illustrate, via experimentation using differing training set sizes, that the transformer-based models are more stable and require less annotated data compared to the other models. The significant improvements achieved by our best-performing classification model over past approaches makes it suitable for automated continuous monitoring of nonmedical PM use from Twitter. Conclusions BERT, BERT-like and fusion-based models not only outperform traditional machine learning and deep learning models, but also show substantial improvements over many years of past research on the topic of prescription medication misuse/abuse classification from social media, which had been shown to be a complex task due to the unique ways in which information about nonmedical use is presented. Several challenges, such as lack of complete context and the nature of social media language, need to be overcome to further improve BERT and BERT-like models. These experimental driven challenges are represented as potential future research directions.

Publication

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid

Publisher: JMIR Publications Inc.

Date: 03-05-2021

DOI: 10.2196/26616

Abstract: The wide adoption of social media in daily life renders it a rich and effective resource for conducting near real-time assessments of consumers’ perceptions of health services. However, its use in these assessments can be challenging because of the vast amount of data and the ersity of content in social media chatter. This study aims to develop and evaluate an automatic system involving natural language processing and machine learning to automatically characterize user-posted Twitter data about health services using Medicaid, the single largest source of health coverage in the United States, as an ex le. We collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website’s search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a s le of tweets in 5 predetermined categories or other and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), naïve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional encoder representations from transformers (BERT). We then applied the best-performing classifier to the collected tweets for postclassification analyses to assess the utility of our methods. We manually annotated 11,379 tweets (Corpus 1: 9179 Corpus 2: 2200) and used 7930 (69.7%) for training, 1449 (12.7%) for validation, and 2000 (17.6%) for testing. A classifier based on BERT obtained the highest accuracies (81.7%, Corpus 1 80.7%, Corpus 2) and F1 scores on consumer feedback (0.58, Corpus 1 0.90, Corpus 2), outperforming the second best classifiers in terms of accuracy (74.6%, RF on Corpus 1 69.4%, RF on Corpus 2) and F1 score on consumer feedback (0.44, NN on Corpus 1 0.82, RF on Corpus 2). Postclassification analyses revealed differing intercorpora distributions of tweet categories, with political (400778/628411, 63.78%) and consumer feedback (15073/27337, 55.14%) tweets being the most frequent for Corpus 1 and Corpus 2, respectively. The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization and can be deployed and generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies.

Publication

Large-Scale Social Media Analysis Reveals Emotions Associated with Nonmedical Prescription Drug Use

Publisher: American Association for the Advancement of Science (AAAS)

Date: 2022

DOI: 10.34133/2022/9851989

Abstract: Background. The behaviors and emotions associated with and reasons for nonmedical prescription drug use (NMPDU) are not well-captured through traditional instruments such as surveys and insurance claims. Publicly available NMPDU-related posts on social media can potentially be leveraged to study these aspects unobtrusively and at scale. Methods. We applied a machine learning classifier to detect self-reports of NMPDU on Twitter and extracted all public posts of the associated users. We analyzed approximately 137 million posts from 87,718 Twitter users in terms of expressed emotions, sentiments, concerns, and possible reasons for NMPDU via natural language processing. Results. Users in the NMPDU group express more negative emotions and less positive emotions, more concerns about family, the past, and body, and less concerns related to work, leisure, home, money, religion, health, and achievement compared to a control group (i.e., users who never reported NMPDU). NMPDU posts tend to be highly polarized, indicating potential emotional triggers. Gender-specific analyses show that female users in the NMPDU group express more content related to positive emotions, anticipation, sadness, joy, concerns about family, friends, home, health, and the past, and less about anger than males. The findings are consistent across distinct prescription drug categories (opioids, benzodiazepines, stimulants, and polysubstance). Conclusion. Our analyses of large-scale data show that substantial differences exist between the texts of the posts from users who self-report NMPDU on Twitter and those who do not, and between males and females who report NMPDU. Our findings can enrich our understanding of NMPDU and the population involved.

Publication

The Early Detection of Fraudulent COVID-19 Products From Twitter Chatter: Data Set and Baseline Approach Using Anomaly Detection (Preprint)

Publisher: JMIR Publications Inc.

Date: 20-10-2022

DOI: 10.2196/PREPRINTS.43694

Abstract: ocial media has served as a lucrative platform for spreading misinformation and for promoting fraudulent products for the treatment, testing, and prevention of COVID-19. This has resulted in the issuance of many warning letters by the US Food and Drug Administration (FDA). While social media continues to serve as the primary platform for the promotion of such fraudulent products, it also presents the opportunity to identify these products early by using effective social media mining methods. ur objectives were to (1) create a data set of fraudulent COVID-19 products that can be used for future research and (2) propose a method using data from Twitter for automatically detecting heavily promoted COVID-19 products early. e created a data set from FDA-issued warnings during the early months of the COVID-19 pandemic. We used natural language processing and time-series anomaly detection methods for automatically detecting fraudulent COVID-19 products early from Twitter. Our approach is based on the intuition that increases in the popularity of fraudulent products lead to corresponding anomalous increases in the volume of chatter regarding them. We compared the anomaly signal generation date for each product with the corresponding FDA letter issuance date. We also performed a brief manual analysis of chatter associated with 2 products to characterize their contents. DA warning issue dates ranged from March 6, 2020, to June 22, 2021, and 44 key phrases representing fraudulent products were included. From 577,872,350 posts made between February 19 and December 31, 2020, which are all publicly available, our unsupervised approach detected 34 out of 44 (77.3%) signals about fraudulent products earlier than the FDA letter issuance dates, and an additional 6 (13.6%) within a week following the corresponding FDA letters. Content analysis revealed i misinformation /i , i information /i , i olitical, /i and i conspiracy theories /i to be prominent topics. ur proposed method is simple, effective, easy to deploy, and does not require high-performance computing machinery unlike deep neural network–based methods. The method can be easily extended to other types of signal detection from social media data. The data set may be used for future research and the development of more advanced methods.

Mohammed Al-Garadi

Researcher

Publications

Defining Patient-Oriented Natural Language Processing: A New Paradigm for Research and Development to Facilitate Adoption and Use by Medical Experts (Preprint)

Defining Patient-Oriented Natural Language Processing: A New Paradigm for Research and Development to Facilitate Adoption and Use by Medical Experts

Can accurate demographic information about people who use prescription medications nonmedically be derived from Twitter?

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid (Preprint)

Text classification models for the automatic detection of nonmedical prescription medication use from social media

The Early Detection of Fraudulent COVID-19 Products From Twitter Chatter: Data Set and Baseline Approach Using Anomaly Detection

The Role of Natural Language Processing during the COVID-19 Pandemic: Health Applications, Opportunities, and Challenges

Text Classification Models for the Automatic Detection of Nonmedical Prescription Medication Use from Social Media

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid

Large-Scale Social Media Analysis Reveals Emotions Associated with Nonmedical Prescription Drug Use

The Early Detection of Fraudulent COVID-19 Products From Twitter Chatter: Data Set and Baseline Approach Using Anomaly Detection (Preprint)

Related Organisations

University Of California, San Diego

Emory University

Vanderbilt University Medical Center

Related Funding Activities

Mohammed Al-Garadi

Researcher

Related Links

Publications

Defining Patient-Oriented Natural Language Processing: A New Paradigm for Research and Development to Facilitate Adoption and Use by Medical Experts (Preprint)

Defining Patient-Oriented Natural Language Processing: A New Paradigm for Research and Development to Facilitate Adoption and Use by Medical Experts

Can accurate demographic information about people who use prescription medications nonmedically be derived from Twitter?

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid (Preprint)

Text classification models for the automatic detection of nonmedical prescription medication use from social media

The Early Detection of Fraudulent COVID-19 Products From Twitter Chatter: Data Set and Baseline Approach Using Anomaly Detection

The Role of Natural Language Processing during the COVID-19 Pandemic: Health Applications, Opportunities, and Challenges

Text Classification Models for the Automatic Detection of Nonmedical Prescription Medication Use from Social Media

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid

Large-Scale Social Media Analysis Reveals Emotions Associated with Nonmedical Prescription Drug Use

The Early Detection of Fraudulent COVID-19 Products From Twitter Chatter: Data Set and Baseline Approach Using Anomaly Detection (Preprint)

Related Organisations

University Of California, San Diego

Emory University

Vanderbilt University Medical Center

Related Funding Activities

ARDC NEWSLETTER SIGNUP