ORCID Profile
0000-0001-7358-544X
Current Organisations
The University of Auckland
,
Emory University
,
Skin Cancer Doctors
,
The Insides Company
,
Skin Cancer College Australasia
,
Northland DHB
Does something not look right? The information on this page has been harvested from data sources that may not be up to date. We continue to work with information providers to improve coverage and quality. To report an issue, use the Feedback Form.
In Research Link Australia (RLA), "Research Topics" refer to ANZSRC FOR and SEO codes. These topics are either sourced from ANZSRC FOR and SEO codes listed in researchers' related grants or generated by a large language model (LLM) based on their publications.
Theoretical Physics | Optics And Opto-Electronic Physics | Theoretical and Computational Chemistry | Quantum Chemistry
Telecommunications | Chemical sciences | Physical sciences |
Publisher: Springer Science and Business Media LLC
Date: 07-08-2023
DOI: 10.1186/S44247-023-00029-W
Abstract: Substance use, including the non-medical use of prescription medications, is a global health problem resulting in hundreds of thousands of overdose deaths and other health problems. Social media has emerged as a potent source of information for studying substance use-related behaviours and their consequences. Mining large-scale social media data on the topic requires the development of natural language processing (NLP) and machine learning frameworks customized for this problem. Our objective in this research is to develop a framework for conducting a content analysis of Twitter chatter about the non-medical use of a set of prescription medications. We collected Twitter data for four medications—fentanyl and morphine (opioids), alprazolam (benzodiazepine), and Adderall® (stimulant), and identified posts that indicated non-medical use using an automatic machine learning classifier. In our NLP framework, we applied supervised named entity recognition (NER) to identify other substances mentioned, symptoms, and adverse events. We applied unsupervised topic modelling to identify latent topics associated with the chatter for each medication. The quantitative analysis demonstrated the performance of the proposed NER approach in identifying substance-related entities from data with a high degree of accuracy compared to the baseline methods. The performance evaluation of the topic modelling was also notable. The qualitative analysis revealed knowledge about the use, non-medical use, and side effects of these medications in in iduals and communities. NLP-based analyses of Twitter chatter associated with prescription medications belonging to different categories provide multi-faceted insights about their use and consequences. Our developed framework can be applied to chatter about other substances. Further research can validate the predictive value of this information on the prevention, assessment, and management of these disorders.
Publisher: Springer Science and Business Media LLC
Date: 21-12-2015
Publisher: Oxford University Press (OUP)
Date: 27-09-2019
DOI: 10.1093/JAMIA/OCZ156
Abstract: Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step toward incorporating Twitter data in pharmacoepidemiologic research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names suffer from low recall due to misspellings or ambiguity with common words, we propose a more advanced method to recognize them. We present Kusuri, an Ensemble Learning classifier able to identify tweets mentioning drug products and dietary supplements. Kusuri (薬, “medication” in Japanese) is composed of 2 modules: first, 4 different classifiers (lexicon based, spelling variant based, pattern based, and a weakly trained neural network) are applied in parallel to discover tweets potentially containing medication names second, an ensemble of deep neural networks encoding morphological, semantic, and long-range dependencies of important words in the tweets makes the final decision. On a class-balanced (50-50) corpus of 15 005 tweets, Kusuri demonstrated performances close to human annotators with an F1 score of 93.7%, the best score achieved thus far on this corpus. On a corpus made of all tweets posted by 112 Twitter users (98 959 tweets, with only 0.26% mentioning medications), Kusuri obtained an F1 score of 78.8%. To the best of our knowledge, Kusuri is the first system to achieve this score on such an extremely imbalanced dataset. The system identifies tweets mentioning drug names with performance high enough to ensure its usefulness, and is ready to be integrated in pharmacovigilance, toxicovigilance, or more generally, public health pipelines that depend on medication name mentions.
Publisher: JMIR Publications Inc.
Date: 17-08-2020
DOI: 10.2196/18401
Abstract: Twitter is a potentially valuable tool for public health officials and state Medicaid programs in the United States, which provide public health insurance to 72 million Americans. We aim to characterize how Medicaid agencies and managed care organization (MCO) health plans are using Twitter to communicate with the public. Using Twitter’s public application programming interface, we collected 158,714 public posts (“tweets”) from active Twitter profiles of state Medicaid agencies and MCOs, spanning March 2014 through June 2019. Manual content analyses identified 5 broad categories of content, and these coded tweets were used to train supervised machine learning algorithms to classify all collected posts. We identified 15 state Medicaid agencies and 81 Medicaid MCOs on Twitter. The mean number of followers was 1784, the mean number of those followed was 542, and the mean number of posts was 2476. Approximately 39% of tweets came from just 10 accounts. Of all posts, 39.8% (63,168/158,714) were classified as general public health education and outreach 23.5% (n=37,298) were about specific Medicaid policies, programs, services, or events 18.4% (n=29,203) were organizational promotion of staff and activities and 11.6% (n=18,411) contained general news and news links. Only 4.5% (n=7142) of posts were responses to specific questions, concerns, or complaints from the public. Twitter has the potential to enhance community building, beneficiary engagement, and public health outreach, but appears to be underutilized by the Medicaid program.
Publisher: Cold Spring Harbor Laboratory
Date: 20-06-2020
DOI: 10.1101/2020.06.19.20135962
Abstract: Methadone and buprenorphine-naloxone (Suboxone®) have been discussed and compared extensively in the medical literature as effective treatments for opioid use disorder (OUD). While the evidence basis for the use of these medications is very favorable, less is known about the perceptions of these medications within the general public. This study aimed to use social media, specifically Twitter, to assess the public perception of these medications, and to compare the discussion content between each medication based on theme, subtheme, and sentiment. We conducted a mixed methods descriptive study analyzing in idual microposts (“tweets”) that mentioned “ methadone ” or “ suboxone ”. We then categorized these tweets into themes and subthemes, as well as by sentiment and personal experience, and compared the information posted about these two medications, including in tweets that mentioned both medications. We analyzed 900 tweets, most of which related to access (13.8% for methadone 12.9% for suboxone®), stigma (15.3% 14.0%), and OUD treatment (11.5% 5.4%). Only a small proportion of tweets (16.4 % for suboxone® and 9.3% for methadone) expressed positive sentiments about the medications, with few tweets describing personal experiences. Tweets mentioning both medications primarily discussed MOUD in general, rather than comparing the two medications directly. Twitter content about methadone and suboxone are similar, with the same major themes and similar sub-themes. Despite the proven effectiveness of these medications, there was little dialogue related to their benefits or efficacy in the treatment of opioid use disorder. Perceptions of these medications may contribute to their underutilization in combatting opioid use disorder.
Publisher: Elsevier BV
Date: 05-2023
Publisher: Cold Spring Harbor Laboratory
Date: 18-06-2021
DOI: 10.1101/2021.06.15.21259004
Abstract: To mine Reddit to discover long-COVID symptoms self-reported by users, compare symptom distributions across studies, and create a symptom lexicon. We retrieved posts from the /r/covidlonghaulers subreddit and extracted symptoms via approximate matching using an expanded meta-lexicon. We mapped the extracted symptoms to standard concept IDs, compared their distributions with those reported in recent literature and analyzed their distributions over time. From 42,995 posts by 4249 users, we identified 1744 users who expressed at least 1 symptom. The most frequently reported long-COVID symptoms were mental health-related symptoms (55.2%), fatigue (51.2%), general ache ain (48.4%), brain fog/confusion (32.8%) and dyspnea (28.9%) amongst users reporting at least 1 symptom. Comparison with recent literature revealed a large variance in reported symptoms across studies. Temporal analysis showed several persistent symptoms up to 15 months after infection. The spectrum of symptoms identified from Reddit may provide early insights about long-COVID.
Publisher: Springer Science and Business Media LLC
Date: 30-08-2018
Publisher: Oxford University Press (OUP)
Date: 27-06-2018
DOI: 10.1093/BIOINFORMATICS/BTY273
Abstract: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision techniques to generate additional s les to train our NER. Our NER achieves an F1-score of 0.910 and significantly outperforms the previous state-of-the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER’s capability to embed external features to further boost the system’s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.
Publisher: Wiley
Date: 10-02-2020
DOI: 10.1111/CODI.14957
Publisher: JMIR Publications Inc.
Date: 28-09-2021
DOI: 10.2196/18471
Abstract: The capabilities of natural language processing (NLP) methods have expanded significantly in recent years, and progress has been particularly driven by advances in data science and machine learning. However, NLP is still largely underused in patient-oriented clinical research and care (POCRC). A key reason behind this is that clinical NLP methods are typically developed, optimized, and evaluated with narrowly focused data sets and tasks (eg, those for the detection of specific symptoms in free texts). Such research and development (R& D) approaches may be described as problem oriented, and the developed systems perform specialized tasks well. As standalone systems, however, they generally do not comprehensively meet the needs of POCRC. Thus, there is often a gap between the capabilities of clinical NLP methods and the needs of patient-facing medical experts. We believe that to increase the practical use of biomedical NLP, future R& D efforts need to be broadened to a new research paradigm—one that explicitly incorporates characteristics that are crucial for POCRC. We present our viewpoint about 4 such interrelated characteristics that can increase NLP systems’ suitability for POCRC (3 that represent NLP system properties and 1 associated with the R& D process)—(1) interpretability (the ability to explain system decisions), (2) patient centeredness (the capability to characterize erse patients), (3) customizability (the flexibility for adapting to distinct settings, problems, and cohorts), and (4) multitask evaluation (the validation of system performance based on multiple tasks involving heterogeneous data sets). By using the NLP task of clinical concept detection as an ex le, we detail these characteristics and discuss how they may result in the increased uptake of NLP systems for POCRC.
Publisher: Informa UK Limited
Date: 02-06-2016
Publisher: JMIR Publications Inc.
Date: 14-03-2023
DOI: 10.2196/43694
Abstract: Social media has served as a lucrative platform for spreading misinformation and for promoting fraudulent products for the treatment, testing, and prevention of COVID-19. This has resulted in the issuance of many warning letters by the US Food and Drug Administration (FDA). While social media continues to serve as the primary platform for the promotion of such fraudulent products, it also presents the opportunity to identify these products early by using effective social media mining methods. Our objectives were to (1) create a data set of fraudulent COVID-19 products that can be used for future research and (2) propose a method using data from Twitter for automatically detecting heavily promoted COVID-19 products early. We created a data set from FDA-issued warnings during the early months of the COVID-19 pandemic. We used natural language processing and time-series anomaly detection methods for automatically detecting fraudulent COVID-19 products early from Twitter. Our approach is based on the intuition that increases in the popularity of fraudulent products lead to corresponding anomalous increases in the volume of chatter regarding them. We compared the anomaly signal generation date for each product with the corresponding FDA letter issuance date. We also performed a brief manual analysis of chatter associated with 2 products to characterize their contents. FDA warning issue dates ranged from March 6, 2020, to June 22, 2021, and 44 key phrases representing fraudulent products were included. From 577,872,350 posts made between February 19 and December 31, 2020, which are all publicly available, our unsupervised approach detected 34 out of 44 (77.3%) signals about fraudulent products earlier than the FDA letter issuance dates, and an additional 6 (13.6%) within a week following the corresponding FDA letters. Content analysis revealed misinformation, information, political, and conspiracy theories to be prominent topics. Our proposed method is simple, effective, easy to deploy, and does not require high-performance computing machinery unlike deep neural network–based methods. The method can be easily extended to other types of signal detection from social media data. The data set may be used for future research and the development of more advanced methods.
Publisher: Association for Computational Linguistics
Date: 2017
DOI: 10.18653/V1/W17-2316
Publisher: Cold Spring Harbor Laboratory
Date: 30-09-2021
DOI: 10.1101/2021.09.28.21264253
Abstract: Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks. There is a need to benchmark such models for targeted NLP tasks, and to explore effective pretraining strategies to improve machine learning performance. In this work, we addressed the task of health-related social media text classification. We benchmarked five models—RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT on 22 tasks. We attempted to boost performance for the best models by comparing distinct pretraining strategies—domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and topic-specific pretraining (TSPT). RoBERTa and BERTweet performed comparably in most tasks, and better than others. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT+TSPT showed consistently high performance, with statistically significant improvement in one task. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance. Source code for our model and data preprocessing is available under the Github repository guo0102/transformer_dapt_sapt_tapt . Datasets must be obtained from original sources, as described in supplementary material. Supplementary data are available at Bioinformatics online.
Publisher: JMIR Publications Inc.
Date: 18-12-2020
Abstract: he wide adoption of social media in daily life renders it a rich and effective resource for conducting near real-time assessments of consumers’ perceptions of health services. However, its use in these assessments can be challenging because of the vast amount of data and the ersity of content in social media chatter. his study aims to develop and evaluate an automatic system involving natural language processing and machine learning to automatically characterize user-posted Twitter data about health services using Medicaid, the single largest source of health coverage in the United States, as an ex le. e collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website’s search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a s le of tweets in 5 predetermined categories or i other /i and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), naïve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional encoder representations from transformers (BERT). We then applied the best-performing classifier to the collected tweets for postclassification analyses to assess the utility of our methods. e manually annotated 11,379 tweets (Corpus 1: 9179 Corpus 2: 2200) and used 7930 (69.7%) for training, 1449 (12.7%) for validation, and 2000 (17.6%) for testing. A classifier based on BERT obtained the highest accuracies (81.7%, Corpus 1 80.7%, Corpus 2) and F sub /sub scores on consumer feedback (0.58, Corpus 1 0.90, Corpus 2), outperforming the second best classifiers in terms of accuracy (74.6%, RF on Corpus 1 69.4%, RF on Corpus 2) and F sub /sub score on consumer feedback (0.44, NN on Corpus 1 0.82, RF on Corpus 2). Postclassification analyses revealed differing intercorpora distributions of tweet categories, with political (400778/628411, 63.78%) and consumer feedback (15073/27337, 55.14%) tweets being the most frequent for Corpus 1 and Corpus 2, respectively. he broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization and can be deployed and generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies. >
Publisher: IEEE
Date: 12-2010
Publisher: Elsevier BV
Date: 02-2021
Publisher: Springer Science and Business Media LLC
Date: 25-05-2022
DOI: 10.1186/S12954-022-00628-2
Abstract: Despite recent rises in fatal overdoses involving multiple substances, there is a paucity of knowledge about stimulant co-use patterns among people who use opioids (PWUO) or people being treated with medications for opioid use disorder (PTMOUD). A better understanding of the timing and patterns in stimulant co-use among PWUO based on mentions of these substances on social media can help inform prevention programs, policy, and future research directions. This study examines stimulant co-mention trends among PWUO/PTMOUD on social media over multiple years. We collected publicly available data from 14 forums on Reddit (subreddits) that focused on prescription and illicit opioids, and medications for opioid use disorder (MOUD). Collected data ranged from 2011 to 2020, and we also collected timelines comprising past posts from a s le of Reddit users (Redditors) on these forums. We applied natural language processing to generate lexical variants of all included prescription and illicit opioids and stimulants and detect mentions of them on the chosen subreddits. Finally, we analyzed and described trends and patterns in co-mentions. Posts collected for 13,812 Redditors showed that 12,306 (89.1%) mentioned at least 1 opioid, opioid-related medication, or stimulant. Analyses revealed that the number and proportion of Redditors mentioning both opioids and/or opioid-related medications and stimulants steadily increased over time. Relative rates of co-mentions by the same Redditor of heroin and meth hetamine, the substances most commonly co-mentioned, decreased in recent years, while co-mentions of both fentanyl and MOUD with meth hetamine increased. Our analyses reflect increasing mentions of stimulants, particularly meth hetamine, among PWUO/PTMOUD, which closely resembles the growth in overdose deaths involving both opioids and stimulants. These findings are consistent with recent reports suggesting increasing stimulant use among people receiving treatment for opioid use disorder. These data offer insights on emerging trends in the overdose epidemic and underscore the importance of scaling efforts to address co-occurring opioid and stimulant use including harm reduction and comprehensive healthcare access spanning mental-health services and substance use disorder treatment.
Publisher: Scitechnol Biosoft Pvt. Ltd.
Date: 2012
Publisher: Oxford University Press (OUP)
Date: 09-03-2015
DOI: 10.1093/JAMIA/OCU041
Abstract: Objective Social media is becoming increasingly popular as a platform for sharing personal health-related information. This information can be utilized for public health monitoring tasks, particularly for pharmacovigilance, via the use of natural language processing (NLP) techniques. However, the language in social media is highly informal, and user-expressed medical concepts are often nontechnical, descriptive, and challenging to extract. There has been limited progress in addressing these challenges, and thus far, advanced machine learning-based NLP techniques have been underutilized. Our objective is to design a machine learning-based approach to extract mentions of adverse drug reactions (ADRs) from highly informal text in social media. Methods We introduce ADRMine, a machine learning-based concept extraction system that uses conditional random fields (CRFs). ADRMine utilizes a variety of features, including a novel feature for modeling words’ semantic similarities. The similarities are modeled by clustering words based on unsupervised, pretrained word representation vectors (embeddings) generated from unlabeled user posts in social media using a deep learning technique. Results ADRMine outperforms several strong baseline systems in the ADR extraction task by achieving an F-measure of 0.82. Feature analysis demonstrates that the proposed word cluster features significantly improve extraction performance. Conclusion It is possible to extract complex medical concepts, with relatively high performance, from informal, user-generated content. Our approach is particularly scalable, suitable for social media mining, as it relies on large volumes of unlabeled data, thus diminishing the need for large, annotated training data sets.
Publisher: MDPI AG
Date: 24-06-2022
Abstract: Recently, hybrid fillers have been widely used to improve the properties of biopolymers. The synergistic effects of the hybrid fillers can have a positive impact on biopolymers, including thermoplastic corn starch film (TPCS). In this communication, we highlight the effectiveness of hybrid fillers in inhibiting the aging process of TPCS. The TPCS, thermoplastic corn starch composite films (TPCS-C), and hybrid thermoplastic corn starch composite film (TPCS-HC) were stored for 3 months to study the effect of hybrid filler on the starch retrogradation. TPCS-C and TPCS-HC were prepared by casting method with 5 wt% of fillers: nanocellulose (NC) and bentonite (BT). The alteration of the mechanical properties, aging behavior, and crystalline structure of the films were analyzed through the tensile test, Fourier transform infrared (FTIR), X-ray diffraction (XRD), differential scanning calorimetry (DSC), and water absorption analysis. The obtained data were correlated to each other to analyze the retrogradation of the TPCS, which is the main factor that contributes to the aging process of the biopolymer. Results signify that incorporating the hybrid filler (NC + BT) in the TPCS/4BT1NC films has effectively prevented retrogradation of the starch molecules after being stored for 3 months. On the contrary, the virgin TPCS film showed the highest degree of retrogradation resulting in a significant decrement in the film’s flexibility. These findings proved the capability of the green hybrid filler in inhibiting the aging of the TPCS.
Publisher: JMIR Publications Inc.
Date: 03-05-2023
DOI: 10.2196/48710
Publisher: Wiley
Date: 10-02-2020
DOI: 10.1111/ANS.15421
Publisher: Ovid Technologies (Wolters Kluwer Health)
Date: 04-07-2023
Abstract: The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by International Classification of Diseases ( ICD ) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing–based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with ICD code–based classification. We included free‐text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non‐Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer‐based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and ICD code–based classification on 20% of the held‐out patient data using the F 1 score metric. The ICD classification model, support vector machine, and RoBERTa achieved F 1 scores of 0.81 (95% CI, 0.79–0.83), 0.95 (95% CI, 0.92–0.97), and 0.89 (95% CI, 0.88–0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance ( P .05), and both natural language processing models outperformed ICD code–based classification ( P .05). The sliding window strategy improved performance over the base model ( P .05) but did not outperform support vector machines. ICD code–based classification produced more false positives. Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes, and the former demonstrated the possibility of further improvement.
Publisher: Informa UK Limited
Date: 24-08-2020
Publisher: Informa UK Limited
Date: 03-06-2022
DOI: 10.1080/17538157.2022.2082297
Abstract: Use of mobile health applications (mHealth apps) is becoming increasingly popular for the management of chronic illnesses, but mHealth-based intervention studies often have limitations associated with subject recruitment and retention. In this synopsis, we focus on targeted aspects of mHealth-based intervention studies, specifically: (i) subject recruitment, (ii) cohort sizes, and (iii) retention rates. We used the Google Scholar (meta-search) and Galileo search engines to identify s le articles focusing on
Publisher: American Medical Association (AMA)
Date: 06-11-2019
Publisher: Informa UK Limited
Date: 03-04-2023
Publisher: OMICS Publishing Group
Date: 10-2012
Publisher: JMIR Publications Inc.
Date: 28-02-2020
Abstract: he capabilities of natural language processing (NLP) methods have expanded significantly in recent years, and progress has been particularly driven by advances in data science and machine learning. However, NLP is still largely underused in patient-oriented clinical research and care (POCRC). A key reason behind this is that clinical NLP methods are typically developed, optimized, and evaluated with narrowly focused data sets and tasks (eg, those for the detection of specific symptoms in free texts). Such research and development (R& D) approaches may be described as i roblem oriented /i , and the developed systems perform specialized tasks well. As standalone systems, however, they generally do not comprehensively meet the needs of POCRC. Thus, there is often a gap between the capabilities of clinical NLP methods and the needs of patient-facing medical experts. We believe that to increase the practical use of biomedical NLP, future R& D efforts need to be broadened to a new research paradigm—one that explicitly incorporates characteristics that are crucial for POCRC. We present our viewpoint about 4 such interrelated characteristics that can increase NLP systems’ suitability for POCRC (3 that represent NLP system properties and 1 associated with the R& D process)—(1) interpretability (the ability to explain system decisions), (2) patient centeredness (the capability to characterize erse patients), (3) customizability (the flexibility for adapting to distinct settings, problems, and cohorts), and (4) multitask evaluation (the validation of system performance based on multiple tasks involving heterogeneous data sets). By using the NLP task of clinical concept detection as an ex le, we detail these characteristics and discuss how they may result in the increased uptake of NLP systems for POCRC.
Publisher: Oxford University Press (OUP)
Date: 04-07-2020
Abstract: To mine Twitter and quantitatively analyze COVID-19 symptoms self-reported by users, compare symptom distributions across studies, and create a symptom lexicon for future research. We retrieved tweets using COVID-19-related keywords, and performed semiautomatic filtering to curate self-reports of positive-tested users. We extracted COVID-19-related symptoms mentioned by the users, mapped them to standard concept IDs in the Unified Medical Language System, and compared the distributions to those reported in early studies from clinical settings. We identified 203 positive-tested users who reported 1002 symptoms using 668 unique expressions. The most frequently-reported symptoms were fever yrexia (66.1%), cough (57.9%), body ache ain (42.7%), fatigue (42.1%), headache (37.4%), and dyspnea (36.3%) amongst users who reported at least 1 symptom. Mild symptoms, such as anosmia (28.7%) and ageusia (28.1%), were frequently reported on Twitter, but not in clinical studies. The spectrum of COVID-19 symptoms identified from Twitter may complement those identified in clinical settings.
Publisher: Association for Computational Linguistics
Date: 2021
Publisher: Elsevier BV
Date: 04-2015
Publisher: Research Square Platform LLC
Date: 17-01-2022
DOI: 10.21203/RS.3.RS-1255278/V1
Abstract: Background Despite recent increasing focus on fatal overdoses involving multiple substances, there is a paucity of knowledge about stimulant co-use patterns among people who use opioids (PWUO) or people being treated with medication for opioid use disorder (PTMOUD). This study examines stimulant co-mention trends among PWUO/PTMOUD on social media. Methods We collected publicly-available data from 14 prescription and illicit opioid and MOUD-related forums on Reddit (subreddits) between 2011-2020 and timelines comprising past posts from a s le of Reddit users (Redditors) on these forums. We applied natural language processing to detect mentions of opioids, opioid-related medications, and stimulants and described trends and patterns in co-mentions. Results Posts collected for 13,812 Redditors indicated 12,306 (89.1%) mentioned ≥1 opioid, opioid-related medication or stimulant. Analyses showed the number and proportion of Redditors mentioning both opioids and/or opioid-related medications and stimulants steadily increased over time. Relative rates of co-mentions of heroin and meth hetamine, substances most commonly co-mentioned, decreased in recent years while those of fentanyl and MOUD with meth hetamine increased. Conclusion Data from Reddit reflect increasing mentions of stimulants, particularly meth hetamine, among PWUO/PTMOUD and closely resemble the growth in overdose deaths involving both opioids and stimulants. These findings are consistent with recent reports suggesting increasing stimulant use among people receiving treatment for opioid use disorder. These data offer insights on emerging trends in the overdose epidemic and underscore the importance of scaling efforts to address co-occurring opioid and stimulant use including harm reduction and comprehensive healthcare access spanning mental-health services and substance use disorder treatment.
Publisher: MDPI AG
Date: 15-03-2021
Abstract: Thermoplastic starch (TPS) hybrid bio-composite films containing microcrystalline cellulose (C) and nano-bentonite (B) as hybrid fillers were studied to replace the conventional non-degradable plastic in packaging applications. Raw oil palm empty fruit bunch (OPEFB) was subjected to chemical treatment and acid hydrolysis to obtain C filler. B filler was ultra-sonicated for better dispersion in the TPS films to improve the filler–matrix interactions. The morphology and structure of fillers were characterized by scanning electron microscope (SEM), Fourier transform infrared spectroscopy (FTIR) and X-ray diffraction (XRD). TPS hybrid bio-composite films were produced by the casting method with different ratios of B and C fillers. The best ratio of B/C was determined through the data of the tensile test. FTIR analysis proved the molecular interactions between the TPS and the hybrid fillers due to the presence of polar groups in their structure. XRD analysis confirmed the intercalation of the TPS chains between the B inter-platelets as a result of well-developed interactions between the TPS and hybrid fillers. SEM images suggested that more plastic deformation occurred in the fractured surface of the TPS hybrid bio-composite film due to the higher degree of stretching after being subjected to tensile loading. Overall, the results indicate that incorporating the hybrid B/C fillers could tremendously improve the mechanical properties of the films. The best ratio of B/C in the TPS was found to be 4:1, in which the tensile strength (8.52MPa), Young’s modulus (42.0 MPa), elongation at break (116.4%) and tensile toughness of the film were increased by 92%, 146%, 156% and 338%, respectively. The significantly improved strength, modulus, flexibility and toughness of the film indicate the benefits of using the hybrid fillers, since these features are useful for the development of sustainable flexible packaging film.
Publisher: Association for Computational Linguistics
Date: 2021
Publisher: Springer Science and Business Media LLC
Date: 10-01-2017
Publisher: arXiv
Date: 2022
Publisher: IEEE
Date: 06-2022
Publisher: Elsevier BV
Date: 02-2017
Publisher: Georg Thieme Verlag KG
Date: 2017
DOI: 10.15265/IY-2017-029
Abstract: Background: Natural Language Processing (NLP) methods are increasingly being utilized to mine knowledge from unstructured health-related texts. Recent advances in noisy text processing techniques are enabling researchers and medical domain experts to go beyond the information encapsulated in published texts (e.g., clinical trials and systematic reviews) and structured questionnaires, and obtain perspectives from other unstructured sources such as Electronic Health Records (EHRs) and social media posts. Objectives: To review the recently published literature discussing the application of NLP techniques for mining health-related information from EHRs and social media posts. Methods: Literature review included the research published over the last five years based on searches of PubMed, conference proceedings, and the ACM Digital Library, as well as on relevant publications referenced in papers. We particularly focused on the techniques employed on EHRs and social media data. Results: A set of 62 studies involving EHRs and 87 studies involving social media matched our criteria and were included in this paper. We present the purposes of these studies, outline the key NLP contributions, and discuss the general trends observed in the field, the current state of research, and important outstanding problems. Conclusions: Over the recent years, there has been a continuing transition from lexical and rule-based systems to learning-based approaches, because of the growth of annotated data sets and advances in data science. For EHRs, publicly available annotated data is still scarce and this acts as an obstacle to research progress. On the contrary, research on social media mining has seen a rapid growth, particularly because the large amount of unlabeled data available via this resource compensates for the uncertainty inherent to the data. Effective mechanisms to filter out noise and for mapping social media expressions to standard medical concepts are crucial and latent research problems. Shared tasks and other competitive challenges have been driving factors behind the implementation of open systems, and they are likely to play an imperative role in the development of future systems.
Publisher: Elsevier BV
Date: 12-2018
Publisher: Association for Computational Linguistics
Date: 2018
DOI: 10.18653/V1/W18-5904
Publisher: Cold Spring Harbor Laboratory
Date: 27-09-2021
DOI: 10.1101/2021.09.24.21264090
Abstract: Buprenorphine is an evidence-based treatment for Opioid Use Disorder (OUD). Standard buprenorphine induction requires a period of opioid abstinence to minimize risk of precipitated opioid withdrawal (POW). Our objective was to study the impact of the increasing presence of fentanyl and its analogs in the opioid supply of the United States, on buprenorphine induction and POW, using social media data from Reddit. This is a data-driven, mixed methods study of opioid-related forums, called subreddits, on Reddit to analyze posts related to fentanyl, POW, and buprenorphine induction. The posts were collected from seven subreddits using an application programming interface for Reddit. We applied natural language processing to identify subsets of salient posts relevant to buprenorphine induction, and performed manual, qualitative, thematic analyses of them. 267,136 posts were retrieved from seven subreddits. Fentanyl mentions increased from 3 in 2013 to 3870 in 2020, and POW mentions increased from 2 (2012) to 332 (2020). Manual review of 384 POW-mentioning posts and 106 ‘ Bernese method ’ (a microdosing induction strategy) mentioning posts revealed common themes and peoples’ experiences. Specifically, presence of fentanyl caused POWs despite long abstinence durations, and alternative induction via microdosing were frequently recommended in peer-to-peer discussions. This study found that increased social media chatter on Reddit about POW correlated with fentanyl mentions. A subset of posts described microdosing as a self-management strategy to avoid POW. Reddit posts suggest that people are utilizing these strategies to initiate buprenorphine due to challenges arising from fentanyl prevalence in the opioid supply. Increase in mentions of precipitated opioid withdrawal (POW) on Reddit from 2012 to 2021 was closely correlated with the increase in fentanyl mentions. Experiences of precipitated opioid withdrawal (POW) were described by in iduals who reported sufficient periods of abstinence by standard buprenorphine induction protocols. People with Opioid Use Disorder (OUD) on Reddit are using and recommending microdosing strategies with buprenorphine to avoid POW. People who used fentanyl report experiencing POW following statistically longer periods of abstinence than people who use heroin.
Publisher: Springer Science and Business Media LLC
Date: 09-09-2017
Publisher: Elsevier BV
Date: 02-2015
Publisher: WORLD SCIENTIFIC
Date: 17-11-2017
Publisher: Oxford University Press (OUP)
Date: 10-2018
DOI: 10.1093/JAMIA/OCY114
Abstract: We executed the Social Media Mining for Health (SMM4H) 2017 shared tasks to enable the community-driven development and large-scale evaluation of automatic text processing methods for the classification and normalization of health-related text from social media. An additional objective was to publicly release manually annotated data. We organized 3 independent subtasks: automatic classification of self-reports of 1) adverse drug reactions (ADRs) and 2) medication consumption, from medication-mentioning tweets, and 3) normalization of ADR expressions. Training data consisted of 15 717 annotated tweets for (1), 10 260 for (2), and 6650 ADR phrases and identifiers for (3) and exhibited typical properties of social-media-based health-related texts. Systems were evaluated using 9961, 7513, and 2500 instances for the 3 subtasks, respectively. We evaluated performances of classes of methods and ensembles of system combinations following the shared tasks. Among 55 system runs, the best system scores for the 3 subtasks were 0.435 (ADR class F1-score) for subtask-1, 0.693 (micro-averaged F1-score over two classes) for subtask-2, and 88.5% (accuracy) for subtask-3. Ensembles of system combinations obtained best scores of 0.476, 0.702, and 88.7%, outperforming in idual systems. Among in idual systems, support vector machines and convolutional neural networks showed high performance. Performance gains achieved by ensembles of system combinations suggest that such strategies may be suitable for operational systems relying on difficult text classification tasks (eg, subtask-1). Data imbalance and lack of context remain challenges for natural language processing of social media text. Annotated data from the shared task have been made available as reference standards for future studies (0.17632/rxwfb3tysd.1).
Publisher: WORLD SCIENTIFIC
Date: 18-11-2015
Publisher: Frontiers Media SA
Date: 04-12-2020
DOI: 10.3389/FDGTH.2020.585559
Abstract: As the volume of published medical research continues to grow rapidly, staying up-to-date with the best-available research evidence regarding specific topics is becoming an increasingly challenging problem for medical experts and researchers. The current COVID19 pandemic is a good ex le of a topic on which research evidence is rapidly evolving. Automatic query-focused text summarization approaches may help researchers to swiftly review research evidence by presenting salient and query-relevant information from newly-published articles in a condensed manner. Typical medical text summarization approaches require domain knowledge, and the performances of such systems rely on resource-heavy medical domain-specific knowledge sources and pre-processing methods (e.g., text classification) for deriving semantic information. Consequently, these systems are often difficult to speedily customize, extend, or deploy in low-resource settings, and they are often operationally slow. In this paper, we propose a fast and simple extractive summarization approach that can be easily deployed and run, and may thus aid medical experts and researchers obtain fast access to the latest research evidence. At runtime, our system utilizes similarity measurements derived from pre-trained medical domain-specific word embeddings in addition to simple features, rather than computationally-expensive pre-processing and resource-heavy knowledge bases. Automatic evaluation using ROUGE—a summary evaluation tool—on a public dataset for evidence-based medicine shows that our system's performance, despite the simple implementation, is statistically comparable with the state-of-the-art. Extrinsic manual evaluation based on recently-released COVID19 articles demonstrates that the summarizer performance is close to human agreement, which is generally low, for extractive summarization.
Publisher: Elsevier BV
Date: 08-2016
Publisher: Cold Spring Harbor Laboratory
Date: 11-05-2022
DOI: 10.1101/2022.05.09.22274776
Abstract: Social media have served as lucrative platforms for misinformation and for promoting fraudulent products for the treatment, testing and prevention of COVID-19. This has resulted in the issuance of many warning letters by the United States Food and Drug Administration (FDA). While social media continue to serve as the primary platform for the promotion of such fraudulent products, they also present the opportunity to identify these products early by employing effective social media mining methods. In this study, we employ natural language processing and time series anomaly detection methods for automatically detecting fraudulent COVID-19 products early from Twitter. Our approach is based on the intuition that increases in the popularity of fraudulent products lead to corresponding anomalous increases in the volume of chatter regarding them. We utilized an anomaly detection method on streaming COVID-19-related Twitter data to detect potentially anomalous increases in mentions of fraudulent products. Our unsupervised approach detected 34/44 (77.3%) signals about fraudulent products earlier than the FDA letter issuance dates, and an additional 6/44 (13.6%) within a week following the corresponding FDA letters. Our proposed method is simple, effective and easy to deploy, and do not require high performance computing machinery unlike deep neural network-based methods.
Publisher: JMIR Publications Inc.
Date: 28-04-2020
Abstract: ethadone and buprenorphine-naloxone (Suboxone®) have been discussed and compared extensively in the medical literature as effective treatments for opioid use disorder (OUD). While the evidence basis for the use of these medications is very favorable, less is known about the perceptions of these medications within the general public. his study aimed to use social media, specifically Twitter, to assess the public perception of these medications, and to compare the discussion content between each medication based on theme, subtheme, and sentiment. e conducted a mixed methods descriptive study analyzing in idual microposts (“tweets”) that mentioned “methadone” or “suboxone”. We then categorized these tweets into themes and subthemes, as well as by sentiment and personal experience, and compared the information posted about these two medications, including in tweets that mentioned both medications. e analyzed 900 tweets, most of which related to access (13.8% for methadone 12.9% for suboxone®), stigma (15.3% 14.0%), and OUD treatment (11.5% 5.4%). Only a small proportion of tweets (16.4% for suboxone® and 9.3% for methadone) expressed positive sentiments about the medications, with few tweets describing personal experiences. Tweets mentioning both medications primarily discussed MOUD in general, rather than comparing the two medications directly. witter content about methadone and suboxone are similar, with the same major themes and similar sub-themes. Despite the proven effectiveness of these medications, there was little dialogue related to their benefits or efficacy in the treatment of opioid use disorder. Perceptions of these medications may contribute to their underutilization in combatting opioid use disorder.
Publisher: JMIR Publications Inc.
Date: 13-08-2019
Abstract: ocial media data are being increasingly used for population-level health research because it provides near real-time access to large volumes of consumer-generated data. Recently, a number of studies have explored the possibility of using social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a paucity of annotated data or guidelines for data characterization that discuss how information related to abuse-prone medications is presented on Twitter. his study discusses the creation of an annotated corpus suitable for training supervised classification algorithms for the automatic classification of medication abuse–related chatter. The annotation strategies used for improving interannotator agreement (IAA), a detailed annotation guideline, and machine learning experiments that illustrate the utility of the annotated corpus are also described. e employed an iterative annotation strategy, with interannotator discussions held and updates made to the annotation guidelines at each iteration to improve IAA for the manual annotation task. Using the grounded theory approach, we first characterized tweets into fine-grained categories and then grouped them into 4 broad classes— i abuse or misuse, personal consumption, mention, /i and i unrelated /i . After the completion of manual annotations, we experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data. ur final annotated set consisted of 16,443 tweets mentioning at least 20 abuse-prone medications including opioids, benzodiazepines, atypical antipsychotics, central nervous system stimulants, and gamma-aminobutyric acid analogs. Our final overall IAA was 0.86 (Cohen kappa), which represents high agreement. The manual annotation process revealed the variety of ways in which prescription medication misuse or abuse is discussed on Twitter, including expressions indicating coingestion, nonmedical use, nonstandard route of intake, and consumption above the prescribed doses. Among machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271). ur manual analysis and annotations of a large number of tweets have revealed types of information posted on Twitter about a set of abuse-prone prescription medications and their distributions. In the interests of reproducible and community-driven research, we have made our detailed annotation guidelines and the training data for the classification experiments publicly available, and the test data will be used in future shared tasks.
Publisher: Springer Science and Business Media LLC
Date: 02-01-2020
Publisher: Cold Spring Harbor Laboratory
Date: 11-12-2019
DOI: 10.1101/871608
Abstract: Opioid use disorder (OUD) is a public health emergency in the United States. Over 47,000 overdose-related deaths in 2017 involved opioids. Medication-assisted treatment (MAT), in particular, buprenorphine and buprenorphine combination products such as Suboxone ® , is the most effective, evidence-based treatment for OUD. However, there are a limited number of conclusive scientific studies that provide guidance to medical professionals about strategies for using buprenorphine to achieve stable recovery. In this study, we used data-driven natural language processing methods to mine a total of 16,146 posts about buprenorphine from 1933 unique users on the anonymous social network Reddit. Analysis of a s le of these posts showed that 74% of the posts described users’ personal experiences and that the top three topics included advice on using Suboxone ® (55.0%), Suboxone ® dosage information (35.5%) and information about Suboxone ® tapering (32.0%). Based on two models, one that incorporated ‘upvoting’ by other members and one that did not, we found that Reddit users reported more successful recovery with longer tapering schedules, particularly from 2.0 mg to 0.0 mg (median: 93 days mean: 95 days), as compared to shorter tapering schedules investigated in past clinical trials. Diarrhea, insomnia, restlessness, and fatigue were commonly reported adverse events. Physical exercise, clonidine, and Imodium ® were frequently reported to help during the recovery process. Due to the difficulties of conducting longer-term clinical trials involving patients with OUD, clinicians should consider other information sources including peer discussions from the abundant, real-time information available on Reddit. Opioid use disorder (OUD) is a national crisis in the United States and buprenorphine is one of the most effective evidence-based treatments. However, few studies have explored successful strategies for using and tapering buprenorphine to achieve stable recovery, particularly due to the difficulties of conducting long-term studies involving patients with OUD. In this study, we show that discussions on the anonymous social network Reddit may be leveraged, via automatic text mining methods, to discover successful buprenorphine use and tapering strategies. We discovered that longer tapering schedules, compared to those investigated in past clinical trials, may lead to (self-reported) sustained recovery. Furthermore, Reddit posts also provide key information regarding buprenorphine withdrawal, cravings, adjunct medications for withdrawal symptoms and relapse prevention strategies.
Publisher: Association for Computational Linguistics
Date: 2020
Publisher: Cold Spring Harbor Laboratory
Date: 14-03-2023
DOI: 10.1101/2023.03.13.23287215
Abstract: Xylazine is an alpha-2 agonist increasingly prevalent in the illicit drug supply. Our objectives were to curate information about xylazine through social media from People Who Use Drugs (PWUDs). Specifically, we sought to answer the following: 1) what are the demographics of Reddit subscribers reporting exposure to xylazine? 2) is xylazine a desired additive? and 3) what adverse effects of xylazine are PWUDs experiencing? Natural Language Processing (NLP) was used to identify mentions of “xylazine” from posts by Reddit subscribers who also posted on drug-related subreddits. Posts were qualitatively evaluated for xylazine-related themes. A survey was developed to gather additional information about the Reddit subscribers. This survey was posted on subreddits that were identified by NLP to contain xylazine-related discussions from March 2022 to October 2022. 76 posts mentioning xylazine were extracted via NLP from 765,616 posts by 16,131 Reddit subscribers (January 2018 to August 2021). People on Reddit described xylazine as an unwanted adulterant in their opioid supply. 61 participants completed the survey. Of those that disclosed their location, 25/50 (50%) participants reported locations in the Northeastern United States. The most common eoute of xylazine use was intranasal use (57%). 31/59 (53%) reported experiencing xylazine withdrawal. Frequent adverse events reported were prolonged sedation (81%) and increased skin wounds (43%). Among respondents on these Reddit forums, xylazine appears to be an unwanted adulterant. PWUDs may be experiencing adverse effects such as prolonged sedation and xylazine withdrawal. This appeared to be more common in the Northeast.
Publisher: Elsevier BV
Date: 08-2023
Publisher: Oxford University Press (OUP)
Date: 04-10-2019
DOI: 10.1093/JAMIA/OCZ162
Abstract: Prescription medication (PM) misuse and abuse is a major health problem globally, and a number of recent studies have focused on exploring social media as a resource for monitoring nonmedical PM use. Our objectives are to present a methodological review of social media–based PM abuse or misuse monitoring studies, and to propose a potential generalizable, data-centric processing pipeline for the curation of data from this resource. We identified studies involving social media, PMs, and misuse or abuse (inclusion criteria) from Medline, Embase, Scopus, Web of Science, and Google Scholar. We categorized studies based on multiple characteristics including but not limited to data size social media source(s) medications studied and primary objectives, methods, and findings. A total of 39 studies met our inclusion criteria, with 31 (∼79.5%) published since 2015. Twitter has been the most popular resource, with Reddit and Instagram gaining popularity recently. Early studies focused mostly on manual, qualitative analyses, with a growing trend toward the use of data-centric methods involving natural language processing and machine learning. There is a paucity of standardized, data-centric frameworks for curating social media data for task-specific analyses and near real-time surveillance of nonmedical PM use. Many existing studies do not quantify human agreements for manual annotation tasks or take into account the presence of noise in data. The development of reproducible and standardized data-centric frameworks that build on the current state-of-the-art methods in data and text mining may enable effective utilization of social media data for understanding and monitoring nonmedical PM use.
Publisher: Elsevier BV
Date: 02-2016
DOI: 10.1016/J.JBI.2015.11.010
Abstract: Evidence-based medicine practice requires medical practitioners to rely on the best available evidence, in addition to their expertise, when making clinical decisions. The medical domain boasts a large amount of published medical research data, indexed in various medical databases such as MEDLINE. As the size of this data grows, practitioners increasingly face the problem of information overload, and past research has established the time-associated obstacles faced by evidence-based medicine practitioners. In this paper, we focus on the problem of automatic text summarisation to help practitioners quickly find query-focused information from relevant documents. We utilise an annotated corpus that is specialised for the task of evidence-based summarisation of text. In contrast to past summarisation approaches, which mostly rely on surface level features to identify salient pieces of texts that form the summaries, our approach focuses on the use of corpus-based statistics, and domain-specific lexical knowledge for the identification of summary contents. We also apply a target-sentence-specific summarisation technique that reduces the problem of underfitting that persists in generic summarisation models. In automatic evaluations run over a large number of annotated summaries, our extractive summarisation technique statistically outperforms various baseline and benchmark summarisation models with a percentile rank of 96.8%. A manual evaluation shows that our extractive summarisation approach is capable of selecting content with high recall and precision, and may thus be used to generate bottom-line answers to practitioners' queries. Our research shows that the incorporation of specialised data and domain-specific knowledge can significantly improve text summarisation performance in the medical domain. Due to the vast amounts of medical text available, and the high growth of this form of data, we suspect that such summarisation techniques will address the time-related obstacles associated with evidence-based medicine.
Publisher: WORLD SCIENTIFIC
Date: 18-11-2015
Publisher: Cold Spring Harbor Laboratory
Date: 13-06-2020
DOI: 10.1101/2020.06.12.20129593
Abstract: Social media can be an effective but challenging resource for conducting close-to-real-time assessments of consumers’ perceptions about health services. Our objective was to develop and evaluate an automatic pipeline, involving natural language processing and machine learning, for automatically characterizing user-posted Twitter data about Medicaid. We collected Twitter data via the public API using Medicaid-related keywords (Corpus-1), and the website’s search option using agency-specific handles (Corpus-2). We manually labeled a s le of tweets into five pre-determined categories or other , and artificially increased the number of training posts from specific low-frequency categories. We trained and evaluated several supervised learning algorithms using manually-labeled data, and applied the best-performing classifier to collected tweets for post-classification analyses assessing the utility of our methods. We collected 628,411 and 27,377 tweets for Corpus-1 and -2, respectively. We manually annotated 9,571 (Corpus-1: 8,180 Corpus-2: 1,391) tweets, using 7,923 (82.8%) for training and 1,648 (17.2%) for evaluation. A BERT-based (bidirectional encoder representations from transformers) classifier obtained the highest accuracies (83.9%, Corpus-1 86.4%, Corpus-2), outperforming the second-best classifier (SVMs: 79.6% 76.4%). Post-classification analyses revealed differing inter-corpora distributions of tweet categories, with political (63%) and consumer-feedback (43%) tweets being most frequent for Corpus-1 and -2, respectively. The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed pipeline presents a feasible solution for automatic categorization, and can be deployed/generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies (LINK_TO_BE_AVAILABLE).
Publisher: Cold Spring Harbor Laboratory
Date: 17-04-2020
DOI: 10.1101/2020.04.13.20064089
Abstract: Prescription medication (PM) misuse/abuse has emerged as a national crisis in the United States, and social media has been suggested as a potential resource for performing active monitoring. However, automating a social media-based monitoring system is challenging—requiring advanced natural language processing (NLP) and machine learning methods. In this paper, we describe the development and evaluation of automatic text classification models for detecting self-reports of PM abuse from Twitter. We experimented with state-of-the-art bi-directional transformer-based language models, which utilize tweet-level representations that enable transfer learning (e.g., BERT, RoBERTa, XLNet, AlBERT, and DistilBERT), proposed fusion-based approaches, and compared the developed models with several traditional machine learning, including deep learning, approaches. Using a public dataset, we evaluated the performances of the classifiers on their abilities to classify the non-majority “abuse/misuse” class. Our proposed fusion-based model performs significantly better than the best traditional model (F 1 -score [95% CI]: 0.67 [0.64-0.69] vs. 0.45 [0.42-0.48]). We illustrate, via experimentation using differing training set sizes, that the transformer-based models are more stable and require less annotated data compared to the other models. The significant improvements achieved by our best-performing classification model over past approaches makes it suitable for automated continuous monitoring of nonmedical PM use from Twitter.
Publisher: Cold Spring Harbor Laboratory
Date: 26-05-2020
DOI: 10.1101/2020.05.22.20110742
Abstract: The performances of current medical text summarization systems rely on resource-heavy domain-specific knowledge sources, and preprocessing methods (e.g., classification or deep learning) for deriving semantic information. Consequently, these systems are often difficult to customize, extend or deploy in low-resource settings, and are operationally slow. We propose a fast summarization system that can aid practitioners at point-of-care, and, thus, improve evidence-based healthcare. At runtime, our system utilizes similarity measurements derived from pre-trained domain-specific word embeddings in addition to simple features, rather than clunky knowledge bases and resource-heavy preprocessing. Automatic evaluation on a public dataset for evidence-based medicine shows that our system’s performance, despite the simple implementation, is statistically comparable with the state-of-the-art.
Publisher: Cold Spring Harbor Laboratory
Date: 22-08-2022
DOI: 10.1101/2022.08.20.22279021
Abstract: Due to the high economic and public health burden of chronic pain, and the risk of public health consequences of opioid-based treatments, there is a need to identify effective alternative therapies. The evidence basis for many alternative therapies is weak or nonexistent. Social media presents a unique opportunity to gather large-scale knowledge about such therapies self-reported by sufferers themselves. We attempted to (i) verify the presence of largescale chronic pain-related chatter on Twitter, (ii) develop natural language processing (NLP) and machine learning for automatically detecting chronic pain sufferers, and (iii) identify the types of chronic pain-related information reported by them. We collected data from Twitter using chronic pain-related hashtags and keywords. We manually performed binary annotation of a s le of 4998 posts to indicate if they were self-reports of chronic pain experiences or not, and obtained inter-annotator agreement of 0.82 (Cohen’s kappa). We trained and evaluated several state-of-the-art transformer-based text classification models using the annotated data. The RoBERTa model outperformed all others (F1 score = 0.84 95% CI: 0.80-0.89), and we used this model to classify a large number of unlabeled posts. We identified 22,795 self-reported chronic pain sufferers and collected their past posted data. Via manual and NLP-driven analyses, we found information about but not limited to alternative treatments, sufferers’ sentiments about treatments, side effects, and self-management strategies. Our social media-based approach will result in an automatically growing massive cohort over time, and the data can be leveraged to identify self-reported effective alternative therapies for erse chronic pain types.
Publisher: Elsevier BV
Date: 06-2015
DOI: 10.1016/J.ARTMED.2015.04.001
Abstract: Evidence-based medicine practice requires practitioners to obtain the best available medical evidence, and appraise the quality of the evidence when making clinical decisions. Primarily due to the plethora of electronically available data from the medical literature, the manual appraisal of the quality of evidence is a time-consuming process. We present a fully automatic approach for predicting the quality of medical evidence in order to aid practitioners at point-of-care. Our approach extracts relevant information from medical article abstracts and utilises data from a specialised corpus to apply supervised machine learning for the prediction of the quality grades. Following an in-depth analysis of the usefulness of features (e.g., publication types of articles), they are extracted from the text via rule-based approaches and from the meta-data associated with the articles, and then applied in the supervised classification model. We propose the use of a highly scalable and portable approach using a sequence of high precision classifiers, and introduce a simple evaluation metric called average error distance (AED) that simplifies the comparison of systems. We also perform elaborate human evaluations to compare the performance of our system against human judgments. We test and evaluate our approaches on a publicly available, specialised, annotated corpus containing 1132 evidence-based recommendations. Our rule-based approach performs exceptionally well at the automatic extraction of publication types of articles, with F-scores of up to 0.99 for high-quality publication types. For evidence quality classification, our approach obtains an accuracy of 63.84% and an AED of 0.271. The human evaluations show that the performance of our system, in terms of AED and accuracy, is comparable to the performance of humans on the same data. The experiments suggest that our structured text classification framework achieves evaluation results comparable to those of human performance. Our overall classification approach and evaluation technique are also highly portable and can be used for various evidence grading scales.
Publisher: Springer Berlin Heidelberg
Date: 2013
Publisher: Springer Science and Business Media LLC
Date: 10-2019
DOI: 10.1038/S41746-019-0170-5
Abstract: Social media has recently been used to identify and study a small cohort of Twitter users whose pregnancies with birth defect outcomes—the leading cause of infant mortality—could be observed via their publicly available tweets. In this study, we exploit social media on a larger scale by developing natural language processing (NLP) methods to automatically detect, among thousands of users, a cohort of mothers reporting that their child has a birth defect. We used 22,999 annotated tweets to train and evaluate supervised machine learning algorithms—feature-engineered and deep learning-based classifiers—that automatically distinguish tweets referring to the user’s pregnancy outcome from tweets that merely mention birth defects. Because 90% of the tweets merely mention birth defects, we experimented with under-s ling and over-s ling approaches to address this class imbalance. An SVM classifier achieved the best performance for the two positive classes: an F 1 -score of 0.65 for the “defect” class and 0.51 for the “possible defect” class. We deployed the classifier on 20,457 unlabeled tweets that mention birth defects, which helped identify 542 additional users for potential inclusion in our cohort. Contributions of this study include (1) NLP methods for automatically detecting tweets by users reporting their birth defect outcomes, (2) findings that an SVM classifier can outperform a deep neural network-based classifier for highly imbalanced social media data, (3) evidence that automatic classification can be used to identify additional users for potential inclusion in our cohort, and (4) a publicly available corpus for training and evaluating supervised machine learning algorithms.
Publisher: Cold Spring Harbor Laboratory
Date: 08-01-2021
DOI: 10.1101/2021.01.06.21249350
Abstract: Biomedical research involving social media (SM) data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, SM user’s demographic information ( e.g ., gender) is often not explicitly known from profiles. Here we present an automatic gender classification system for SM and we illustrate how gender information can be incorporated into a SM-based health-related study. We used two large Twitter datasets: (i) public, gender-labeled users (Dataset-1), and (ii) users who have self-reported nonmedical use of prescription medications (Dataset-2). Dataset-1 was used to train and evaluate the gender detection pipeline. We experimented with machine-learning algorithms including support vector machines (SVMs) and deep-learning models, and released packages including M3. We considered user’s information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We applied the best-performing pipeline to Dataset-2 to assess the system’s utility. We collected 67,181 and 176,683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95%-CI: 94.0%-94.8%] Dataset-2: 94.4% [95%-CI: 92.0%-96.6%]. Including automatically-classified information in the analyses of Dataset-2 revealed gender-specific trends— proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs. 0.50 stimulants: 0.50 vs. 0.45), and the overdose Emergency Room Visit due to Opioids by CDC (pain relievers: 0.38 vs. 0.37). Our publicly-available, automated gender detection pipeline may aid cohort-specific social media data analyses ( arkerlab/gender-detection-for-public ).
Publisher: JMIR Publications Inc.
Date: 24-02-2020
Abstract: witter is a potentially valuable tool for public health officials and state Medicaid programs in the United States, which provide public health insurance to 72 million Americans. e aim to characterize how Medicaid agencies and managed care organization (MCO) health plans are using Twitter to communicate with the public. sing Twitter’s public application programming interface, we collected 158,714 public posts (“tweets”) from active Twitter profiles of state Medicaid agencies and MCOs, spanning March 2014 through June 2019. Manual content analyses identified 5 broad categories of content, and these coded tweets were used to train supervised machine learning algorithms to classify all collected posts. e identified 15 state Medicaid agencies and 81 Medicaid MCOs on Twitter. The mean number of followers was 1784, the mean number of those followed was 542, and the mean number of posts was 2476. Approximately 39% of tweets came from just 10 accounts. Of all posts, 39.8% (63,168/158,714) were classified as general public health education and outreach 23.5% (n=37,298) were about specific Medicaid policies, programs, services, or events 18.4% (n=29,203) were organizational promotion of staff and activities and 11.6% (n=18,411) contained general news and news links. Only 4.5% (n=7142) of posts were responses to specific questions, concerns, or complaints from the public. witter has the potential to enhance community building, beneficiary engagement, and public health outreach, but appears to be underutilized by the Medicaid program.
Publisher: Cold Spring Harbor Laboratory
Date: 06-03-2023
DOI: 10.1101/2023.03.01.23286659
Abstract: The Fontan operation palliates single ventricle heart defects and is associated with significant morbidity and premature mortality. Native anatomy varies thus, Fontan cases cannot always be identified by International Classification of Diseases, Ninth and Tenth Revision, Clinical Modification (ICD-9-CM and ICD-10-CM) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing (NLP) based machine learning (ML) models, which utilize free text notes of patients, to automatically detect Fontan cases, and compare their performances with ICD code based classification. We included free text notes of 10,935 manually validated patients, of whom 778 (7.1%) were Fontan and 10,157 (92.9%) non-Fontan patients, from two large, erse healthcare systems. Using 5-fold cross validation, we trained and evaluated multiple ML models, namely support vector machines (SVM) and a transformer based model for language understanding named RoBERTa (2 versions), for automatically identifying Fontan cases based on free text notes. To optimize classifier performances, we experimented with different text representation techniques, including a sliding window strategy to overcome the length limit imposed by RoBERTa. We compared the performances of the ML models to ICD code based classification using the F 1 score metric. The ICD classification model, SVM, and RoBERTa achieved F 1 scores of 0.81 (95% CI: 0.79-0.83), 0.95 (95% CI: 0.92-0.97), and 0.89 (95% CI: 0.88-0.85) for the positive (Fontan) class, respectively. SVM obtained the best performance ( p .05), and both NLP models outperformed ICD code based classification ( p .05 ). The novel sliding window strategy improved performance over the base RoBERTa model ( p .05 ) but did not outperform SVM. ICD code based classification tended to have more false positives compared to both NLP models. Our proposed NLP models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes. Since the sensitivity of ICD codes is high but the positive predictive value is low, it may be beneficial to apply ICD codes as a filter prior to applying NLP/ML to achieve optimal performance.
Publisher: Oxford University Press (OUP)
Date: 04-2021
DOI: 10.1093/JAMIAOPEN/OOAB042
Abstract: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user’s demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study. We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users’ information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system’s utility. We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0–94.8%] Dataset-2: 94.4% [95% confidence interval: 92.0–96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends—proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50 stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department S le (pain relievers: 0.38 vs 0.37). Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (arkerlab/gender-detection-for-public).
Publisher: WORLD SCIENTIFIC
Date: 18-11-2015
Publisher: Cold Spring Harbor Laboratory
Date: 22-04-2020
DOI: 10.1101/2020.04.16.20067421
Abstract: To mine Twitter to quantitatively analyze COVID-19 symptoms self-reported by users, compare symptom distributions against clinical studies, and create a symptom lexicon for the research community. We retrieved tweets using COVID-19-related keywords, and performed semi-automatic filtering to curate self-reports of positive-tested users. We extracted COVID-19-related symptoms mentioned by the users, mapped them to standard concept IDs (UMLS), and compared the distributions to those reported in early studies from clinical settings. We identified 203 positive-tested users who reported 1002 symptoms using 668 unique expressions. The most frequently-reported symptoms were fever yrexia (66.1%), cough (57.9%), body ache ain (42.7%), fatigue (42.1%), headache (37.4%), and dyspnea (36.3%) amongst users who reported at least 1 symptom. Mild symptoms, such as anosmia (28.7%) and ageusia (28.1%) were frequently reported on Twitter, but not in clinical studies. The spectrum of COVID-19 symptoms identified from Twitter may complement those identified in clinical settings.
Publisher: JMIR Publications Inc.
Date: 08-06-2017
Abstract: regnancy exposure registries are the primary sources of information about the safety of maternal usage of medications during pregnancy. Such registries enroll pregnant women in a voluntary fashion early on in pregnancy and follow them until the end of pregnancy or longer to systematically collect information regarding specific pregnancy outcomes. Although the model of pregnancy registries has distinct advantages over other study designs, they are faced with numerous challenges and limitations such as low enrollment rate, high cost, and selection bias. he primary objectives of this study were to systematically assess whether social media (Twitter) can be used to discover cohorts of pregnant women and to develop and deploy a natural language processing and machine learning pipeline for the automatic collection of cohort information. In addition, we also attempted to ascertain, in a preliminary fashion, what types of longitudinal information may potentially be mined from the collected cohort information. ur discovery of pregnant women relies on detecting pregnancy-indicating tweets (PITs), which are statements posted by pregnant women regarding their pregnancies. We used a set of 14 patterns to first detect potential PITs. We manually annotated a s le of 14,156 of the retrieved user posts to distinguish real PITs from false positives and trained a supervised classification system to detect real PITs. We optimized the classification system via cross validation, with features and settings targeted toward optimizing precision for the positive class. For users identified to be posting real PITs via automatic classification, our pipeline collected all their available past and future posts from which other information (eg, medication usage and fetal outcomes) may be mined. ur rule-based PIT detection approach retrieved over 200,000 posts over a period of 18 months. Manual annotation agreement for three annotators was very high at kappa (κ)=.79. On a blind test set, the implemented classifier obtained an overall F1 score of 0.84 (0.88 for the pregnancy class and 0.68 for the nonpregnancy class). Precision for the pregnancy class was 0.93, and recall was 0.84. Feature analysis showed that the combination of dense and sparse vectors for classification achieved optimal performance. Employing the trained classifier resulted in the identification of 71,954 users from the collected posts. Over 250 million posts were retrieved for these users, which provided a multitude of longitudinal information about them. ocial media sources such as Twitter can be used to identify large cohorts of pregnant women and to gather longitudinal information via automated processing of their postings. Considering the many drawbacks and limitations of pregnancy registries, social media mining may provide beneficial complementary information. Although the cohort sizes identified over social media are large, future research will have to assess the completeness of the information available through them.
Publisher: Springer Science and Business Media LLC
Date: 26-01-2021
DOI: 10.1186/S12911-021-01394-0
Abstract: Prescription medication (PM) misuse/abuse has emerged as a national crisis in the United States, and social media has been suggested as a potential resource for performing active monitoring. However, automating a social media-based monitoring system is challenging—requiring advanced natural language processing (NLP) and machine learning methods. In this paper, we describe the development and evaluation of automatic text classification models for detecting self-reports of PM abuse from Twitter. We experimented with state-of-the-art bi-directional transformer-based language models, which utilize tweet-level representations that enable transfer learning (e.g., BERT, RoBERTa, XLNet, AlBERT, and DistilBERT), proposed fusion-based approaches, and compared the developed models with several traditional machine learning, including deep learning, approaches. Using a public dataset, we evaluated the performances of the classifiers on their abilities to classify the non-majority “abuse/misuse” class. Our proposed fusion-based model performs significantly better than the best traditional model (F 1 -score [95% CI]: 0.67 [0.64–0.69] vs. 0.45 [0.42–0.48]). We illustrate, via experimentation using varying training set sizes, that the transformer-based models are more stable and require less annotated data compared to the other models. The significant improvements achieved by our best-performing classification model over past approaches makes it suitable for automated continuous monitoring of nonmedical PM use from Twitter. BERT, BERT-like and fusion-based models outperform traditional machine learning and deep learning models, achieving substantial improvements over many years of past research on the topic of prescription medication misuse/abuse classification from social media, which had been shown to be a complex task due to the unique ways in which information about nonmedical use is presented. Several challenges associated with the lack of context and the nature of social media language need to be overcome to further improve BERT and BERT-like models. These experimental driven challenges are represented as potential future research directions.
Publisher: Oxford University Press (OUP)
Date: 07-11-2022
Abstract: Illicit or ‘designer’ benzodiazepines are a growing contributor to overdose deaths. We employed natural language processing (NLP) to study benzodiazepine mentions over 10 years on 270 online drug forums (subreddits) on Reddit. Using NLP, we automatically detected mentions of illicit and prescription benzodiazepines, including their misspellings and non-standard names, grouping relative mentions by quarter. On a collection of 17 861 755 posts between 2012 and 2021, we searched for 26 benzodiazepines (8 prescription 18 illicit), detecting 173 275 mentions. The rate of posts about both prescription and illicit benzodiazepines increased consistently with increases in deaths involving both drug classes, illustrating the utility of surveillance via Reddit.
Publisher: Cold Spring Harbor Laboratory
Date: 23-09-2021
DOI: 10.1101/2021.09.20.21263856
Abstract: Nonmedical use of prescription drugs (NMPDU) is a global health concern. The extent of, behaviors and emotions associated with, and reasons for NMPDU are not well-captured through traditional instruments such as surveys, prescribing databases and insurance claims. Therefore, this study analyses ∼130 million public posts from 87,718 Twitter users in terms of expressed emotions, sentiments, concerns, and potential reasons for NMPDU via natural language processing. Our results show that users in the NMPDU group express more negative emotions and less positive emotions, more concerns about family, the past and body, and less concerns related to work, leisure, home, money, religion, health and achievement, compared to a control group (i.e., users who never reported NMPDU). NMPDU posts tend to be highly polarized, indicating potential emotional triggers. Gender-specific analysis shows that female users in the NMPDU group express more content related to positive emotions, anticipation, sadness, joy, concerns about family, friends, home, health and the past, and less about anger, compared to males. The findings of the study can enrich our understanding of NMPDU.
Publisher: Cold Spring Harbor Laboratory
Date: 28-04-2022
DOI: 10.1101/2022.04.27.22274390
Abstract: Traditional surveillance mechanisms for nonmedical prescription medication use (NPMU) involve substantial lags. Social media-based approaches have been proposed for conducting close-to-real-time surveillance, but such methods typically cannot provide fine-grained statistics about subpopulations. We address this gap by developing methods for automatically characterizing a large Twitter NPMU cohort (n=288,562) in terms of age-group, race, and gender. Our methods achieved 0.88 precision (95%-CI: 0.84-0.92) for age-group, 0.90 (95%-CI: 0.85-0.95) for race, and 0.94 accuracy (95%-CI: 0.92-0.97) for gender. We compared the automatically-derived statistics for the NPMU of tranquilizers, stimulants, and opioids from Twitter to statistics reported in traditional sources ( eg ., the National Survey on Drug Use and Health). Our estimates were mostly consistent with the traditional sources, except for age-group-related statistics, likely caused by differences in reporting tendencies and representations in the population. Our study demonstrates that subpopulation-specific estimates about NPMU may be automatically derived from Twitter to obtain early insights.
Publisher: American Association for the Advancement of Science (AAAS)
Date: 2023
DOI: 10.34133/HDS.0078
Abstract: Background: Due to the high burden of chronic pain, and the detrimental public health consequences of its treatment with opioids, there is a high-priority need to identify effective alternative therapies. Social media is a potentially valuable resource for knowledge about self-reported therapies by chronic pain sufferers. Methods: We attempted to (a) verify the presence of large-scale chronic pain-related chatter on Twitter, (b) develop natural language processing and machine learning methods for automatically detecting self-disclosures, (c) collect longitudinal data posted by them, and (d) semiautomatically analyze the types of chronic pain-related information reported by them. We collected data using chronic pain-related hashtags and keywords and manually annotated 4,998 posts to indicate if they were self-reports of chronic pain experiences. We trained and evaluated several state-of-the-art supervised text classification models and deployed the best-performing classifier. We collected all publicly available posts from detected cohort members and conducted manual and natural language processing-driven descriptive analyses. Results: Interannotator agreement for the binary annotation was 0.82 (Cohen’s kappa). The RoBERTa model performed best (F 1 score: 0.84 95% confidence interval: 0.80 to 0.89), and we used this model to classify all collected unlabeled posts. We discovered 22,795 self-reported chronic pain sufferers and collected over 3 million of their past posts. Further analyses revealed information about, but not limited to, alternative treatments, patient sentiments about treatments, side effects, and self-management strategies. Conclusion: Our social media based approach will result in an automatically growing large cohort over time, and the data can be leveraged to identify effective opioid-alternative therapies for erse chronic pain types.
Publisher: Ovid Technologies (Wolters Kluwer Health)
Date: 23-12-2021
DOI: 10.1097/ADM.0000000000000940
Abstract: Opioid use disorder (OUD) is a major public health crisis for which buprenorphine-naloxone is an effective evidence-based treatment. Analysis of Reddit data yields detailed information about firsthand experiences with buprenorphine-naloxone that has the potential to inform treatment of OUD. We conducted a thematic analysis of posts about buprenorphine-naloxone from a Reddit forum in which Reddit users anonymously discuss topics related to opioid use. We used an application programming interface to retrieve posts about buprenorphine-naloxone, then applied natural language processing to generate meta-information and curate s les of salient posts. We manually categorized posts according to their content and conducted natural language processing-aided analysis of posts about buprenorphine tapering strategies, withdrawal symptoms, and adjunctive substances/behaviors useful in the tapering process. A total of 16,146 posts from 1933 redditors were retrieved from the /r/suboxone subreddit. Thematic analysis of s le posts (N = 200) revealed descriptions of personal experiences (74%), nonpersonal accounts (24%), and other content (2%). Among redditors who reported tapering to termination (N = 40), 0.063 mg and 0.125 mg were the most common termination doses. Fatigue, gastrointestinal disturbance, and mood disturbance were the most frequent adverse effects, and loperamide and vitamins/dietary supplements the most frequently discussed adverse effects adjunctive substances/behaviors respectively. Discussions on Reddit are rich in information about buprenorphine-naloxone. Information derived from analysis of Reddit posts about buprenorphine-naloxone may not be available elsewhere and may help providers improve treatment of people with OUD through better understanding of the experiences of people who have used buprenorphine-naloxone.
Publisher: Association for Computational Linguistics
Date: 2015
DOI: 10.18653/V1/S15-2085
Publisher: MDPI AG
Date: 12-11-2022
DOI: 10.3390/HEALTHCARE10112270
Abstract: The COVID-19 pandemic is the most devastating public health crisis in at least a century and has affected the lives of billions of people worldwide in unprecedented ways. Compared to pandemics of this scale in the past, societies are now equipped with advanced technologies that can mitigate the impacts of pandemics if utilized appropriately. However, opportunities are currently not fully utilized, particularly at the intersection of data science and health. Health-related big data and technological advances have the potential to significantly aid the fight against such pandemics, including the current pandemic’s ongoing and long-term impacts. Specifically, the field of natural language processing (NLP) has enormous potential at a time when vast amounts of text-based data are continuously generated from a multitude of sources, such as health/hospital systems, published medical literature, and social media. Effectively mitigating the impacts of the pandemic requires tackling challenges associated with the application and deployment of NLP systems. In this paper, we review the applications of NLP to address erse aspects of the COVID-19 pandemic. We outline key NLP-related advances on a chosen set of topics reported in the literature and discuss the opportunities and challenges associated with applying NLP during the current pandemic and future ones. These opportunities and challenges can guide future research aimed at improving the current health and social response systems and pandemic preparedness.
Publisher: JMIR Publications Inc.
Date: 03-05-2021
DOI: 10.2196/26616
Abstract: The wide adoption of social media in daily life renders it a rich and effective resource for conducting near real-time assessments of consumers’ perceptions of health services. However, its use in these assessments can be challenging because of the vast amount of data and the ersity of content in social media chatter. This study aims to develop and evaluate an automatic system involving natural language processing and machine learning to automatically characterize user-posted Twitter data about health services using Medicaid, the single largest source of health coverage in the United States, as an ex le. We collected data from Twitter in two ways: via the public streaming application programming interface using Medicaid-related keywords (Corpus 1) and by using the website’s search option for tweets mentioning agency-specific handles (Corpus 2). We manually labeled a s le of tweets in 5 predetermined categories or other and artificially increased the number of training posts from specific low-frequency categories. Using the manually labeled data, we trained and evaluated several supervised learning algorithms, including support vector machine, random forest (RF), naïve Bayes, shallow neural network (NN), k-nearest neighbor, bidirectional long short-term memory, and bidirectional encoder representations from transformers (BERT). We then applied the best-performing classifier to the collected tweets for postclassification analyses to assess the utility of our methods. We manually annotated 11,379 tweets (Corpus 1: 9179 Corpus 2: 2200) and used 7930 (69.7%) for training, 1449 (12.7%) for validation, and 2000 (17.6%) for testing. A classifier based on BERT obtained the highest accuracies (81.7%, Corpus 1 80.7%, Corpus 2) and F1 scores on consumer feedback (0.58, Corpus 1 0.90, Corpus 2), outperforming the second best classifiers in terms of accuracy (74.6%, RF on Corpus 1 69.4%, RF on Corpus 2) and F1 score on consumer feedback (0.44, NN on Corpus 1 0.82, RF on Corpus 2). Postclassification analyses revealed differing intercorpora distributions of tweet categories, with political (400778/628411, 63.78%) and consumer feedback (15073/27337, 55.14%) tweets being the most frequent for Corpus 1 and Corpus 2, respectively. The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization and can be deployed and generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies.
Publisher: American Association for the Advancement of Science (AAAS)
Date: 2022
Abstract: Background. The behaviors and emotions associated with and reasons for nonmedical prescription drug use (NMPDU) are not well-captured through traditional instruments such as surveys and insurance claims. Publicly available NMPDU-related posts on social media can potentially be leveraged to study these aspects unobtrusively and at scale. Methods. We applied a machine learning classifier to detect self-reports of NMPDU on Twitter and extracted all public posts of the associated users. We analyzed approximately 137 million posts from 87,718 Twitter users in terms of expressed emotions, sentiments, concerns, and possible reasons for NMPDU via natural language processing. Results. Users in the NMPDU group express more negative emotions and less positive emotions, more concerns about family, the past, and body, and less concerns related to work, leisure, home, money, religion, health, and achievement compared to a control group (i.e., users who never reported NMPDU). NMPDU posts tend to be highly polarized, indicating potential emotional triggers. Gender-specific analyses show that female users in the NMPDU group express more content related to positive emotions, anticipation, sadness, joy, concerns about family, friends, home, health, and the past, and less about anger than males. The findings are consistent across distinct prescription drug categories (opioids, benzodiazepines, stimulants, and polysubstance). Conclusion. Our analyses of large-scale data show that substantial differences exist between the texts of the posts from users who self-report NMPDU on Twitter and those who do not, and between males and females who report NMPDU. Our findings can enrich our understanding of NMPDU and the population involved.
Publisher: JMIR Publications Inc.
Date: 20-10-2022
Abstract: ocial media has served as a lucrative platform for spreading misinformation and for promoting fraudulent products for the treatment, testing, and prevention of COVID-19. This has resulted in the issuance of many warning letters by the US Food and Drug Administration (FDA). While social media continues to serve as the primary platform for the promotion of such fraudulent products, it also presents the opportunity to identify these products early by using effective social media mining methods. ur objectives were to (1) create a data set of fraudulent COVID-19 products that can be used for future research and (2) propose a method using data from Twitter for automatically detecting heavily promoted COVID-19 products early. e created a data set from FDA-issued warnings during the early months of the COVID-19 pandemic. We used natural language processing and time-series anomaly detection methods for automatically detecting fraudulent COVID-19 products early from Twitter. Our approach is based on the intuition that increases in the popularity of fraudulent products lead to corresponding anomalous increases in the volume of chatter regarding them. We compared the anomaly signal generation date for each product with the corresponding FDA letter issuance date. We also performed a brief manual analysis of chatter associated with 2 products to characterize their contents. DA warning issue dates ranged from March 6, 2020, to June 22, 2021, and 44 key phrases representing fraudulent products were included. From 577,872,350 posts made between February 19 and December 31, 2020, which are all publicly available, our unsupervised approach detected 34 out of 44 (77.3%) signals about fraudulent products earlier than the FDA letter issuance dates, and an additional 6 (13.6%) within a week following the corresponding FDA letters. Content analysis revealed i misinformation /i , i information /i , i olitical, /i and i conspiracy theories /i to be prominent topics. ur proposed method is simple, effective, easy to deploy, and does not require high-performance computing machinery unlike deep neural network–based methods. The method can be easily extended to other types of signal detection from social media data. The data set may be used for future research and the development of more advanced methods.
Publisher: MDPI AG
Date: 05-08-2022
DOI: 10.3390/HEALTHCARE10081478
Abstract: Pretrained contextual language models proposed in the recent past have been reported to achieve state-of-the-art performances in many natural language processing (NLP) tasks, including those involving health-related social media data. We sought to evaluate the effectiveness of different pretrained transformer-based models for social media-based health-related text classification tasks. An additional objective was to explore and propose effective pretraining strategies to improve machine learning performance on such datasets and tasks. We benchmarked six transformer-based models that were pretrained with texts from different domains and sources—BERT, RoBERTa, BERTweet, TwitterBERT, BioClinical_BERT, and BioBERT—on 22 social media-based health-related text classification tasks. For the top-performing models, we explored the possibility of further boosting performance by comparing several pretraining strategies: domain-adaptive pretraining (DAPT), source-adaptive pretraining (SAPT), and a novel approach called topic specific pretraining (TSPT). We also attempted to interpret the impacts of distinct pretraining strategies by visualizing document-level embeddings at different stages of the training process. RoBERTa outperformed BERTweet on most tasks, and better than others. BERT, TwitterBERT, BioClinical_BERT and BioBERT consistently underperformed. For pretraining strategies, SAPT performed better or comparable to the off-the-shelf models, and significantly outperformed DAPT. SAPT + TSPT showed consistently high performance, with statistically significant improvement in three tasks. Our findings demonstrate that RoBERTa and BERTweet are excellent off-the-shelf models for health-related social media text classification, and extended pretraining using SAPT and TSPT can further improve performance.
Publisher: Informa UK Limited
Date: 04-02-2022
DOI: 10.1080/15563650.2022.2032730
Abstract: Induction of buprenorphine, an evidence-based treatment for opioid use disorder (OUD), has been reported to be difficult for people with heavy use of fentanyl, the most prevalent opioid in many areas of the country. In this population, precipitated opioid withdrawal (POW) may occur even after in iduals have completed a period of opioid abstinence prior to induction. Our objective was to study potential associations between fentanyl, buprenorphine induction, and POW, using social media data. This is a mixed methods study of data from seven opioid-related forums (subreddits) on Reddit. We retrieved publicly available data from the subreddits Reddit subscribers often associate POW with F&A use and describe self-managed buprenorphine induction strategies involving microdosing to avoid POW. Further objective studies in patients with fentanyl use and OUD initiating buprenorphine are needed to corroborate these findings.HIGHLIGHTSIncrease in mentions of precipitated opioid withdrawal (POW) on Reddit from 2012 to 2021 was associated with the increase in fentanyl and analog mentions.Experiences of precipitated opioid withdrawal (POW) were described by in iduals despite reporting prolonged periods of abstinence compared to standard buprenorphine induction protocols.People with Opioid Use Disorder (OUD) on Reddit are using and recommending microdosing strategies with buprenorphine to avoid POW.People who used fentanyl report experiencing POW following statistically longer periods of abstinence than people who use heroin.
Publisher: Oxford University Press (OUP)
Date: 11-04-2019
DOI: 10.1093/JAMIA/OCZ013
Publisher: JMIR Publications Inc.
Date: 30-10-2017
DOI: 10.2196/JMIR.8164
Publisher: IEEE
Date: 06-2012
Publisher: Springer Nature Switzerland
Date: 2023
Publisher: Association for Computational Linguistics
Date: 2019
DOI: 10.18653/V1/W19-3203
Publisher: Cold Spring Harbor Laboratory
Date: 21-05-2020
DOI: 10.1101/2020.05.17.20104778
Abstract: Breast cancer patients often discontinue their long-term treatments, such as hormone therapy, increasing the risk of cancer recurrence. These discontinuations are often caused by adverse patient-centered outcomes (PCOs) due to hormonal drug side effects or other factors. PCOs are not detectable through laboratory tests and are sparsely documented in electronic health records. Thus, there is a need to explore other sources of information for PCOs associated with breast cancer treatments. Social media is a promising resource, but extracting true PCOs from it first requires the accurate detection of breast cancer patients. We describe a natural language processing (NLP) architecture for automatically detecting breast cancer patients from Twitter based on their self-reports. The architecture employs breast cancer-related keywords to collect streaming data from Twitter, applies NLP patterns to pre-filter noisy posts, and then employs a machine learning classifier trained using manually-annotated data (n=5019) for distinguishing firsthand self-reports of breast cancer from other tweets. A classifier based on bidirectional encoder representations from transformers (BERT) showed human-like performance and achieved F1-score of 0.857 (inter-annotator agreement: 0.845 Cohen's kappa) for the positive class, considerably outperforming the next best classifier--a deep neural network (F1-score: 0.665). Qualitative analyses of posts from automatically-detected users revealed discussions about side effects, non-adherence, and mental health conditions, illustrating the feasibility of our social media-based approach for studying breast cancer-related PCOs from a large population.
Publisher: Ovid Technologies (Wolters Kluwer Health)
Date: 03-2020
DOI: 10.1097/DCR.0000000000001583
Abstract: Low anterior resection syndrome is pragmatically defined as disordered bowel function after rectal resection leading to a detriment in quality of life. This broad characterization does not allow for precise estimates of prevalence. The low anterior resection syndrome score was designed as a simple tool for clinical evaluation of low anterior resection syndrome. Although the low anterior resection syndrome score has good clinical utility, it may not capture all important aspects that patients may experience. The aim of this collaboration was to develop an international consensus definition of low anterior resection syndrome that encompasses all aspects of the condition and is informed by all stakeholders. This international patient-provider initiative used an online Delphi survey, regional patient consultation meetings, and an international consensus meeting. Three expert groups participated: patients, surgeons, and other health professionals from 5 regions (Australasia, Denmark, Spain, Great Britain and Ireland, and North America) and in 3 languages (English, Spanish, and Danish). The primary outcome measured was the priorities for the definition of low anterior resection syndrome. Three hundred twenty-five participants (156 patients) registered. The response rates for successive rounds of the Delphi survey were 86%, 96%, and 99%. Eighteen priorities emerged from the Delphi survey. Patient consultation and consensus meetings refined these priorities to 8 symptoms and 8 consequences that capture essential aspects of the syndrome. S ling bias may have been present, in particular, in the patient panel because social media was used extensively in recruitment. There was also dominance of the surgical panel at the final consensus meeting despite attempts to mitigate this. This is the first definition of low anterior resection syndrome developed with direct input from a large international patient panel. The involvement of patients in all phases has ensured that the definition presented encompasses the vital aspects of the patient experience of low anterior resection syndrome. The novel separation of symptoms and consequences may enable greater sensitivity to detect changes in low anterior resection syndrome over time and with intervention.
Publisher: Elsevier BV
Date: 11-2018
Publisher: JMIR Publications Inc.
Date: 03-05-2023
Abstract: ocial media have emerged as important sources of information generated by large segments of the population, which can be particularly valuable during infectious disease outbreaks. y analyzing posts from Twitter (tweets), we aimed to identify the topics of public discourse, and knowledge and opinions about the monkeypox virus during the 2022 outbreak. e collected data from Twitter for English-language posts using the key phrases monkeypox, mpoxvirus, and monkey pox, and their hashtag equivalents from August to October 2022. We selected a small random s le from the collected posts, analyzed, coded, and manually categorized them first into topics, then into coarse-grained themes. 28,615 posts were collected in total 200 tweets were selected and included for manual analyses. Eight themes were generated from the Twitter posts—monkeypox doubts, media, monkeypox transmission, effect of monkeypox, knowledge of monkeypox, politics, monkeypox vaccine, and general comments. The commonest themes from our study were monkeypox doubts and media, 22% each. The posts represented a mixture of useful information as new knowledge on the topic emerged, and also misinformation. ocial networks, such as Twitter, are useful sources of information in the early stages of outbreaks. Close to real-time identification and analyses of misinformation may help authorities take the necessary steps in a timely manner. /A
Publisher: JMIR Publications Inc.
Date: 26-02-2020
DOI: 10.2196/15861
Abstract: Social media data are being increasingly used for population-level health research because it provides near real-time access to large volumes of consumer-generated data. Recently, a number of studies have explored the possibility of using social media data, such as from Twitter, for monitoring prescription medication abuse. However, there is a paucity of annotated data or guidelines for data characterization that discuss how information related to abuse-prone medications is presented on Twitter. This study discusses the creation of an annotated corpus suitable for training supervised classification algorithms for the automatic classification of medication abuse–related chatter. The annotation strategies used for improving interannotator agreement (IAA), a detailed annotation guideline, and machine learning experiments that illustrate the utility of the annotated corpus are also described. We employed an iterative annotation strategy, with interannotator discussions held and updates made to the annotation guidelines at each iteration to improve IAA for the manual annotation task. Using the grounded theory approach, we first characterized tweets into fine-grained categories and then grouped them into 4 broad classes—abuse or misuse, personal consumption, mention, and unrelated. After the completion of manual annotations, we experimented with several machine learning algorithms to illustrate the utility of the corpus and generate baseline performance metrics for automatic classification on these data. Our final annotated set consisted of 16,443 tweets mentioning at least 20 abuse-prone medications including opioids, benzodiazepines, atypical antipsychotics, central nervous system stimulants, and gamma-aminobutyric acid analogs. Our final overall IAA was 0.86 (Cohen kappa), which represents high agreement. The manual annotation process revealed the variety of ways in which prescription medication misuse or abuse is discussed on Twitter, including expressions indicating coingestion, nonmedical use, nonstandard route of intake, and consumption above the prescribed doses. Among machine learning classifiers, support vector machines obtained the highest automatic classification accuracy of 73.00% (95% CI 71.4-74.5) over the test set (n=3271). Our manual analysis and annotations of a large number of tweets have revealed types of information posted on Twitter about a set of abuse-prone prescription medications and their distributions. In the interests of reproducible and community-driven research, we have made our detailed annotation guidelines and the training data for the classification experiments publicly available, and the test data will be used in future shared tasks.
Publisher: Proceedings of the National Academy of Sciences
Date: 14-02-2023
Abstract: Traditional substance use (SU) surveillance methods, such as surveys, incur substantial lags. Due to the continuously evolving trends in SU, insights obtained via such methods are often outdated. Social media-based sources have been proposed for obtaining timely insights, but methods leveraging such data cannot typically provide fine-grained statistics about subpopulations, unlike traditional approaches. We address this gap by developing methods for automatically characterizing a large Twitter nonmedical prescription medication use (NPMU) cohort (n = 288,562) in terms of age-group, race, and gender. Our natural language processing and machine learning methods for automated cohort characterization achieved 0.88 precision (95% CI:0.84 to 0.92) for age-group, 0.90 (95% CI: 0.85 to 0.95) for race, and 94% accuracy (95% CI: 92 to 97) for gender, when evaluated against manually annotated gold-standard data. We compared automatically derived statistics for NPMU of tranquilizers, stimulants, and opioids from Twitter with statistics reported in the National Survey on Drug Use and Health (NSDUH) and the National Emergency Department S le (NEDS). Distributions automatically estimated from Twitter were mostly consistent with the NSDUH [Spearman r : race: 0.98 ( P 0.005) age-group: 0.67 ( P 0.005) gender: 0.66 ( P = 0.27)] and NEDS, with 34/65 (52.3%) of the Twitter-based estimates lying within 95% CIs of estimates from the traditional sources. Explainable differences (e.g., overrepresentation of younger people) were found for age-group-related statistics. Our study demonstrates that accurate subpopulation-specific estimates about SU, particularly NPMU, may be automatically derived from Twitter to obtain earlier insights about targeted subpopulations compared to traditional surveillance approaches.
Publisher: Informa UK Limited
Date: 06-04-2021
Publisher: Association for Computational Linguistics
Date: 2016
DOI: 10.18653/V1/S16-1031
Publisher: Elsevier BV
Date: 07-2022
Publisher: Springer Science and Business Media LLC
Date: 09-01-2016
Publisher: Oxford University Press (OUP)
Date: 27-11-2020
DOI: 10.1093/BIOINFORMATICS/BTAA995
Abstract: LexExp is an open-source, data-centric lexicon expansion system that generates spelling variants of lexical expressions in a lexicon using a phrase embedding model, lexical similarity-based natural language processing methods and a set of tunable threshold decay functions. The system is customizable, can be optimized for recall or precision and can generate variants for multi-word expressions. Code available at: sarker/lexexp data and resources available at: exexp. Supplementary data are available at Bioinformatics online.
Publisher: Wiley
Date: 14-12-2020
DOI: 10.1111/CODI.15465
Publisher: ACM
Date: 23-04-2018
Publisher: Wiley
Date: 12-02-2017
DOI: 10.1111/IJD.13506
Publisher: Elsevier BV
Date: 10-2019
DOI: 10.1016/J.JBI.2019.103268
Abstract: The assessment of written medical examinations is a tedious and expensive process, requiring significant amounts of time from medical experts. Our objective was to develop a natural language processing (NLP) system that can expedite the assessment of unstructured answers in medical examinations by automatically identifying relevant concepts in the examinee responses. Our NLP system, Intelligent Clinical Text Evaluator (INCITE), is semi-supervised in nature. Learning from a limited set of fully annotated ex les, it sequentially applies a series of customized text comparison and similarity functions to determine if a text span represents an entry in a given reference standard. Combinations of fuzzy matching and set intersection-based methods capture inexact matches and also fragmented concepts. Customizable, dynamic similarity-based matching thresholds allow the system to be tailored for examinee responses of different lengths. INCITE achieved an average F Long and non-standard expressions are difficult for INCITE to detect, but the problem is mitigated by the use of dynamic thresholding (i.e., varying the similarity threshold for a text span to be considered a match). Annotation variations within exams and disagreements between annotators were the primary causes for false positives. Small amounts of annotated data can significantly improve system performance. The high performance and interpretability of INCITE will likely significantly aid the assessment process and also help mitigate the impact of manual assessment inconsistencies.
Publisher: Springer Science and Business Media LLC
Date: 05-03-2022
DOI: 10.1186/S13011-022-00442-W
Abstract: Timely data from official sources regarding the impact of the COVID-19 pandemic on people who use prescription and illegal opioids is lacking. We conducted a large-scale, natural language processing (NLP) analysis of conversations on opioid-related drug forums to better understand concerns among people who use opioids. In this retrospective observational study, we analyzed posts from 14 opioid-related forums on the social network Reddit. We applied NLP to identify frequently mentioned substances and phrases, and grouped the phrases manually based on their contents into three broad key themes: (i) prescription and/or illegal opioid use (ii) substance use disorder treatment access and care and (iii) withdrawal . Phrases that were unmappable to any particular theme were discarded. We computed the frequencies of substance and theme mentions, and quantified their volumes over time. We compared changes in post volumes by key themes and substances between pre-COVID-19 (1/1/2019—2/29/2020) and COVID-19 (3/1/2020—11/30/2020) periods. Seventy-seven thousand six hundred fifty-two and 119,168 posts were collected for the pre-COVID-19 and COVID-19 periods, respectively. By theme, posts about treatment and access to care increased by 300%, from 0.631 to 2.526 per 1000 posts between the pre-COVID-19 and COVID-19 periods. Conversations about withdrawal increased by 812% between the same periods (0.026 to 0.235 per 1,000 posts). Posts about drug use did not increase (0.219 to 0.218 per 1,000 posts). By substance, among medications for opioid use disorder, methadone had the largest increase in conversations (20.751 to 56.313 per 1,000 posts 171.4% increase). Among other medications, posts about diphenhydramine exhibited the largest increase (0.341 to 0.927 per 1,000 posts 171.8% increase). Conversations on opioid-related forums among people who use opioids revealed increased concerns about treatment and access to care along with withdrawal following the emergence of COVID-19. Greater attention to social media data may help inform timely responses to the needs of people who use opioids during COVID-19.
Publisher: MDPI AG
Date: 20-08-2019
DOI: 10.3390/MTI3030060
Abstract: In the medical domain, user-generated social media text is increasingly used as a valuablecomplementary knowledge source to scientific medical literature. The extraction of this knowledge iscomplicated by colloquial language use and misspellings. However, lexical normalization of suchdata has not been addressed effectively. This paper presents a data-driven lexical normalizationpipeline with a novel spelling correction module for medical social media. Our method significantlyoutperforms state-of-the-art spelling correction methods and can detect mistakes with an F1 of 0.63despite extreme imbalance in the data. We also present the first corpus for spelling mistake detectionand correction in a medical patient forum.
Publisher: Cold Spring Harbor Laboratory
Date: 26-11-2021
DOI: 10.1101/2021.11.24.21266793
Abstract: Intimate partner violence (IPV) is a preventable public health issue that affects millions of people worldwide. Approximately one in four women are estimated to be or have been victims of severe violence at some point in their lives, irrespective of their age, ethnicity, and economic status. Victims often report IPV experiences on social media, and automatic detection of such reports via machine learning may enable the proactive and targeted distribution of support and/or interventions for those in need. We collected posts from Twitter using a list of keywords related to IPV. We manually reviewed subsets of retrieved posts, and prepared annotation guidelines to categorize tweets into IPV - report or non-IPV-report . We manually annotated a random subset of the collected tweets according to the guidelines, and used them to train and evaluate multiple supervised classification models. For the best classification strategy, we examined the model errors, bias, and trustworthiness through manual and automated content analysis. We annotated a total of 6,348 tweets, with inter-annotator agreement (IAA) of 0.86 (Cohen’s kappa) among 1,834 double-annotated tweets. The dataset had substantial class imbalance, with only 668 (∼11%) tweets representing IPV-reports. The RoBERTa model achieved the best classification performance (accuracy: 95% IPV-report F 1 -score 0.76 non-IPV-report F 1 -score 0.97). Content analysis of the tweets revealed that the RoBERTa model sometimes misclassified as it focused on IPV-irrelevant words or symbols during decision making. Classification outcome and word importance analyses showed that our developed model is not biased toward gender or ethnicity while making classification decisions. Our study developed an effective NLP model to identify IPV-reporting tweets automatically and in real time. The developed model can be an essential component for providing proactive social media based intervention and support for victims. It may also be used for population-level surveillance and conducting large-scale cohort studies.
Publisher: WORLD SCIENTIFIC
Date: 11-2018
Publisher: Research Square Platform LLC
Date: 12-01-2021
DOI: 10.21203/RS.3.RS-58679/V2
Abstract: Background Prescription medication (PM) misuse/abuse has emerged as a national crisis in the United States, and social media has been suggested as a potential resource for performing active monitoring. However, automating a social media-based monitoring system is challenging—requiring advanced natural language processing (NLP) and machine learning methods. In this paper, we describe the development and evaluation of automatic text classification models for detecting self-reports of PM abuse from Twitter. Methods We experimented with state-of-the-art bi-directional transformer-based language models, which utilize tweet-level representations that enable transfer learning (e.g., BERT, RoBERTa, XLNet, AlBERT, and DistilBERT), proposed fusion-based approaches, and compared the developed models with several traditional machine learning, including deep learning, approaches. Using a public dataset, we evaluated the performances of the classifiers on their abilities to classify the non-majority “abuse/misuse” class. Results Our proposed fusion-based model performs significantly better than the best traditional model (F1-score [95% CI]: 0.67 [0.64-0.69] vs. 0.45 [0.42-0.48]). We illustrate, via experimentation using differing training set sizes, that the transformer-based models are more stable and require less annotated data compared to the other models. The significant improvements achieved by our best-performing classification model over past approaches makes it suitable for automated continuous monitoring of nonmedical PM use from Twitter. Conclusions BERT, BERT-like and fusion-based models not only outperform traditional machine learning and deep learning models, but also show substantial improvements over many years of past research on the topic of prescription medication misuse/abuse classification from social media, which had been shown to be a complex task due to the unique ways in which information about nonmedical use is presented. Several challenges, such as lack of complete context and the nature of social media language, need to be overcome to further improve BERT and BERT-like models. These experimental driven challenges are represented as potential future research directions.
Location: United States of America
Location: New Zealand
Start Date: 2015
End Date: 2015
Funder: Health Research Council of New Zealand
View Funded ActivityStart Date: 2020
End Date: 2020
Funder: Maurice and Phyllis Paykel Trust
View Funded ActivityStart Date: 2015
End Date: 2015
Funder: Lottery Health Research
View Funded ActivityStart Date: 2015
End Date: 2015
Funder: Royal Australasian College of Surgeons
View Funded ActivityStart Date: 2016
End Date: 2018
Funder: Auckland Medical Research Foundation
View Funded ActivityStart Date: 2018
End Date: 2018
Funder: Auckland Medical Research Foundation
View Funded ActivityStart Date: 2018
End Date: 2022
Funder: National Institute on Drug Abuse
View Funded ActivityStart Date: 2002
End Date: 12-2003
Amount: $375,000.00
Funder: Australian Research Council
View Funded Activity