ARDC Research Link Australia

Publication

BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.

Publisher: Oxford University Press (OUP)

Date: 2018

DOI: 10.1093/DATABASE/BAY122

Publication

Leveraging gene ontology annotations to improve a memory-based language understanding system

Publisher: IEEE

Date: 09-2010

DOI: 10.1109/ICSC.2010.62

Publication

The Evolution of Clinical Knowledge During COVID-19: Towards a Global Learning Health System.

Publisher: Georg Thieme Verlag KG

Date: 08-2021

DOI: 10.1055/S-0041-1726503

Abstract: Objectives: We examine the knowledge ecosystem of COVID-19, focusing on clinical knowledge and the role of health informatics as enabling technology. We argue for commitment to the model of a global learning health system to facilitate rapid knowledge translation supporting health care decision making in the face of emerging diseases. Methods and Results: We frame the evolution of knowledge in the COVID-19 crisis in terms of learning theory, and present a view of what has occurred during the pandemic to rapidly derive and share knowledge as an (underdeveloped) instance of a global learning health system. We identify the key role of information technologies for electronic data capture and data sharing, computational modelling, evidence synthesis, and knowledge dissemination. We further highlight gaps in the system and barriers to full realisation of an efficient and effective global learning health system. Conclusions: The need for a global knowledge ecosystem supporting rapid learning from clinical practice has become more apparent than ever during the COVID-19 pandemic. Continued effort to realise the vision of a global learning health system, including establishing effective approaches to data governance and ethics to support the system, is imperative to enable continuous improvement in our clinical care.

Publication

Use of a Victorian statewide surveillance programme to evaluate the burden of healthcare‐associated Staphylococcus aureus bacteraemia and Clostridioides difficile infection in patients with cancer

Publisher: Wiley

Date: 31-05-2022

DOI: 10.1111/IMJ.15301

Abstract: Patients with cancer are at high risk for infection, but the epidemiology of healthcare-associated Staphylococcus aureus bacteraemia (HA-SAB) and Clostridioides difficile infection (HA-CDI) in Australian cancer patients has not previously been reported. To compare the cumulative aggregate incidence and time trends of HA-SAB and HA-CDI in a predefined cancer cohort with a mixed statewide patient population in Victoria, Australia. All SAB and CDI events in patients admitted to Victorian healthcare facilities between 1 July 2010 and 31 December 2018 were submitted to the Victorian Healthcare Associated Infection Surveillance System Coordinating Centre. Descriptive analyses and multilevel mixed-effects Poisson regression modelling were applied to a standardised data extract. In total, 10 608 and 13 118 SAB and CDI events were reported across 139 Victorian healthcare facilities, respectively. Of these, 89 (85%) and 279 (88%) were healthcare-associated in the cancer cohort compared with 34% (3561/10 503) and 66% (8403/12 802) in the statewide cohort. The aggregate incidence was more than twofold higher in the cancer cohort compared with the statewide cohort for HA-SAB (2.25 (95% confidence interval (CI): 1.74-2.77) vs 1.11 (95% CI: 1.07-1.15) HA-SAB/10 000 occupied bed-days) and threefold higher for HA-CDI (6.26 (95% CI: 5.12-7.41) vs 2.31 (95% CI: 2.21-2.42) HA-CDI/10 000 occupied bed-days). Higher quarterly diminishing rates were observed in the cancer cohort than the statewide data for both infections. Our findings demonstrate a higher burden of HA-SAB and HA-CDI in a cancer cohort when compared with state data and highlight the need for cancer-specific targets and benchmarks to meaningfully support quality improvement.

Publication

A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning.

Publisher: Springer Science and Business Media LLC

Date: 12-2016

DOI: 10.1186/S12918-016-0371-3

Publication

Conceptualising health information seeking behaviours and exploratory search: result of a qualitative study

Publisher: Springer Science and Business Media LLC

Date: 10-02-2015

DOI: 10.1007/S12553-015-0096-0

Publication

Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters

Publisher: Springer Science and Business Media LLC

Date: 26-02-2014

DOI: 10.1186/1471-2105-15-59

Publication

Classifying literature mentions of biological pathogens as experimentally studied using natural language processing

Publisher: Research Square Platform LLC

Date: 09-2022

DOI: 10.21203/RS.3.RS-1996210/V1

Abstract: Background Information pertaining to mechanisms, management and treatment of disease-causing pathogens including viruses and bacteria is readily available from research publications indexed in MEDLINE. However, identifying the literature that specifically characterises these pathogens and their properties based on experimental research, important for understanding of the molecular basis of diseases caused by these agents, requires sifting through a large quantity of articles to exclude incidental mentions of the pathogens, or references to pathogens in other non-experimental contexts such as public health. Objective In this work, we lay the foundations for the development of automatic methods for characterising mentions of pathogens in scientific literature, focusing on the task of identifying research that involves active study of a pathogen in an experimental context. There are no manually annotated pathogen corpora available for this purpose, while such resources are necessary to support development of machine learning-based models. We therefore aim to fill this gap, producing a large data set automatically from MEDLINE, and using it to explore automatic methods that specifically support detection of experimentally studied pathogen mentions in research publications. Methods We developed a pathogen mention characterisation literature data set —READBiomed-Pathogens— automatically using NCBI resources, which we make available. Resources such as the NCBI Taxonomy, MeSH and GenBank can be used effectively to identify relevant literature about experimentally researched pathogens, more specifically using MeSH to link to MEDLINE citations including titles and abstracts with relevant pathogens. We experiment with several machine learning-based natural language processing (NLP) algorithms leveraging this dataset as training data to model the task of detecting papers that specifically describe active experimental study of a pathogen. Results We show that our data set READBiomed-Pathogens can be used to explore natural language processing configurations for experimental pathogen mention characterisation. READBiomed-Pathogens includes citations related to organisms including bacteria, viruses, and a small number of toxins and other disease-causing agents. Conclusions We studied the characterisation of experimentally studied pathogens in scientific literature, developing several natural language processing methods supported by an automatically developed data set. As a core contribution of the work, we presented a methodology to automatically construct a data set for pathogen identification using existing biomedical resources. The data set and the annotation code are made publicly available. Performance of the pathogen mention identification and characterisation algorithms were additionally evaluated on a small manually annotated data set shows that the data set that we have generated allows characterising pathogens of interest. Trial Registration: N/A

Publication

Chemtables: A Dataset for Semantic Classification on Tables in Chemical Patents

Publisher: Springer Science and Business Media LLC

Date: 12-2021

DOI: 10.1186/S13321-021-00568-2

Abstract: Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables , which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables . The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged $$F_1$$ F 1 score on the table classification task. The ChemTables dataset is publicly available at 0.17632/g7tjh7tbrj.3 , subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository enanz/ChemTables .

Publication

Exploring effective approaches for haplotype block phasing

Publisher: Springer Science and Business Media LLC

Date: 30-10-2019

DOI: 10.1186/S12859-019-3095-8

Abstract: Knowledge of phase, the specific allele sequence on each copy of homologous chromosomes, is increasingly recognized as critical for detecting certain classes of disease-associated mutations. One approach for detecting such mutations is through phased haplotype association analysis. While the accuracy of methods for phasing genotype data has been widely explored, there has been little attention given to phasing accuracy at haplotype block scale. Understanding the combined impact of the accuracy of phasing tool and the method used to determine haplotype blocks on the error rate within the determined blocks is essential to conduct accurate haplotype analyses. We present a systematic study exploring the relationship between seven widely used phasing methods and two common methods for determining haplotype blocks. The evaluation focuses on the number of haplotype blocks that are incorrectly phased. Insights from these results are used to develop a haplotype estimator based on a consensus of three tools. The consensus estimator achieved the most accurate phasing in all applied tests. In idually, EAGLE2, BEAGLE and SHAPEIT2 alternate in being the best performing tool in different scenarios. Determining haplotype blocks based on linkage disequilibrium leads to more correctly phased blocks compared to a sliding window approach. We find that there is little difference between phasing sections of a genome (e.g. a gene) compared to phasing entire chromosomes. Finally, we show that the location of phasing error varies when the tools are applied to the same data several times, a finding that could be important for downstream analyses. The choice of phasing and block determination algorithms and their interaction impacts the accuracy of phased haplotype blocks. This work provides guidance and evidence for the different design choices needed for analyses using haplotype blocks. The study highlights a number of issues that may have limited the replicability of previous haplotype analysis.

Publication

Performance of ICD-10-AM codes for quality improvement monitoring of hospital-acquired pneumonia in a haematology-oncology casemix in Victoria, Australia

Publisher: SAGE Publications

Date: 14-11-2022

DOI: 10.1177/18333583221131753

Abstract: The Australian hospital-acquired complication (HAC) policy was introduced to facilitate negative funding adjustments in Australian hospitals using ICD-10-AM codes. The aim of this study was to determine the positive predictive value (PPV) of the ICD-10-AM codes in the HAC framework to detect hospital-acquired pneumonia in patients with cancer and to describe any change in PPV before and after implementation of an electronic medical record (EMR) at our centre. A retrospective case review of all coded pneumonia episodes at the Peter MacCallum Cancer Centre in Melbourne, Australia spanning two time periods (01 July 2015 to 30 June 2017 [pre-EMR period] and 01 September 2020 to 28 February 2021 [EMR period]) was performed to determine the proportion of events satisfying standardised surveillance definitions. HAC-coded pneumonia occurred in 3.66% ( The current HAC definition is a poor-to-moderate classifier for hospital-acquired pneumonia in patients with cancer and, therefore, may not accurately reflect hospital-level quality improvement. Implementation of an EMR did enhance case detection, and future refinements to administratively coded data in support of robust monitoring frameworks should focus on EMR systems. Although ICD-10-AM data are readily available in Australian healthcare settings, these data are not sufficient for monitoring and reporting of hospital-acquired pneumonia in haematology-oncology patients.

Publication

Literature Consistency of Bioinformatics Sequence Databases is Effective for Assessing Record Quality

Publisher: Cold Spring Harbor Laboratory

Date: 23-01-2017

DOI: 10.1101/101873

Abstract: Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from in idual laboratories, as well as from bulk submissions from large-scale sequencing centres their ersity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness, and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using Principal Component Analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that 1 record out of 4 is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records.

Publication

Text Mining Improves Prediction of Protein Functional Sites

Publisher: Public Library of Science (PLoS)

Date: 29-02-2012

DOI: 10.1371/JOURNAL.PONE.0032171

Publication

Burden and clinical outcomes of hospital-coded infections in patients with cancer: an 11-year longitudinal cohort study at an Australian cancer centre

Publisher: Springer Science and Business Media LLC

Date: 15-04-2020

DOI: 10.1007/S00520-020-05439-4

Publication

Automatic English-Chinese name transliteration for development of multilingual resources

Publisher: Association for Computational Linguistics

Date: 1998

DOI: 10.3115/980432.980789

Publication

“Note Bloat” impacts deep learning-based NLP models for clinical prediction tasks

Publisher: Elsevier BV

Date: 09-2022

DOI: 10.1016/J.JBI.2022.104149

Abstract: One unintended consequence of the Electronic Health Records (EHR) implementation is the overuse of content-importing technology, such as copy-and-paste, that creates "bloated" notes containing large amounts of textual redundancy. Despite the rising interest in applying machine learning models to learn from real-patient data, it is unclear how the phenomenon of note bloat might affect the Natural Language Processing (NLP) models derived from these notes. Therefore, in this work we examine the impact of redundancy on deep learning-based NLP models, considering four clinical prediction tasks using a publicly available EHR database. We applied two deduplication methods to the hospital notes, identifying large quantities of redundancy, and found that removing the redundancy usually has little negative impact on downstream performances, and can in certain circumstances assist models to achieve significantly better results. We also showed it is possible to attack model predictions by simply adding note duplicates, causing changes of correct predictions made by trained models into wrong predictions. In conclusion, we demonstrated that EHR text redundancy substantively affects NLP models for clinical prediction tasks, showing that the awareness of clinical contexts and robust modeling methods are important to create effective and reliable NLP systems in healthcare contexts.

Publication

Learning from Unlabelled Data for Clinical Semantic Textual Similarity

Publisher: Association for Computational Linguistics

Date: 2020

DOI: 10.18653/V1/2020.CLINICALNLP-1.25

Publication

Appraising the Quality of Systematic Reviews for Age-Related Macular Degeneration Interventions

Publisher: American Medical Association (AMA)

Date: 09-2018

DOI: 10.1001/JAMAOPHTHALMOL.2018.2620

Abstract: Age-related macular degeneration (AMD) is a leading cause of vision impairment. It is imperative that AMD care is timely, appropriate, and evidence-based. It is thus essential that AMD systematic reviews are robust however, little is known about the quality of this literature. To investigate the methodological quality of systematic reviews of AMD intervention studies, and to evaluate their use for guiding evidence-based care. This systematic review adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement. All studies that self-identified as a systematic review in their title or abstract or were categorized as a systematic review from a medical subject heading and investigated the safety, efficacy and/or effectiveness of an AMD intervention were included. Comprehensive electronic searches were performed in Ovid MEDLINE, Embase, and the Cochrane Library from inception to March 2017. Two reviewers independently assessed titles and abstracts, then full-texts for eligibility. Quality was assessed using the Assessing the Methodological Quality of Systematic Reviews (AMSTAR) tool. Study characteristics (publication year, type of intervention, journal, citation rate, and funding source) were extracted. Of 983 citations retrieved, 71 studies (7.6%) were deemed eligible. The first systematic review relating to an AMD intervention was published in 2003. More than half were published since 2014. Methodological quality was highly variable. The mean (SD) AMSTAR score was 5.8 (3.2) of 11.0, with no significant improvement over time (r = -0.03 95% CI, -0.26 to 0.21 P = .83). Cochrane systematic reviews were overall of higher quality than reviews in other journals (mean [SD] AMSTAR score, 9.9 [1.2], n = 15 vs 4.7 [2.2], n = 56 P < .001). Overall, there was poor adherence to referring to an a priori design (22 articles [31%]) and reporting conflicts of interest in both the review and included studies (16 articles [23%]). Reviews funded by government grants and/or institutions were generally of higher quality than industry-sponsored reviews or where the funding source was not reported. There are gaps in the conduct of systematic reviews in the field of AMD. Enhanced endorsement of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement by refereed journals may improve review quality and improve the dissemination of reliable evidence relating to AMD interventions to clinicians.

Publication

High-precision biological event extraction: Effects of system and of data

Publisher: Wiley

Date: 11-2011

DOI: 10.1111/J.1467-8640.2011.00405.X

Publication

Early prediction of incident liver disease using conventional risk factors and gut-microbiome-augmented gradient boosting

Publisher: Elsevier BV

Date: 05-2022

DOI: 10.1016/J.CMET.2022.03.002

Publication

Evaluation of consensus strategies for haplotype phasing

Publisher: Cold Spring Harbor Laboratory

Date: 14-07-2020

DOI: 10.1101/2020.07.13.175786

Abstract: Haplotype phasing is a critical step for many genetic applications but incorrect estimates of phase can negatively impact downstream analyses. One proposed strategy to improve phasing accuracy is to combine multiple independent phasing estimates to overcome the limitations of any in idual estimate. As such a strategy is yet to be thoroughly explored, this study provides a comprehensive evaluation of consensus strategies for haplotype phasing, exploring their performance, along with their constituent tools, across a range of real and simulated datasets with different data characteristics and on the downstream task of genotype imputation. Based on the outputs of existing phasing tools, we explore two different strategies to construct haplotype consensus estimators: voting across outputs from multiple phasing tools and multiple outputs of a single non-deterministic tool. We find the consensus approach from multiple tools reduces switch error by an average of 10% compared to any constituent tool when applied to European populations and has the highest accuracy regardless of population ethnicity, s le size, SNP-density or SNP frequency. Furthermore, a consensus provides a small improvement indirectly the downstream task of genotype imputation regardless of which genotype imputation tools were used. Our results provide guidance on how to produce the most accurate phasing estimates and the tradeoffs that a consensus approach may have. Our implementation of consensus haplotype phasing, consHap, is available freely at iadbkh/consHap .

Publication

Interoperability of text corpus annotations with the semantic web.

Publisher: Springer Science and Business Media LLC

Date: 06-08-2015

DOI: 10.1186/1753-6561-9-S5-A2

Publication

Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

Publisher: Cold Spring Harbor Laboratory

Date: 05-10-2019

DOI: 10.1101/788034

Abstract: The volume of biological database records is growing rapidly, populated by complex records drawn from heterogeneous sources. A specific challenge is duplication, that is, the presence of redundancy (records with high similarity) or inconsistency (dissimilar records that correspond to the same entity). The characteristics (which records are duplicates), impact (why duplicates are significant), and solutions (how to address duplication), are not well understood. Studies on the topic are neither recent nor comprehensive. In addition, other data quality issues, such as inconsistencies and inaccuracies, are also of concern in the context of biological databases. A primary focus of this paper is to present and consolidate the opinions of over 20 experts and practitioners on the topic of duplication in biological sequence databases. The results reveal that survey participants believe that duplicate records are erse that the negative impacts of duplicates are severe, while positive impacts depend on correct identification of duplicates and that duplicate detection methods need to be more precise, scalable, and robust. A secondary focus is to consider other quality issues. We observe that biocuration is the key mechanism used to ensure the quality of this data, and explore the issues through a case study of curation in UniProtKB/Swiss-Prot as well as an interview with an experienced biocurator. While biocuration is a vital solution for handling of data quality issues, a broader community effort is needed to provide adequate support for thorough biocuration in the face of widespread quality concerns.

Publication

The Use of Web-Based Technologies in Health Research Participation: Qualitative Study of Consumer and Researcher Experiences

Publisher: JMIR Publications Inc.

Date: 30-10-2018

DOI: 10.2196/12094

Publication

Evaluation of a Machine Learning Duplicate Detection Method for Bioinformatics Databases

Publisher: ACM

Date: 22-10-2015

DOI: 10.1145/2811163.2811175

Publication

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Publisher: Springer Science and Business Media LLC

Date: 17-08-2012

DOI: 10.1186/1471-2105-13-207

Publication

Chemtables: A Dataset for Semantic Classification on Tables in Chemical Patents

Publisher: Research Square Platform LLC

Date: 21-05-2021

DOI: 10.21203/RS.3.RS-127219/V2

Abstract: Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables. The best performing model, Table-BERT, achieves a performance of 88.66 micro F1 score on the table classification task. Availability: The ChemTables dataset is publicly available at 0.17632/g7tjh7tbrj.1, subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository enanz/ChemTables.

Publication

Use and validation of text mining and cluster algorithms to derive insights from Corona Virus Disease-2019 (COVID-19) medical literature

Publisher: Elsevier BV

Date: 2021

DOI: 10.1016/J.CMPBUP.2021.100010

Publication

ChemTables: A Dataset for Semantic Classification of Tables in Chemical Patents

Publisher: Research Square Platform LLC

Date: 16-12-2020

DOI: 10.21203/RS.3.RS-127219/V1

Abstract: Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called ChemTables, which consists of 7,886 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on ChemTables. The best performing model, Table-BERT, achieves a performance of 88.66 micro F1 score on the table classification task. Availability: A 10% s le of the ChemTables dataset has been made publicly available, subject to a data usage agreement.

Publication

Literature consistency of bioinformatics sequence databases is effective for assessing record quality.

Publisher: Oxford University Press (OUP)

Date: 2017

DOI: 10.1093/DATABASE/BAX021

Publication

Extracting structured information from free-text medication prescriptions using dependencies

Publisher: ACM

Date: 29-10-2012

DOI: 10.1145/2390068.2390076

Publication

Better health information exploration

Publisher: ACM

Date: 07-12-2015

DOI: 10.1145/2846439.2846444

Publication

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Publisher: Springer Science and Business Media LLC

Date: 2008

DOI: 10.1186/GB-2008-9-S2-S11

Publication

BioC: a minimalist approach to interoperability for biomedical text processing

Publisher: Oxford University Press (OUP)

Date: 18-09-2013

DOI: 10.1093/DATABASE/BAT064

Publication

MPVNN: Mutated Pathway Visible Neural Network architecture for interpretable prediction of cancer-specific survival risk.

Publisher: Oxford University Press (OUP)

Date: 19-09-2022

DOI: 10.1093/BIOINFORMATICS/BTAC636

Abstract: Survival risk prediction using gene expression data is important in making treatment decisions in cancer. Standard neural network (NN) survival analysis models are black boxes with a lack of interpretability. More interpretable visible neural network architectures are designed using biological pathway knowledge. But they do not model how pathway structures can change for particular cancer types. We propose a novel Mutated Pathway Visible Neural Network (MPVNN) architecture, designed using prior signaling pathway knowledge and random replacement of known pathway edges using gene mutation data simulating signal flow disruption. As a case study, we use the PI3K-Akt pathway and demonstrate overall improved cancer-specific survival risk prediction of MPVNN over other similar-sized NN and standard survival analysis methods. We show that trained MPVNN architecture interpretation, which points to smaller sets of genes connected by signal flow within the PI3K-Akt pathway that is important in risk prediction for particular cancer types, is reliable. The data and code are available at ourabghoshroy/MPVNN. Supplementary data are available at Bioinformatics online.

Publication

What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages

Publisher: Association for Computational Linguistics

Date: 2014

DOI: 10.3115/V1/D14-1096

Publication

Concept annotation in the CRAFT corpus

Publisher: Springer Science and Business Media LLC

Date: 09-07-2012

DOI: 10.1186/1471-2105-13-161

Publication

Protein annotation as term categorization in the gene ontology using word proximity networks

Publisher: Springer Science and Business Media LLC

Date: 05-2005

DOI: 10.1186/1471-2105-6-S1-S20

Abstract: We participated in the BioCreAtIvE Task 2, which addressed the annotation of proteins into the Gene Ontology (GO) based on the text of a given document and the selection of evidence text from the document justifying that annotation. We approached the task utilizing several combinations of two distinct methods: an unsupervised algorithm for expanding words associated with GO nodes, and an annotation methodology which treats annotation as categorization of terms from a protein's document neighborhood into the GO. The evaluation results indicate that the method for expanding words associated with GO nodes is quite powerful we were able to successfully select appropriate evidence text for a given annotation in 38% of Task 2.1 queries by building on this method. The term categorization methodology achieved a precision of 16% for annotation within the correct extended family in Task 2.2, though we show through subsequent analysis that this can be improved with a different parameter setting. Our architecture proved not to be very successful on the evidence text component of the task, in the configuration used to generate the submitted results. The initial results show promise for both of the methods we explored, and we are planning to integrate the methods more closely to achieve better results overall.

Publication

Quantifying Semantic Similarity of Clinical Evidence in the Biomedical Literature to Facilitate Related Evidence Synthesis

Publisher: Elsevier BV

Date: 12-2019

DOI: 10.1016/J.JBI.2019.103321

Abstract: Published clinical trials and high quality peer reviewed medical publications are considered as the main sources of evidence used for synthesizing systematic reviews or practicing Evidence Based Medicine (EBM). Finding all relevant published evidence for a particular medical case is a time and labour intensive task, given the breadth of the biomedical literature. Automatic quantification of conceptual relationships between key clinical evidence within and across publications, despite variations in the expression of clinically-relevant concepts, can help to facilitate synthesis of evidence. In this study, we aim to provide an approach towards expediting evidence synthesis by quantifying semantic similarity of key evidence as expressed in the form of in idual sentences. Such semantic textual similarity can be applied as a key approach for supporting selection of related studies. We propose a generalisable approach for quantifying semantic similarity of clinical evidence in the biomedical literature, specifically considering the similarity of sentences corresponding to a given type of evidence, such as clinical interventions, population information, clinical findings, etc. We develop three sets of generic, ontology-based, and vector-space models of similarity measures that make use of a variety of lexical, conceptual, and contextual information to quantify the similarity of full sentences containing clinical evidence. To understand the impact of different similarity measures on the overall evidence semantic similarity quantification, we provide a comparative analysis of these measures when used as input to an unsupervised linear interpolation and a supervised regression ensemble. In order to provide a reliable test-bed for this experiment, we generate a dataset of 1000 pairs of sentences from biomedical publications that are annotated by ten human experts. We also extend the experiments on an external dataset for further generalisability testing. The combination of all erse similarity measures showed stronger correlations with the gold standard similarity scores in the dataset than any in idual kind of measure. Our approach reached near 0.80 average Pearson correlation across different clinical evidence types using the devised similarity measures. Although they were more effective when combined together, in idual generic and vector-space measures also resulted in strong similarity quantification when used in both unsupervised and supervised models. On the external dataset, our similarity measures were highly competitive with the state-of-the-art approaches developed and trained specifically on that dataset for predicting semantic similarity. Experimental results showed that the proposed semantic similarity quantification approach can effectively identify related clinical evidence that is reported in the literature. The comparison with a state-of-the-art method demonstrated the effectiveness of the approach, and experiments with an external dataset support its generalisability.

Publication

Automatic Consistency Assurance for Literature-based Gene Ontology Annotation

Publisher: Cold Spring Harbor Laboratory

Date: 27-05-2021

DOI: 10.1101/2021.05.26.445910

Abstract: Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. Our approach demonstrates clear value for human-in-the-loop curation scenarios. The synthetic dataset, and the code for generating it are available at iyuc/BioConsistency .

Publication

Roles for text mining in protein function prediction.

Publisher: Springer New York

Date: 2014

DOI: 10.1007/978-1-4939-0709-0_6

Abstract: The Human Genome Project has provided science with a hugely valuable resource: the blueprints for life the specification of all of the genes that make up a human. While the genes have all been identified and deciphered, it is proteins that are the workhorses of the human body: they are essential to virtually all cell functions and are the primary mechanism through which biological function is carried out. Hence in order to fully understand what happens at a molecular level in biological organisms, and eventually to enable development of treatments for diseases where some aspect of a biological system goes awry, we must understand the functions of proteins. However, experimental characterization of protein function cannot scale to the vast amount of DNA sequence data now available. Computational protein function prediction has therefore emerged as a problem at the forefront of modern biology (Ra ojac et al., Nat Methods 10(13):221-227, 2013).Within the varied approaches to computational protein function prediction that have been explored, there are several that make use of biomedical literature mining. These methods take advantage of information in the published literature to associate specific proteins with specific protein functions. In this chapter, we introduce two main strategies for doing this: association of function terms, represented as Gene Ontology terms (Ashburner et al., Nat Genet 25(1):25-29, 2000), to proteins based on information in published articles, and a paradigm called LEAP-FS (Literature-Enhanced Automated Prediction of Functional Sites) in which literature mining is used to validate the predictions of an orthogonal computational protein function prediction method.

Publication

Applied Text Mining

Publisher: Springer New York

Date: 2013

DOI: 10.1007/978-1-4419-9863-7_133

Publication

Why Bother Enabling Biomedical Literature Analysis with Semantics?

Publisher: ACM

Date: 25-04-2022

DOI: 10.1145/3487553.3527164

Publication

Use of cefovecin in dogs and cats attending first‐opinion veterinary practices in Australia

Publisher: Wiley

Date: 21-08-2020

DOI: 10.1136/VR.105997

Publication

Automated Detection of Records in Biological Sequence Databases that are Inconsistent with the Literature

Publisher: Cold Spring Harbor Laboratory

Date: 18-01-2017

DOI: 10.1101/101246

Abstract: We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as “confident” or “suspicious” . Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.

Publication

Describing the antimicrobial usage patterns of companion animal veterinary practices; free text analysis of more than 4.4 million consultation records.

Publisher: Public Library of Science (PLoS)

Date: 13-03-2020

DOI: 10.1371/JOURNAL.PONE.0230049

Publication

The gut microbiome is a significant risk factor for future chronic lung disease

Publisher: Cold Spring Harbor Laboratory

Date: 23-03-2022

DOI: 10.1101/2022.03.22.22272736

Abstract: The gut-lung axis is generally recognized, but there are few large studies of the gut microbiome and incident respiratory disease in adults. To investigate the associations between gut microbiome and respiratory disease and to construct predictive models from baseline gut microbiome profiles for incident asthma or chronic obstructive pulmonary disease (COPD). Shallow metagenomic sequencing was performed for stool s les from a prospective, population-based cohort (FINRISK02 N=7,115 adults) with linked national administrative health register derived classifications for incident asthma and COPD up to 15 years after baseline. Generalised linear models and Cox regressions were utilised to assess associations of microbial taxa and ersity with disease occurrence. Predictive models were constructed using machine learning with extreme gradient boosting. Models considered taxa abundances in idually and in combination with other risk factors, including sex, age, body mass index and smoking status. A total of 695 and 392 significant microbial associations at different taxonomic levels were found with incident asthma and COPD, respectively. Gradient boosting decision trees of baseline gut microbiome predicted incident asthma and COPD with mean area under the curves of 0.608 and 0.780, respectively. For both incident asthma and COPD, the baseline gut microbiome had C-indices of 0.623 for asthma and 0.817 for COPD, which were more predictive than other conventional risk factors. The integration of gut microbiome and conventional risk factors further improved prediction capacities. Subgroup analyses indicated gut microbiome was significantly associated with incident COPD in both current smokers and non-smokers, as well as in in iduals who reported never smoking. The gut microbiome is a significant risk factor for incident asthma and incident COPD and is largely independent of conventional risk factors.

Publication

Automatic consistency assurance for literature-based gene ontology annotation

Publisher: Springer Science and Business Media LLC

Date: 25-11-2021

DOI: 10.1186/S12859-021-04479-9

Abstract: Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. Our approach demonstrates clear value for human-in-the-loop curation scenarios.

Publication

Assessing the impact of case sensitivity and term information gain on biomedical concept recognition.

Publisher: Public Library of Science (PLoS)

Date: 19-03-2015

DOI: 10.1371/JOURNAL.PONE.0119091

Publication

U-Compare bio-event meta-service: compatible BioNLP event extraction services

Publisher: Springer Science and Business Media LLC

Date: 12-2011

DOI: 10.1186/1471-2105-12-481

Publication

Exploring automatic inconsistency detection for literature-based gene ontology annotation

Publisher: Oxford University Press (OUP)

Date: 24-06-2022

DOI: 10.1093/BIOINFORMATICS/BTAC230

Abstract: Literature-based gene ontology annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in the primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This article presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows. The data underlying this article are available in Github at iyuc/AutoGOAConsistency.

Publication

Evaluating the Performance of Various Machine Learning Algorithms to Detect Subclinical Keratoconus

Publisher: Association for Research in Vision and Ophthalmology (ARVO)

Date: 24-04-2020

DOI: 10.1167/TVST.9.2.24

Publication

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.

Publisher: Oxford University Press (OUP)

Date: 2017

DOI: 10.1093/DATABASE/BAW163

Publication

Rev at SemEval-2016 Task 2: Aligning Chunks by Lexical, Part of Speech and Semantic Equivalence

Publisher: Association for Computational Linguistics

Date: 2016

DOI: 10.18653/V1/S16-1120

Publication

Approaches to verb subcategorization for biomedicine

Publisher: Elsevier BV

Date: 04-2013

DOI: 10.1016/J.JBI.2012.12.001

Abstract: Information about verb subcategorization frames (SCFs) is important to many tasks in natural language processing (NLP) and, in turn, text mining. Biomedicine has a need for high-quality SCF lexicons to support the extraction of information from the biomedical literature, which helps biologists to take advantage of the latest biomedical knowledge despite the overwhelming growth of that literature. Unfortunately, techniques for creating such resources for biomedical text are relatively undeveloped compared to general language. This paper serves as an introduction to subcategorization and existing approaches to acquisition, and provides motivation for developing techniques that address issues particularly important to biomedical NLP. First, we give the traditional linguistic definition of subcategorization, along with several related concepts. Second, we describe approaches to learning SCF lexicons from large data sets for general and biomedical domains. Third, we consider the crucial issue of linguistic variation between biomedical fields (subdomain variation). We demonstrate significant variation among subdomains, and find the variation does not simply follow patterns of general lexical variation. Finally, we note several requirements for future research in biomedical SCF lexicon acquisition: a high-quality gold standard, investigation of different definitions of subcategorization, and minimally-supervised methods that can learn subdomain-specific lexical usage without the need for extensive manual work.

Publication

A Bayesian reanalysis of the Standard versus Accelerated Initiation of Renal-Replacement Therapy in Acute Kidney Injury (STARRT-AKI) trial

Publisher: Springer Science and Business Media LLC

Date: 25-08-2022

DOI: 10.1186/S13054-022-04120-Y

Abstract: Timing of initiation of kidney-replacement therapy (KRT) in critically ill patients remains controversial. The Standard versus Accelerated Initiation of Renal-Replacement Therapy in Acute Kidney Injury (STARRT-AKI) trial compared two strategies of KRT initiation (accelerated versus standard) in critically ill patients with acute kidney injury and found neutral results for 90-day all-cause mortality. Probabilistic exploration of the trial endpoints may enable greater understanding of the trial findings. We aimed to perform a reanalysis using a Bayesian framework. We performed a secondary analysis of all 2927 patients randomized in multi-national STARRT-AKI trial, performed at 168 centers in 15 countries. The primary endpoint, 90-day all-cause mortality, was evaluated using hierarchical Bayesian logistic regression. A spectrum of priors includes optimistic, neutral, and pessimistic priors, along with priors informed from earlier clinical trials. Secondary endpoints (KRT-free days and hospital-free days) were assessed using zero–one inflated beta regression. The posterior probability of benefit comparing an accelerated versus a standard KRT initiation strategy for the primary endpoint suggested no important difference, regardless of the prior used (absolute difference of 0.13% [95% credible interval [CrI] − 3.30% 3.40%], − 0.39% [95% CrI − 3.46% 3.00%], and 0.64% [95% CrI − 2.53% 3.88%] for neutral, optimistic, and pessimistic priors, respectively). There was a very low probability that the effect size was equal or larger than a consensus-defined minimal clinically important difference. Patients allocated to the accelerated strategy had a lower number of KRT-free days (median absolute difference of − 3.55 days [95% CrI − 6.38 − 0.48]), with a probability that the accelerated strategy was associated with more KRT-free days of 0.008. Hospital-free days were similar between strategies, with the accelerated strategy having a median absolute difference of 0.48 more hospital-free days (95% CrI − 1.87 2.72) compared with the standard strategy and the probability that the accelerated strategy had more hospital-free days was 0.66. In a Bayesian reanalysis of the STARRT-AKI trial, we found very low probability that an accelerated strategy has clinically important benefits compared with the standard strategy. Patients receiving the accelerated strategy probably have fewer days alive and KRT-free. These findings do not support the adoption of an accelerated strategy of KRT initiation.

Publication

Automating Quality Assessment of Medical Evidence in Systematic Reviews: Model Development and Validation Study (Preprint)

Publisher: JMIR Publications Inc.

Date: 12-12-2021

DOI: 10.2196/PREPRINTS.35568

Abstract: ssessment of the quality of medical evidence available on the web is a critical step in the preparation of systematic reviews. Existing tools that automate parts of this task validate the quality of in idual studies but not of entire bodies of evidence and focus on a restricted set of quality criteria. e proposed a quality assessment task that provides an overall quality rating for each body of evidence (BoE), as well as finer-grained justification for different quality criteria according to the Grading of Recommendation, Assessment, Development, and Evaluation formalization framework. For this purpose, we constructed a new data set and developed a machine learning baseline system (EvidenceGRADEr). e algorithmically extracted quality-related data from all summaries of findings found in the Cochrane Database of Systematic Reviews. Each BoE was defined by a set of population, intervention, comparison, and outcome criteria and assigned a quality grade (high, moderate, low, or very low) together with quality criteria (justification) that influenced that decision. Different statistical data, metadata about the review, and parts of the review text were extracted as support for grading each BoE. After pruning the resulting data set with various quality checks, we used it to train several neural-model variants. The predictions were compared against the labels originally assigned by the authors of the systematic reviews. ur quality assessment data set, Cochrane Database of Systematic Reviews Quality of Evidence, contains 13,440 instances, or BoEs labeled for quality, originating from 2252 systematic reviews published on the internet from 2002 to 2020. On the basis of a 10-fold cross-validation, the best neural binary classifiers for quality criteria detected risk of bias at 0.78 i F /i sub /sub ( i P /i =.68 R=0.92) and imprecision at 0.75 i F /i sub /sub ( i P /i =.66 R=0.86), while the performance on inconsistency, indirectness, and publication bias criteria was lower ( i F /i sub /sub in the range of 0.3-0.4). The prediction of the overall quality grade into 1 of the 4 levels resulted in 0.5 i F /i sub /sub . When casting the task as a binary problem by merging the Grading of Recommendation, Assessment, Development, and Evaluation classes (high+moderate vs low+very low-quality evidence), we attained 0.74 i F /i sub /sub . We also found that the results varied depending on the supporting information that is provided as an input to the models. ifferent factors affect the quality of evidence in the context of systematic reviews of medical evidence. Some of these (risk of bias and imprecision) can be automated with reasonable accuracy. Other quality dimensions such as indirectness, inconsistency, and publication bias prove more challenging for machine learning, largely because they are much rarer. This technology could substantially reduce reviewer workload in the future and expedite quality assessment as part of evidence synthesis.

Publication

Benchmarks for Measurement of Duplicate Detection Methods in Nucleotide Databases

Publisher: Oxford University Press (OUP)

Date: 08-01-2016

DOI: 10.1093/DATABASE/BAW164

Abstract: Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study, we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to two data sections within the UniProt Knowledgebase (UniProtKB), UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For ex le, the benchmark derived from UniProtKB/Swiss-Prot mappings identifies more erse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection or record linkage methods that are required to help maintain these essential resources. Database URL : iodbqual/benchmarks

Publication

PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources

Publisher: F1000 Research Ltd

Date: 16-07-2015

DOI: 10.12688/F1000RESEARCH.6670.1

Abstract: The human phenotype ontology (HPO) was recently developed as a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In this work we demonstrate the performance advantage of the structured SVM approach which was shown to be highly effective for Gene Ontology term prediction in comparison to several baseline methods. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data.

Publication

Mutation extraction tools can be combined for robust recognition of genetic variants in the literature

Publisher: F1000 Research Ltd

Date: 10-06-2014

DOI: 10.12688/F1000RESEARCH.3-18.V2

Abstract: As the cost of genomic sequencing continues to fall, the amount of data being collected and studied for the purpose of understanding the genetic basis of disease is increasing dramatically. Much of the source information relevant to such efforts is available only from unstructured sources such as the scientific literature, and significant resources are expended in manually curating and structuring the information in the literature. As such, there have been a number of systems developed to target automatic extraction of mutations and other genetic variation from the literature using text mining tools. We have performed a broad survey of the existing publicly available tools for extraction of genetic variants from the scientific literature. We consider not just one tool but a number of different tools, in idually and in combination, and apply the tools in two scenarios. First, they are compared in an intrinsic evaluation context, where the tools are tested for their ability to identify specific mentions of genetic variants in a corpus of manually annotated papers, the Variome corpus. Second, they are compared in an extrinsic evaluation context based on our previous study of text mining support for curation of the COSMIC and InSiGHT databases. Our results demonstrate that no single tool covers the full range of genetic variants mentioned in the literature. Rather, several tools have complementary coverage and can be used together effectively. In the intrinsic evaluation on the Variome corpus, the combined performance is above 0.95 in F-measure, while in the extrinsic evaluation the combined recall performance is above 0.71 for COSMIC and above 0.62 for InSiGHT, a substantial improvement over the performance of any in idual tool. Based on the analysis of these results, we suggest several directions for the improvement of text mining tools for genetic variant extraction from the literature.

Publication

Testing Contextualized Word Embeddings to Improve NER in Spanish Clinical Case Narratives

Publisher: Research Square Platform LLC

Date: 05-02-2020

DOI: 10.21203/RS.2.22697/V1

Abstract: Background: In the Big Data era there is an increasing need to fully exploit and analyse the huge quantity of information available about health. Natural Language Processing (NLP) technologies can contribute to extract relevant information from unstructured data contained in Electronic Health Records (EHR) such as clinical notes, patient’s discharge summaries and radiology reports among others. Extracted information could help in health-related decision making processes. Named entity recognition (NER) devoted to detect important concepts in texts (diseases, symptoms, drugs, etc.) is a crucial task in information extraction processes especially in languages other than English. In this work, we develop a deep learning-based NLP pipeline for biomedical entity extraction in Spanish clinical narrative. Methods: We explore the use of contextualized word embeddings to enhance named entity recognition in Spanish language clinical text, particularly of pharmacological substances, compounds, and proteins. Various combinations of word and sense embeddings were tested on the evaluation corpus of the PharmacoNER 2019 task, the Spanish Clinical Case Corpus (SPACCC). This data set consists of clinical case sections derived from open access Spanish-language medical publications. Results: NER system integrates in-domain pre-trained Flair and FastText word embeddings, byte-pairwise encoded and the bi-LSTM-based character word embeddings. The system yielded the best performance measure with F-score of 90.84%. Error analysis showed that the main source of errors for the best model is the newly detected false positive entities with the half of that amount of errors belonged to longer than the actual ones detected entities. Conclusions: Our study shows that our deep-learning-based system with domain-specific contextualized embeddings coupled with stacking of complementary embeddings yields superior performance over the system with integrated standard and general-domain word embeddings. With this system, we achieve performance competitive with the state-of-the-art.

Publication

Thematic issue of the Second combined Bio-ontologies and Phenotypes Workshop

Publisher: Springer Science and Business Media LLC

Date: 12-2016

DOI: 10.1186/S13326-016-0108-7

Publication

Better Health Explorer

Publisher: ACM

Date: 07-12-2015

DOI: 10.1145/2838739.2838772

Publication

Integration of polygenic and gut metagenomic risk prediction for common diseases

Publisher: Cold Spring Harbor Laboratory

Date: 05-08-2023

DOI: 10.1101/2023.07.30.23293396

Abstract: Multi-omics has opened new avenues for non-invasive risk profiling and early detection of complex diseases. Both polygenic risk scores (PRSs) and the human microbiome have shown promise in improving risk assessment of various common diseases. Here, in a prospective population-based cohort (FINRISK 2002 n=5,676) with ∼18 years of e-health record follow-up, we assess the incremental and combined value of PRSs and gut metagenomic sequencing as compared to conventional risk factors for predicting incident coronary artery disease (CAD), type 2 diabetes (T2D), Alzheimer’s disease (AD) and prostate cancer. We found that PRSs improved predictive capacity over conventional risk factors for all diseases (ΔC-indices between 0.010 – 0.027). In sex-stratified analyses, gut metagenomics improved predictive capacity over baseline age for CAD, T2D and prostate cancer however, improvement over all conventional risk factors was only observed for T2D (ΔC-index 0.004) and prostate cancer (ΔC-index 0.005). Integrated risk models of PRSs, gut metagenomic scores and conventional risk factors achieved the highest predictive performance for all diseases studied as compared to models based on conventional risk factors alone. We make our integrated risk models available for the wider research community. This study demonstrates that integrated PRS and gut metagenomic risk models improve the predictive value over conventional risk factors for common chronic diseases.

Publication

Evaluation of consensus strategies for haplotype phasing

Publisher: Oxford University Press (OUP)

Date: 25-11-2020

DOI: 10.1093/BIB/BBAA280

Abstract: Haplotype phasing is a critical step for many genetic applications but incorrect estimates of phase can negatively impact downstream analyses. One proposed strategy to improve phasing accuracy is to combine multiple independent phasing estimates to overcome the limitations of any in idual estimate. However, such a strategy is yet to be thoroughly explored. This study provides a comprehensive evaluation of consensus strategies for haplotype phasing. We explore the performance of different consensus paradigms, and the effect of specific constituent tools, across several datasets with different characteristics and their impact on the downstream task of genotype imputation. Based on the outputs of existing phasing tools, we explore two different strategies to construct haplotype consensus estimators: voting across outputs from multiple phasing tools and multiple outputs of a single non-deterministic tool. We find that the consensus approach from multiple tools reduces SE by an average of 10% compared to any constituent tool when applied to European populations and has the highest accuracy regardless of population ethnicity, s le size, variant density or variant frequency. Furthermore, the consensus estimator improves the accuracy of the downstream task of genotype imputation carried out by the widely used Minimac3, pbwt and BEAGLE5 tools. Our results provide guidance on how to produce the most accurate phasing estimates and the trade-offs that a consensus approach may have. Our implementation of consensus haplotype phasing, consHap, is available freely at iadbkh/consHap. Supplementary information: Supplementary data are available at Briefings in Bioinformatics online.

Publication

COVID-19 Drug Repurposing: A Network-Based Framework for Exploring Biomedical Literature and Clinical Trials for Possible Treatments

Publisher: MDPI AG

Date: 04-03-2022

DOI: 10.3390/PHARMACEUTICS14030567

Abstract: Background: With the Coronavirus becoming a new reality of our world, global efforts continue to seek answers to many questions regarding the spread, variants, vaccinations, and medications. Particularly, with the emergence of several strains (e.g., Delta, Omicron), vaccines will need further development to offer complete protection against the new variants. It is critical to identify antiviral treatments while the development of vaccines continues. In this regard, the repurposing of already FDA-approved drugs remains a major effort. In this paper, we investigate the hypothesis that a combination of FDA-approved drugs may be considered as a candidate for COVID-19 treatment if (1) there exists an evidence in the COVID-19 biomedical literature that suggests such a combination, and (2) there is match in the clinical trials space that validates this drug combination. Methods: We present a computational framework that is designed for detecting drug combinations, using the following components (a) a Text-mining module: to extract drug names from the abstract section of the biomedical publications and the intervention/treatment sections of clinical trial records. (b) a network model constructed from the drug names and their associations, (c) a clique similarity algorithm to identify candidate drug treatments. Result and Conclusions: Our framework has identified treatments in the form of two, three, or four drug combinations (e.g., hydroxychloroquine, doxycycline, and azithromycin). The identifications of the various treatment candidates provided sufficient evidence that supports the trustworthiness of our hypothesis.

Publication

Machine learning with a reduced dimensionality representation of comprehensive Pentacam tomography parameters to identify subclinical keratoconus

Publisher: Elsevier BV

Date: 11-2021

DOI: 10.1016/J.COMPBIOMED.2021.104884

Publication

Summary of the BioLINK SIG 2013 meeting at ISMB/ECCB 2013.

Publisher: Oxford University Press (OUP)

Date: 14-07-2015

DOI: 10.1093/BIOINFORMATICS/BTU412

Abstract: The ISMB Special Interest Group on Linking Literature, Information and Knowledge for Biology (BioLINK) organized a one-day workshop at ISMB/ECCB 2013 in Berlin, Germany. The theme of the workshop was ‘Roles for text mining in biomedical knowledge discovery and translational medicine’. This summary reviews the outcomes of the workshop. Meeting themes included concept annotation methods and applications, extraction of biological relationships and the use of text-mined data for biological data analysis. Availability and implementation: All articles are available at roceedings-online/ . Contact: karin.verspoor@unimelb.edu.au

Publication

Acquisition and evaluation of verb subcategorization resources for biomedicine

Publisher: Elsevier BV

Date: 04-2013

DOI: 10.1016/J.JBI.2013.01.001

Abstract: Biomedical natural language processing (NLP) applications that have access to detailed resources about the linguistic characteristics of biomedical language demonstrate improved performance on tasks such as relation extraction and syntactic or semantic parsing. Such applications are important for transforming the growing unstructured information buried in the biomedical literature into structured, actionable information. In this paper, we address the creation of linguistic resources that capture how in idual biomedical verbs behave. We specifically consider verb subcategorization, or the tendency of verbs to "select" co-occurrence with particular phrase types, which influences the interpretation of verbs and identification of verbal arguments in context. There are currently a limited number of biomedical resources containing information about subcategorization frames (SCFs), and these are the result of either labor-intensive manual collation, or automatic methods that use tools adapted to a single biomedical subdomain. Either method may result in resources that lack coverage. Moreover, the quality of existing verb SCF resources for biomedicine is unknown, due to a lack of available gold standards for evaluation. This paper presents three new resources related to verb subcategorization frames in biomedicine, and four experiments making use of the new resources. We present the first biomedical SCF gold standards, capturing two different but widely-used definitions of subcategorization, and a new SCF lexicon, BioCat, covering a large number of biomedical sub-domains. We evaluate the SCF acquisition methodologies for BioCat with respect to the gold standards, and compare the results with the accuracy of the only previously existing automatically-acquired SCF lexicon for biomedicine, the BioLexicon. Our results show that the BioLexicon has greater precision while BioCat has better coverage of SCFs. Finally, we explore the definition of subcategorization using these resources and its implications for biomedical NLP. All resources are made publicly available. The SCF resources we have evaluated still show considerably lower accuracy than that reported with general English lexicons, demonstrating the need for domain- and subdomain-specific SCF acquisition tools for biomedicine. Our new gold standards reveal major differences when annotators use the different definitions. Moreover, evaluation of BioCat yields major differences in accuracy depending on the gold standard, demonstrating that the definition of subcategorization adopted will have a direct impact on perceived system accuracy for specific tasks.

Publication

Positive-Unlabeled Learning for inferring drug interactions based on heterogeneous attributes

Publisher: Springer Science and Business Media LLC

Date: 03-2017

DOI: 10.1186/S12859-017-1546-7

Publication

Duplicates, redundancies, and inconsistencies in the primary nucleotide databases: a descriptive study

Publisher: Cold Spring Harbor Laboratory

Date: 03-11-2016

DOI: 10.1101/085019

Abstract: GenBank, the EMBL European Nucleotide Archive, and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different in iduals, by different teams, with a range of technologies and assumptions, and over a period of decades. As a consequence, they contain a great many duplicates, redundancies, and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds, and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC – a dataset of 67,888 merged groups with 111,823 duplicate pairs across 21 organisms from INSDC databases – in terms of the prevalence, types, and impacts of duplicates. (2) We categorise duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.

Publication

Big Data in Medicine Is Driving Big Changes

Publisher: Georg Thieme Verlag KG

Date: 08-2014

DOI: 10.15265/IY-2014-0020

Abstract: Objectives: To summarise current research that takes advantage of “Big Data” in health and biomedical informatics applications. Methods:Survey of trends in this work, and exploration of literature describing how large-scale structured and unstructured data sources are being used to support applications from clinical decision making and health policy, to drug design and pharmacovigilance, and further to systems biology and genetics. Results: The survey highlights ongoing development of powerful new methods for turning that large-scale, and often complex, data into information that provides new insights into human health, in a range of different areas. Consideration of this body of work identifies several important paradigm shifts that are facilitated by Big Data resources and methods: in clinical and translational research, from hypothesis-driven research to data-driven research, and in medicine, from evidence-based practice to practice-based evidence. Conclusions: The increasing scale and availability of large quantities of health data require strategies for data management, data linkage, and data integration beyond the limits of many existing information systems, and substantial effort is underway to meet those needs. As our ability to make sense of that data improves, the value of the data will continue to increase. Health systems, genetics and genomics, population and public health all areas of biomedicine stand to benefit from Big Data and the associated technologies.

Publication

Text mining electronic hospital records to automatically classify admissions against disease: Measuring the impact of linking data sources

Publisher: Elsevier BV

Date: 12-2016

DOI: 10.1016/J.JBI.2016.10.008

Abstract: Text and data mining play an important role in obtaining insights from Health and Hospital Information Systems. This paper presents a text mining system for detecting admissions marked as positive for several diseases: Lung Cancer, Breast Cancer, Colon Cancer, Secondary Malignant Neoplasm of Respiratory and Digestive Organs, Multiple Myeloma and Malignant Plasma Cell Neoplasms, Pneumonia, and Pulmonary Embolism. We specifically examine the effect of linking multiple data sources on text classification performance. Support Vector Machine classifiers are built for eight data source combinations, and evaluated using the metrics of Precision, Recall and F-Score. Sub-s ling techniques are used to address unbalanced datasets of medical records. We use radiology reports as an initial data source and add other sources, such as pathology reports and patient and hospital admission data, in order to assess the research question regarding the impact of the value of multiple data sources. Statistical significance is measured using the Wilcoxon signed-rank test. A second set of experiments explores aspects of the system in greater depth, focusing on Lung Cancer. We explore the impact of feature selection analyse the learning curve examine the effect of restricting admissions to only those containing reports from all data sources and examine the impact of reducing the sub-s ling. These experiments provide better understanding of how to best apply text classification in the context of imbalanced data of variable completeness. Radiology questions plus patient and hospital admission data contribute valuable information for detecting most of the diseases, significantly improving performance when added to radiology reports alone or to the combination of radiology and pathology reports. Overall, linking data sources significantly improved classification performance for all the diseases examined. However, there is no single approach that suits all scenarios the choice of the most effective combination of data sources depends on the specific disease to be classified.

Publication

Crowdsourcing critical appraisal of research evidence (CrowdCARE) was found to be a valid approach to assessing clinical research quality

Publisher: Elsevier BV

Date: 12-2018

DOI: 10.1016/J.JCLINEPI.2018.07.015

Abstract: We developed a free, online tool (CrowdCARE: crowdcare.unimelb.edu.au) to crowdsource research critical appraisal. The aim was to examine the validity of this approach for assessing the methodological quality of systematic reviews. In this prospective, cross-sectional study, a s le of systematic reviews (N = 71), of heterogeneous quality, was critically appraised using the Assessing the Methodological Quality of Systematic Reviews (AMSTAR) tool, in CrowdCARE, by five trained novice and two expert raters. After performing independent appraisals, experts resolved any disagreements by consensus (to produce an "expert consensus" rating, as the gold-standard approach). The expert consensus rating was within ±1 (on an 11-point scale) of the in idual expert ratings for 82% of studies and was within ±1 of the mean novice rating for 79% of studies. There was a strong correlation (r Crowdsourcing can be used to assess the quality of systematic reviews. Novices can be trained to appraise systematic reviews and, on average, achieve a high degree of accuracy relative to experts. These proof-of-concept data demonstrate the merit of crowdsourcing, compared with current gold standards of appraisal, and the potential capacity for this approach to transform evidence-based practice worldwide by sharing the appraisal load.

Publication

Combining heterogeneous data sources for accurate functional annotation of proteins

Publisher: Springer Science and Business Media LLC

Date: 02-2013

DOI: 10.1186/1471-2105-14-S3-S10

Publication

Detecting evidence of invasive fungal infections in cytology and histopathology reports enriched with concept-level annotations

Publisher: Elsevier BV

Date: 03-2023

DOI: 10.1016/J.JBI.2023.104293

Publication

A close look at protein function prediction evaluation protocols

Publisher: Oxford University Press (OUP)

Date: 14-09-2015

DOI: 10.1186/S13742-015-0082-5

Publication

Natural Language Processing

Publisher: Springer New York

Date: 2013

DOI: 10.1007/978-1-4419-9863-7_158

Publication

A categorical analysis of coreference resolution errors in biomedical texts.

Publisher: Elsevier BV

Date: 04-2016

DOI: 10.1016/J.JBI.2016.02.015

Abstract: Coreference resolution is an essential task in information extraction from the published biomedical literature. It supports the discovery of complex information by linking referring expressions such as pronouns and appositives to their referents, which are typically entities that play a central role in biomedical events. Correctly establishing these links allows detailed understanding of all the participants in events, and connecting events together through their shared participants. As an initial step towards the development of a novel coreference resolution system for the biomedical domain, we have categorised the characteristics of coreference relations by type of anaphor as well as broader syntactic and semantic characteristics, and have compared the performance of a domain adaptation of a state-of-the-art general system to published results from domain-specific systems in terms of this categorisation. We also develop a rule-based system for anaphoric coreference resolution in the biomedical domain with simple modules derived from available systems. Our results show that the domain-specific systems outperform the general system overall. Whilst this result is unsurprising, our proposed categorisation enables a detailed quantitative analysis of the system performance. We identify limitations of each system and find that there remain important gaps in the state-of-the-art systems, which are clearly identifiable with respect to the categorisation. We have analysed in detail the performance of existing coreference resolution systems for the biomedical literature and have demonstrated that there clear gaps in their coverage. The approach developed in the general domain needs to be tailored for portability to the biomedical domain. The specific framework for class-based error analysis of existing systems that we propose has benefits for identifying specific limitations of those systems. This in turn provides insights for further system development.

Publication

Artificial intelligence for clinical decision support in neurology

Publisher: Oxford University Press (OUP)

Date: 2020

DOI: 10.1093/BRAINCOMMS/FCAA096

Abstract: Artificial intelligence is one of the most exciting methodological shifts in our era. It holds the potential to transform healthcare as we know it, to a system where humans and machines work together to provide better treatment for our patients. It is now clear that cutting edge artificial intelligence models in conjunction with high-quality clinical data will lead to improved prognostic and diagnostic models in neurological disease, facilitating expert-level clinical decision tools across healthcare settings. Despite the clinical promise of artificial intelligence, machine and deep-learning algorithms are not a one-size-fits-all solution for all types of clinical data and questions. In this article, we provide an overview of the core concepts of artificial intelligence, particularly contemporary deep-learning methods, to give clinician and neuroscience researchers an appreciation of how artificial intelligence can be harnessed to support clinical decisions. We clarify and emphasize the data quality and the human expertise needed to build robust clinical artificial intelligence models in neurology. As artificial intelligence is a rapidly evolving field, we take the opportunity to iterate important ethical principles to guide the field of medicine is it moves into an artificial intelligence enhanced future.

Publication

High-precision biological event extraction with a concept recognizer

Publisher: Association for Computational Linguistics

Date: 2009

DOI: 10.3115/1572340.1572348

Publication

Ontology quality assurance through analysis of term transformations

Publisher: Oxford University Press (OUP)

Date: 27-05-2009

DOI: 10.1093/BIOINFORMATICS/BTP195

Abstract: Motivation: It is important for the quality of biological ontologies that similar concepts be expressed consistently, or univocally. Univocality is relevant for the usability of the ontology for humans, as well as for computational tools that rely on regularity in the structure of terms. However, in practice terms are not always expressed consistently, and we must develop methods for identifying terms that are not univocal so that they can be corrected. Results: We developed an automated transformation-based clustering methodology for detecting terms that use different linguistic conventions for expressing similar semantics. These term sets represent occurrences of univocality violations. Our method was able to identify 67 ex les of univocality violations in the Gene Ontology. Availability: The identified univocality violations are available upon request. We are preparing a release of an open source version of the software to be available at bionlp.sourceforge.net. Contact: karin.verspoor@ucdenver.edu

Publication

Web Forum Retrieval and Text Analytics: A Survey

Publisher: now Publishers Inc

Date: 2018

DOI: 10.1561/9781680833515

Publication

Automated Detection of Records in Biological Sequence Databases that are Inconsistent with the Literature

Publisher: Elsevier BV

Date: 07-2017

DOI: 10.1016/J.JBI.2017.06.015

Abstract: We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.

Publication

Tasks as needs: reframing the paradigm of clinical natural language processing research for real-world decision support

Publisher: Oxford University Press (OUP)

Date: 14-07-2022

DOI: 10.1093/JAMIA/OCAC121

Abstract: Electronic medical records are increasingly used to store patient information in hospitals and other clinical settings. There has been a corresponding proliferation of clinical natural language processing (cNLP) systems aimed at using text data in these records to improve clinical decision-making, in comparison to manual clinician search and clinical judgment alone. However, these systems have delivered marginal practical utility and are rarely deployed into healthcare settings, leading to proposals for technical and structural improvements. In this paper, we argue that this reflects a violation of Friedman’s “Fundamental Theorem of Biomedical Informatics,” and that a deeper epistemological change must occur in the cNLP field, as a parallel step alongside any technical or structural improvements. We propose that researchers shift away from designing cNLP systems independent of clinical needs, in which cNLP tasks are ends in themselves—“tasks as decisions”—and toward systems that are directly guided by the needs of clinicians in realistic decision-making contexts—“tasks as needs.” A case study ex le illustrates the potential benefits of developing cNLP systems that are designed to more directly support clinical needs.

Publication

Testing Contextualized Word Embeddings to Improve NER in Spanish Clinical Case Narratives

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 2020

DOI: 10.1109/ACCESS.2020.3018688

Publication

An expanded evaluation of protein function prediction methods shows an improvement in accuracy

Publisher: Springer Science and Business Media LLC

Date: 07-09-2016

DOI: 10.1186/S13059-016-1037-6

Publication

CommViz: Visualization of semantic patterns in large social communication networks

Publisher: SAGE Publications

Date: 02-2017

DOI: 10.1177/1473871617693039

Abstract: We introduce CommViz, an information visualization tool that enables complex communication networks to be explored, exposing trends and patterns in the data at a glance. We adapt a visualization approach known as hive plots to reflect the semantic structure of the networks, a generalization we call semantic hive plots. The method efficiently organizes and provides insight into complex, high-dimensional communication data such as email and messages on social media. We present the architecture of the CommViz tool and its application to the Enron email corpus as a case study, demonstrating how the structure of the visualization enables investigation of patterns and relationships in a large set of messages. We also provide a user study performed with Amazon Mechanical Turk that shows the value of the tool for certain complex data interrogations and further show how the incorporation of semantic structure on semantic coordinates can also be applied to parallel coordinates visualization. The integration of the social network characteristics with semantic attributes of the underlying data in a single visualization is, to our knowledge, a novel contribution of the work. The tool can be accessed at commviz.eng.unimelb.edu.au . Code is available at eadbiomed/commviz . The Enron email corpus is available from nron_email.html .

Publication

TOWARDS EARLY DISCOVERY OF SALIENT HEALTH THREATS: A SOCIAL MEDIA EMOTION CLASSIFICATION TECHNIQUE

Publisher: WORLD SCIENTIFIC

Date: 18-11-2015

DOI: 10.1142/9789814749411_0046

Publication

Automated assessment of biological database assertions using the scientific literature.

Publisher: Springer Science and Business Media LLC

Date: 29-04-2019

DOI: 10.1186/S12859-019-2801-X

Publication

A categorization approach to automated ontological function annotation

Publisher: Wiley

Date: 06-2006

DOI: 10.1110/PS.062184006

Publication

The randomized information coefficient: assessing dependencies in noisy data

Publisher: Springer Science and Business Media LLC

Date: 19-09-2018

DOI: 10.1007/S10994-017-5664-2

Publication

Pattern Learning through Distant Supervision for Extraction of Protein-Residue Associations in the Biomedical Literature

Publisher: IEEE

Date: 12-2011

DOI: 10.1109/ICMLA.2011.112

Publication

Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions

Publisher: Mary Ann Liebert Inc

Date: 06-2019

DOI: 10.1089/CMB.2018.0198

Publication

Web Forum Retrieval and Text Analytics: A Survey

Publisher: Now Publishers

Date: 2018

DOI: 10.1561/1500000062

Publication

A two-tiered unsupervised clustering approach for drug repositioning through heterogeneous data integration

Publisher: Springer Science and Business Media LLC

Date: 11-04-2018

DOI: 10.1186/S12859-018-2123-4

Publication

Optimizing graph-based patterns to extract biomedical events from the literature.

Publisher: Springer Science and Business Media LLC

Date: 30-10-2015

DOI: 10.1186/1471-2105-16-S16-S2

Publication

Literature mining of genetic variants for curation: quantifying the importance of supplementary material

Publisher: Oxford University Press (OUP)

Date: 10-02-2014

DOI: 10.1093/DATABASE/BAU003

Publication

The ChEMU 2022 Evaluation Campaign: Information Extraction in Chemical Patents

Publisher: Springer International Publishing

Date: 2022

DOI: 10.1007/978-3-030-99739-7_50

Publication

Exploiting Term Relations for Semantic Hierarchy Construction

Publisher: IEEE

Date: 08-2008

DOI: 10.1109/ICSC.2008.68

Publication

Graph embedding-based link prediction for literature-based discovery in Alzheimer’s Disease

Publisher: Elsevier BV

Date: 09-2023

DOI: 10.1016/J.JBI.2023.104464

Publication

Visualization and Language Processing for Supporting Analysis across the Biomedical Literature

Publisher: Springer Berlin Heidelberg

Date: 2010

DOI: 10.1007/978-3-642-15384-6_45

Publication

Analysis of predictive performance and reliability of classifiers for quality assessment of medical evidence revealed important variation by medical area

Publisher: Elsevier BV

Date: 07-2023

DOI: 10.1016/J.JCLINEPI.2023.04.006

Publication

Exploiting graph kernels for high performance biomedical relation extraction.

Publisher: Springer Science and Business Media LLC

Date: 30-01-2018

DOI: 10.1186/S13326-017-0168-3

Publication

Learning Biological Sequence Types Using the Literature

Publisher: ACM

Date: 06-11-2017

DOI: 10.1145/3132847.3133051

Publication

Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities

Publisher: Elsevier BV

Date: 09-2023

DOI: 10.1016/J.JBI.2023.104466

Publication

Unstructured Information Management Architecture (UIMA)

Publisher: Springer New York

Date: 2013

DOI: 10.1007/978-1-4419-9863-7_183

Publication

Detection of self-harm and suicidal ideation in emergency department triage notes

Publisher: Oxford University Press (OUP)

Date: 13-12-2021

DOI: 10.1093/JAMIA/OCAB261

Abstract: Accurate identification of self-harm presentations to Emergency Departments (ED) can lead to more timely mental health support, aid in understanding the burden of suicidal intent in a population, and support impact evaluation of public health initiatives related to suicide prevention. Given lack of manual self-harm reporting in ED, we aim to develop an automated system for the detection of self-harm presentations directly from ED triage notes. We frame this as supervised classification using natural language processing (NLP), utilizing a large data set of 477 627 free-text triage notes from ED presentations in 2012–2018 to The Royal Melbourne Hospital, Australia. The data were highly imbalanced, with only 1.4% of triage notes relating to self-harm. We explored various preprocessing techniques, including spelling correction, negation detection, bigram replacement, and clinical concept recognition, and several machine learning methods. Our results show that machine learning methods dramatically outperform keyword-based methods. We achieved the best results with a calibrated Gradient Boosting model, showing 90% Precision and 90% Recall (PR-AUC 0.87) on blind test data. Prospective validation of the model achieves similar results (88% Precision 89% Recall). ED notes are noisy texts, and simple token-based models work best. Negation detection and concept recognition did not change the results while bigram replacement significantly impaired model performance. This first NLP-based classifier for self-harm in ED notes has practical value for identifying patients who would benefit from mental health follow-up in ED, and for supporting surveillance of self-harm and suicide prevention efforts in the population.

Publication

Overview of ChEMU 2022 Evaluation Campaign: Information Extraction in Chemical Patents

Publisher: Springer International Publishing

Date: 2022

DOI: 10.1007/978-3-031-13643-6_30

Publication

Multi-field query expansion is effective for biomedical dataset retrieval.

Publisher: Oxford University Press (OUP)

Date: 2017

DOI: 10.1093/DATABASE/BAX062

Publication

Knowledge Integration in OpenWorlds: Utilizing the Mathematics of Hierarchical Structure

Publisher: IEEE

Date: 09-2007

DOI: 10.1109/ICSC.2007.82

Publication

Evaluating the dose, indication and agreement with guidelines of antimicrobial use in companion animal practice with natural language processing

Publisher: Oxford University Press (OUP)

Date: 02-2022

DOI: 10.1093/JACAMR/DLAB194

Abstract: As antimicrobial prescribers, veterinarians contribute to the emergence of MDR pathogens. Antimicrobial stewardship programmes are an effective means of reducing the rate of development of antimicrobial resistance. A key component of antimicrobial stewardship programmes is selecting an appropriate antimicrobial agent for the presenting complaint and using an appropriate dose rate for an appropriate duration. To describe antimicrobial usage, including dose, for common indications for antimicrobial use in companion animal practice. Natural language processing (NLP) techniques were applied to extract and analyse clinical records. A total of 343 668 records for dogs and 109 719 records for cats administered systemic antimicrobials from 1 January 2013 to 31 December 2017 were extracted from the database. The NLP algorithms extracted dose, duration of therapy and diagnosis completely for 133 046 (39%) of the records for dogs and 40 841 records for cats (37%). The remaining records were missing one or more of these elements in the clinical data. The most common reason for antimicrobial administration was skin disorders (n = 66 198, 25%) and traumatic injuries (n = 15 932, 19%) in dogs and cats, respectively. Dose was consistent with guideline recommendations in 73% of cases where complete clinical data were available. Automated extraction using NLP methods is a powerful tool to evaluate large datasets and to enable veterinarians to describe the reasons that antimicrobials are administered. However, this can only be determined when the data presented in the clinical record are complete, which was not the case in most instances in this dataset. Most importantly, the dose administered varied and was often not consistent with guideline recommendations.

Publication

Accuracy of Machine Learning Assisted Detection of Keratoconus: A Systematic Review and Meta-Analysis

Publisher: MDPI AG

Date: 18-01-2022

DOI: 10.3390/JCM11030478

Abstract: (1) Background: The objective of this review was to synthesize available data on the use of machine learning to evaluate its accuracy (as determined by pooled sensitivity and specificity) in detecting keratoconus (KC), and measure reporting completeness of machine learning models in KC based on TRIPOD (the transparent reporting of multivariable prediction models for in idual prognosis or diagnosis) statement. (2) Methods: Two independent reviewers searched the electronic databases for all potential articles on machine learning and KC published prior to 2021. The TRIPOD 29-item checklist was used to evaluate the adherence to reporting guidelines of the studies, and the adherence rate to each item was computed. We conducted a meta-analysis to determine the pooled sensitivity and specificity of machine learning models for detecting KC. (3) Results: Thirty-five studies were included in this review. Thirty studies evaluated machine learning models for detecting KC eyes from controls and 14 studies evaluated machine learning models for detecting early KC eyes from controls. The pooled sensitivity for detecting KC was 0.970 (95% CI 0.949–0.982), with a pooled specificity of 0.985 (95% CI 0.971–0.993), whereas the pooled sensitivity of detecting early KC was 0.882 (95% CI 0.822–0.923), with a pooled specificity of 0.947 (95% CI 0.914–0.967). Between 3% and 48% of TRIPOD items were adhered to in studies, and the average (median) adherence rate for a single TRIPOD item was 23% across all studies. (4) Conclusions: Application of machine learning model has the potential to make the diagnosis and monitoring of KC more efficient, resulting in reduced vision loss to the patients. This review provides current information on the machine learning models that have been developed for detecting KC and early KC. Presently, the machine learning models performed poorly in identifying early KC from control eyes and many of these research studies did not follow established reporting standards, thus resulting in the failure of these clinical translation of these machine learning models. We present possible approaches for future studies for improvement in studies related to both KC and early KC models to more efficiently and widely utilize machine learning models for diagnostic process.

Publication

Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations

Publisher: Public Library of Science (PLoS)

Date: 17-04-2013

DOI: 10.1371/JOURNAL.PONE.0060954

Publication

DETECTION OF PROTEIN CATALYTIC SITES IN THE BIOMEDICAL LITERATURE

Publisher: WORLD SCIENTIFIC

Date: 11-2012

DOI: 10.1142/9789814447973_0042

Publication

Automating Quality Assessment of Medical Evidence in Systematic Reviews: Model Development and Validation Study

Publisher: JMIR Publications Inc.

Date: 13-03-2023

DOI: 10.2196/35568

Abstract: Assessment of the quality of medical evidence available on the web is a critical step in the preparation of systematic reviews. Existing tools that automate parts of this task validate the quality of in idual studies but not of entire bodies of evidence and focus on a restricted set of quality criteria. We proposed a quality assessment task that provides an overall quality rating for each body of evidence (BoE), as well as finer-grained justification for different quality criteria according to the Grading of Recommendation, Assessment, Development, and Evaluation formalization framework. For this purpose, we constructed a new data set and developed a machine learning baseline system (EvidenceGRADEr). We algorithmically extracted quality-related data from all summaries of findings found in the Cochrane Database of Systematic Reviews. Each BoE was defined by a set of population, intervention, comparison, and outcome criteria and assigned a quality grade (high, moderate, low, or very low) together with quality criteria (justification) that influenced that decision. Different statistical data, metadata about the review, and parts of the review text were extracted as support for grading each BoE. After pruning the resulting data set with various quality checks, we used it to train several neural-model variants. The predictions were compared against the labels originally assigned by the authors of the systematic reviews. Our quality assessment data set, Cochrane Database of Systematic Reviews Quality of Evidence, contains 13,440 instances, or BoEs labeled for quality, originating from 2252 systematic reviews published on the internet from 2002 to 2020. On the basis of a 10-fold cross-validation, the best neural binary classifiers for quality criteria detected risk of bias at 0.78 F1 (P=.68 R=0.92) and imprecision at 0.75 F1 (P=.66 R=0.86), while the performance on inconsistency, indirectness, and publication bias criteria was lower (F1 in the range of 0.3-0.4). The prediction of the overall quality grade into 1 of the 4 levels resulted in 0.5 F1. When casting the task as a binary problem by merging the Grading of Recommendation, Assessment, Development, and Evaluation classes (high+moderate vs low+very low-quality evidence), we attained 0.74 F1. We also found that the results varied depending on the supporting information that is provided as an input to the models. Different factors affect the quality of evidence in the context of systematic reviews of medical evidence. Some of these (risk of bias and imprecision) can be automated with reasonable accuracy. Other quality dimensions such as indirectness, inconsistency, and publication bias prove more challenging for machine learning, largely because they are much rarer. This technology could substantially reduce reviewer workload in the future and expedite quality assessment as part of evidence synthesis.

Publication

The Use of Web-Based Technologies in Health Research Participation: Qualitative Study of Consumer and Researcher Experiences (Preprint)

Publisher: JMIR Publications Inc.

Date: 02-09-2018

DOI: 10.2196/PREPRINTS.12094

Abstract: ealth consumers are often targeted for their involvement in health research including randomized controlled trials, focus groups, interviews, and surveys. However, as reported by many studies, recruitment and engagement of consumers in academic research remains challenging. In addition, there is scarce literature describing what consumers look for and want to achieve by participating in research. nderstanding and responding to the needs of consumers is crucial to the success of health research projects. In this study, we aim to understand consumers’ needs and investigate the opportunities for addressing these needs with Web-based technologies, particularly in the use of Web-based research registers and social networking sites (SNSs). e undertook a qualitative approach, interviewing both consumer and medical researchers in this study. With the help from an Australian-based organization supporting people with musculoskeletal conditions, we successfully interviewed 23 consumers and 10 researchers. All interviews were transcribed and analyzed with thematic analysis methodology. Data collection was stopped after the data themes reached saturation. e found that consumers perceive research as a learning opportunity and, therefore, expect high research transparency and regular updates. They also consider the sources of the information about research projects, the trust between consumers and researchers, and the mobility of consumers before participating in any research. Researchers need to be aware of such needs when designing a c aign for recruitment for their studies. On the other hand, researchers have attempted to establish a rapport with consumer participants, design research for consumers’ needs, and use technologies to reach out to consumers. A systematic approach to integrating a variety of technologies is needed. n the basis of the feedback from both consumers and researchers, we propose 3 future directions to use Web-based technologies for addressing consumers’ needs and engaging with consumers in health research: (1) researchers can make use of consumer registers and Web-based research portals, (2) SNSs and new media should be frequently used as an aid, and (3) new technologies should be adopted to remotely collect data and reduce administrative work for obtaining consumers’ consent.

Publication

Bow-tie architecture of gene regulatory networks in species of varying complexity

Publisher: The Royal Society

Date: 06-2021

DOI: 10.1098/RSIF.2021.0069

Abstract: The gene regulatory network (GRN) architecture plays a key role in explaining the biological differences between species. We aim to understand species differences in terms of some universally present dynamical properties of their gene regulatory systems. A network architectural feature associated with controlling system-level dynamical properties is the bow-tie, identified by a strongly connected subnetwork, the core layer, between two sets of nodes, the in and the out layers. Though a bow-tie architecture has been observed in many networks, its existence has not been extensively investigated in GRNs of species of widely varying biological complexity. We analyse publicly available GRNs of several well-studied species from prokaryotes to unicellular eukaryotes to multicellular organisms. In their GRNs, we find the existence of a bow-tie architecture with a distinct largest strongly connected core layer. We show that the bow-tie architecture is a characteristic feature of GRNs. We observe an increasing trend in the relative core size with species complexity. Using studied relationships of the core size with dynamical properties like robustness and fragility, flexibility, criticality, controllability and evolvability, we hypothesize how these regulatory system properties have emerged differently with biological complexity, based on the observed differences of the GRN bow-tie architectures.

Publication

Comparative Analysis of Sequence Clustering Methods for Deduplication of Biological Databases

Publisher: Association for Computing Machinery (ACM)

Date: 30-09-2017

DOI: 10.1145/3131611

Abstract: The massive volumes of data in biological sequence databases provide a remarkable resource for large-scale biological studies. However, the underlying data quality of these resources is a critical concern. A particular challenge is duplication, in which multiple records have similar sequences, creating a high level of redundancy that impacts database storage, curation, and search. Biological database deduplication has two direct applications: for database curation, where detected duplicates are removed to improve curation efficiency, and for database search, where detected duplicate sequences may be flagged but remain available to support analysis. Clustering methods have been widely applied to biological sequences for database deduplication. Since an exhaustive all-by-all pairwise comparison of sequences cannot scale for a high volume of data, heuristic approaches have been recruited, such as the use of simple similarity thresholds. In this article, we present a comparison between CD-HIT and UCLUST, the two best-known clustering tools for sequence database deduplication. Our contributions include a detailed assessment of the redundancy remaining after deduplication, application of standard clustering evaluation metrics to quantify the cohesion and separation of the clusters generated by each method, and a biological case study that assesses intracluster function annotation consistency to demonstrate the impact of these factors on a practical application of the sequence clustering methods. Our results show that the trade-off between efficiency and accuracy becomes acute when low threshold values are used and when cluster sizes are large. This evaluation leads to practical recommendations for users for more effective uses of the sequence clustering tools for deduplication.

Publication

Potentials of Large Language Models in Healthcare: A Delphi Study (Preprint)

Publisher: JMIR Publications Inc.

Date: 02-09-2023

DOI: 10.2196/PREPRINTS.52399

Publication

An Improved Neural Network Model for Joint

Publisher: Association for Computational Linguistics

Date: 2018

DOI: 10.18653/V1/K18-2008

Publication

Exploring Species-Based Strategies for Gene Normalization

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 07-2010

DOI: 10.1109/TCBB.2010.48

Publication

CQADupStack

Publisher: ACM

Date: 08-12-2015

DOI: 10.1145/2838931.2838934

Publication

Designing Health Websites Based on Users’ Web-Based Information-Seeking Behaviors: A Mixed-Method Observational Study

Publisher: JMIR Publications Inc.

Date: 06-06-2016

DOI: 10.2196/JMIR.5661

Publication

Special issue on bio-ontologies and phenotypes

Publisher: Springer Science and Business Media LLC

Date: 12-2015

DOI: 10.1186/S13326-015-0040-2

Publication

Classification performance of administrative coding data for detection of invasive fungal infection in paediatric cancer patients

Publisher: Public Library of Science (PLoS)

Date: 09-09-2020

DOI: 10.1371/JOURNAL.PONE.0238889

Publication

Semantic-Based Policy Composition for Privacy-Demanding Data Linkage

Publisher: IEEE

Date: 08-2018

DOI: 10.1109/TRUSTCOM/BIGDATASE.2018.00060

Publication

PoLoBag: Polynomial Lasso Bagging for signed gene regulatory network inference from expression data

Publisher: Oxford University Press (OUP)

Date: 22-07-2020

DOI: 10.1093/BIOINFORMATICS/BTAA651

Abstract: Inferring gene regulatory networks (GRNs) from expression data is a significant systems biology problem. A useful inference algorithm should not only unveil the global structure of the regulatory mechanisms but also the details of regulatory interactions such as edge direction (from regulator to target) and sign (activation/inhibition). Many popular GRN inference algorithms cannot infer edge signs, and those that can infer signed GRNs cannot simultaneously infer edge directions or network cycles. To address these limitations of existing algorithms, we propose Polynomial Lasso Bagging (PoLoBag) for signed GRN inference with both edge directions and network cycles. PoLoBag is an ensemble regression algorithm in a bagging framework where Lasso weights estimated on bootstrap s les are averaged. These bootstrap s les incorporate polynomial features to capture higher-order interactions. Results demonstrate that PoLoBag is consistently more accurate for signed inference than state-of-the-art algorithms on simulated and real-world expression datasets. Algorithm and data are freely available at ourabghoshroy/PoLoBag. Supplementary data are available at Bioinformatics online.

Publication

Evaluating a variety of text-mined features for automatic protein function prediction with GOstruct

Publisher: Springer Science and Business Media LLC

Date: 2015

DOI: 10.1186/S13326-015-0006-4

Publication

LARGE-SCALE TESTING OF BIBLIOME INFORMATICS USING PFAM PROTEIN FAMILIES

Publisher: WORLD SCIENTIFIC

Date: 12-2005

DOI: 10.1142/9789812701626_0008

Publication

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents

Publisher: Frontiers Media SA

Date: 25-03-2021

DOI: 10.3389/FRMA.2021.654438

Abstract: Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition , requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions and (2) event extraction , which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in s ling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.

Publication

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine

Publisher: Oxford University Press (OUP)

Date: 2019

DOI: 10.1093/DATABASE/BAY147

Publication

Overcoming challenges in extracting prescribing habits from veterinary clinics using big data and deep learning

Publisher: Wiley

Date: 25-01-2022

DOI: 10.1111/AVJ.13145

Abstract: Understanding antimicrobial usage patterns and encouraging appropriate antimicrobial usage is a critical component of antimicrobial stewardship. Studies using VetCompass Australia and Natural Language Processing (NLP) have demonstrated antimicrobial usage patterns in companion animal practices across Australia. Doing so has highlighted the many obstacles and barriers to the task of converting raw clinical notes into a format that can be readily queried and analysed. We developed NLP systems using rules‐based algorithms and machine learning to automate the extraction of data describing the key elements to assess appropriate antimicrobial use. These included the clinical indication, antimicrobial agent selection, dose and duration of therapy. Our methods were applied to over 4.4 million companion animal clinical records across Australia on all consultations with antimicrobial use to help us understand what antibiotics are being given and why on a population level. Of these, approximately only 40% recorded the reason why antimicrobials were prescribed, along with the dose and duration of treatment. NLP and deep learning might be able to overcome the difficulties of harvesting free text data from clinical records, but when the essential data are not recorded in the clinical records, then, this becomes an insurmountable obstacle.

Publication

Supervised Learning for Detection of Duplicates in Genomic Sequence Databases

Publisher: Public Library of Science (PLoS)

Date: 04-08-2016

DOI: 10.1371/JOURNAL.PONE.0159644

Publication

Benchmarks for Measurement of Duplicate Detection Methods in Nucleotide Databases

Publisher: Cold Spring Harbor Laboratory

Date: 03-11-2016

DOI: 10.1101/085324

Abstract: Duplication of information in databases is a major data quality challenge. The presence of duplicates, implying either redundancy or inconsistency, can have a range of impacts on the quality of analyses that use the data. To provide a sound basis for research on this issue in databases of nucleotide sequences, we have developed new, large-scale validated collections of duplicates, which can be used to test the effectiveness of duplicate detection methods. Previous collections were either designed primarily to test efficiency, or contained only a limited number of duplicates of limited kinds. To date, duplicate detection methods have been evaluated on separate, inconsistent benchmarks, leading to results that cannot be compared and, due to limitations of the benchmarks, of questionable generality. In this study we present three nucleotide sequence database benchmarks, based on information drawn from a range of resources, including information derived from mapping to Swiss-Prot and TrEMBL. Each benchmark has distinct characteristics. We quantify these characteristics and argue for their complementary value in evaluation. The benchmarks collectively contain a vast number of validated biological duplicates the largest has nearly half a billion duplicate pairs (although this is probably only a tiny fraction of the total that is present). They are also the first benchmarks targeting the primary nucleotide databases. The records include the 21 most heavily studied organisms in molecular biology research. Our quantitative analysis shows that duplicates in the different benchmarks, and in different organisms, have different characteristics. It is thus unreliable to evaluate duplicate detection methods against any single benchmark. For ex le, the benchmark derived from Swiss-Prot mappings identifies more erse types of duplicates, showing the importance of expert curation, but is limited to coding sequences. Overall, these benchmarks form a resource that we believe will be of great value for development and evaluation of the duplicate detection methods that are required to help maintain these essential resources. Availability : The benchmark data sets are available at iodbqual/benchmarks .

Publication

Towards a semantic lexicon for biological language processing

Publisher: Hindawi Limited

Date: 2005

DOI: 10.1002/CFG.451

Abstract: This paper explores the use of the resources in the National Library of Medicine's Unified Medical Language System (UMLS) for the construction of a lexicon useful for processing texts in the field of molecular biology. A lexicon is constructed from overlapping terms in the UMLS SPECIALIST lexicon and the UMLS Metathesaurus to obtain both morphosyntactic and semantic information for terms, and the coverage of a domain corpus is assessed. Over 77% of tokens in the domain corpus are found in the constructed lexicon, validating the lexicon's coverage of the most frequent terms in the domain and indicating that the constructed lexicon is potentially an important resource for biological text processing.

Publication

Biomedical Text Mining: State-of-the-Art, Open Problems and Future Challenges

Publisher: Springer Berlin Heidelberg

Date: 2014

DOI: 10.1007/978-3-662-43968-5_16

Publication

Brief Description of COVID-SEE: The Scientific Evidence Explorer for COVID-19 Related Research

Publisher: Springer International Publishing

Date: 2021

DOI: 10.1007/978-3-030-72240-1_65

Publication

Dynamic document delivery: generating natural language texts on demand

Publisher: IEEE Comput. Soc

Date: 1998

DOI: 10.1109/DEXA.1998.707392

Publication

From POS tagging to dependency parsing for biomedical event extraction

Publisher: Springer Science and Business Media LLC

Date: 12-02-2019

DOI: 10.1186/S12859-019-2604-0

Publication

Stratification of keratoconus progression using unsupervised machine learning analysis of tomographical parameters

Publisher: Elsevier BV

Date: 2023

DOI: 10.1016/J.IBMED.2023.100095

Publication

Gene Ontology synonym generation rules lead to increased performance in biomedical concept recognition.

Publisher: Springer Science and Business Media LLC

Date: 09-09-2016

DOI: 10.1186/S13326-016-0096-7

Publication

Development of a Self-Harm Monitoring System for Victoria

Publisher: MDPI AG

Date: 15-12-2020

DOI: 10.3390/IJERPH17249385

Abstract: The prevention of suicide and suicide-related behaviour are key policy priorities in Australia and internationally. The World Health Organization has recommended that member states develop self-harm surveillance systems as part of their suicide prevention efforts. This is also a priority under Australia’s Fifth National Mental Health and Suicide Prevention Plan. The aim of this paper is to describe the development of a state-based self-harm monitoring system in Victoria, Australia. In this system, data on all self-harm presentations are collected from eight hospital emergency departments in Victoria. A natural language processing classifier that uses machine learning to identify episodes of self-harm is currently being developed. This uses the free-text triage case notes, together with certain structured data fields, contained within the metadata of the incoming records. Post-processing is undertaken to identify primary mechanism of injury, substances consumed (including alcohol, illicit drugs and pharmaceutical preparations) and presence of psychiatric disorders. This system will ultimately leverage routinely collected data in combination with advanced artificial intelligence methods to support robust community-wide monitoring of self-harm. Once fully operational, this system will provide accurate and timely information on all presentations to participating emergency departments for self-harm, thereby providing a useful indicator for Australia’s suicide prevention efforts.

Publication

Associating disease-related genetic variants in intergenic regions to the genes they impact

Publisher: PeerJ

Date: 23-10-2014

DOI: 10.7717/PEERJ.639

Publication

Annotating the biomedical literature for the human variome

Publisher: Oxford University Press (OUP)

Date: 12-04-2013

DOI: 10.1093/DATABASE/BAT019

Publication

The textual characteristics of traditional and Open Access scientific journals are similar

Publisher: Springer Science and Business Media LLC

Date: 15-06-2009

DOI: 10.1186/1471-2105-10-183

Publication

A UIMA wrapper for the NCBO annotator

Publisher: Oxford University Press (OUP)

Date: 26-05-2010

DOI: 10.1093/BIOINFORMATICS/BTQ250

Abstract: Summary: The Unstructured Information Management Architecture (UIMA) framework and web services are emerging as useful tools for integrating biomedical text mining tools. This note describes our work, which wraps the National Center for Biomedical Ontology (NCBO) Annotator—an ontology-based annotation service—to make it available as a component in UIMA workflows. Availability: This wrapper is freely available on the web at bionlp-uima.sourceforge.net/ as part of the UIMA tools distribution from the Center for Computational Pharmacology (CCP) at the University of Colorado School of Medicine. It has been implemented in Java for support on Mac OS X, Linux and MS Windows. Contact: chris.roeder@ucdenver.edu

Publication

BioLemmatizer: a lemmatization tool for morphological processing of biomedical text

Publisher: Springer Science and Business Media LLC

Date: 04-2012

DOI: 10.1186/2041-1480-3-3

Publication

Privacy-Preserving Access Control in Electronic Health Record Linkage

Publisher: IEEE

Date: 08-2018

DOI: 10.1109/TRUSTCOM/BIGDATASE.2018.00151

Publication

A large-scale evaluation of computational protein function prediction

Publisher: Springer Science and Business Media LLC

Date: 27-01-2013

DOI: 10.1038/NMETH.2340

Publication

Characterizing Text Revisions to Better Support Collaborative

Publisher: IEEE

Date: 12-2022

DOI: 10.1109/ICDI57181.2022.10007395

Publication

Coreference resolution improves extraction of Biological Expression Language statements from texts.

Publisher: Oxford University Press (OUP)

Date: 2016

DOI: 10.1093/DATABASE/BAW076

Publication

BioC interoperability track overview

Publisher: Oxford University Press (OUP)

Date: 30-06-2014

DOI: 10.1093/DATABASE/BAU053

Publication

ChEMU 2021: Reaction Reference Resolution and Anaphora Resolution in Chemical Patents

Publisher: Springer International Publishing

Date: 2021

DOI: 10.1007/978-3-030-72240-1_71

Publication

Propagation, detection and correction of errors using the sequence database network

Publisher: Oxford University Press (OUP)

Date: 20-10-2022

DOI: 10.1093/BIB/BBAC416

Abstract: Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect—or even correct—erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.

Publication

Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT.

Publisher: Springer Science and Business Media LLC

Date: 04-01-2022

DOI: 10.1186/S12859-021-04504-X

Abstract: Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models—dubbed PPI-BioBERT-x10—to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter $$\\approx 5700$$ ≈ 5700 (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly s led subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.

Publication

Coreference annotation and resolution in the Colorado Richly Annotated Full Text (CRAFT) corpus of biomedical journal articles

Publisher: Springer Science and Business Media LLC

Date: 17-08-2017

DOI: 10.1186/S12859-017-1775-9

Publication

Predicting Publication of Clinical Trials Using Structured and Unstructured Data: Model Development and Validation Study

Publisher: JMIR Publications Inc.

Date: 23-12-2022

DOI: 10.2196/38859

Abstract: Publication of registered clinical trials is a critical step in the timely dissemination of trial findings. However, a significant proportion of completed clinical trials are never published, motivating the need to analyze the factors behind success or failure to publish. This could inform study design, help regulatory decision-making, and improve resource allocation. It could also enhance our understanding of bias in the publication of trials and publication trends based on the research direction or strength of the findings. Although the publication of clinical trials has been addressed in several descriptive studies at an aggregate level, there is a lack of research on the predictive analysis of a trial’s publishability given an in idual (planned) clinical trial description. We aimed to conduct a study that combined structured and unstructured features relevant to publication status in a single predictive approach. Established natural language processing techniques as well as recent pretrained language models enabled us to incorporate information from the textual descriptions of clinical trials into a machine learning approach. We were particularly interested in whether and which textual features could improve the classification accuracy for publication outcomes. In this study, we used metadata from ClinicalTrials.gov (a registry of clinical trials) and MEDLINE (a database of academic journal articles) to build a data set of clinical trials (N=76,950) that contained the description of a registered trial and its publication outcome (27,702/76,950, 36% published and 49,248/76,950, 64% unpublished). This is the largest data set of its kind, which we released as part of this work. The publication outcome in the data set was identified from MEDLINE based on clinical trial identifiers. We carried out a descriptive analysis and predicted the publication outcome using 2 approaches: a neural network with a large domain-specific language model and a random forest classifier using a weighted bag-of-words representation of text. First, our analysis of the newly created data set corroborates several findings from the existing literature regarding attributes associated with a higher publication rate. Second, a crucial observation from our predictive modeling was that the addition of textual features (eg, eligibility criteria) offers consistent improvements over using only structured data (F1-score=0.62-0.64 vs F1-score=0.61 without textual features). Both pretrained language models and more basic word-based representations provide high-utility text representations, with no significant empirical difference between the two. Different factors affect the publication of a registered clinical trial. Our approach to predictive modeling combines heterogeneous features, both structured and unstructured. We show that methods from natural language processing can provide effective textual features to enable more accurate prediction of publication success, which has not been explored for this task previously.

Publication

Representing annotation compositionality and provenance for the Semantic Web

Publisher: Springer Science and Business Media LLC

Date: 2013

DOI: 10.1186/2041-1480-4-38

Publication

A Semantics-Enhanced Language Model for Unsupervised Word Sense Disambiguation

Publisher: Springer Berlin Heidelberg

Date: 2008

DOI: 10.1007/978-3-540-78135-6_24

Publication

Using natural language processing and VetCompass to understand antimicrobial usage patterns in Australia

Publisher: Wiley

Date: 17-06-2019

DOI: 10.1111/AVJ.12836

Abstract: Currently there is an incomplete understanding of antimicrobial usage patterns in veterinary clinics in Australia, but such knowledge is critical for the successful implementation and monitoring of antimicrobial stewardship programs. VetCompass Australia collects medical records from 181 clinics in Australia (as of May 2018). These records contain detailed information from in idual consultations regarding the medications dispensed. One unique aspect of VetCompass Australia is its focus on applying natural language processing (NLP) and machine learning techniques to analyse the records, similar to efforts conducted in other medical studies. The free text fields of 4,394,493 veterinary consultation records of dogs and cats between 2013 and 2018 were collated by VetCompass Australia and NLP techniques applied to enable the querying of the antimicrobial usage within these consultations. The NLP algorithms developed matched antimicrobial in clinical records with 96.7% accuracy and an F1 Score of 0.85, as evaluated relative to expert annotations. This dataset can be readily queried to demonstrate the antimicrobial usage patterns of companion animal practices throughout Australia.

Publication

Establishing a baseline for literature mining human genetic variants and their relationships to disease cohorts

Publisher: Springer Science and Business Media LLC

Date: 07-2016

DOI: 10.1186/S12911-016-0294-3

Publication

BioHackathon series in 2013 and 2014: improvements of semantic interoperability in life science data and services

Publisher: F1000 Research Ltd

Date: 23-09-2019

DOI: 10.12688/F1000RESEARCH.18238.1

Abstract: Publishing databases in the Resource Description Framework (RDF) model is becoming widely accepted to maximize the syntactic and semantic interoperability of open data in life sciences. Here we report advancements made in the 6th and 7th annual BioHackathons which were held in Tokyo and Miyagi respectively. This review consists of two major sections covering: 1) improvement and utilization of RDF data in various domains of the life sciences and 2) meta-data about these RDF data, the resources that store them, and the service quality of SPARQL Protocol and RDF Query Language (SPARQL) endpoints. The first section describes how we developed RDF data, ontologies and tools in genomics, proteomics, metabolomics, glycomics and by literature text mining. The second section describes how we defined descriptions of datasets, the provenance of data, and quality assessment of services and service discovery. By enhancing the harmonization of these two layers of machine-readable data and knowledge, we improve the way community wide resources are developed and published. Moreover, we outline best practices for the future, and prepare ourselves for an exciting and unanticipatable variety of real world applications in coming years.

Publication

Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

Publisher: Elsevier BV

Date: 04-2020

DOI: 10.1016/J.GPB.2018.11.006

Publication

Early prediction of diagnostic-related groups and estimation of hospital cost by processing clinical notes

Publisher: Springer Science and Business Media LLC

Date: 07-2021

DOI: 10.1038/S41746-021-00474-9

Abstract: As healthcare providers receive fixed amounts of reimbursement for given services under DRG (Diagnosis-Related Groups) payment, DRG codes are valuable for cost monitoring and resource allocation. However, coding is typically performed retrospectively post-discharge. We seek to predict DRGs and DRG-based case mix index (CMI) at early inpatient admission using routine clinical text to estimate hospital cost in an acute setting. We examined a deep learning-based natural language processing (NLP) model to automatically predict per-episode DRGs and corresponding cost-reflecting weights on two cohorts (paid under Medicare Severity (MS) DRG or All Patient Refined (APR) DRG), without human coding efforts. It achieved macro-averaged area under the receiver operating characteristic curve (AUC) scores of 0·871 (SD 0·011) on MS-DRG and 0·884 (0·003) on APR-DRG in fivefold cross-validation experiments on the first day of ICU admission. When extended to simulated patient populations to estimate average cost-reflecting weights, the model increased its accuracy over time and obtained absolute CMI error of 2·40 (1·07%) and 12·79% (2·31%), respectively on the first day. As the model could adapt to variations in admission time, cohort size, and requires no extra manual coding efforts, it shows potential to help estimating costs for active patients to support better operational decision-making in hospitals.

Publication

Literature mining of protein-residue associations with graph rules learned through distant supervision.

Publisher: Springer Science and Business Media LLC

Date: 10-2012

DOI: 10.1186/2041-1480-3-S3-S2

Publication

Overview of ChEMU 2021: Reaction Reference Resolution and Anaphora Resolution in Chemical Patents

Publisher: Springer International Publishing

Date: 2021

DOI: 10.1007/978-3-030-85251-1_20

Publication

Integrating UIMA with Alveo, a human communication science virtual laboratory

Publisher: Association for Computational Linguistics and Dublin City University

Date: 2014

DOI: 10.3115/V1/W14-5202

Karin Verspoor

Researcher

Research Topics

Top 5 Research Topics

ANZSRC Field of Research (FoR)

ANZSRC Socio-Economic Objective (SEO)

Related Links

Publications

BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics.

Leveraging gene ontology annotations to improve a memory-based language understanding system

The Evolution of Clinical Knowledge During COVID-19: Towards a Global Learning Health System.

Use of a Victorian statewide surveillance programme to evaluate the burden of healthcare‐associated Staphylococcus aureus bacteraemia and Clostridioides difficile infection in patients with cancer

A physarum-inspired prize-collecting steiner tree approach to identify subnetworks for drug repositioning.

Conceptualising health information seeking behaviours and exploratory search: result of a qualitative study

Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters

Classifying literature mentions of biological pathogens as experimentally studied using natural language processing

Chemtables: A Dataset for Semantic Classification on Tables in Chemical Patents

Exploring effective approaches for haplotype block phasing

Performance of ICD-10-AM codes for quality improvement monitoring of hospital-acquired pneumonia in a haematology-oncology casemix in Victoria, Australia

Literature Consistency of Bioinformatics Sequence Databases is Effective for Assessing Record Quality

Text Mining Improves Prediction of Protein Functional Sites

Burden and clinical outcomes of hospital-coded infections in patients with cancer: an 11-year longitudinal cohort study at an Australian cancer centre

Automatic English-Chinese name transliteration for development of multilingual resources

“Note Bloat” impacts deep learning-based NLP models for clinical prediction tasks

Learning from Unlabelled Data for Clinical Semantic Textual Similarity

Appraising the Quality of Systematic Reviews for Age-Related Macular Degeneration Interventions

High-precision biological event extraction: Effects of system and of data

Early prediction of incident liver disease using conventional risk factors and gut-microbiome-augmented gradient boosting

Evaluation of consensus strategies for haplotype phasing

Interoperability of text corpus annotations with the semantic web.

Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

The Use of Web-Based Technologies in Health Research Participation: Qualitative Study of Consumer and Researcher Experiences

Evaluation of a Machine Learning Duplicate Detection Method for Bioinformatics Databases

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools

Chemtables: A Dataset for Semantic Classification on Tables in Chemical Patents

Use and validation of text mining and cluster algorithms to derive insights from Corona Virus Disease-2019 (COVID-19) medical literature

ChemTables: A Dataset for Semantic Classification of Tables in Chemical Patents

Literature consistency of bioinformatics sequence databases is effective for assessing record quality.

Extracting structured information from free-text medication prescriptions using dependencies

Better health information exploration

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

BioC: a minimalist approach to interoperability for biomedical text processing

MPVNN: Mutated Pathway Visible Neural Network architecture for interpretable prediction of cancer-specific survival risk.

What Can We Get From 1000 Tokens? A Case Study of Multilingual POS Tagging For Resource-Poor Languages

Concept annotation in the CRAFT corpus

Protein annotation as term categorization in the gene ontology using word proximity networks

Quantifying Semantic Similarity of Clinical Evidence in the Biomedical Literature to Facilitate Related Evidence Synthesis

Automatic Consistency Assurance for Literature-based Gene Ontology Annotation

Roles for text mining in protein function prediction.

Applied Text Mining

Why Bother Enabling Biomedical Literature Analysis with Semantics?

Use of cefovecin in dogs and cats attending first‐opinion veterinary practices in Australia

Automated Detection of Records in Biological Sequence Databases that are Inconsistent with the Literature

Describing the antimicrobial usage patterns of companion animal veterinary practices; free text analysis of more than 4.4 million consultation records.

The gut microbiome is a significant risk factor for future chronic lung disease

Automatic consistency assurance for literature-based gene ontology annotation

Assessing the impact of case sensitivity and term information gain on biomedical concept recognition.

U-Compare bio-event meta-service: compatible BioNLP event extraction services

Exploring automatic inconsistency detection for literature-based gene ontology annotation

Evaluating the Performance of Various Machine Learning Algorithms to Detect Subclinical Keratoconus

Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study.

Rev at SemEval-2016 Task 2: Aligning Chunks by Lexical, Part of Speech and Semantic Equivalence

Approaches to verb subcategorization for biomedicine

A Bayesian reanalysis of the Standard versus Accelerated Initiation of Renal-Replacement Therapy in Acute Kidney Injury (STARRT-AKI) trial

Automating Quality Assessment of Medical Evidence in Systematic Reviews: Model Development and Validation Study (Preprint)

Benchmarks for Measurement of Duplicate Detection Methods in Nucleotide Databases

PHENOstruct: Prediction of human phenotype ontology terms using heterogeneous data sources

Mutation extraction tools can be combined for robust recognition of genetic variants in the literature

Testing Contextualized Word Embeddings to Improve NER in Spanish Clinical Case Narratives

Thematic issue of the Second combined Bio-ontologies and Phenotypes Workshop

Better Health Explorer

Integration of polygenic and gut metagenomic risk prediction for common diseases

Evaluation of consensus strategies for haplotype phasing

COVID-19 Drug Repurposing: A Network-Based Framework for Exploring Biomedical Literature and Clinical Trials for Possible Treatments

Machine learning with a reduced dimensionality representation of comprehensive Pentacam tomography parameters to identify subclinical keratoconus

Summary of the BioLINK SIG 2013 meeting at ISMB/ECCB 2013.

Acquisition and evaluation of verb subcategorization resources for biomedicine

Positive-Unlabeled Learning for inferring drug interactions based on heterogeneous attributes

Duplicates, redundancies, and inconsistencies in the primary nucleotide databases: a descriptive study

Big Data in Medicine Is Driving Big Changes