ARDC Research Link Australia

Publication

Compressed String Dictionaries

Publisher: Springer Berlin Heidelberg

Date: 2011

DOI: 10.1007/978-3-642-20662-7_12

Publication

Inherited Thrombophilias Are Associated With a Higher Risk of COVID-19–Associated Venous Thromboembolism: A Prospective Population-Based Cohort Study

Publisher: Ovid Technologies (Wolters Kluwer Health)

Date: 22-03-2022

DOI: 10.1161/CIRCULATIONAHA.121.057394

Publication

Full Compressed Affix Tree Representations

Publisher: IEEE

Date: 04-2017

DOI: 10.1109/DCC.2017.39

Publication

Shortest DNA Cyclic Cover in Compressed Space

Publisher: IEEE

Date: 03-2016

DOI: 10.1109/DCC.2016.79

Publication

Lossy compression of quality scores in genomic data

Publisher: Oxford University Press (OUP)

Date: 02-05-2014

DOI: 10.1093/BIOINFORMATICS/BTU183

Abstract: Motivation: Next-generation sequencing technologies are revolutionizing medicine. Data from sequencing technologies are typically represented as a string of bases, an associated sequence of per-base quality scores and other metadata, and in aggregate can require a large amount of space. The quality scores show how accurate the bases are with respect to the sequencing process, that is, how confident the sequencer is of having called them correctly, and are the largest component in datasets in which they are retained. Previous research has examined how to store sequences of bases effectively here we add to that knowledge by examining methods for compressing quality scores. The quality values originate in a continuous domain, and so if a fidelity criterion is introduced, it is possible to introduce flexibility in the way these values are represented, allowing lossy compression over the quality score data. Results: We present existing compression options for quality score data, and then introduce two new lossy techniques. Experiments measuring the trade-off between compression ratio and information loss are reported, including quantifying the effect of lossy representations on a downstream application that carries out single nucleotide polymorphism and insert/deletion detection. The new methods are demonstrably superior to other techniques when assessed against the spectrum of possible trade-offs between storage required and fidelity of representation. Availability and implementation: An implementation of the methods described here is available at canovas/libCSAM . Contact: rcanovas@student.unimelb.edu.au Supplementary information: Supplementary data are available at Bioinformatics online.

Publication

LZ78 Compression in Low Main Memory Space

Publisher: Springer International Publishing

Date: 2017

DOI: 10.1007/978-3-319-67428-5_4

Publication

Practical Compressed Suffix Trees

Publisher: MDPI AG

Date: 21-05-2013

DOI: 10.3390/A6020319

Publication

Querying RDF dictionaries in compressed space

Publisher: Association for Computing Machinery (ACM)

Date: 06-2012

DOI: 10.1145/2340416.2340422

Abstract: The use of dictionaries is a common practice among those applications performing on huge RDF datasets. It allows long terms occurring in the RDF triples to be replaced by short IDs which reference them. This decision greatly compacts the dataset and mitigates the scalability issues underlying to its management. However, the dictionary size is not negligible and the techniques used for its representation also suffer from scalability limitations. This paper focuses on this scenario by adapting compression techniques for string dictionaries to the case of RDF. We propose a novel technique: D comp , which can be tuned to represent the dictionary in compressed space (22--64%) and to perform basic lookup operations in a few microseconds (1--50μ s ). In addition, we propose D comp as a basis for specific SPARQL query optimizations leveraging its ability for early FILTER resolution.

Publication

Genomic risk scores for juvenile idiopathic arthritis and its subtypes

Publisher: BMJ

Date: 04-09-2020

DOI: 10.1136/ANNRHEUMDIS-2020-217421

Abstract: Juvenile idiopathic arthritis (JIA) is an autoimmune disease and a common cause of chronic disability in children. Diagnosis of JIA is based purely on clinical symptoms, which can be variable, leading to diagnosis and treatment delays. Despite JIA having substantial heritability, the construction of genomic risk scores (GRSs) to aid or expedite diagnosis has not been assessed. Here, we generate GRSs for JIA and its subtypes and evaluate their performance. We examined three case/control cohorts (UK, US-based and Australia) with genome-wide single nucleotide polymorphism (SNP) genotypes. We trained GRSs for JIA and its subtypes using lasso-penalised linear models in cross-validation on the UK cohort, and externally tested it in the other cohorts. The JIA GRS alone achieved cross-validated area under the receiver operating characteristic curve (AUC)=0.670 in the UK cohort and externally-validated AUCs of 0.657 and 0.671 in the US-based and Australian cohorts, respectively. In logistic regression of case/control status, the corresponding odds ratios (ORs) per standard deviation (SD) of GRS were 1.831 (1.685 to 1.991) and 2.008 (1.731 to 2.345), and were unattenuated by adjustment for sex or the top 10 genetic principal components. Extending our analysis to JIA subtypes revealed that the enthesitis-related JIA had both the longest time-to-referral and the subtype GRS with the strongest predictive capacity overall across data sets: AUCs 0.82 in UK 0.84 in Australian and 0.70 in US-based. The particularly common oligoarthritis JIA also had a GRS that outperformed those for JIA overall, with AUCs of 0.72, 0.74 and 0.77, respectively. A GRS for JIA has potential to augment clinical JIA diagnosis protocols, prioritising higher-risk in iduals for follow-up and treatment. Consistent with JIA heterogeneity, subtype-specific GRSs showed particularly high performance for enthesitis-related and oligoarthritis JIA.

Publication

The impact of coffee subtypes on incident cardiovascular disease, arrhythmias and mortality: long term outcomes from the UK Biobank

Publisher: Oxford University Press (OUP)

Date: 27-09-2022

DOI: 10.1093/EURJPC/ZWAC189

Abstract: Epidemiological studies report the beneficial effects of habitual coffee consumption on incident arrhythmia, cardiovascular disease (CVD), and mortality. However, the impact of different coffee preparations on cardiovascular outcomes and survival is largely unknown. The aim of this study was to evaluate associations between coffee subtypes on incident outcomes, utilizing the UK Biobank. Coffee subtypes were defined as decaffeinated, ground, and instant, then ided into 0, & , 1, 2–3, 4–5, and & cups/day, and compared with non-drinkers. Cardiovascular disease included coronary heart disease, cardiac failure, and ischaemic stroke. Cox regression modelling with hazard ratios (HRs) assessed associations with incident arrhythmia, CVD, and mortality. Outcomes were determined through ICD codes and death records. A total of 449 563 participants (median 58 years, 55.3% females) were followed over 12.5 ± 0.7 years. Ground and instant coffee consumption was associated with a significant reduction in arrhythmia at 1–5 cups/day but not for decaffeinated coffee. The lowest risk was 4–5 cups/day for ground coffee [HR 0.83, confidence interval (CI) 0.76–0.91, P & 0.0001] and 2–3 cups/day for instant coffee (HR 0.88, CI 0.85–0.92, P & 0.0001). All coffee subtypes were associated with a reduction in incident CVD (the lowest risk was 2–3 cups/day for decaffeinated, P = 0.0093 ground, P & 0.0001 and instant coffee, P & 0.0001) vs. non-drinkers. All-cause mortality was significantly reduced for all coffee subtypes, with the greatest risk reduction seen with 2–3 cups/day for decaffeinated (HR 0.86, CI 0.81–0.91, P & 0.0001) ground (HR 0.73, CI 0.69–0.78, P & 0.0001) and instant coffee (HR 0.89, CI 0.86–0.93, P & 0.0001). Decaffeinated, ground, and instant coffee, particularly at 2–3 cups/day, were associated with significant reductions in incident CVD and mortality. Ground and instant but not decaffeinated coffee was associated with reduced arrhythmia.

Publication

Practical compressed string dictionaries

Publisher: Elsevier BV

Date: 03-2016

DOI: 10.1016/J.IS.2015.08.008

Publication

FT-GPI, a highly sensitive and accurate predictor of GPI-anchored proteins, reveals the composition and evolution of the GPI proteome in Plasmodium species

Publisher: Springer Science and Business Media LLC

Date: 25-01-2023

DOI: 10.1186/S12936-022-04430-0

Abstract: Protozoan parasites are known to attach specific and erse group of proteins to their plasma membrane via a GPI anchor. In malaria parasites, GPI-anchored proteins (GPI-APs) have been shown to play an important role in host–pathogen interactions and a key function in host cell invasion and immune evasion. Because of their immunogenic properties, some of these proteins have been considered as malaria vaccine candidates. However, identification of all possible GPI-APs encoded by these parasites remains challenging due to their sequence ersity and limitations of the tools used for their characterization. The FT-GPI software was developed to detect GPI-APs based on the presence of a hydrophobic helix at both ends of the premature peptide. FT-GPI was implemented in C ++and applied to study the GPI-proteome of 46 isolates of the order Haemosporida. Using the GPI proteome of Plasmodium falciparum strain 3D7 and Plasmodium vivax strain Sal-1, a heuristic method was defined to select the most sensitive and specific FT-GPI software parameters. FT-GPI enabled revision of the GPI-proteome of P. falciparum and P. vivax, including the identification of novel GPI-APs. Orthology- and synteny-based analyses showed that 19 of the 37 GPI-APs found in the order Haemosporida are conserved among Plasmodium species. Our analyses suggest that gene duplication and deletion events may have contributed significantly to the evolution of the GPI proteome, and its composition correlates with speciation. FT-GPI-based prediction is a useful tool for mining GPI-APs and gaining further insights into their evolution and sequence ersity. This resource may also help identify new protein candidates for the development of vaccines for malaria and other parasitic diseases.

Publication

Compression of RDF dictionaries

Publisher: ACM

Date: 26-03-2012

DOI: 10.1145/2245276.2245343

Publication

Succinct Trees in Practice

Publisher: Society for Industrial and Applied Mathematics

Date: 16-01-2013

DOI: 10.1137/1.9781611972900.9

Publication

Engineering Practical Lempel-Ziv Tries

Publisher: Association for Computing Machinery (ACM)

Date: 30-10-2021

DOI: 10.1145/3481638

Abstract: The Lempel-Ziv 78 ( LZ78 ) and Lempel-Ziv-Welch ( LZW ) text factorizations are popular, not only for bare compression but also for building compressed data structures on top of them. Their regular factor structure makes them computable within space bounded by the compressed output size. In this article, we carry out the first thorough study of low-memory LZ78 and LZW text factorization algorithms, introducing more efficient alternatives to the classical methods, as well as new techniques that can run within less memory space than the necessary to hold the compressed file. Our results build on hash-based representations of tries that may have independent interest.

Publication

CSAM: Compressed SAM format

Publisher: Oxford University Press (OUP)

Date: 18-08-2016

DOI: 10.1093/BIOINFORMATICS/BTW543

Abstract: Motivation: Next generation sequencing machines produce vast amounts of genomic data. For the data to be useful, it is essential that it can be stored and manipulated efficiently. This work responds to the combined challenge of compressing genomic data, while providing fast access to regions of interest, without necessitating decompression of whole files. Results: We describe CSAM (Compressed SAM format), a compression approach offering lossless and lossy compression for SAM files. The structures and techniques proposed are suitable for representing SAM files, as well as supporting fast access to the compressed information. They generate more compact lossless representations than BAM, which is currently the preferred lossless compressed SAM-equivalent format and are self-contained, that is, they do not depend on any external resources to compress or decompress SAM files. Availability and Implementation: An implementation is available at canovas/libCSAM. Contact: canovas-ba@lirmm.fr Supplementary Information: Supplementary data is available at Bioinformatics online.

Publication

Practical Compressed Suffix Trees

Publisher: Springer Berlin Heidelberg

Date: 2010

DOI: 10.1007/978-3-642-13193-6_9

Rodrigo Canovas

Researcher

Publications

Compressed String Dictionaries

Inherited Thrombophilias Are Associated With a Higher Risk of COVID-19–Associated Venous Thromboembolism: A Prospective Population-Based Cohort Study

Full Compressed Affix Tree Representations

Shortest DNA Cyclic Cover in Compressed Space

Lossy compression of quality scores in genomic data

LZ78 Compression in Low Main Memory Space

Practical Compressed Suffix Trees

Querying RDF dictionaries in compressed space

Genomic risk scores for juvenile idiopathic arthritis and its subtypes

The impact of coffee subtypes on incident cardiovascular disease, arrhythmias and mortality: long term outcomes from the UK Biobank

Practical compressed string dictionaries

FT-GPI, a highly sensitive and accurate predictor of GPI-anchored proteins, reveals the composition and evolution of the GPI proteome in Plasmodium species

Compression of RDF dictionaries

Succinct Trees in Practice

Engineering Practical Lempel-Ziv Tries

CSAM: Compressed SAM format

Practical Compressed Suffix Trees

Related Organisations

Baker IDI Heart And Diabetes Institute

CSIRO

Related Funding Activities

Rodrigo Canovas

Researcher

Related Links

Publications

Compressed String Dictionaries

Inherited Thrombophilias Are Associated With a Higher Risk of COVID-19–Associated Venous Thromboembolism: A Prospective Population-Based Cohort Study

Full Compressed Affix Tree Representations

Shortest DNA Cyclic Cover in Compressed Space

Lossy compression of quality scores in genomic data

LZ78 Compression in Low Main Memory Space

Practical Compressed Suffix Trees

Querying RDF dictionaries in compressed space

Genomic risk scores for juvenile idiopathic arthritis and its subtypes

The impact of coffee subtypes on incident cardiovascular disease, arrhythmias and mortality: long term outcomes from the UK Biobank

Practical compressed string dictionaries

FT-GPI, a highly sensitive and accurate predictor of GPI-anchored proteins, reveals the composition and evolution of the GPI proteome in Plasmodium species

Compression of RDF dictionaries

Succinct Trees in Practice

Engineering Practical Lempel-Ziv Tries

CSAM: Compressed SAM format

Practical Compressed Suffix Trees

Related Organisations

Baker IDI Heart And Diabetes Institute

CSIRO

Related Funding Activities

ARDC NEWSLETTER SIGNUP