ARDC Research Link Australia

Publication

Bootstrapping phylogenies inferred from rearrangement data

Publisher: Springer Science and Business Media LLC

Date: 29-08-2012

Abstract: Large-scale sequencing of genomes has enabled the inference of phylogenies based on the evolution of genomic architecture, under such events as rearrangements, duplications, and losses. Many evolutionary models and associated algorithms have been designed over the last few years and have found use in comparative genomics and phylogenetic inference. However, the assessment of phylogenies built from such data has not been properly addressed to date. The standard method used in sequence-based phylogenetic inference is the bootstrap, but it relies on a large number of homologous characters that can be res led yet in the case of rearrangements, the entire genome is a single character. Alternatives such as the jackknife suffer from the same problem, while likelihood tests cannot be applied in the absence of well established probabilistic models. We present a new approach to the assessment of distance-based phylogenetic inference from whole-genome data our approach combines features of the jackknife and the bootstrap and remains nonparametric. For each feature of our method, we give an equivalent feature in the sequence-based framework we also present the results of extensive experimental testing, in both sequence-based and genome-based frameworks. Through the feature-by-feature comparison and the experimental results, we show that our bootstrapping approach is on par with the classic phylogenetic bootstrap used in sequence-based reconstruction, and we establish the clear superiority of the classic bootstrap for sequence data and of our corresponding new approach for rearrangement data over proposed variants. Finally, we test our approach on a small dataset of mammalian genomes, verifying that the support values match current thinking about the respective branches. Our method is the first to provide a standard of assessment to match that of the classic phylogenetic bootstrap for aligned sequences. Its support values follow a similar scale and its receiver-operating characteristics are nearly identical, indicating that it provides similar levels of sensitivity and specificity. Thus our assessment method makes it possible to conduct phylogenetic analyses on whole genomes with the same degree of confidence as for analyses on aligned sequences. Extensions to search-based inference methods such as maximum parsimony and maximum likelihood are possible, but remain to be thoroughly tested.

Publication

An Algorithm to Mine Therapeutic Motifs for Cancer From Networks of Genetic Interactions

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 06-2022

DOI: 10.1109/JBHI.2022.3141076

Publication

MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs

Publisher: Springer International Publishing

Date: 2022

DOI: 10.1007/978-3-031-04749-7_5

Publication

VStrains: De Novo Reconstruction of Viral Strains via Iterative Path Extraction from Assembly Graphs

Publisher: Springer Nature Switzerland

Date: 2023

DOI: 10.1007/978-3-031-29119-7_1

Abstract: With the high mutation rate in viruses, a mixture of closely related viral strains (called viral quasispecies) often co-infect an in idual host. Reconstructing in idual strains from viral quasispecies is a key step to characterizing the viral population, revealing strain-level genetic variability, and providing insights into biomedical and clinical studies. Reference-based approaches of reconstructing viral strains suffer from the lack of high-quality references due to high mutation rates and biased variant calling introduced by a selected reference. De novo methods require no references but face challenges due to errors in reads, the high similarity of quasispecies, and uneven abundance of strains. In this paper, we propose VStrains, a de novo approach for reconstructing strains from viral quasispecies. VStrains incorporates contigs, paired-end reads, and coverage information to iteratively extract the strain-specific paths from assembly graphs. We benchmark VStrains against multiple state-of-the-art de novo and reference-based approaches on both simulated and real datasets. Experimental results demonstrate that VStrains achieves the best overall performance on both simulated and real datasets under a comprehensive set of metrics such as genome fraction, duplication ratio, NGA50, error rate, etc . Availability: VStrains is freely available at github.com/ MetaGenTools/VStrains .

Publication

Phylogenetic Reconstruction for Copy-Number Evolution Problems

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 03-2019

DOI: 10.1109/TCBB.2018.2829698

Publication

Sorting Signed Permutations by Inversions in O(nlogn) Time

Publisher: Springer Berlin Heidelberg

Date: 2009

DOI: 10.1007/978-3-642-02008-7_28

Publication

Heuristics for the inversion median problem

Publisher: Springer Science and Business Media LLC

Date: 2010

DOI: 10.1186/1471-2105-11-S1-S30

Publication

Hurdles and Sorting by Inversions: Combinatorial, Statistical, and Experimental Results

Publisher: Mary Ann Liebert Inc

Date: 10-2009

DOI: 10.1089/CMB.2009.0156

Abstract: As data about genomic architecture accumulates, genomic rearrangements have attracted increasing attention. One of the main rearrangement mechanisms, inversions (also called reversals), was characterized by Hannenhalli and Pevzner and this characterization in turn extended by various authors. The characterization relies on the concepts of breakpoints, cycles, and obstructions colorfully named hurdles and fortresses. In this paper, we study the probability of generating a hurdle in the process of sorting a permutation if one does not take special precautions to avoid them (as in a randomized algorithm, for instance). To do this we revisit and extend the work of Caprara and of Bergeron by providing simple and exact characterizations of the probability of encountering a hurdle in a random permutation. Using similar methods we provide the first asymptotically tight analysis of the probability that a fortress exists in a random permutation. Finally, we study other aspects of hurdles, both analytically and through experiments: when are they created in a sequence of sorting inversions, how much later are they detected, and how much work may need to be undone to return to a sorting sequence.

Publication

A Fragmentation Event Model for Peptide Identification by Mass Spectrometry

Publisher: Springer Berlin Heidelberg

Date: 2008

DOI: 10.1007/978-3-540-78839-3_14

Publication

Morphological stasis masks ecologically divergent coral species on tropical reefs

Publisher: Cold Spring Harbor Laboratory

Date: 05-09-2020

DOI: 10.1101/2020.09.04.260208

Abstract: Coral reefs are the epitome of species ersity, yet the number of described scleractinian coral species, the framework-builders of coral reefs, remains moderate by comparison. DNA sequencing studies are rapidly challenging this notion by exposing a wealth of undescribed ersity, but the evolutionary and ecological significance of this ersity remains largely unclear. Here, we present an annotated genome for one of the most ubiquitous corals in the Indo-Pacific ( Pachyseris speciosa ), and uncover through a comprehensive genomic and phenotypic assessment that it comprises morphologically indistinguishable, but ecologically ergent cryptic lineages. Demographic modelling based on whole-genome resequencing disproved that morphological crypsis was due to recent ergence, and instead indicated ancient morphological stasis. Although the lineages occur sympatrically across shallow and mesophotic habitats, extensive genotyping using a rapid diagnostic assay revealed differentiation of their ecological distributions. Leveraging “common garden” conditions facilitated by the overlapping distributions, we assessed physiological and quantitative skeletal traits and demonstrated concurrent phenotypic differentiation. Lastly, spawning observations of genotyped colonies highlighted the potential role of temporal reproductive isolation in the limited admixture, with consistent genomic signatures in genes related to morphogenesis and reproduction. Overall, our findings demonstrate how ecologically and phenotypically ergent coral species can evolve despite morphological stasis, and provide new leads into the potential mechanisms facilitating such ergence in sympatry. More broadly, they indicate that our current taxonomic framework for reef-building corals may be scratching the surface of the ecologically relevant ersity on coral reefs, consequently limiting our ability to protect or restore this ersity effectively.

Publication

An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes

Publisher: Mary Ann Liebert Inc

Date: 05-2015

DOI: 10.1089/CMB.2014.0096

Abstract: Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes NP-hard in the presence of duplicate genes. In this article, we propose an integer linear programming (ILP) formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse, and rat genomes, where once again our method outperforms MSOAR.

Publication

Reconstructing Yeasts Phylogenies and Ancestors from Whole Genome Data

Publisher: Springer Science and Business Media LLC

Date: 09-11-2017

DOI: 10.1038/S41598-017-15484-5

Abstract: Phylogenetic studies aim to discover evolutionary relationships and histories. These studies are based on similarities of morphological characters and molecular sequences. Currently, widely accepted phylogenetic approaches are based on multiple sequence alignments, which analyze shared gene datasets and concatenate/coalesce these results to a final phylogeny with maximum support. However, these approaches still have limitations, and often have conflicting results with each other. Reconstructing ancestral genomes helps us understand mechanisms and corresponding consequences of evolution. Most existing genome level phylogeny and ancestor reconstruction methods can only process simplified real genome datasets or simulated datasets with identical genome content, unique genome markers, and limited types of evolutionary events. Here, we provide an alternative way to resolve phylogenetic problems based on analyses of real genome data. We use phylogenetic signals from all types of genome level evolutionary events, and overcome the conflicting issues existing in traditional phylogenetic approaches. Further, we build an automated computational pipeline to reconstruct phylogenies and ancestral genomes for two high-resolution real yeast genome datasets. Comparison results with recent studies and publications show that we reconstruct very accurate and robust phylogenies and ancestors. Finally, we identify and analyze the conserved syntenic blocks among reconstructed ancestral genomes and present yeast species.

Publication

A Median Solver and Phylogenetic Inference Based on Double-Cut-and-Join Sorting

Publisher: Mary Ann Liebert Inc

Date: 03-2018

DOI: 10.1089/CMB.2017.0157

Abstract: Genome rearrangement is known as one of the main evolutionary mechanisms on the genomic level. Phylogenetic analysis based on rearrangement played a crucial role in biological research in the past decades, especially with the increasing availability of fully sequenced genomes. In general, phylogenetic analysis aims to solve two problems: small parsimony problem (SPP) and big parsimony problem (BPP). Maximum parsimony is a popular approach for SPP and BPP, which relies on iteratively solving an NP-hard problem, the median problem. As a result, current median solvers and phylogenetic inference methods based on the median problem all face serious problems on scalability and cannot be applied to data sets with large and distant genomes. In this article, we propose a new median solver for gene order data that combines double-cut-and-join sorting with the simulated annealing algorithm. Based on this median solver, we built a new phylogenetic inference method to solve both SPP and BPP problems. Our experimental results show that the new median solver achieves an excellent performance on simulated data sets, and the phylogenetic inference tool built based on the new median solver has a better performance than other existing methods.

Publication

Rearrangements in Phylogenetic Inference: Compare, Model, or Encode?

Publisher: Springer London

Date: 2013

DOI: 10.1007/978-1-4471-5298-9_7

Publication

GraphPlas: Refined Classification of Plasmid Sequences Using Assembly Graphs

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 2022

DOI: 10.1109/TCBB.2021.3082915

Publication

Phylogeny analysis from gene-order data with massive duplications

Publisher: Springer Science and Business Media LLC

Date: 10-2017

DOI: 10.1186/S12864-017-4129-0

Publication

MLGO: phylogeny reconstruction and ancestral inference from gene-order data

Publisher: Springer Science and Business Media LLC

Date: 08-11-2014

DOI: 10.1186/S12859-014-0354-6

Publication

Fast and Accurate Phylogenetic Reconstruction from High-Resolution Whole-Genome Data and a Novel Robustness Estimator

Publisher: Mary Ann Liebert Inc

Date: 09-2011

DOI: 10.1089/CMB.2011.0114

Abstract: The rapid accumulation of whole-genome data has renewed interest in the study of genomic rearrangements. Comparative genomics, evolutionary biology, and cancer research all require models and algorithms to elucidate the mechanisms, history, and consequences of these rearrangements. However, even simple models lead to NP-hard problems, particularly in the area of phylogenetic analysis. Current approaches are limited to small collections of genomes and low-resolution data (typically a few hundred syntenic blocks). Moreover, whereas phylogenetic analyses from sequence data are deemed incomplete unless bootstrapping scores (a measure of confidence) are given for each tree edge, no equivalent to bootstrapping exists for rearrangement-based phylogenetic analysis. We describe a fast and accurate algorithm for rearrangement analysis that scales up, in both time and accuracy, to modern high-resolution genomic data. We also describe a novel approach to estimate the robustness of results-an equivalent to the bootstrapping analysis used in sequence-based phylogenetic reconstruction. We present the results of extensive testing on both simulated and real data showing that our algorithm returns very accurate results, while scaling linearly with the size of the genomes and cubically with their number. We also present extensive experimental results showing that our approach to robustness testing provides excellent estimates of confidence, which, moreover, can be tuned to trade off thresholds between false positives and false negatives. Together, these two novel approaches enable us to attack heretofore intractable problems, such as phylogenetic inference for high-resolution vertebrate genomes, as we demonstrate on a set of six vertebrate genomes with 8,380 syntenic blocks. A copy of the software is available on demand.

Publication

Bacterial Whole Cell Typing by Mass Spectra Pattern Matching with Bootstrapping Assessment

Publisher: American Chemical Society (ACS)

Date: 10-11-2017

DOI: 10.1021/ACS.ANALCHEM.7B03820

Abstract: Bacterial typing is of great importance in clinical diagnosis, environmental monitoring, food safety analysis, and biological research. Matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) is now widely used to analyze bacterial s les. Identification of bacteria at the species level can be realized by matching the mass spectra of s les against a library of mass spectra of known bacteria. Nevertheless, in order to reasonably type bacteria, identification accuracy should be further improved. Herein, we propose a new framework to the identification and assessment for MALDI-MS based bacterial analysis. Our approach combines new measures for spectra similarity and a novel bootstrapping assessment. We tested our approach on a general data set containing the mass spectra of 1741 strains of bacteria and another challenging data set containing 250 strains, including 40 strains in the Bacillus cereus group that were previously claimed to be impossible to resolve by MALDI-MS. With the bootstrapping assessment, we achieved much more reliable predictions at both the genus and species level, and enabled to resolve the Bacillus cereus group. To the best of the authors' knowledge, our method is the first to provide a statistical assessment to MALDI-MS based bacterial typing that could lead to more reliable bacterial typing.

Publication

An Iterative Approach for Phylogenetic Analysis of Tumor Progression Using FISH Copy Number

Publisher: Springer International Publishing

Date: 2015

DOI: 10.1007/978-3-319-19048-8_34

Publication

Publishing Differentially Private Datasets via Stable Microaggregation

Publisher: No publisher found

Date: 2019

DOI: 10.5441/002/EDBT.2019.81

Publication

A Highly Scalable Labelling Approach for Exact Distance Queries in Complex Networks

Publisher: No publisher found

Date: 2019

DOI: 10.5441/002/EDBT.2019.03

Publication

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

Publisher: Springer International Publishing

Date: 2014

DOI: 10.1007/978-3-319-05269-4_22

Publication

Approximating the edit distance for genomes with duplicate genes under DCJ, insertion and deletion

Publisher: Springer Science and Business Media LLC

Date: 12-2012

DOI: 10.1186/1471-2105-13-S19-S13

Publication

TIBA: a tool for phylogeny inference from rearrangement data with bootstrap analysis

Publisher: Oxford University Press (OUP)

Date: 11-10-2012

DOI: 10.1093/BIOINFORMATICS/BTS603

Abstract: Summary: TIBA is a tool to reconstruct phylogenetic trees from rearrangement data that consist of ordered lists of synteny blocks (or genes), where each synteny block is shared with all of its homologues in the input genomes. The evolution of these synteny blocks, through rearrangement operations, is modelled by the uniform Double-Cut-and-Join model. Using a true distance estimate under this model and simple distance-based methods, TIBA reconstructs a phylogeny of the input genomes. Unlike any previous tool for inferring phylogenies from rearrangement data, TIBA uses novel methods of robustness estimation to provide support values for the edges in the inferred tree. Availability: lcbb.epfl.ch/softwares/tiba.html. Contact: vaibhav.rajan@epfl.ch

Publication

Bootstrapping Phylogenies Inferred from Rearrangement Data

Publisher: Springer Berlin Heidelberg

Date: 2011

DOI: 10.1007/978-3-642-23038-7_16

Publication

HaploJuice : accurate haplotype assembly from a pool of sequences with known relative concentrations

Publisher: Springer Science and Business Media LLC

Date: 22-10-2018

DOI: 10.1186/S12859-018-2424-7

Publication

Study of cell differentiation by phylogenetic analysis using histone modification data

Publisher: Springer Science and Business Media LLC

Date: 08-08-2014

DOI: 10.1186/1471-2105-15-269

Publication

Detection and analysis of ancient segmental duplications in mammalian genomes

Publisher: Cold Spring Harbor Laboratory

Date: 07-05-2018

DOI: 10.1101/GR.228718.117

Abstract: Although segmental duplications (SDs) represent hotbeds for genomic rearrangements and emergence of new genes, there are still no easy-to-use tools for identifying SDs. Moreover, while most previous studies focused on recently emerged SDs, detection of ancient SDs remains an open problem. We developed an SDquest algorithm for SD finding and applied it to analyzing SDs in human, gorilla, and mouse genomes. Our results demonstrate that previous studies missed many SDs in these genomes and show that SDs account for at least 6.05% of the human genome (version hg19), a 17% increase as compared to the previous estimate. Moreover, SDquest classified 6.42% of the latest GRCh38 version of the human genome as SDs, a large increase as compared to previous studies. We thus propose to re-evaluate evolution of SDs based on their accurate representation across multiple genomes. Toward this goal, we analyzed the complex mosaic structure of SDs and decomposed mosaic SDs into elementary SDs, a prerequisite for follow-up evolutionary analysis. We also introduced the concept of the breakpoint graph of mosaic SDs that revealed SD hotspots and suggested that some SDs may have originated from circular extrachromosomal DNA (ecDNA), not unlike ecDNA that contributes to accelerated evolution in cancer.

Publication

A Metric for Phylogenetic Trees Based on Matching

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 07-2012

DOI: 10.1109/TCBB.2011.157

Publication

Fast and Accurate Phylogenetic Reconstruction from High-Resolution Whole-Genome Data and a Novel Robustness Estimator

Publisher: Springer Berlin Heidelberg

Date: 2017

DOI: 10.1007/978-3-642-16181-0_12

Publication

Modeling methane emission from wetlands in North-Eastern New South Wales, Australia using Landsat ETM+

Publisher: MDPI AG

Date: 17-05-2010

DOI: 10.3390/RS2051378

Publication

A New Genomic Evolutionary Model for Rearrangements, Duplications, and Losses That Applies across Eukaryotes and Prokaryotes

Publisher: Springer Berlin Heidelberg

Date: 2010

DOI: 10.1007/978-3-642-16181-0_19

Publication

What is the difference between the breakpoint graph and the de Bruijn graph?

Publisher: Springer Science and Business Media LLC

Date: 10-2014

DOI: 10.1186/1471-2164-15-S6-S6

Publication

Assembly of long, error-prone reads using repeat graphs

Publisher: Springer Science and Business Media LLC

Date: 04-2019

DOI: 10.1038/S41587-019-0072-8

Abstract: Accurate genome assembly is h ered by repetitive regions. Although long single molecule sequencing reads are better able to resolve genomic repeats than short-read data, most long-read assembly algorithms do not provide the repeat characterization necessary for producing optimal assemblies. Here, we present Flye, a long-read assembly algorithm that generates arbitrary paths in an unknown repeat graph, called disjointigs, and constructs an accurate repeat graph from these error-riddled disjointigs. We benchmark Flye against five state-of-the-art assemblers and show that it generates better or comparable assemblies, while being an order of magnitude faster. Flye nearly doubled the contiguity of the human genome assembly (as measured by the NGA50 assembly quality metric) compared with existing assemblers.

Publication

Assembly of long error-prone reads using de Bruijn graphs

Publisher: Proceedings of the National Academy of Sciences

Date: 12-12-2016

DOI: 10.1073/PNAS.1604560113

Abstract: When the long reads generated using single-molecule se-quencing (SMS) technology were made available, most researchers were skeptical about the ability of existing algorithms to generate high-quality assemblies from long error-prone reads. Nevertheless, recent algorithmic breakthroughs resulted in many successful SMS sequencing projects. However, as the recent assemblies of important plant pathogens illustrate, the problem of assembling long error-prone reads is far from being resolved even in the case of relatively short bacterial genomes. We propose an algorithmic approach for assembling long error-prone reads and describe the ABruijn assembler, which results in accurate genome reconstructions.

Publication

Phylogenetic Analysis of Cell Types Using Histone Modifications

Publisher: Springer Berlin Heidelberg

Date: 2013

DOI: 10.1007/978-3-642-40453-5_25

Publication

In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics

Publisher: Springer Science and Business Media LLC

Date: 09-01-2020

DOI: 10.1038/S41467-019-13866-Z

Abstract: Data-independent acquisition (DIA) is an emerging technology for quantitative proteomic analysis of large cohorts of s les. However, s le-specific spectral libraries built by data-dependent acquisition (DDA) experiments are required prior to DIA analysis, which is time-consuming and limits the identification/quantification by DIA to the peptides identified by DDA. Herein, we propose DeepDIA, a deep learning-based approach to generate in silico spectral libraries for DIA analysis. We demonstrate that the quality of in silico libraries predicted by instrument-specific models using DeepDIA is comparable to that of experimental libraries, and outperforms libraries generated by global models. With peptide detectability prediction, in silico libraries can be built directly from protein sequence databases. We further illustrate that DeepDIA can break through the limitation of DDA on peptide rotein detection, and enhance DIA analysis on human serum s les compared to the state-of-the-art protocol using a DDA library. We expect this work expanding the toolbox for DIA proteomics.

Publication

Large-scale 3D chromatin reconstruction from chromosomal contacts

Publisher: Springer Science and Business Media LLC

Date: 04-2019

DOI: 10.1186/S12864-019-5470-2

Publication

Examining the potential impacts of sea level rise on coastal wetlands in north-eastern NSW, Australia

Publisher: Springer Science and Business Media LLC

Date: 30-07-2011

DOI: 10.1007/S11852-010-0114-3

Publication

Estimating true evolutionary distances under rearrangements, duplications, and losses

Publisher: Springer Science and Business Media LLC

Date: 2010

DOI: 10.1186/1471-2105-11-S1-S54

Publication

Can a breakpoint graph be decomposed into none other than 2-cycles?

Publisher: Elsevier BV

Date: 07-2018

DOI: 10.1016/J.TCS.2017.09.019

Publication

AN ITERATIVE ALGORITHM TO QUANTIFY FACTORS INFLUENCING PEPTIDE FRAGMENTATION DURING TANDEM MASS SPECTROMETRY

Publisher: World Scientific Pub Co Pte Lt

Date: 04-2007

DOI: 10.1142/S0219720007002643

Abstract: In protein identification by tandem mass spectrometry, it is critical to accurately predict the theoretical spectrum for a peptide sequence. To date, the widely-used database searching methods adopted simple statistical models for predicting. For some peptide, these models usually yield a theoretical spectrum with a significant deviation from the experimental one. In this paper, in order to derive an improved predicting model, we utilized a non-linear programming model to quantify the factors impacting peptide fragmentation. Then, an iterative algorithm was proposed to solve this optimization problem. Upon a training set of 1803 spectra, the experimental result showed a good agreement with some known principles about peptide fragmentation, such as the tendency to cleave at the middle of peptide, and Pro's preference of the N-terminal cleavage. Moreover, upon a testing set of 941 spectra, comparison of the predicted spectra against the experimental ones showed that this method can generate reasonable predictions. The results in this paper can offer help to both database searching and de novo methods.

Publication

Maximum Parsimony Analysis of Gene Copy Number Changes

Publisher: Springer Berlin Heidelberg

Date: 2015

DOI: 10.1007/978-3-662-48221-6_8

Publication

Deriving the Probabilities of Water Loss and Ammonia Loss for Amino Acids from Tandem Mass Spectra

Publisher: American Chemical Society (ACS)

Date: 20-12-2007

DOI: 10.1021/PR070479V

Abstract: In protein identification through tandem mass spectrometry, it is critical to accurately predict the theoretical spectrum for a peptide sequence. The widely used prediction models, such as SEQUEST and MASCOT, ignore the intensity of the ions with important neutral losses, including water loss and ammonia loss. However, ignoring these neutral losses results in a significant deviation between the predicted theoretical spectrum and its experimental counterpart. Here, based on the "one peak, multiple explanations" observation, we proposed an expectation-maximization (EM) method to automatically learn the probabilities of water loss and ammonia loss for each amino acid. Then we employed these probabilities to design an improved statistical model for theoretical spectrum prediction. We implemented these methods and tested them on practical data. On a training set containing 1803 spectra, the experimental results show a good agreement with some known knowledge about neutral losses, such as the tendency of water loss from Asp, Glu, Ser, and Thr. Furthermore, on a testing set containing 941 spectra, the improved similarity between the experimental and predicted spectra demonstrates that this method can generate more reasonable predictions relative to the model that ignores neutral losses. As an application of the derived probabilities, we implemented a database searching method adopting the improved theoretical spectrum model with neutral loss ions estimated. Experimental results on Keller's data set demonstrate that this method can identify peptides more accurately than SEQUEST. In another application to validate SEQUEST's results, the reported peptide-spectrum pairs are reranked with respect to the similarity between experimental and predicted spectra. Experimental results on both LTQ and QSTAR data sets suggest that this reranking strategy can effectively distinguish the false negative predictions reported by SEQUEST.

Publication

Metagenomics Binning of Long Reads Using Read-Overlap Graphs

Publisher: Springer International Publishing

Date: 2022

DOI: 10.1007/978-3-031-06220-9_15

Publication

VStrains: De Novo Reconstruction of Viral Strains via Iterative Path Extraction From Assembly Graphs

Publisher: Cold Spring Harbor Laboratory

Date: 21-10-2022

DOI: 10.1101/2022.10.21.513181

Abstract: With the high mutation rate in viruses, a mixture of closely related viral strains (called viral quasispecies) often co-infect an in idual host. Reconstructing in idual strains from viral quasispecies is a key step to characterizing the viral population, revealing strain-level genetic variability, and providing insights into biomedical and clinical studies. Reference-based approaches of reconstructing viral strains suffer from the lack of high-quality references due to high mutation rates and biased variant calling introduced by a selected reference. De novo methods require no references but face challenges due to errors in reads, the high similarity of quasispecies, and uneven abundance of strains. In this paper, we propose VStrains, a de novo approach for reconstructing strains from viral quasispecies. VStrains incorporates contigs, paired-end reads, and coverage information to iteratively extract the strain-specific paths from assembly graphs. We benchmark VStrains against multiple state-of-the-art de novo and reference-based approaches on both simulated and real datasets. Experimental results demonstrate that VStrains achieves the best overall performance on both simulated and real datasets under a comprehensive set of metrics such as genome fraction, duplication ratio, NGA50, error rate, etc . VStrains is freely available at github.com/MetaGenTools/VStrains .

Publication

Monitoring coastal wetland communities in north-eastern NSW using ASTER and Landsat satellite data

Publisher: Springer Science and Business Media LLC

Date: 11-02-2010

DOI: 10.1007/S11273-010-9176-0

Publication

Approximation Algorithms for Bi-clustering Problems

Publisher: Springer Berlin Heidelberg

Date: 2006

DOI: 10.1007/11851561_29

Publication

GraphBin2: Refined and Overlapped Binning of Metagenomic Contigs Using Assembly Graphs

Publisher: Schloss Dagstuhl - Leibniz-Zentrum für Informatik

Date: 2020

DOI: 10.4230/LIPICS.WABI.2020.8

Publication

A maximum-likelihood approach for building cell-type trees by lifting

Publisher: Springer Science and Business Media LLC

Date: 2016

DOI: 10.1186/S12864-015-2297-3

Publication

MetaBCC-LR: metagenomics binning by coverage and composition for long reads

Publisher: Oxford University Press (OUP)

Date: 07-2020

DOI: 10.1093/BIOINFORMATICS/BTAA441

Abstract: Metagenomics studies have provided key insights into the composition and structure of microbial communities found in different environments. Among the techniques used to analyse metagenomic data, binning is considered a crucial step to characterize the different species of micro-organisms present. The use of short-read data in most binning tools poses several limitations, such as insufficient species-specific signal, and the emergence of long-read sequencing technologies offers us opportunities to surmount them. However, most current metagenomic binning tools have been developed for short reads. The few tools that can process long reads either do not scale with increasing input size or require a database with reference genomes that are often unknown. In this article, we present MetaBCC-LR, a scalable reference-free binning method which clusters long reads directly based on their k-mer coverage histograms and oligonucleotide composition. We evaluate MetaBCC-LR on multiple simulated and real metagenomic long-read datasets with varying coverages and error rates. Our experiments demonstrate that MetaBCC-LR substantially outperforms state-of-the-art reference-free binning tools, achieving ∼13% improvement in F1-score and ∼30% improvement in ARI compared to the best previous tools. Moreover, we show that using MetaBCC-LR before long-read assembly helps to enhance the assembly quality while significantly reducing the assembly cost in terms of time and memory usage. The efficiency and accuracy of MetaBCC-LR pave the way for more effective long-read-based metagenomics analyses to support a wide range of applications. The source code is freely available at: nuradhawick/MetaBCC-LR. Supplementary data are available at Bioinformatics online.

Publication

A New Genomic Evolutionary Model for Rearrangements, Duplications, and Losses that Applies across Eukaryotes and Prokaryotes

Publisher: Mary Ann Liebert Inc

Date: 09-2011

DOI: 10.1089/CMB.2011.0098

Abstract: Genomic rearrangements have been studied since the beginnings of modern genetics and models for such rearrangements have been the subject of many papers over the last 10 years. However, none of the extant models can predict the evolution of genomic organization into circular unichromosomal genomes (as in most prokaryotes) and linear multichromosomal genomes (as in most eukaryotes). Very few of these models support gene duplications and losses--yet these events may be more common in evolutionary history than rearrangements and themselves cause apparent rearrangements. We propose a new evolutionary model that integrates gene duplications and losses with genome rearrangements and that leads to genomes with either one (or a very few) circular chromosome or a collection of linear chromosomes. Our model is based on existing rearrangement models and inherits their linear-time algorithms for pairwise distance computation (for rearrangement only). Moreover, our model predictions fit observations about the evolution of gene family sizes and agree with the existing predictions about the growth in the number of chromosomes in eukaryotic genomes.

Publication

Estimating true evolutionary distances under the DCJ model

Publisher: Oxford University Press (OUP)

Date: 07-2008

DOI: 10.1093/BIOINFORMATICS/BTN148

Abstract: Motivation: Modern techniques can yield the ordering and strandedness of genes on each chromosome of a genome such data already exists for hundreds of organisms. The evolutionary mechanisms through which the set of the genes of an organism is altered and reordered are of great interest to systematists, evolutionary biologists, comparative genomicists and biomedical researchers. Perhaps the most basic concept in this area is that of evolutionary distance between two genomes: under a given model of genomic evolution, how many events most likely took place to account for the difference between the two genomes? Results: We present a method to estimate the true evolutionary distance between two genomes under the ‘double-cut-and-join’ (DCJ) model of genome rearrangement, a model under which a single multichromosomal operation accounts for all genomic rearrangement events: inversion, transposition, translocation, block interchange and chromosomal fusion and fission. Our method relies on a simple structural characterization of a genome pair and is both analytically and computationally tractable. We provide analytical results to describe the asymptotic behavior of genomes under the DCJ model, as well as experimental results on a wide variety of genome structures to exemplify the very high accuracy (and low variance) of our estimator. Our results provide a tool for accurate phylogenetic reconstruction from multichromosomal gene rearrangement data as well as a theoretical basis for refinements of the DCJ model to account for biological constraints. Availability: All of our software is available in source form under GPL at lcbb.epfl.ch Contact: bernard.moret@epfl.ch

Publication

Skyblocking for entity resolution

Publisher: Elsevier BV

Date: 11-2019

DOI: 10.1016/J.IS.2019.06.003

Publication

Analysis of gene copy number changes in tumor phylogenetics

Publisher: Springer Science and Business Media LLC

Date: 22-09-2016

DOI: 10.1186/S13015-016-0088-2

Publication

Approximation Algorithms for Biclustering Problems

Publisher: Society for Industrial & Applied Mathematics (SIAM)

Date: 2008

DOI: 10.1137/060664112

Publication

dK-Microaggregation: Anonymizing Graphs with Differential Privacy Guarantees

Publisher: Springer International Publishing

Date: 2020

DOI: 10.1007/978-3-030-47436-2_15

Publication

Kmer2SNP: reference-free SNP calling from raw reads based on matching

Publisher: Cold Spring Harbor Laboratory

Date: 17-05-2020

DOI: 10.1101/2020.05.17.100305

Abstract: The development of DNA sequencing technologies provides the opportunity to call heterozygous SNPs for each in idual. SNP calling is a fundamental problem of genetic analysis and has many applications, such as gene-disease diagnosis, drug design, and ancestry inference. Reference-based SNP calling approaches generate highly accurate results, but they face serious limitations especially when high-quality reference genomes are not available for many species. Although reference-free approaches have the potential to call SNPs without using the reference genome, they have not been widely applied on large and complex genomes because existing approaches suffer from low recall recision or high runtime. We develop a reference-free algorithm Kmer2SNP to call SNP directly from raw reads. Kmer2SNP first computes the k-mer frequency distribution from reads and identifies potential heterozygous k-mers which only appear in one haplotype. Kmer2SNP then constructs a graph by choosing these heterozygous k-mers as vertices and connecting edges between pairs of heterozygous k-mers that might correspond to SNPs. Kmer2SNP further assigns a weight to each edge using overlapping information between heterozygous k-mers, computes a maximum weight matching and finally outputs SNPs as edges between k-mer pairs in the matching. We benchmark Kmer2SNP against reference-free methods including hybrid (assembly-based) and assembly-free methods on both simulated and real datasets. Experimental results show that Kmer2SNP achieves better SNP calling quality while being an order of magnitude faster than the state-of-the-art methods. Kmer2SNP shows the potential of calling SNPs only using k-mers from raw reads without assembly. The source code is freely available at anboANU/Kmer2SNP .

Publication

Morphological stasis masks ecologically divergent coral species on tropical reefs

Publisher: Elsevier BV

Date: 06-2021

DOI: 10.1016/J.CUB.2021.03.028

Abstract: Coral reefs are the epitome of species ersity, yet the number of described scleractinian coral species, the framework-builders of coral reefs, remains moderate by comparison. DNA sequencing studies are rapidly challenging this notion by exposing a wealth of undescribed ersity, but the evolutionary and ecological significance of this ersity remains largely unclear. Here, we present an annotated genome for one of the most ubiquitous corals in the Indo-Pacific (Pachyseris speciosa) and uncover, through a comprehensive genomic and phenotypic assessment, that it comprises morphologically indistinguishable but ecologically ergent lineages. Demographic modeling based on whole-genome resequencing indicated that morphological crypsis (across micro- and macromorphological traits) was due to ancient morphological stasis rather than recent ergence. Although the lineages occur sympatrically across shallow and mesophotic habitats, extensive genotyping using a rapid molecular assay revealed differentiation of their ecological distributions. Leveraging "common garden" conditions facilitated by the overlapping distributions, we assessed physiological and quantitative skeletal traits and demonstrated concurrent phenotypic differentiation. Lastly, spawning observations of genotyped colonies highlighted the potential role of temporal reproductive isolation in the limited admixture, with consistent genomic signatures in genes related to morphogenesis and reproduction. Overall, our findings demonstrate the presence of ecologically and phenotypically ergent coral species without substantial morphological differentiation and provide new leads into the potential mechanisms facilitating such ergence. More broadly, they indicate that our current taxonomic framework for reef-building corals may be scratching the surface of the ecologically relevant ersity on coral reefs, consequently limiting our ability to protect or restore this ersity effectively.

Publication

Sorting Signed Permutations by Inversions in O(nlogn) Time

Publisher: Mary Ann Liebert Inc

Date: 03-2010

DOI: 10.1089/CMB.2009.0184

Abstract: The study of genomic inversions (or reversals) has been a mainstay of computational genomics for nearly 20 years. After the initial breakthrough of Hannenhalli and Pevzner, who gave the first polynomial-time algorithm for sorting signed permutations by inversions, improved algorithms have been designed, culminating with an optimal linear-time algorithm for computing the inversion distance and a subquadratic algorithm for providing a shortest sequence of inversions--also known as sorting by inversions. Remaining open was the question of whether sorting by inversions could be done in O(nlogn) time. In this article, we present a qualified answer to this question, by providing two new sorting algorithms, a simple and fast randomized algorithm and a deterministic refinement. The deterministic algorithm runs in time O(nlogn + kn), where k is a data-dependent parameter. We provide the results of extensive experiments showing that both the average and the standard deviation for k are small constants, independent of the size of the permutation. We conclude (but do not prove) that almost all signed permutations can be sorted by inversions in O(nlogn) time.

Publication

Sorting genomes with rearrangements and segmental duplications through trajectory graphs

Publisher: Springer Science and Business Media LLC

Date: 10-2013

DOI: 10.1186/1471-2105-14-S15-S9

Publication

Hurdles Hardly Have to Be Heeded

Publisher: Springer Berlin Heidelberg

Date: 2008

DOI: 10.1007/978-3-540-87989-3_18

Publication

DCHap: A divide-and-conquer haplotype phasing algorithm for third-generation sequences

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 2020

DOI: 10.1109/TCBB.2020.3005673

Publication

AN ITERATIVE ALGORITHM TO QUANTIFY THE FACTORS INFLUENCING PEPTIDE FRAGMENTATION FOR MS/MS SPECTRUM

Publisher: PUBLISHED BY IMPERIAL COLLEGE PRESS AND DISTRIBUTED BY WORLD SCIENTIFIC PUBLISHING CO.

Date: 07-2006

DOI: 10.1142/9781860947575_0042

Publication

G-quadruplex structures mark human regulatory chromatin

Publisher: Springer Science and Business Media LLC

Date: 12-09-2016

DOI: 10.1038/NG.3662

Abstract: G-quadruplex (G4) structural motifs have been linked to transcription, replication and genome instability and are implicated in cancer and other diseases. However, it is crucial to demonstrate the bona fide formation of G4 structures within an endogenous chromatin context. Herein we address this through the development of G4 ChIP-seq, an antibody-based G4 chromatin immunoprecipitation and high-throughput sequencing approach. We find ∼10,000 G4 structures in human chromatin, predominantly in regulatory, nucleosome-depleted regions. G4 structures are enriched in the promoters and 5' UTRs of highly transcribed genes, particularly in genes related to cancer and in somatic copy number lifications, such as MYC. Strikingly, de novo and enhanced G4 formation are associated with increased transcriptional activity, as shown by HDAC inhibitor-induced chromatin relaxation and observed in immortalized as compared to normal cellular states. Our findings show that regulatory, nucleosome-depleted chromatin and elevated transcription shape the endogenous human G4 DNA landscape.

Publication

Direct MALDI-TOF MS Identification of Bacterial Mixtures

Publisher: American Chemical Society (ACS)

Date: 09-08-2018

DOI: 10.1021/ACS.ANALCHEM.8B02258

Abstract: Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) is now widely used to characterize bacterial s les for clinical diagnosis, food safety control, environmental monitoring, and so on. However, existing standard approaches are only applied to analyze single colonies purified by plate culture, which limits the approaches to cultivable bacteria and makes the whole approaches time-consuming. In this work, we propose a new framework to analyze MALDI-TOF spectra of bacterial mixtures and to directly characterize each component without purification procedures. The framework is a combination of a synthetic mixture model based on a non-negative linear combination of candidate reference spectra and a statistical assessment by in silico generated spectra via a jackknife res ling. Ninety-seven model bacterial mixture s les and 8 cocultured blind-coded bacterial mixture s les, containing up to 6 strains in varied ratios in each s le, together with a reference database containing the mass spectra of 1081 strains, were used to validate the framework. High sensitivity (>80%, with error rate 60% for balanced quaternary and pentabasic mixtures, and 48%-71% for asymmetric situation, with error rate <5%. The work can facilitate rapid and reliable characterization of bacterial mixtures without purification procedures, which is of practical value in clinical diagnosis, food safety control, environmental monitoring, and so on. The framework can be further applied to many other spectroscopy-based analytics to interpret spectra from mixed s les.

Publication

A Metric for Phylogenetic Trees Based on Matching

Publisher: Springer Berlin Heidelberg

Date: 2011

DOI: 10.1007/978-3-642-21260-4_21

Publication

Manifold de Bruijn Graphs

Publisher: Springer Berlin Heidelberg

Date: 2014

DOI: 10.1007/978-3-662-44753-6_22

Yu Lin

Researcher

Research Topics

Top 5 Research Topics

ANZSRC Field of Research (FoR)

ANZSRC Socio-Economic Objective (SEO)

Related Links

Publications

Bootstrapping phylogenies inferred from rearrangement data

An Algorithm to Mine Therapeutic Motifs for Cancer From Networks of Genetic Interactions

MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs

VStrains: De Novo Reconstruction of Viral Strains via Iterative Path Extraction from Assembly Graphs

Phylogenetic Reconstruction for Copy-Number Evolution Problems

Sorting Signed Permutations by Inversions in O(nlogn) Time

Heuristics for the inversion median problem

Hurdles and Sorting by Inversions: Combinatorial, Statistical, and Experimental Results

A Fragmentation Event Model for Peptide Identification by Mass Spectrometry

Morphological stasis masks ecologically divergent coral species on tropical reefs

An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes

Reconstructing Yeasts Phylogenies and Ancestors from Whole Genome Data

A Median Solver and Phylogenetic Inference Based on Double-Cut-and-Join Sorting

Rearrangements in Phylogenetic Inference: Compare, Model, or Encode?

GraphPlas: Refined Classification of Plasmid Sequences Using Assembly Graphs

Phylogeny analysis from gene-order data with massive duplications

MLGO: phylogeny reconstruction and ancestral inference from gene-order data

Fast and Accurate Phylogenetic Reconstruction from High-Resolution Whole-Genome Data and a Novel Robustness Estimator

Bacterial Whole Cell Typing by Mass Spectra Pattern Matching with Bootstrapping Assessment

An Iterative Approach for Phylogenetic Analysis of Tumor Progression Using FISH Copy Number

Publishing Differentially Private Datasets via Stable Microaggregation

A Highly Scalable Labelling Approach for Exact Distance Queries in Complex Networks

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

Approximating the edit distance for genomes with duplicate genes under DCJ, insertion and deletion

TIBA: a tool for phylogeny inference from rearrangement data with bootstrap analysis

Bootstrapping Phylogenies Inferred from Rearrangement Data

HaploJuice : accurate haplotype assembly from a pool of sequences with known relative concentrations

Study of cell differentiation by phylogenetic analysis using histone modification data

Detection and analysis of ancient segmental duplications in mammalian genomes

A Metric for Phylogenetic Trees Based on Matching

Fast and Accurate Phylogenetic Reconstruction from High-Resolution Whole-Genome Data and a Novel Robustness Estimator

Modeling methane emission from wetlands in North-Eastern New South Wales, Australia using Landsat ETM+

A New Genomic Evolutionary Model for Rearrangements, Duplications, and Losses That Applies across Eukaryotes and Prokaryotes

What is the difference between the breakpoint graph and the de Bruijn graph?

Assembly of long, error-prone reads using repeat graphs

Assembly of long error-prone reads using de Bruijn graphs

Phylogenetic Analysis of Cell Types Using Histone Modifications

In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics

Large-scale 3D chromatin reconstruction from chromosomal contacts

Examining the potential impacts of sea level rise on coastal wetlands in north-eastern NSW, Australia

Estimating true evolutionary distances under rearrangements, duplications, and losses

Can a breakpoint graph be decomposed into none other than 2-cycles?

AN ITERATIVE ALGORITHM TO QUANTIFY FACTORS INFLUENCING PEPTIDE FRAGMENTATION DURING TANDEM MASS SPECTROMETRY

Maximum Parsimony Analysis of Gene Copy Number Changes

Deriving the Probabilities of Water Loss and Ammonia Loss for Amino Acids from Tandem Mass Spectra

Metagenomics Binning of Long Reads Using Read-Overlap Graphs

VStrains: De Novo Reconstruction of Viral Strains via Iterative Path Extraction From Assembly Graphs

Monitoring coastal wetland communities in north-eastern NSW using ASTER and Landsat satellite data

Approximation Algorithms for Bi-clustering Problems

GraphBin2: Refined and Overlapped Binning of Metagenomic Contigs Using Assembly Graphs

A maximum-likelihood approach for building cell-type trees by lifting

MetaBCC-LR: metagenomics binning by coverage and composition for long reads

A New Genomic Evolutionary Model for Rearrangements, Duplications, and Losses that Applies across Eukaryotes and Prokaryotes

Estimating true evolutionary distances under the DCJ model

Skyblocking for entity resolution

Analysis of gene copy number changes in tumor phylogenetics

Approximation Algorithms for Biclustering Problems

dK-Microaggregation: Anonymizing Graphs with Differential Privacy Guarantees

Kmer2SNP: reference-free SNP calling from raw reads based on matching

Morphological stasis masks ecologically divergent coral species on tropical reefs

Sorting Signed Permutations by Inversions in O(nlogn) Time

Sorting genomes with rearrangements and segmental duplications through trajectory graphs

Hurdles Hardly Have to Be Heeded

DCHap: A divide-and-conquer haplotype phasing algorithm for third-generation sequences

AN ITERATIVE ALGORITHM TO QUANTIFY THE FACTORS INFLUENCING PEPTIDE FRAGMENTATION FOR MS/MS SPECTRUM

G-quadruplex structures mark human regulatory chromatin

Direct MALDI-TOF MS Identification of Bacterial Mixtures

A Metric for Phylogenetic Trees Based on Matching

Manifold de Bruijn Graphs

Related Organisations

Australian National University

École Polytechnique Fédérale De Lausanne