ARDC Research Link Australia

Publication

The Influence of Model Violation on Phylogenetic Inference: A Simulation Study

Publisher: Cold Spring Harbor Laboratory

Date: 24-09-2021

DOI: 10.1101/2021.09.22.461455

Abstract: Phylogenetic inference typically assumes that the data has evolved under Stationary, Reversible and Homogeneous (SRH) conditions. Many empirical and simulation studies have shown that assuming SRH conditions can lead to significant errors in phylogenetic inference when the data violates these assumptions. Yet, many simulation studies focused on extreme non-SRH conditions that represent worst-case scenarios and not the average empirical dataset. In this study, we simulate datasets under various degrees of non-SRH conditions using empirically derived parameters to mimic real data and examine the effects of incorrectly assuming SRH conditions on inferring phylogenies. Our results show that maximum likelihood inference is generally quite robust to a wide range of SRH model violations but is inaccurate under extreme convergent evolution.

Publication

Whole genome analysis of a Vietnamese trio

Publisher: Springer Science and Business Media LLC

Date: 03-2015

DOI: 10.1007/S12038-015-9501-0

Abstract: We here present the first whole genome analysis of an anonymous Kinh Vietnamese (KHV) trio whose genomes were deeply sequenced to 30-fold average coverage. The resulting short reads covered 99.91 percent of the human reference genome (GRCh37d5). We identified 4,719,412 SNPs and 827,385 short indels that satisfied the Mendelian inheritance law. Among them, 109,914 (2.3 percent) SNPs and 59,119 (7.1 percent) short indels were novel. We also detected 30,171 structural variants of which 27,604 (91.5 percent) were large indels. There were 6,681 large indels in the range 0.1-100 kbp occurring in the child genome that were also confirmed in either the father or mother genome. We compared these large indels against the DGV database and found that 1,499 (22.44 percent) were KHV specific. De novo assembly of high-quality unmapped reads yielded 789 contigs with the length greater than or equal to 300 bp. There were 235 contigs from the child genome of which 199 (84.7 percent) were significantly matched with at least one contig from the father or mother genome. Blasting these 199 contigs against other alternative human genomes revealed 4 novel contigs. The novel variants identified from our study demonstrated the necessity of conducting more genome-wide studies not only for Kinh but also for other ethnic groups in Vietnam.

Publication

Corrigendum to: IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era

Publisher: Oxford University Press (OUP)

Date: 18-06-2020

DOI: 10.1093/MOLBEV/MSAA131

Publication

nQMaker: estimating time non-reversible amino acid substitution models

Publisher: Cold Spring Harbor Laboratory

Date: 19-10-2021

DOI: 10.1101/2021.10.18.464754

Abstract: Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In this paper, we introduce a maximum likelihood approach nQMaker, an extension of the recently published QMaker method, that allows the estimation of time non-reversible amino acid substitution models and rooted phylogenetic trees from a set of protein sequence alignments. We show that the non-reversible models estimated with nQMaker are a much better fit to empirical alignments than pre-existing reversible models, across a wide range of datasets including mammals, birds, plants, fungi, and other taxa, and that the improvements in model fit scale with the size of the dataset. Notably, for the recently published plant and bird trees, these non-reversible models correctly recovered the commonly known root placements with very high statistical support without the need to use an outgroup. We provide nQMaker as an easy-to-use feature in the IQ-TREE software ( www.iqtree.org ), allowing users to estimate non-reversible models and rooted phylogenies from their own protein datasets.

Publication

A test statistic to quantify treelikeness in phylogenetics

Publisher: Cold Spring Harbor Laboratory

Date: 17-02-2021

DOI: 10.1101/2021.02.16.431544

Abstract: Most phylogenetic analyses assume that the evolutionary history of an alignment (either that of a single locus, or of multiple concatenated loci) can be described by a single bifurcating tree, the so-called the treelikeness assumption. Treelikeness can be violated by biological events such as recombination, introgression, or incomplete lineage sorting, and by systematic errors in phylogenetic analyses. The incorrect assumption of treelikeness may then mislead phylogenetic inferences. To quantify and test for treelikeness in alignments, we develop a test statistic which we call the tree proportion. This statistic quantifies the proportion of the edge weights in a phylogenetic network that are represented in a bifurcating phylogenetic tree of the same alignment. We extend this statistic to a statistical test of treelikeness using a parametric bootstrap. We use extensive simulations to compare tree proportion to a range of related approaches. We show that tree proportion successfully identifies non-treelikeness in a wide range of simulation scenarios, and discuss its strengths and weaknesses compared to other approaches. The power of the tree-proportion test to reject non-treelike alignments can be lower than some other approaches, but these approaches tend to be limited in their scope and/or the ease with which they can be interpreted. Our recommendation is to test treelikeness of sequence alignments with both tree proportion and mosaic methods such as 3Seq. The scripts necessary to replicate this study are available at aitlinch/treelikeness

Publication

Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression

Publisher: Cold Spring Harbor Laboratory

Date: 16-04-2020

DOI: 10.1101/2020.04.15.043786

Abstract: Our understanding of the evolutionary history of primates is undergoing continual revision due to ongoing genome sequencing efforts. Bolstered by growing fossil evidence, these data have led to increased acceptance of once controversial hypotheses regarding phylogenetic relationships, hybridization and introgression, and the biogeographical history of primate groups. Among these findings is a pattern of recent introgression between species within all major primate groups examined to date, though little is known about introgression deeper in time. To address this and other phylogenetic questions, here we present new reference genome assemblies for three Old World Monkey species: Colobus angolensis ssp. palliatus (the black and white colobus), Macaca nemestrina (southern pig-tailed macaque), and Mandrillus leucophaeus (the drill). We combine these data with 23 additional primate genomes to estimate both the species tree and in idual gene trees using thousands of loci. While our species tree is largely consistent with previous phylogenetic hypotheses, the gene trees reveal high levels of genealogical discordance associated with multiple primate radiations. We use strongly asymmetric patterns of gene tree discordance around specific branches to identify multiple instances of introgression between ancestral primate lineages. In addition, we exploit recent fossil evidence to perform fossil-calibrated molecular dating analyses across the tree. Taken together, our genome-wide data help to resolve multiple contentious sets of relationships among primates, while also providing insight into the biological processes and technical artifacts that led to the disagreements in the first place.

Publication

Combined transcriptome and proteome profiling reveals specific molecular brain signatures for sex, maturation and circalunar clock phase

Publisher: eLife Sciences Publications, Ltd

Date: 15-02-2019

DOI: 10.7554/ELIFE.41556

Abstract: Many marine animals, ranging from corals to fishes, synchronise reproduction to lunar cycles. In the annelid Platynereis dumerilii, this timing is orchestrated by an endogenous monthly (circalunar) clock entrained by moonlight. Whereas daily (circadian) clocks cause extensive transcriptomic and proteomic changes, the quality and quantity of regulations by circalunar clocks have remained largely elusive. By establishing a combined transcriptomic and proteomic profiling approach, we provide first systematic insight into the molecular changes in Platynereis heads between circalunar phases, and across sexual differentiation and maturation. Whereas maturation elicits large transcriptomic and proteomic changes, the circalunar clock exhibits only minor transcriptomic, but strong proteomic regulation. Our study provides a versatile extraction technique and comprehensive resources. It corroborates that circadian and circalunar clock effects are likely distinct and identifies key molecular brain signatures for reproduction, sex and circalunar clock phase. Ex les include prepro-whitnin roctolin and ependymin-related proteins as circalunar clock targets.

Publication

A new phylogenetic tree sampling method for maximum parsimony bootstrapping and proof-of-concept implementation

Publisher: IEEE

Date: 10-2016

DOI: 10.1109/KSE.2016.7758020

Publication

Polymorphism-aware species trees with advanced mutation models, bootstrap and rate heterogeneity

Publisher: Cold Spring Harbor Laboratory

Date: 30-11-2018

DOI: 10.1101/483479

Abstract: Molecular phylogenetics has neglected polymorphisms within present and ancestral populations for a long time. Recently, multispecies coalescent based methods have increased in popularity, however, their application is limited to a small number of species and in iduals. We introduced a polymorphism-aware phylogenetic model (PoMo), which overcomes this limitation and scales well with the increasing amount of sequence data while accounting for present and ancestral polymorphisms. PoMo circumvents handling of gene trees and directly infers species trees from allele frequency data. Here, we extend the PoMo implementation in IQ-TREE and integrate search for the statistically best-fit mutation model, the ability to infer mutation rate variation across sites, and assessment of branch support values. We exemplify an analysis of a hundred species with ten haploid in iduals each, showing that PoMo can perform inference on large data sets. While PoMo is more accurate than standard substitution models applied to concatenated alignments, it is almost as fast. We also provide bmm-simulate , a software package that allows simulation of sequences evolving under PoMo. The new options consolidate the value of PoMo for phylogenetic analyses with population data.

Publication

IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era

Publisher: Oxford University Press (OUP)

Date: 03-02-2020

DOI: 10.1093/MOLBEV/MSAA015

Abstract: IQ-TREE (www.iqtree.org, last accessed February 6, 2020) is a user-friendly and widely used software package for phylogenetic inference using maximum likelihood. Since the release of version 1 in 2014, we have continuously expanded IQ-TREE to integrate a plethora of new models of sequence evolution and efficient computational approaches of phylogenetic inference to deal with genomic data. Here, we describe notable features of IQ-TREE version 2 and highlight the key advantages over other software.

Publication

DecentTree: Scalable Neighbour-Joining for the Genomic Era

Publisher: Oxford University Press (OUP)

Date: 08-07-2023

DOI: 10.1093/BIOINFORMATICS/BTAD536

Abstract: Neighbour-Joining is one of the most widely used distance-based phylogenetic inference methods. However, current implementations do not scale well for datasets with more than 10,000 sequences. Given the increasing pace of generating new sequence data, particularly in outbreaks of emerging diseases, and the already enormous existing databases of sequence data for which NJ is a useful approach, new implementations of existing methods are warranted. Here we present DecentTree, which provides highly optimised and parallel implementations of Neighbour-Joining and several of its variants. DecentTree is designed as a stand-alone application and a header-only library easily integrated with other phylogenetic software (e.g., it is integral in the popular IQ-TREE software). We show that DecentTree shows similar or improved performance over existing software (BIONJ, Quicktree, FastME, and RapidNJ), especially for handling very large alignments. For ex le, DecentTree is up to 6-fold faster than the fastest existing Neighbour-Joining software (e.g., RapidNJ) when generating a tree of 64,000 SARS-CoV-2 genomes. DecentTree is open source and freely available at qtree/decenttree. All code and data used in this analysis are available on Github (sdcid/Comparison-of-neighbour-joining-software). Supplementary data are available at Bioinformatics online.

Publication

Decisive Data Sets in Phylogenomics: Lessons from Studies on the Phylogenetic Relationships of Primarily Wingless Insects

Publisher: Oxford University Press (OUP)

Date: 18-10-2013

DOI: 10.1093/MOLBEV/MST196

Publication

HIV-1 Full-Genome Phylogenetics of Generalized Epidemics in Sub-Saharan Africa: Impact of Missing Nucleotide Characters in Next-Generation Sequences

Publisher: Mary Ann Liebert Inc

Date: 11-2017

DOI: 10.1089/AID.2017.0061

Publication

Author response: Combined transcriptome and proteome profiling reveals specific molecular brain signatures for sex, maturation and circalunar clock phase

Publisher: eLife Sciences Publications, Ltd

Date: 15-01-2019

DOI: 10.7554/ELIFE.41556.114

Publication

GHOST: Recovering historical signal from heterotachously-evolved sequence alignments

Publisher: Oxford University Press (OUP)

Date: 31-07-2019

DOI: 10.1093/SYSBIO/SYZ051

Abstract: Molecular sequence data that have evolved under the influence of heterotachous evolutionary processes are known to mislead phylogenetic inference. We introduce the General Heterogeneous evolution On a Single Topology (GHOST) model of sequence evolution, implemented under a maximum-likelihood framework in the phylogenetic program IQ-TREE (www.iqtree.org). Simulations show that using the GHOST model, IQ-TREE can accurately recover the tree topology, branch lengths, and substitution model parameters from heterotachously evolved sequences. We investigate the performance of the GHOST model on empirical data by s ling phylogenomic alignments of varying lengths from a plastome alignment. We then carry out inference under the GHOST model on a phylogenomic data set composed of 248 genes from 16 taxa, where we find the GHOST model concurs with the currently accepted view, placing turtles as a sister lineage of archosaurs, in contrast to results obtained using traditional variable rates-across-sites models. Finally, we apply the model to a data set composed of a sodium channel gene of 11 fish taxa, finding that the GHOST model is able to elucidate a subtle component of the historical signal, linked to the previously established convergent evolution of the electric organ in two geographically distinct lineages of electric fish. We compare inference under the GHOST model to partitioning by codon position and show that, owing to the minimization of model constraints, the GHOST model offers unique biological insights when applied to empirical data.

Publication

Undinarchaeota illuminate the evolution of DPANN archaea

Publisher: Cold Spring Harbor Laboratory

Date: 05-03-2020

DOI: 10.1101/2020.03.05.976373

Abstract: The evolution and ersification of Archaea is central to the history of life on Earth. Cultivation-independent approaches have revealed the existence of at least ten archaeal lineages whose members have small cell and genome sizes and limited metabolic capabilities and together comprise the tentative DPANN archaea. However, the phylogenetic ersity of DPANN and the placement of the various lineages of this group in the archaeal tree remain debated. Here, we reconstructed additional metagenome assembled genomes (MAGs) of a thus far uncharacterized archaeal phylum-level lineage UAP2 ( Candidatus Undinarchaeota) affiliating with DPANN archaea. Comparative genome analyses revealed that members of the Undinarchaeota have small estimated genome sizes and, while potentially being able to conserve energy through fermentation, likely depend on partner organisms for the acquisition of vitamins, amino acids and other metabolites. Phylogenomic analyses robustly recovered Undinarchaeota as a major independent lineage between two highly supported clans of DPANN: one clan comprising Micrarchaeota, Altiarchaeota and Diapherotrites, and another encompassing all other DPANN. Our analyses also suggest that DPANN archaea may have exchanged core genes with their hosts by horizontal gene transfer (HGT), adding to the difficulty of placing DPANN in the archaeal tree. Together, our findings provide crucial insights into the origins and evolution of DPANN archaea and their hosts.

Publication

Ultrafast Approximation for Phylogenetic Bootstrap

Publisher: Oxford University Press (OUP)

Date: 15-02-2013

DOI: 10.1093/MOLBEV/MST024

Publication

pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies

Publisher: Oxford University Press (OUP)

Date: 26-07-2005

DOI: 10.1093/BIOINFORMATICS/BTI594

Abstract: IQPNNI is a program to infer maximum-likelihood phylogenetic trees from DNA or protein data with a large number of sequences. We present an improved and MPI-parallel implementation showing very good scaling and speed-up behavior.

Publication

Maximum likelihood pandemic-scale phylogenetics

Publisher: Springer Science and Business Media LLC

Date: 10-04-2023

DOI: 10.1038/S41588-023-01368-0

Abstract: Phylogenetics has a crucial role in genomic epidemiology. Enabled by unparalleled volumes of genome sequence data generated to study and help contain the COVID-19 pandemic, phylogenetic analyses of SARS-CoV-2 genomes have shed light on the virus’s origins, spread, and the emergence and reproductive success of new variants. However, most phylogenetic approaches, including maximum likelihood and Bayesian methods, cannot scale to the size of the datasets from the current pandemic. We present ‘MAximum Parsimonious Likelihood Estimation’ (MAPLE), an approach for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. MAPLE infers SARS-CoV-2 phylogenies more accurately than existing maximum likelihood approaches while running up to thousands of times faster, and requiring at least 100 times less memory on large datasets. This extends the reach of genomic epidemiology, allowing the continued use of accurate phylogenetic, phylogeographic and phylodynamic analyses on datasets of millions of genomes.

Publication

The Phylogenetic Likelihood Library

Publisher: Oxford University Press (OUP)

Date: 30-10-2014

DOI: 10.1093/SYSBIO/SYU084

Publication

Does This Schlank Make Me Look Fat?

Publisher: Elsevier BV

Date: 09-2018

DOI: 10.1016/J.TEM.2018.04.003

Publication

W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis

Publisher: Oxford University Press (OUP)

Date: 15-04-2016

DOI: 10.1093/NAR/GKW256

Publication

Building Population-Specific Reference Genomes: A Case Study of Vietnamese Reference Genome

Publisher: IEEE

Date: 10-2015

DOI: 10.1109/KSE.2015.49

Publication

DecentTree: scalable Neighbour-Joining for the genomic era

Publisher: Cold Spring Harbor Laboratory

Date: 10-04-2023

DOI: 10.1101/2022.04.10.487712

Abstract: Neighbour-Joining is one of the most widely used distance-based phylogenetic inference methods. However, current implementations do not scale well for datasets with more than 10,000 sequences. Given the increasing pace of generating new sequence data, particularly in outbreaks of emerging diseases, and the already enormous existing databases of sequence data for which NJ is a useful approach, new implementations of existing methods are warranted. Here we present DecentTree, which provides highly optimised and parallel implementations of Neighbour-Joining and several of its variants. DecentTree is designed as a stand-alone application and a header-only library easily integrated with other phylogenetic software (e.g. it is integral in the popular IQ-TREE software). We show that DecentTree shows similar or improved performance over existing software (BIONJ, Quicktree, FastME, and RapidNJ), especially for handling very large alignments. For ex le, DecentTree is up to 6-fold faster than the fastest existing Neighbour-Joining software (e.g. RapidNJ) when generating a tree of 64,000 SARS-CoV-2 genomes. DecentTree is open source and freely available at qtree/decenttree . Minh Bui: m.bui@anu.edu.au Robert Lanfear: rob.lanfear@anu.edu.au Supplementary data are available at Bioinformatics online.

Publication

Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution

Publisher: Springer Science and Business Media LLC

Date: 07-08-2020

DOI: 10.1038/S41467-020-17408-W

Abstract: The recently discovered DPANN archaea are a potentially deep-branching, monophyletic radiation of organisms with small cells and genomes. However, the monophyly and early emergence of the various DPANN clades and their role in life’s evolution are debated. Here, we reconstructed and analysed genomes of an uncharacterized archaeal phylum ( Candidatus Undinarchaeota), revealing that its members have small genomes and, while potentially being able to conserve energy through fermentation, likely depend on partner organisms for the acquisition of certain metabolites. Our phylogenomic analyses robustly place Undinarchaeota as an independent lineage between two highly supported ‘DPANN’ clans. Further, our analyses suggest that DPANN have exchanged core genes with their hosts, adding to the difficulty of placing DPANN in the tree of life. This pattern can be sufficiently dominant to allow identifying known symbiont-host clades based on routes of gene transfer. Together, our work provides insights into the origins and evolution of DPANN and their hosts.

Publication

Budgeted Phylogenetic Diversity on Circular Split Systems

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 2009

DOI: 10.1109/TCBB.2008.54

Publication

Polymorphism-Aware Species Trees with Advanced Mutation Models, Bootstrap, and Rate Heterogeneity

Publisher: Oxford University Press (OUP)

Date: 02-03-2019

DOI: 10.1093/MOLBEV/MSZ043

Publication

Discovery of the first light-dependent protochlorophyllide oxidoreductase in anoxygenic phototrophic bacteria

Publisher: Wiley

Date: 05-08-2014

DOI: 10.1111/MMI.12719

Abstract: In all photosynthetic organisms, chlorophylls function as light-absorbing photopigments allowing the efficient harvesting of light energy. Chlorophyll biosynthesis recurs in similar ways in anoxygenic phototrophic proteobacteria as well as oxygenic phototrophic cyanobacteria and plants. Here, the biocatalytic conversion of protochlorophyllide to chlorophyllide is catalysed by evolutionary and structurally distinct protochlorophyllide reductases (PORs) in anoxygenic and oxygenic phototrophs. It is commonly assumed that anoxygenic phototrophs only contain oxygen-sensitive dark-operative PORs (DPORs), which catalyse protochlorophyllide reduction independent of the presence of light. In contrast, oxygenic phototrophs additionally (or exclusively) possess oxygen-insensitive but light-dependent PORs (LPORs). Based on this observation it was suggested that light-dependent protochlorophyllide reduction first emerged as a consequence of increased atmospheric oxygen levels caused by oxygenic photosynthesis in cyanobacteria. Here, we provide experimental evidence for the presence of an LPOR in the anoxygenic phototrophic α-proteobacterium Dinoroseobacter shibae DFL12(T). In vitro and in vivo functional assays unequivocally prove light-dependent protochlorophyllide reduction by this enzyme and reveal that LPORs are not restricted to cyanobacteria and plants. Sequence-based phylogenetic analyses reconcile our findings with current hypotheses about the evolution of LPORs by suggesting that the light-dependent enzyme of D. shibae DFL12(T) might have been obtained from cyanobacteria by horizontal gene transfer.

Publication

Consequences of Common Topological Rearrangements for Partition Trees in Phylogenomic Inference

Publisher: Mary Ann Liebert Inc

Date: 12-2015

DOI: 10.1089/CMB.2015.0146

Publication

IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era

Publisher: Cold Spring Harbor Laboratory

Date: 21-11-2019

DOI: 10.1101/849372

Abstract: IQ-TREE ( www.iqtree.org ) is a user-friendly and widely used software package for phylogenetic inference using maximum likelihood. Since the release of version 1 in 2014, we have continuously expanded IQ-TREE to integrate a plethora of new models of sequence evolution and efficient computational approaches of phylogenetic inference to deal with genomic data. Here, we describe notable features of IQ-TREE version 2 and highlight the key advantages over other software.

Publication

Maximum likelihood pandemic-scale phylogenetics

Publisher: Cold Spring Harbor Laboratory

Date: 22-03-2022

DOI: 10.1101/2022.03.22.485312

Abstract: Phylogenetics plays a crucial role in the interpretation of genomic data 1 . Phylogenetic analyses of SARS-CoV-2 genomes have allowed the detailed study of the virus’s origins 2 , of its international 3,4 and local 4–9 spread, and of the emergence 10 and reproductive success 11 of new variants, among many applications. These analyses have been enabled by the unparalleled volumes of genome sequence data generated and employed to study and help contain the pandemic 12 . However, preferred model-based phylogenetic approaches including maximum likelihood and Bayesian methods, mostly based on Felsenstein’s ‘pruning’ algorithm 13,14 , cannot scale to the size of the datasets from the current pandemic 4,15 , h ering our understanding of the virus’s evolution and transmission 16 . We present new approaches, based on reworking Felsenstein’s algorithm, for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. We exploit near-certainty regarding ancestral genomes, and the similarities between closely related and densely s led genomes, to greatly reduce computational demands for memory and time. Combined with new methods for searching amongst candidate evolutionary trees, this results in our MAPLE (‘MAximum Parsimonious Likelihood Estimation’) software giving better results than popular approaches such as FastTree 2 17 , IQ-TREE 2 18 , RAxML-NG 19 and UShER 15 . Our approach therefore allows complex and accurate proba-bilistic phylogenetic analyses of millions of microbial genomes, extending the reach of genomic epidemiology. Future epidemiological datasets are likely to be even larger than those currently associated with COVID-19, and other disciplines such as metagenomics and bio ersity science are also generating huge numbers of genome sequences 20–22 . Our methods will permit continued use of preferred likelihood-based phylogenetic analyses.

Publication

Modeling Site Heterogeneity with Posterior Mean Site Frequency Profiles Accelerates Accurate Phylogenomic Estimation

Publisher: Oxford University Press (OUP)

Date: 07-08-2017

DOI: 10.1093/SYSBIO/SYX068

Abstract: Proteins have distinct structural and functional constraints at different sites that lead to site-specific preferences for particular amino acid residues as the sequences evolve. Heterogeneity in the amino acid substitution process between sites is not modeled by commonly used empirical amino acid exchange matrices. Such model misspecification can lead to artefacts in phylogenetic estimation such as long-branch attraction. Although sophisticated site-heterogeneous mixture models have been developed to address this problem in both Bayesian and maximum likelihood (ML) frameworks, their formidable computational time and memory usage severely limits their use in large phylogenomic analyses. Here we propose a posterior mean site frequency (PMSF) method as a rapid and efficient approximation to full empirical profile mixture models for ML analysis. The PMSF approach assigns a conditional mean amino acid frequency profile to each site calculated based on a mixture model fitted to the data using a preliminary guide tree. These PMSF profiles can then be used for in-depth tree-searching in place of the full mixture model. Compared with widely used empirical mixture models with $k$ classes, our implementation of PMSF in IQ-TREE (www.iqtree.org) speeds up the computation by approximately $k$/1.5-fold and requires a small fraction of the RAM. Furthermore, this speedup allows, for the first time, full nonparametric bootstrap analyses to be conducted under complex site-heterogeneous models on large concatenated data matrices. Our simulations and empirical data analyses demonstrate that PMSF can effectively ameliorate long-branch attraction artefacts. In some empirical and simulation settings PMSF provided more accurate estimates of phylogenies than the mixture models from which they derive.

Publication

New Methods to Calculate Concordance Factors for Phylogenomic Datasets

Publisher: Oxford University Press (OUP)

Date: 04-05-2020

DOI: 10.1093/MOLBEV/MSAA106

Abstract: We implement two measures for quantifying genealogical concordance in phylogenomic data sets: the gene concordance factor (gCF) and the novel site concordance factor (sCF). For every branch of a reference tree, gCF is defined as the percentage of “decisive” gene trees containing that branch. This measure is already in wide usage, but here we introduce a package that calculates it while accounting for variable taxon coverage among gene trees. sCF is a new measure defined as the percentage of decisive sites supporting a branch in the reference tree. gCF and sCF complement classical measures of branch support in phylogenetics by providing a full description of underlying disagreement among loci and sites. An easy to use implementation and tutorial is freely available in the IQ-TREE software package (oc/Concordance-Factor, last accessed May 13, 2020).

Publication

2022 Zuckerkandl Prize

Publisher: Springer Science and Business Media LLC

Date: 06-01-2023

DOI: 10.1007/S00239-022-10089-7

Publication

SDA*: A Simple and Unifying Solution to Recent Bioinformatic Challenges for Conservation Genetics

Publisher: IEEE

Date: 10-2010

DOI: 10.1109/KSE.2010.24

Publication

Quantitative detection and typing of hepatitis D virus in human serum by real-time polymerase chain reaction and melting curve analysis

Publisher: Elsevier BV

Date: 06-2010

DOI: 10.1016/J.DIAGMICROBIO.2010.02.003

Abstract: Hepatitis D virus (HDV) infection is an important etiologic agent of fulminant hepatitis and may aggravate the clinical course of chronic hepatitis B infection resulting in cirrhosis and liver failure. This report describes the establishment of a real-time reverse transcriptase polymerase chain reaction method that allows the quantitative detection of HDV-1 and HDV-3 with a sensitivity in a linear range of 2 x 10(3) to 10(8) copies/mL. Additionally, the new assay provides the opportunity to distinguish HDV-1 from HDV-3 by a subsequent melting curve analysis, an important option because these HDV types are highly associated with severe clinical outcome. The results of the melting curve analysis of 42 HDV sequences obtained in this study and the phylogenetic analysis based on 139 full-length sequences from GenBank were consistent and showed that all sequences described here cluster within the HDV-1 clade. Therefore, this assay is useful for monitoring of antiviral treatment and molecular epidemiologic studies of HDV distribution.

Publication

Untangling the early diversification of eukaryotes: a phylogenomic study of the evolutionary origins of Centrohelida, Haptophyta and Cryptista

Publisher: The Royal Society

Date: 27-01-2016

DOI: 10.1098/RSPB.2015.2802

Abstract: Assembling the global eukaryotic tree of life has long been a major effort of Biology. In recent years, pushed by the new availability of genome-scale data for microbial eukaryotes, it has become possible to revisit many evolutionary enigmas. However, some of the most ancient nodes, which are essential for inferring a stable tree, have remained highly controversial. Among other reasons, the lack of adequate genomic datasets for key taxa has prevented the robust reconstruction of early ersification events. In this context, the centrohelid heliozoans are particularly relevant for reconstructing the tree of eukaryotes because they represent one of the last substantial groups that was missing large and erse genomic data. Here, we filled this gap by sequencing high-quality transcriptomes for four centrohelid lineages, each corresponding to a different family. Combining these new data with a broad eukaryotic s ling, we produced a gene-rich taxon-rich phylogenomic dataset that enabled us to refine the structure of the tree. Specifically, we show that (i) centrohelids relate to haptophytes, confirming Haptista (ii) Haptista relates to SAR (iii) Cryptista share strong affinity with Archaeplastida and (iv) Haptista + SAR is sister to Cryptista + Archaeplastida. The implications of this topology are discussed in the broader context of plastid evolution.

Publication

Taxon Selection under Split Diversity

Publisher: Oxford University Press (OUP)

Date: 21-09-2009

DOI: 10.1093/SYSBIO/SYP058

Abstract: The "phylogenetic ersity" (PD) measure of bio ersity is evaluated using a phylogenetic tree, usually inferred from morphological or molecular data. Consequently, it is vulnerable to errors in that tree, including those resulting from s ling error, model misspecification, or conflicting signals. To improve the robustness of PD, we can evaluate the measure using either a collection (or distribution) of trees or a phylogenetic network. Recently, it has been shown that these 2 approaches are equivalent but that the problem of maximizing PD in the general concept is NP-hard. In this study, we provide an efficient dynamic programming algorithm for maximizing PD when splits in the trees or network form a circular split system. We illustrate our method using a case study of game birds ("Galliformes") and discuss the different choices of taxa based on our approach and PD.

Publication

Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression

Publisher: Public Library of Science (PLoS)

Date: 03-12-2020

DOI: 10.1371/JOURNAL.PBIO.3000954

Abstract: Our understanding of the evolutionary history of primates is undergoing continual revision due to ongoing genome sequencing efforts. Bolstered by growing fossil evidence, these data have led to increased acceptance of once controversial hypotheses regarding phylogenetic relationships, hybridization and introgression, and the biogeographical history of primate groups. Among these findings is a pattern of recent introgression between species within all major primate groups examined to date, though little is known about introgression deeper in time. To address this and other phylogenetic questions, here, we present new reference genome assemblies for 3 Old World monkey (OWM) species: Colobus angolensis ssp. palliatus (the black and white colobus), Macaca nemestrina (southern pig-tailed macaque), and Mandrillus leucophaeus (the drill). We combine these data with 23 additional primate genomes to estimate both the species tree and in idual gene trees using thousands of loci. While our species tree is largely consistent with previous phylogenetic hypotheses, the gene trees reveal high levels of genealogical discordance associated with multiple primate radiations. We use strongly asymmetric patterns of gene tree discordance around specific branches to identify multiple instances of introgression between ancestral primate lineages. In addition, we exploit recent fossil evidence to perform fossil-calibrated molecular dating analyses across the tree. Taken together, our genome-wide data help to resolve multiple contentious sets of relationships among primates, while also providing insight into the biological processes and technical artifacts that led to the disagreements in the first place.

Publication

Split diversity in constrained conservation prioritization using integer linear programming

Publisher: Wiley

Date: 06-12-2014

DOI: 10.1111/2041-210X.12299

Publication

A Comprehensive Phylogenetic Analysis of the Serpin Superfamily

Publisher: Oxford University Press (OUP)

Date: 21-03-2021

DOI: 10.1093/MOLBEV/MSAB081

Abstract: Serine protease inhibitors (serpins) are found in all kingdoms of life and play essential roles in multiple physiological processes. Owing to the ersity of the superfamily, phylogenetic analysis is challenging and prokaryotic serpins have been speculated to have been acquired from Metazoa through horizontal gene transfer due to their unexpectedly high homology. Here, we have leveraged a structural alignment of erse serpins to generate a comprehensive 6,000-sequence phylogeny that encompasses serpins from all kingdoms of life. We show that in addition to a central “hub” of highly conserved serpins, there has been extensive ersification of the superfamily into many novel functional clades. Our analysis indicates that the hub proteins are ancient and are similar because of convergent evolution, rather than the alternative hypothesis of horizontal gene transfer. This work clarifies longstanding questions in the evolution of serpins and provides new directions for research in the field of serpin biology.

Publication

AliSim: A Fast and Versatile Phylogenetic Sequence Simulator for the Genomic Era

Publisher: Oxford University Press (OUP)

Date: 05-2022

DOI: 10.1093/MOLBEV/MSAC092

Abstract: Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programmes exist, but the most feature-rich programmes tend to be rather slow, and the fastest programmes tend to be feature-poor. Here, we introduce AliSim, a new tool that can efficiently simulate biologically realistic alignments under a large range of complex evolutionary models. To achieve high performance across a wide range of simulation conditions, AliSim implements an adaptive approach that combines the commonly used rate matrix and probability matrix approaches. AliSim takes 1.4 h and 1.3 GB RAM to simulate alignments with one million sequences or sites, whereas popular software Seq-Gen, Dawg, and INDELible require 2–5 h and 50–500 GB of RAM. We provide AliSim as an extension of the IQ-TREE software version 2.2, freely available at www.iqtree.org, and a comprehensive user tutorial at oc/AliSim.

Publication

Unifying the global phylogeny and environmental distribution of ammonia-oxidising archaea based on amoA genes

Publisher: Springer Science and Business Media LLC

Date: 17-04-2018

DOI: 10.1038/S41467-018-03861-1

Abstract: Ammonia-oxidising archaea (AOA) are ubiquitous and abundant in nature and play a major role in nitrogen cycling. AOA have been studied intensively based on the amoA gene (encoding ammonia monooxygenase subunit A), making it the most sequenced functional marker gene. Here, based on extensive phylogenetic and meta-data analyses of 33,378 curated archaeal amoA sequences, we define a highly resolved taxonomy and uncover global environmental patterns that challenge many earlier generalisations. Particularly, we show: (i) the global frequency of AOA is extremely uneven, with few clades dominating AOA ersity in most ecosystems (ii) characterised AOA do not represent most predominant clades in nature, including soils and oceans (iii) the functional role of the most prevalent environmental AOA clade remains unclear and (iv) AOA harbour molecular signatures that possibly reflect phenotypic traits. Our work synthesises information from a decade of research and provides the first integrative framework to study AOA in a global context.

Publication

Complex Models of Sequence Evolution Require Accurate Estimators as Exemplified with the Invariable Site Plus Gamma Model

Publisher: Oxford University Press (OUP)

Date: 27-11-2017

DOI: 10.1093/SYSBIO/SYX092

Publication

Newly Emerged Serotype 1c of Shigella flexneri: Multiple Origins and Changing Drug Resistance Landscape

Publisher: MDPI AG

Date: 03-09-2020

DOI: 10.3390/GENES11091042

Abstract: Bacillary dysentery caused by Shigella flexneri is a major cause of under-five mortality in developing countries, where a novel S. flexneri serotype 1c has become very common since the 1980s. However, the origin and ersification of serotype 1c remain poorly understood. To understand the evolution of serotype 1c and their antimicrobial resistance, we sequenced and analyzed the whole-genome of 85 clinical isolates from the United Kingdom, Egypt, Bangladesh, Vietnam, and Japan belonging to serotype 1c and related serotypes of 1a, 1b and Y/Yv. We identified up to three distinct O-antigen modifying genes in S. flexneri 1c strains, which were acquired from three different bacteriophages. Our analysis shows that S. flexneri 1c strains have originated from serotype 1a and serotype 1b strains after the acquisition of bacteriophage-encoding gtrIc operon. The maximum-likelihood phylogenetic analysis using core genes suggests two distinct S. flexneri 1c lineages, one specific to Bangladesh, which originated from ancestral serotype 1a strains and the other from the United Kingdom, Egypt, and Vietnam originated from ancestral serotype 1b strains. We also identified 63 isolates containing multiple drug-resistant genes in them conferring resistance against streptomycin, sulfonamide, quinolone, trimethoprim, tetracycline, chlor henicol, and beta-lactamase. Furthermore, antibiotic susceptibility assays showed 83 (97.6%) isolates as either complete or intermediate resistance to the WHO-recommended first- and second-line drugs. This changing drug resistance pattern demonstrates the urgent need for drug resistance surveillance and renewed treatment guidelines.

Publication

Assessing Confidence in Root Placement on Phylogenies: An Empirical Study Using Non-Reversible Models for Mammals

Publisher: Cold Spring Harbor Laboratory

Date: 02-08-2020

DOI: 10.1101/2020.07.31.230144

Abstract: Using time-reversible Markov models is a very common practice in phylogenetic analysis, because although we expect many of their assumptions to be violated by empirical data, they provide high computational efficiency. However, these models lack the ability to infer the root placement of the estimated phylogeny. In order to compensate for the inability of these models to root the tree, many researchers use external information such as using outgroup taxa or additional assumptions such as molecular-clocks. In this study, we investigate the utility of non-reversible models to root empirical phylogenies and introduce a new bootstrap measure, the rootstrap , which provides information on the statistical support for any given root position. rootstrap support is implemented in IQ-TREE 2 and a tutorial is available at the iqtree webpage oc/Rootstrap . In addition, a python script is available at uhanaser/Rootstrap . [phylogenetic inference, root estimation, bootstrap, non-reversible models]

Publication

ACOPHY: A Simple and General Ant Colony Optimization Approach for Phylogenetic Tree Reconstruction

Publisher: Springer Berlin Heidelberg

Date: 2010

DOI: 10.1007/978-3-642-15461-4_32

Publication

Updated site concordance factors minimize effects of homoplasy and taxon sampling

Publisher: Oxford University Press (OUP)

Date: 16-11-2022

DOI: 10.1093/BIOINFORMATICS/BTAC741

Abstract: Site concordance factors (sCFs) have become a widely used way to summarize discordance in phylogenomic datasets. However, the original version of sCFs was calculated by s ling a quartet of tip taxa and then applying parsimony-based criteria for discordance. This approach has the potential to be strongly affected by multiple hits at a site (homoplasy), especially when substitution rates are high or taxa are not closely related. Here, we introduce a new method for calculating sCFs. The updated version uses likelihood to generate probability distributions of ancestral states at internal nodes of the phylogeny. By s ling from the states at internal nodes adjacent to a given branch, this approach substantially reduces—but does not abolish—the effects of homoplasy and taxon s ling. Updated sCFs are implemented in IQ-TREE 2.2.2. The software is freely available at qtree/iqtree2/releases. Supplementary information is available at Bioinformatics online.

Publication

MPBoot: fast phylogenetic maximum parsimony tree inference and bootstrap approximation

Publisher: Springer Science and Business Media LLC

Date: 02-02-2018

DOI: 10.1186/S12862-018-1131-3

Publication

AliSim: A Fast and Versatile Phylogenetic Sequence Simulator For the Genomic Era

Publisher: Cold Spring Harbor Laboratory

Date: 17-12-2021

DOI: 10.1101/2021.12.16.472905

Abstract: Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programs exist, but the most feature-rich programs tend to be rather slow, and the fastest programs tend to be feature-poor. Here, we introduce AliSim, a new tool that can efficiently simulate biologically realistic alignments under a large range of complex evolutionary models. To achieve high performance across a wide range of simulation conditions, AliSim implements an adaptive approach that combines the commonly-used rate matrix and probability matrix approach. AliSim takes 1.3 hours and 1.3 GB RAM to simulate alignments with one million sequences or sites, while popular software Seq-Gen, Dawg, and INDELible require two to five hours and 50 to 500 GB of RAM. We provide AliSim as an extension of the IQ-TREE software version 2.2, freely available at www.iqtree.org , and a comprehensive user tutorial at oc/AliSim .

Publication

nQMaker: Estimating Time Nonreversible Amino Acid Substitution Models

Publisher: Oxford University Press (OUP)

Date: 09-02-2022

DOI: 10.1093/SYSBIO/SYAC007

Abstract: Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All commonly used amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In this article, we introduce a maximum likelihood approach nQMaker, an extension of the recently published QMaker method, that allows the estimation of time nonreversible amino acid substitution models and rooted phylogenetic trees from a set of protein sequence alignments. We show that the nonreversible models estimated with nQMaker are a much better fit to empirical alignments than pre-existing reversible models, across a wide range of data sets including mammals, birds, plants, fungi, and other taxa, and that the improvements in model fit scale with the size of the data set. Notably, for the recently published plant and bird trees, these nonreversible models correctly recovered the commonly estimated root placements with very high-statistical support without the need to use an outgroup. We provide nQMaker as an easy-to-use feature in the IQ-TREE software (www.iqtree.org), allowing users to estimate nonreversible models and rooted phylogenies from their own protein data sets. The data sets and scripts used in this article are available at 0.5061/dryad.3tx95x6hx. [amino acid sequence analyses amino acid substitution models maximum likelihood model estimation nonreversible models phylogenetic inference reversible models.]

Publication

ModelFinder: fast model selection for accurate phylogenetic estimates

Publisher: Springer Science and Business Media LLC

Date: 08-05-2017

DOI: 10.1038/NMETH.4285

Publication

A comprehensive phylogenetic analysis of the serpin superfamily

Publisher: Cold Spring Harbor Laboratory

Date: 09-09-2020

DOI: 10.1101/2020.09.09.289108

Abstract: Serine protease inhibitors (serpins) are found in all kingdoms of life and play essential roles in multiple physiological processes. Owing to the ersity of the superfamily, phylogenetic analysis is challenging and prokaryotic serpins have been speculated to have been acquired from Metazoa through horizontal gene transfer (HGT) due to their unexpectedly high homology. Here we have leveraged a structural alignment of erse serpins to generate a comprehensive 6000-sequence phylogeny that encompasses serpins from all kingdoms of life. We show that in addition to a central “hub” of highly conserved serpins, there has been extensive ersification of the superfamily into many novel functional clades. Our analysis indicates that the hub proteins are ancient and are similar because of convergent evolution, rather than the alternative hypothesis of HGT. This work clarifies longstanding questions in the evolution of serpins and provides new directions for research in the field of serpin biology.

Publication

Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices

Publisher: Oxford University Press (OUP)

Date: 26-04-2016

DOI: 10.1093/SYSBIO/SYW037

Publication

Reversible polymorphism-aware phylogenetic models and their application to tree inference

Publisher: Elsevier BV

Date: 10-2016

DOI: 10.1016/J.JTBI.2016.07.042

Abstract: We present a reversible Polymorphism-Aware Phylogenetic Model (revPoMo) for species tree estimation from genome-wide data. revPoMo enables the reconstruction of large scale species trees for many within-species s les. It expands the alphabet of DNA substitution models to include polymorphic states, thereby, naturally accounting for incomplete lineage sorting. We implemented revPoMo in the maximum likelihood software IQ-TREE. A simulation study and an application to great apes data show that the runtimes of our approach and standard substitution models are comparable but that revPoMo has much better accuracy in estimating trees, ergence times and mutation rates. The advantage of revPoMo is that an increase of s le size per species improves estimations but does not increase runtime. Therefore, revPoMo is a valuable tool with several applications, from speciation dating to species tree reconstruction.

Publication

QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution

Publisher: Oxford University Press (OUP)

Date: 22-02-2021

DOI: 10.1093/SYSBIO/SYAB010

Abstract: Amino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models however, they are typically complicated and slow. In this article, we propose QMaker, a new ML method to estimate a general time-reversible $Q$ matrix from a large protein data set consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.[Amino acid replacement matrices amino acid substitution models maximum likelihood estimation phylogenetic inferences.]

Publication

Want to track pandemic variants faster? Fix the bioinformatics bottleneck

Publisher: Springer Science and Business Media LLC

Date: 03-2021

DOI: 10.1038/D41586-021-00525-X

Publication

Split Diversity: Measuring and Optimizing Biodiversity Using Phylogenetic Split Networks

Publisher: Springer International Publishing

Date: 2016

DOI: 10.1007/978-3-319-22461-9_9

Publication

UFBoot2: Improving the Ultrafast Bootstrap Approximation

Publisher: Oxford University Press (OUP)

Date: 25-10-2017

DOI: 10.1093/MOLBEV/MSX281

Publication

Distribution and Phylogeny of Light-Oxygen-Voltage-Blue-Light-Signaling Proteins in the Three Kingdoms of Life

Publisher: American Society for Microbiology

Date: 12-2009

DOI: 10.1128/JB.00923-09

Abstract: Plants and fungi respond to environmental light stimuli via the action of different photoreceptor modules. One such class, responding to the blue region of light, is constituted by photoreceptors containing so-called l ight- o xygen- v oltage (LOV) domains as sensor modules. Four major LOV families are currently identified in eukaryotes: (i) the plant phototropins, regulating various physiological effects such as phototropism, chloroplast relocation, and stomatal opening (ii) the aureochromes, mediating photomorphogenesis in photosynthetic stramenopile algae (iii) the plant circadian photoreceptors of the zeitlupe (ZTL)/adagio (ADO)/flavin-binding Kelch repeat F-box protein 1 (FKF1) family and (iv) the fungal circadian photoreceptors white-collar 1 (WC-1). Blue-light-sensitive LOV signaling modules are also widespread throughout the prokaryotic world, and physiological responses mediated by bacterial LOV photoreceptors were recently reported. Thus, the question arises as to the evolutionary relationship between the pro- and eukaryotic LOV photoreceptor systems. We used Bayesian and maximum-likelihood tree reconstruction methods to infer evolutionary scenarios that might have led to the widespread appearance of LOV domains among the pro- and eukaryotes. The phylogenetic study presented here suggests a bacterial origin for the LOV domains of the four major eukaryotic LOV photoreceptor families, whereas the LOV sensor domains were most likely recruited from the bacteria in the course of plastid and mitochondrial endosymbiosis.

Publication

Phylogenetic Diversity within Seconds

Publisher: Oxford University Press (OUP)

Date: 10-2006

DOI: 10.1080/10635150600981604

Abstract: We consider a (phylogenetic) tree with n labeled leaves, the taxa, and a length for each branch in the tree. For any subset of k taxa, the phylogenetic ersity is defined as the sum of the branch-lengths of the minimal subtree connecting the taxa in the subset. We introduce two time-efficient algorithms (greedy and pruning) to compute a subset of size k with maximal phylogenetic ersity in O(n log k) and O[n + (n-k) log (n-k)] time, respectively. The greedy algorithm is an efficient implementation of the so-called greedy strategy (Steel, 2005 Pardi and Goldman, 2005), whereas the pruning algorithm provides an alternative description of the same problem. Both algorithms compute within seconds a subtree with maximal phylogenetic ersity for trees with 100,000 taxa or more.

Publication

AliSim-HPC: parallel sequence simulator for phylogenetics

Publisher: Oxford University Press (OUP)

Date: 09-2023

DOI: 10.1093/BIOINFORMATICS/BTAD540

Publication

MAST: Phylogenetic Inference with Mixtures Across Sites and Trees

Publisher: Cold Spring Harbor Laboratory

Date: 08-10-2022

DOI: 10.1101/2022.10.06.511210

Abstract: Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting, introgression, and/or horizontal gene transfer even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce the mixture across sites and trees (MAST) model, which uses a mixture of bifurcating trees to represent multiple histories in a single concatenated alignment. The MAST model allows each tree to have its own topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights (i.e. frequencies) for a given set of tree topologies. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of incomplete lineage sorting in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of four Platyrrhine species for which standard concatenated maximum likelihood and gene tree approaches disagree, we find that MAST gives the highest weight to the tree favored by gene tree approaches. These results suggest that the MAST model is able to analyse a concatenated alignment using maximum likelihood, while avoiding some of the biases that come with assuming there is only a single tree. The MAST model can therefore offer unique biological insights when applied to datasets with multiple evolutionary histories. We discuss how it can be extended in the future.

Publication

A novel Fibroblast Growth Factor Receptor family member promotes neuronal outgrowth and synaptic plasticity in Aplysia

Publisher: Springer Science and Business Media LLC

Date: 25-07-2014

DOI: 10.1007/S00726-014-1803-2

Publication

AliSim-HPC: parallel sequence simulator for phylogenetics

Publisher: Cold Spring Harbor Laboratory

Date: 18-01-2023

DOI: 10.1101/2023.01.15.524158

Abstract: Sequence simulation plays a vital role in phylogenetics with many applications, such as evaluating phylogenetic methods, testing hypotheses, and generating training data for machine-learning applications. We recently introduced a new simulator for multiple sequence alignments called AliSim, which outperformed existing tools. However, with the increasing demands of simulating large data sets, AliSim is still slow due to its sequential implementation for ex le, to simulate millions of sequence alignments, AliSim took several days or weeks. Parallelization has been used for many phylogenetic inference methods but not yet for sequence simulation. This paper introduces AliSim-HPC, which, for the first time, employs high-performance computing for phylogenetic simulations. AliSim-HPC parallelizes the simulation process at both multi-core and multi-CPU levels using the OpenMP and MPI libraries, respectively. AliSim-HPC is highly efficient and scalable, which reduces the runtime to simulate 100 large alignments from one day to 9 minutes using 256 CPU cores from a cluster with 6 computing nodes, a 162-fold speedup. AliSim-HPC is open source and available as part of the new IQ-TREE version v2.2.2.2 at qtree/iqtree2/releases with a user manual at oc/AliSim . m.bui@anu.edu.au

Publication

GHOST: Recovering Historical Signal from Heterotachously-evolved Sequence Alignments

Publisher: Cold Spring Harbor Laboratory

Date: 10-08-2017

DOI: 10.1101/174789

Abstract: Molecular sequence data that have evolved under the influence of heterotachous evolutionary processes are known to mislead phylogenetic inference. We introduce the General Heterogeneous evolution On a Single Topology (GHOST) model of sequence evolution, implemented under a maximum-likelihood framework in the phylogenetic program IQ-TREE ( www.iqtree.org ). Simulations show that using the GHOST model, IQ-TREE can accurately recover the tree topology, branch lengths and substitution model parameters from heterotachously-evolved sequences. We develop a model selection algorithm based on simulation results, and investigate the performance of the GHOST model on empirical data by s ling phylogenomic alignments of varying lengths from a plastome alignment. We then carry out inference under the GHOST model on a phylogenomic dataset composed of 248 genes from 16 taxa, where we find the GHOST model concurs with the currently accepted view, placing turtles as a sister lineage of archosaurs, in contrast to results obtained using traditional variable rates-across-sites models. Finally, we apply the model to a dataset composed of a sodium channel gene of 11 fish taxa, finding that the GHOST model is able to infer a subtle component of the historical signal, linked to the previously established convergent evolution of the electric organ in two geographically distinct lineages of electric fish. We compare inference under the GHOST model to partitioning by codon position and show that, owing to the minimization of model constraints, the GHOST model is able to offer unique biological insights when applied to empirical data.

Bui Quang Minh

Researcher

Research Topics

Top 3 Research Topics

ANZSRC Field of Research (FoR)

ANZSRC Socio-Economic Objective (SEO)

Related Links

Publications

The Influence of Model Violation on Phylogenetic Inference: A Simulation Study

Whole genome analysis of a Vietnamese trio

Corrigendum to: IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era

nQMaker: estimating time non-reversible amino acid substitution models

A test statistic to quantify treelikeness in phylogenetics

Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression

Combined transcriptome and proteome profiling reveals specific molecular brain signatures for sex, maturation and circalunar clock phase

A new phylogenetic tree sampling method for maximum parsimony bootstrapping and proof-of-concept implementation

Polymorphism-aware species trees with advanced mutation models, bootstrap and rate heterogeneity

IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era

DecentTree: Scalable Neighbour-Joining for the Genomic Era

Decisive Data Sets in Phylogenomics: Lessons from Studies on the Phylogenetic Relationships of Primarily Wingless Insects

HIV-1 Full-Genome Phylogenetics of Generalized Epidemics in Sub-Saharan Africa: Impact of Missing Nucleotide Characters in Next-Generation Sequences

Author response: Combined transcriptome and proteome profiling reveals specific molecular brain signatures for sex, maturation and circalunar clock phase

GHOST: Recovering historical signal from heterotachously-evolved sequence alignments

Undinarchaeota illuminate the evolution of DPANN archaea

Ultrafast Approximation for Phylogenetic Bootstrap

pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies

Maximum likelihood pandemic-scale phylogenetics

The Phylogenetic Likelihood Library

Does This Schlank Make Me Look Fat?

W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis

Building Population-Specific Reference Genomes: A Case Study of Vietnamese Reference Genome

DecentTree: scalable Neighbour-Joining for the genomic era

Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution

Budgeted Phylogenetic Diversity on Circular Split Systems

Polymorphism-Aware Species Trees with Advanced Mutation Models, Bootstrap, and Rate Heterogeneity

Discovery of the first light-dependent protochlorophyllide oxidoreductase in anoxygenic phototrophic bacteria

Consequences of Common Topological Rearrangements for Partition Trees in Phylogenomic Inference

IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era

Maximum likelihood pandemic-scale phylogenetics

Modeling Site Heterogeneity with Posterior Mean Site Frequency Profiles Accelerates Accurate Phylogenomic Estimation

New Methods to Calculate Concordance Factors for Phylogenomic Datasets

2022 Zuckerkandl Prize

SDA*: A Simple and Unifying Solution to Recent Bioinformatic Challenges for Conservation Genetics

Quantitative detection and typing of hepatitis D virus in human serum by real-time polymerase chain reaction and melting curve analysis

Untangling the early diversification of eukaryotes: a phylogenomic study of the evolutionary origins of Centrohelida, Haptophyta and Cryptista

Taxon Selection under Split Diversity

Primate phylogenomics uncovers multiple rapid radiations and ancient interspecific introgression

Split diversity in constrained conservation prioritization using integer linear programming

A Comprehensive Phylogenetic Analysis of the Serpin Superfamily

AliSim: A Fast and Versatile Phylogenetic Sequence Simulator for the Genomic Era

Unifying the global phylogeny and environmental distribution of ammonia-oxidising archaea based on amoA genes

Complex Models of Sequence Evolution Require Accurate Estimators as Exemplified with the Invariable Site Plus Gamma Model

Newly Emerged Serotype 1c of Shigella flexneri: Multiple Origins and Changing Drug Resistance Landscape

Assessing Confidence in Root Placement on Phylogenies: An Empirical Study Using Non-Reversible Models for Mammals

ACOPHY: A Simple and General Ant Colony Optimization Approach for Phylogenetic Tree Reconstruction

Updated site concordance factors minimize effects of homoplasy and taxon sampling

MPBoot: fast phylogenetic maximum parsimony tree inference and bootstrap approximation

AliSim: A Fast and Versatile Phylogenetic Sequence Simulator For the Genomic Era

nQMaker: Estimating Time Nonreversible Amino Acid Substitution Models

ModelFinder: fast model selection for accurate phylogenetic estimates

A comprehensive phylogenetic analysis of the serpin superfamily

Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices

Reversible polymorphism-aware phylogenetic models and their application to tree inference

QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution

Want to track pandemic variants faster? Fix the bioinformatics bottleneck

Split Diversity: Measuring and Optimizing Biodiversity Using Phylogenetic Split Networks

UFBoot2: Improving the Ultrafast Bootstrap Approximation

Distribution and Phylogeny of Light-Oxygen-Voltage-Blue-Light-Signaling Proteins in the Three Kingdoms of Life

Phylogenetic Diversity within Seconds

AliSim-HPC: parallel sequence simulator for phylogenetics

MAST: Phylogenetic Inference with Mixtures Across Sites and Trees

A novel Fibroblast Growth Factor Receptor family member promotes neuronal outgrowth and synaptic plasticity in Aplysia

AliSim-HPC: parallel sequence simulator for phylogenetics

GHOST: Recovering Historical Signal from Heterotachously-evolved Sequence Alignments

Related Organisations

Gregor Mendel Institute Of Molecular Plant Biology GmbH

University Of Vienna

University Of Freiburg

Max F. Perutz Laboratories

Australian National University