ORCID Profile
0000-0002-7752-1942
Current Organisations
University of Helsinki
,
University of Oslo
,
Wellcome Trust Sanger Institute
Does something not look right? The information on this page has been harvested from data sources that may not be up to date. We continue to work with information providers to improve coverage and quality. To report an issue, use the Feedback Form.
In Research Link Australia (RLA), "Research Topics" refer to ANZSRC FOR and SEO codes. These topics are either sourced from ANZSRC FOR and SEO codes listed in researchers' related grants or generated by a large language model (LLM) based on their publications.
Biostatistics | Genetics | Quantitative Genetics (incl. Disease and Trait Mapping Genetics)
Animal Production and Animal Primary Products not elsewhere classified | Plant Production and Plant Primary Products not elsewhere classified | Health not elsewhere classified |
Publisher: Cold Spring Harbor Laboratory
Date: 15-02-2021
DOI: 10.1101/2021.02.15.431222
Abstract: The pan-genome is defined as the combined set of all genes in the gene pool of a species. Pan-genome analyses have been very useful in helping to understand different evolutionary dynamics of bacterial species: an open pan-genome often indicates a free-living lifestyle with metabolic versatility, while closed pan-genomes are linked to host-restricted, ecologically specialised bacteria. A detailed understanding of the species pan-genome has also been instrumental in tracking the phylodynamics of emerging drug resistance mechanisms and drug resistant pathogens. However, current approaches to analyse a species’ pan-genome do not take the species population structure into account, nor do they account for the uneven s ling of different lineages, as is commonplace due to over-s ling of clinically relevant representatives. Here we present the application of a population structure-aware approach for classifying genes in a pan-genome based on within-species distribution. We demonstrate our approach on a collection of 7,500 E. coli genomes, one of the most-studied bacterial species used as a model for an open pan-genome. We reveal clearly distinct groups of genes, clustered by different underlying evolutionary dynamics, and provide a more biologically informed and accurate description of the species’ pan-genome.
Publisher: Microbiology Society
Date: 02-2021
Abstract: Escherichia coli is a highly erse organism that includes a range of commensal and pathogenic variants found across a range of niches and worldwide. In addition to causing severe intestinal and extraintestinal disease, E. coli is considered a priority pathogen due to high levels of observed drug resistance. The ersity in the E. coli population is driven by high genome plasticity and a very large gene pool. All these have made E. coli one of the most well-studied organisms, as well as a commonly used laboratory strain. Today, there are thousands of sequenced E. coli genomes stored in public databases. While data is widely available, accessing the information in order to perform analyses can still be a challenge. Collecting relevant available data requires accessing different sources, where data may be stored in a range of formats, and often requires further manipulation and processing to apply various analyses and extract useful information. In this study, we collated and intensely curated a collection of over 10 000 E. coli and Shigella genomes to provide a single, uniform, high-quality dataset. Shigella were included as they are considered specialized pathovars of E. coli . We provide these data in a number of easily accessible formats that can be used as the foundation for future studies addressing the biological differences between E. coli lineages and the distribution and flow of genes in the E. coli population at a high resolution. The analysis we present emphasizes our lack of understanding of the true ersity of the E. coli species, and the biased nature of our current understanding of the genetic ersity of such a key pathogen.
Publisher: Cold Spring Harbor Laboratory
Date: 06-08-2023
DOI: 10.1101/2023.08.04.551407
Abstract: Population genomics has revolutionised our ability to study bacterial evolution by enabling data-driven discovery of the genetic architecture of trait variation. Genome-wide association studies (GWAS) have more recently become accompanied by genome-wide epistasis and co-selection (GWES) analysis, which offers a phenotype-free approach to generating hypotheses about selective processes that simultaneously impact multiple loci across the genome. However, existing GWES methods only consider associations between distant pairs of loci within the genome due to the strong impact of linkage-disequilibrium (LD) over short distances. Based on the general functional organisation of genomes it is nevertheless expected that the majority of co-selection and epistasis will act within relatively short genomic proximity, on co-variation occurring within genes and their promoter regions, and within operons. Here we introduce LDWeaver, which enables an exhaustive GWES across both short- and long-range LD, to disentangle likely neutral co-variation from selection. We demonstrate the ability of LDWeaver to efficiently generate hypotheses about co-selection using large genomic surveys of multiple major human bacterial pathogen species and validate several findings using functional annotation and phenotypic measurements. Our approach will facilitate the study of bacterial evolution in the light of rapidly expanding population genomic data.
Publisher: Cold Spring Harbor Laboratory
Date: 26-10-2018
DOI: 10.1101/454355
Abstract: We present fastbaps, a fast solution to the genetic clustering problem. Fastbaps rapidly identifies an approximate fit to a Dirichlet Process Mixture model (DPM) for clustering multilocus genotype data. Our efficient model-based clustering approach is able to cluster datasets 10-100 times larger than the existing model-based methods, which we demonstrate by analysing an alignment of over 110,000 sequences of HIV-1 pol genes. We also provide a method for rapidly partitioning an existing hierarchy in order to maximise the DPM model marginal likelihood, allowing us to split phylogenetic trees into clades and subclades using a population genomic model. Extensive tests on simulated data as well as a erse set of real bacterial and viral datasets show that fastbaps provides comparable or improved solutions to previous model-based methods, while generally being significantly faster. The method is made freely available under an open source MIT licence as an easy to use R package at tonkinhill/fastbaps .
Publisher: American Chemical Society (ACS)
Date: 16-02-2022
Abstract: Ion exchange membranes with strong ionic separation performance have strategic importance for resource recovery and water purification, but the current state-of-the-art membranes suffer from inadequate ion selective transport for the target ions. This work proposes a new class of zeolitic imidazolate framework (ZIF)-based anion exchange membranes (named as S@ZIF-AMX) with suppressed multivalent anion mobility and enhanced target ion transport via an ionic control strategy under alternating current driven assembly. In electrodialysis with an initial concentration of 50 mM of NaBr, NaCl, Na
Publisher: Microbiology Society
Date: 16-11-2021
Abstract: Genomic epidemiology is a tool for tracing transmission of pathogens based on whole-genome sequencing. We introduce the mGEMS pipeline for genomic epidemiology with plate sweeps representing mixed s les of a target pathogen, opening the possibility to sequence all colonies on selective plates with a single DNA extraction and sequencing step. The pipeline includes the novel mGEMS read binner for probabilistic assignments of sequencing reads, and the scalable pseudoaligner Themisto. We demonstrate the effectiveness of our approach using closely related s les in a nosocomial setting, obtaining results that are comparable to those based on single-colony picks. Our results lend firm support to more widespread consideration of genomic epidemiology with mixed infection s les.
Publisher: Microbiology Society
Date: 24-09-2021
Abstract: The pan-genome is defined as the combined set of all genes in the gene pool of a species. Pan-genome analyses have been very useful in helping to understand different evolutionary dynamics of bacterial species: an open pan-genome often indicates a free-living lifestyle with metabolic versatility, while closed pan-genomes are linked to host-restricted, ecologically specialized bacteria. A detailed understanding of the species pan-genome has also been instrumental in tracking the phylodynamics of emerging drug resistance mechanisms and drug-resistant pathogens. However, current approaches to analyse a species’ pan-genome do not take the species population structure into account, nor do they account for the uneven s ling of different lineages, as is commonplace due to over-s ling of clinically relevant representatives. Here we present the application of a population structure-aware approach for classifying genes in a pan-genome based on within-species distribution. We demonstrate our approach on a collection of 7500 Escherichia coli genomes, one of the most-studied bacterial species and used as a model for an open pan-genome. We reveal clearly distinct groups of genes, clustered by different underlying evolutionary dynamics, and provide a more biologically informed and accurate description of the species’ pan-genome.
Publisher: Springer Science and Business Media LLC
Date: 10-10-2022
DOI: 10.1038/S41564-022-01238-1
Abstract: Characterizing the genetic ersity of pathogens within the host promises to greatly improve surveillance and reconstruction of transmission chains. For bacteria, it also informs our understanding of inter-strain competition and how this shapes the distribution of resistant and sensitive bacteria. Here we study the genetic ersity of Streptococcus pneumoniae within 468 infants and 145 of their mothers by deep sequencing whole pneumococcal populations from 3,761 longitudinal nasopharyngeal s les. We demonstrate that deep sequencing has unsurpassed sensitivity for detecting multiple colonization, doubling the rate at which highly invasive serotype 1 bacteria were detected in carriage compared with gold-standard methods. The greater resolution identified an elevated rate of transmission from mothers to their children in the first year of the child’s life. Comprehensive treatment data demonstrated that infants were at an elevated risk of both the acquisition and persistent colonization of a multidrug-resistant bacterium following antimicrobial treatment. Some alleles were enriched after antimicrobial treatment, suggesting that they aided persistence, but generally purifying selection dominated within-host evolution. Rates of co-colonization imply that in the absence of treatment, susceptible lineages outcompeted resistant lineages within the host. These results demonstrate the many benefits of deep sequencing for the genomic surveillance of bacterial pathogens.
Publisher: Cold Spring Harbor Laboratory
Date: 04-04-2020
DOI: 10.1101/2020.04.03.021501
Abstract: Genomic epidemiology is a tool for tracing transmission of pathogens based on whole-genome sequencing. We introduce the mGEMS pipeline for genomic epidemiology with plate sweeps representing mixed s les of a target pathogen, skipping the colony pick step. The pipeline includes the novel mGEMS read binner for probabilistic assignments of sequencing reads, and the scalable pseudoaligner Themisto. We demonstrate the effectiveness of our approach using closely related s les in a nosocomial setting, obtaining results that are comparable to those based on colony picks. Our results lend firm support to more widespread consideration of genomic epidemiology with mixed infection s les.
Publisher: Cold Spring Harbor Laboratory
Date: 05-06-2018
DOI: 10.1101/332544
Abstract: Determining the composition of bacterial communities beyond the level of a genus or species is challenging because of the considerable overlap between genomes representing close relatives. Here, we present the mSWEEP method for identifying and estimating the relative abundances of bacterial lineages from plate sweeps of enrichment cultures. mSWEEP leverages biologically grouped sequence assembly databases, applying probabilistic modelling, and provides controls for false positive results. Using sequencing data from major pathogens, we demonstrate significant improvements in lineage quantification and detection accuracy. Our method facilitates investigating cultures comprising mixtures of bacteria, and opens up a new field of plate sweep metagenomics.
Publisher: Cold Spring Harbor Laboratory
Date: 19-12-2022
DOI: 10.1101/2022.12.16.520696
Abstract: Extrachromosomal elements of bacterial cells such as plasmids are notorious for their importance in evolution and adaptation to changing ecology. However, high-resolution population-wide analysis of plasmids has only become accessible recently with the advent of scalable long-read sequencing technology. Current typing methods for the classification of plasmids remain limited in their scope which motivated us to develop a computationally efficient approach to simultaneously recognize novel types and classify plasmids into previously identified groups. Our method can easily handle thousands of input sequences which are compressed using a unitig representation in a de Bruijn graph. We provide an intuitive visualization, classification and clustering scheme that users can explore interactively. This provides a framework that can be easily distributed and replicated, enabling a consistent labelling of plasmids across past, present, and future sequence collections. We illustrate the attractive features of our approach by the analysis of population-wide plasmid data from the opportunistic pathogen Escherichia coli and the distribution of the colistin resistance gene mcr-1 . 1 in the plasmid population.
Publisher: F1000 Research Ltd
Date: 30-01-2020
DOI: 10.12688/WELLCOMEOPENRES.15639.1
Abstract: Determining the composition of bacterial communities beyond the level of a genus or species is challenging because of the considerable overlap between genomes representing close relatives. Here, we present the mSWEEP pipeline for identifying and estimating the relative sequence abundances of bacterial lineages from plate sweeps of enrichment cultures. mSWEEP leverages biologically grouped sequence assembly databases, applying probabilistic modelling, and provides controls for false positive results. Using sequencing data from major pathogens, we demonstrate significant improvements in lineage quantification and detection accuracy. Our pipeline facilitates investigating cultures comprising mixtures of bacteria, and opens up a new field of plate sweep metagenomics.
Publisher: F1000 Research Ltd
Date: 08-10-2021
DOI: 10.12688/WELLCOMEOPENRES.15639.2
Abstract: Determining the composition of bacterial communities beyond the level of a genus or species is challenging because of the considerable overlap between genomes representing close relatives. Here, we present the mSWEEP pipeline for identifying and estimating the relative sequence abundances of bacterial lineages from plate sweeps of enrichment cultures. mSWEEP leverages biologically grouped sequence assembly databases, applying probabilistic modelling, and provides controls for false positive results. Using sequencing data from major pathogens, we demonstrate significant improvements in lineage quantification and detection accuracy. Our pipeline facilitates investigating cultures comprising mixtures of bacteria, and opens up a new field of plate sweep metagenomics.
Publisher: Cold Spring Harbor Laboratory
Date: 21-02-2022
DOI: 10.1101/2022.02.20.480002
Abstract: Characterising the genetic ersity of pathogens within the host promises to greatly improve surveillance and reconstruction of transmission chains. For bacteria, it also informs our understanding of inter-strain competition, and how this shapes the distribution of resistant and sensitive bacteria. Here we study the genetic ersity of Streptococcus pneumoniae within in idual infants and their mothers by deep sequencing whole pneumococcal populations from longitudinal nasopharyngeal s les. We demonstrate deep sequencing has unsurpassed sensitivity for detecting multiple colonisation, doubling the rate at which highly invasive serotype 1 bacteria were detected in carriage compared to gold-standard methods. The greater resolution identified an elevated rate of transmission from mothers to their children in the first year of the child’s life. Comprehensive treatment data demonstrated infants were at an elevated risk of both the acquisition, and persistent colonisation, of a multidrug resistant bacterium following antimicrobial treatment. Some alleles were enriched after antimicrobial treatment, suggesting they aided persistence, but generally purifying selection dominated within-host evolution. Rates of co-colonisation imply that in the absence of treatment, susceptible lineages outcompeted resistant lineages within the host. These results demonstrate the many benefits of deep sequencing for the genomic surveillance of bacterial pathogens.
Publisher: Springer Science and Business Media LLC
Date: 09-03-2021
DOI: 10.1038/S41467-021-21749-5
Abstract: Enterococcus faecalis is a commensal and nosocomial pathogen, which is also ubiquitous in animals and insects, representing a classical generalist microorganism. Here, we study E. faecalis isolates ranging from the pre-antibiotic era in 1936 up to 2018, covering a large set of host species including wild birds, mammals, healthy humans, and hospitalised patients. We sequence the bacterial genomes using short- and long-read techniques, and identify multiple extant hospital-associated lineages, with last common ancestors dating back as far as the 19th century. We find a population cohesively connected through homologous recombination, a metabolic flexibility despite a small genome size, and a stable large core genome. Our findings indicate that the apparent hospital adaptations found in hospital-associated E. faecalis lineages likely predate the “modern hospital” era, suggesting selection in another niche, and underlining the generalist nature of this nosocomial pathogen.
Publisher: Springer Science and Business Media LLC
Date: 28-11-2018
DOI: 10.1038/S41467-018-07368-7
Abstract: Some of the most common infectious diseases are caused by bacteria that naturally colonise humans asymptomatically. Combating these opportunistic pathogens requires an understanding of the traits that differentiate infecting strains from harmless relatives. Staphylococcus epidermidis is carried asymptomatically on the skin and mucous membranes of virtually all humans but is a major cause of nosocomial infection associated with invasive procedures. Here we address the underlying evolutionary mechanisms of opportunistic pathogenicity by combining pangenome-wide association studies and laboratory microbiology to compare S. epidermidis from bloodstream and wound infections and asymptomatic carriage. We identify 61 genes containing infection-associated genetic elements (k-mers) that correlate with in vitro variation in known pathogenicity traits (biofilm formation, cell toxicity, interleukin-8 production, methicillin resistance). Horizontal gene transfer spreads these elements, allowing ergent clones to cause infection. Finally, Random Forest model prediction of disease status (carriage vs. infection) identifies pathogenicity elements in 415 S. epidermidis isolates with 80% accuracy, demonstrating the potential for identifying risk genotypes pre-operatively.
Publisher: F1000 Research Ltd
Date: 30-07-2018
DOI: 10.12688/WELLCOMEOPENRES.14694.1
Abstract: Identifying structure in collections of sequence data sets remains a common problem in genomics. hierBAPS, a popular algorithm for identifying population structure in haploid genomes, has previously only been available as a MATLAB binary. We provide an R implementation which is both easier to install and use, automating the entire pipeline. Additionally, we allow for the use of multiple processors, improve on the default settings of the algorithm, and provide an interface with the ggtree library to enable informative illustration of the clustering results. Our aim is that this package aids in the understanding and dissemination of the method, as well as enhancing the reproducibility of population structure analyses.
Publisher: Springer Science and Business Media LLC
Date: 22-03-2021
DOI: 10.1038/S41467-021-22238-5
Abstract: A Correction to this paper has been published: 0.1038/s41467-021-22238-5
Publisher: Cold Spring Harbor Laboratory
Date: 24-01-2019
Abstract: The routine use of genomics for disease surveillance provides the opportunity for high-resolution bacterial epidemiology. Current whole-genome clustering and multilocus typing approaches do not fully exploit core and accessory genomic variation, and they cannot both automatically identify, and subsequently expand, clusters of significantly similar isolates in large data sets spanning entire species. Here, we describe PopPUNK ( Pop ulation P artitioning U sing N ucleotide K -mers), a software implementing scalable and expandable annotation- and alignment-free methods for population analysis and clustering. Variable-length k -mer comparisons are used to distinguish isolates’ ergence in shared sequence and gene content, which we demonstrate to be accurate over multiple orders of magnitude using data from both simulations and genomic collections representing 10 taxonomically widespread species. Connections between closely related isolates of the same strain are robustly identified, despite interspecies variation in the pairwise distance distributions that reflects species’ erse evolutionary patterns. PopPUNK can process 10 3 –10 4 genomes in a single batch, with minimal memory use and runtimes up to 200-fold faster than existing model-based methods. Clusters of strains remain consistent as new batches of genomes are added, which is achieved without needing to reanalyze all genomes de novo. This facilitates real-time surveillance with consistent cluster naming between studies and allows for outbreak detection using hundreds of genomes in minutes. Interactive visualization and online publication is streamlined through the automatic output of results to multiple platforms. PopPUNK has been designed as a flexible platform that addresses important issues with currently used whole-genome clustering and typing methods, and has potential uses across bacterial genetics and public health research.
Publisher: Cold Spring Harbor Laboratory
Date: 04-10-2021
DOI: 10.1101/2021.10.04.462983
Abstract: Advances in whole-genome genotyping and sequencing have allowed genome-wide analyses of association, prediction and heritability in many organisms. However, the application of such analyses to bacteria is still in its infancy, being limited by difficulties including the plasticity of bacterial genomes and their strong population structure. Here we propose, and validate using simulations, a suite of genome-wide analyses for bacteria. We combine methods from human genetics and previous bacterial studies, including linear mixed models, elastic net and LD-score regression, and introduce innovations such as frequency-based allele coding, testing for both insertion/deletion and nucleotide effects and partitioning heritability by genome region. We then analyse three phenotypes of a major human pathogen Streptococcus pneumoniae , including the first analyses of minimum inhibitory concentrations (MIC) for each of two antibiotics, penicillin and ceftriaxone. We show that these are highly heritable leading to high prediction accuracy, which is explained by many genetic associations identified under good control of population structure effects. In the case of ceftriaxone MIC, these results are surprising because none of the isolates was resistant according to the inhibition zone diameter threshold. We estimate that just over half of the heritability of penicillin MIC is explained by a known drug-resistance region, which also contributes around a quarter of the heritability of ceftriaxone MIC. For the within-host survival phenotype carriage duration, no reliable associations were found but we observed moderate heritability and prediction accuracy, indicating a polygenic trait. While generating important new results for S. pneumoniae , we have critically assessed existing methods and introduced innovations that will be useful for future large-scale population genomics studies to help decipher the genetic architecture of bacterial traits. Genome-wide association, prediction and heritability analyses in bacteria are beginning to help unravel the genetic underpinnings of traits such as antimicrobial resistance, virulence, within-host survival and transmissibility. Progress to date is limited by challenges including the effects of strong population structure and variable recombination, and the many gaps in sequence alignments including the absence of entire genes in many isolates. More work is required to critically asses and develop methods for bacterial genomics. We address this task here, using a range of existing methods from bacterial and human genetics, such as linear mixed models, elastic net and LD-score regression. Using simulations, we first validate and then adapt these methods to introduce new analyses, including separate assessment of gap and nucleotide effects, a new allele coding for association analyses and a method to partition heritability into genome regions. We analyse within-host survival and two antimicrobial response traits of Streptococcus pneumoniae , identifying many novel associations while demonstrating good control of population structure and accurate prediction. We present both new results for an important pathogen and methodological advances that will be useful in guiding future studies in bacterial population genomics.
Publisher: Cold Spring Harbor Laboratory
Date: 08-09-2023
Publisher: Cold Spring Harbor Laboratory
Date: 25-04-2022
DOI: 10.1101/2022.04.23.489244
Abstract: Horizontal gene transfer (HGT) plays a critical role in the evolution and ersification of many microbial species. The resulting dynamics of gene gain and loss can have important implications for the development of antibiotic resistance and the design of vaccine and drug interventions. Methods for the analysis of gene presence/absence patterns typically do not account for errors introduced in the automated annotation and clustering of gene sequences. In particular, methods adapted from ecological studies, including the pangenome gene accumulation curve, can be misleading as they may reflect the underlying ersity in the temporal s ling of genomes rather than a difference in the dynamics of HGT. Here, we introduce Panstripe, a method based on Generalised Linear Regression that is robust to population structure, s ling bias and errors in the predicted presence/absence of genes. We demonstrate using simulations that Panstripe can effectively identify differences in the rate and number of genes involved in HGT events, and illustrate its capability by analysing several erse bacterial genome datasets representing major human pathogens. Panstripe is freely available as an R package at tonkinhill anstripe .
Publisher: Springer Science and Business Media LLC
Date: 03-02-2021
DOI: 10.1038/S41467-021-20988-W
Abstract: Chickens are the most common birds on Earth and colibacillosis is among the most common diseases affecting them. This major threat to animal welfare and safe sustainable food production is difficult to combat because the etiological agent, avian pathogenic Escherichia coli (APEC), emerges from ubiquitous commensal gut bacteria, with no single virulence gene present in all disease-causing isolates. Here, we address the underlying evolutionary mechanisms of extraintestinal spread and systemic infection in poultry. Combining population scale comparative genomics and pangenome-wide association studies, we compare E. coli from commensal carriage and systemic infections. We identify phylogroup-specific and species-wide genetic elements that are enriched in APEC, including pathogenicity-associated variation in 143 genes that have erse functions, including genes involved in metabolism, lipopolysaccharide synthesis, heat shock response, antimicrobial resistance and toxicity. We find that horizontal gene transfer spreads pathogenicity elements, allowing ergent clones to cause infection. Finally, a Random Forest model prediction of disease status (carriage vs. disease) identifies pathogenic strains in the emergent ST-117 poultry-associated lineage with 73% accuracy, demonstrating the potential for early identification of emergent APEC in healthy flocks.
Publisher: Cold Spring Harbor Laboratory
Date: 28-01-2020
DOI: 10.1101/2020.01.28.922989
Abstract: Population-level comparisons of prokaryotic genomes must take into account the substantial differences in gene content, resulting from frequent horizontal gene transfer, gene duplication and gene loss. However, the automated annotation of prokaryotic genomes is imperfect, and errors due to fragmented assemblies, contamination, erse gene families and mis-assemblies accumulate over the population, leading to profound consequences when analysing the set of all genes found in a species. Here we introduce Panaroo, a graph based pangenome clustering tool that is able to account for many of the sources of error introduced during the annotation of prokaryotic genome assemblies. We verified our approach through extensive simulations of de novo assemblies using the infinitely many genes model and by analysing a number of publicly available large bacterial genome datasets. Using a highly clonal Mycobacterium tuberculosis dataset as a negative control case, we show that failing to account for annotation errors can lead to pangenome estimates that are dominated by error. We additionally demonstrate the utility of the improved graphical output provided by Panaroo by performing a pan-genome wide association study in Neisseria gonorrhoeae and by analysing gene gain and loss rates across 51 of the major global pneumococcal sequence clusters. Panaroo is freely available under an open source MIT licence at tonkinhill anaroo .
Publisher: Cold Spring Harbor Laboratory
Date: 06-07-2018
DOI: 10.1101/360917
Abstract: The routine use of genomics for disease surveillance provides the opportunity for high-resolution bacterial epidemiology. However, current whole-genome clustering and multi-locus typing approaches do not fully exploit core and accessory genomic variation, and cannot both automatically identify, and subsequently expand, clusters of significantly-similar isolates in large datasets and across species. Here we describe PopPUNK (Population Partitioning Using Nucleotide K-mers poppunk.readthedocs.io/en/latest/ ). software implementing scalable and expandable annotation- and alignment-free methods for population analysis and clustering. Variable-length k -mer comparisons are used to distinguish isolates’ ergence in shared sequence and gene content, which we demonstrate to be accurate over multiple orders of magnitude using both simulated data and real datasets from ten taxonomically-widespread species. Connections between closely-related isolates of the same strain are robustly identified, despite variation in the discontinuous pairwise distance distributions that reflects species’ erse evolutionary patterns. PopPUNK can process 10 3 -10 4 genomes as single batch, with minimal memory use and runtimes up to 200-fold faster than existing methods. Clusters of strains remain consistent as new batches of genomes are added, which is achieved without needing to re-analyse all genomes de novo. This facilitates real-time surveillance with stable cluster naming and allows for outbreak detection using hundreds of genomes in minutes. Interactive visualisation and online publication is streamlined through automatic output of results to multiple platforms. PopPUNK has been designed as a flexible platform that addresses important issues with currently used whole-genome clustering and typing methods, and has potential uses across bacterial genetics and public health research.
Location: United Kingdom of Great Britain and Northern Ireland
Start Date: 06-2019
End Date: 11-2022
Amount: $410,000.00
Funder: Australian Research Council
View Funded Activity