ARDC Research Link Australia

Publication

Inferring the ancestry of everyone

Publisher: Cold Spring Harbor Laboratory

Date: 11-2018

Abstract: A central problem in evolutionary biology is to infer the full genealogical history of a set of DNA sequences. This history contains rich information about the forces that have influenced a sexually reproducing species. However, existing methods are limited: the most accurate is unable to cope with more than a few dozen s les. With modern genetic data sets rapidly approaching millions of genomes, there is an urgent need for efficient inference methods to exploit such rich resources. We introduce an algorithm to infer whole-genome history which has comparable accuracy to the state-of-the-art but can process around four orders of magnitude more sequences. Additionally, our method results in an “evolutionary encoding” of the original sequence data, enabling efficient access to genealogies and calculation of genetic statistics over the data. We apply this technique to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the genealogies we estimate are both rich in biological signal and efficient to process.

Publication

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Publisher: eLife Sciences Publications, Ltd

Date: 21-06-2023

DOI: 10.7554/ELIFE.84874

Abstract: Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than threefold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed the best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.

Publication

htsget: a protocol for securely streaming genomic data

Publisher: Oxford University Press (OUP)

Date: 19-06-2018

DOI: 10.1093/BIOINFORMATICS/BTY492

Abstract: Standardized interfaces for efficiently accessing high-throughput sequencing data are a fundamental requirement for large-scale genomic data sharing. We have developed htsget, a protocol for secure, efficient and reliable access to sequencing read and variation data. We demonstrate four independent client and server implementations, and the results of a comprehensive interoperability demonstration. samtools.github.io/hts-specs/htsget.html Supplementary data are available at Bioinformatics online.

Publication

Author response: A community-maintained standard library of population genetic models

Publisher: eLife Sciences Publications, Ltd

Date: 26-05-2020

DOI: 10.7554/ELIFE.54967.SA2

Publication

Bioconda: sustainable and comprehensive software distribution for the life sciences

Publisher: Springer Science and Business Media LLC

Date: 07-2018

DOI: 10.1038/S41592-018-0046-7

Publication

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Publisher: eLife Sciences Publications, Ltd

Date: 23-05-2023

DOI: 10.7554/ELIFE.84874.2

Abstract: Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic data sets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and to the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework monospace stdpopsim /monospace seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of monospace stdpopsim /monospace focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of monospace stdpopsim /monospace (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than three-fold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to monospace stdpopsim /monospace aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.

Publication

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Publisher: eLife Sciences Publications, Ltd

Date: 03-03-2023

DOI: 10.7554/ELIFE.84874.1

Abstract: Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic data sets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and to the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework monospace stdpopsim /monospace seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of monospace stdpopsim /monospace focused on establishing this framework using six well-characterized model species (Adrion et al.,2020). Here, we report on major improvements made in the new release of monospace stdpopsim /monospace (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than three-fold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to monospace stdpopsim /monospace aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.

Publication

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Publisher: eLife Sciences Publications, Ltd

Date: 21-06-2023

DOI: 10.7554/ELIFE.84874.3

Abstract: Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic datasets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than threefold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed the best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.

Publication

Efficient ancestry and mutation simulation with msprime 1.0

Publisher: Cold Spring Harbor Laboratory

Date: 09-2021

DOI: 10.1101/2021.08.31.457499

Abstract: Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this necessity, a large number of specialised simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and tskit library. We summarise msprime ’s many features, and show that its performance is excellent, often many times faster and more memory efficient than specialised alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.

Publication

Publisher Correction: Inferring whole-genome histories in large population datasets

Publisher: Springer Science and Business Media LLC

Date: 07-10-2019

DOI: 10.1038/S41588-019-0523-7

Abstract: An amendment to this paper has been published and can be accessed via a link at the top of the paper.

Publication

Bayesian inference of ancestral recombination graphs

Publisher: Public Library of Science (PLoS)

Date: 09-03-2022

DOI: 10.1371/JOURNAL.PCBI.1009960

Abstract: We present a novel algorithm, implemented in the software ARGinfer , for probabilistic inference of the Ancestral Recombination Graph under the Coalescent with Recombination. Our Markov Chain Monte Carlo algorithm takes advantage of the Succinct Tree Sequence data structure that has allowed great advances in simulation and point estimation, but not yet probabilistic inference. Unlike previous methods, which employ the Sequentially Markov Coalescent approximation, ARGinfer uses the Coalescent with Recombination, allowing more accurate inference of key evolutionary parameters. We show using simulations that ARGinfer can accurately estimate many properties of the evolutionary history of the s le, including the topology and branch lengths of the genealogical tree at each sequence site, and the times and locations of mutation and recombination events. ARGinfer approximates posterior probability distributions for these and other quantities, providing interpretable assessments of uncertainty that we show to be well calibrated. ARGinfer is currently limited to tens of DNA sequences of several hundreds of kilobases, but has scope for further computational improvements to increase its applicability.

Publication

A community-maintained standard library of population genetic models

Publisher: eLife Sciences Publications, Ltd

Date: 23-06-2020

DOI: 10.7554/ELIFE.54967

Abstract: The explosion in population genomic data demands ever more complex modes of analysis, and increasingly, these analyses depend on sophisticated simulations. Recent advances in population genetic simulation have made it possible to simulate large and complex models, but specifying such models for a particular simulation engine remains a difficult and error-prone task. Computational genetics researchers currently re-implement simulation models independently, leading to inconsistency and duplication of effort. This situation presents a major barrier to empirical researchers seeking to use simulations for power analyses of upcoming studies or sanity checks on existing genomic data. Population genetics, as a field, also lacks standard benchmarks by which new tools for inference might be measured. Here, we describe a new resource, stdpopsim, that attempts to rectify this situation. Stdpopsim is a community-driven open source project, which provides easy access to a growing catalog of published simulation models from a range of organisms and supports multiple simulation engine backends. This resource is available as a well-documented python library with a simple command-line interface. We share some ex les demonstrating how stdpopsim can be used to systematically compare demographic inference methods, and we encourage a broader community of developers to contribute to this growing resource.

Publication

Inferring whole-genome histories in large population datasets

Publisher: Springer Science and Business Media LLC

Date: 09-2019

DOI: 10.1038/S41588-019-0483-Y

Publication

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Publisher: Cold Spring Harbor Laboratory

Date: 31-10-2022

DOI: 10.1101/2022.10.29.514266

Abstract: Simulation is a key tool in population genetics for both methods development and empirical research, but producing simulations that recapitulate the main features of genomic data sets remains a major obstacle. Today, more realistic simulations are possible thanks to large increases in the quantity and quality of available genetic data, and to the sophistication of inference and simulation software. However, implementing these simulations still requires substantial time and specialized knowledge. These challenges are especially pronounced for simulating genomes for species that are not well-studied, since it is not always clear what information is required to produce simulations with a level of realism sufficient to confidently answer a given question. The community-developed framework stdpopsim seeks to lower this barrier by facilitating the simulation of complex population genetic models using up-to-date information. The initial version of stdpopsim focused on establishing this framework using six well-characterized model species (Adrion et al., 2020). Here, we report on major improvements made in the new release of stdpopsim (version 0.2), which includes a significant expansion of the species catalog and substantial additions to simulation capabilities. Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations. Through community-driven efforts, we expanded the number of species in the catalog more than three-fold and broadened coverage across the tree of life. During the process of expanding the catalog, we have identified common sticking points and developed best practices for setting up genome-scale simulations. We describe the input data required for generating a realistic simulation, suggest good practices for obtaining the relevant information from the literature, and discuss common pitfalls and major considerations. These improvements to stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations, especially in non-model organisms, making them available, transparent, and accessible to everyone.

Publication

GA4GH: International policies and standards for data sharing across genomic research and healthcare

Publisher: Elsevier BV

Date: 11-2021

DOI: 10.1016/J.XGEN.2021.100029

Publication

Efficient ancestry and mutation simulation with msprime 1.0

Publisher: Oxford University Press (OUP)

Date: 13-12-2021

DOI: 10.1093/GENETICS/IYAB229

Abstract: Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime’s many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.

Jerome Kelleher

Researcher

Research Topics

Top 5 Research Topics

ANZSRC Field of Research (FoR)

ANZSRC Socio-Economic Objective (SEO)

Publications

Inferring the ancestry of everyone

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

htsget: a protocol for securely streaming genomic data

Author response: A community-maintained standard library of population genetic models

Bioconda: sustainable and comprehensive software distribution for the life sciences

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Efficient ancestry and mutation simulation with msprime 1.0

Publisher Correction: Inferring whole-genome histories in large population datasets

Bayesian inference of ancestral recombination graphs

A community-maintained standard library of population genetic models

Inferring whole-genome histories in large population datasets

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

GA4GH: International policies and standards for data sharing across genomic research and healthcare

Efficient ancestry and mutation simulation with msprime 1.0

Related Organisations

The University Of Edinburgh

University Of Oxford

University College Cork

MPSTOR

Related Funding Activities

Discovery Projects - Grant ID: DP210102168

Jerome Kelleher

Researcher

Research Topics

Top 5 Research Topics

ANZSRC Field of Research (FoR)

ANZSRC Socio-Economic Objective (SEO)

Related Links

Publications

Inferring the ancestry of everyone

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

htsget: a protocol for securely streaming genomic data

Author response: A community-maintained standard library of population genetic models

Bioconda: sustainable and comprehensive software distribution for the life sciences

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

Efficient ancestry and mutation simulation with msprime 1.0

Publisher Correction: Inferring whole-genome histories in large population datasets

Bayesian inference of ancestral recombination graphs

A community-maintained standard library of population genetic models

Inferring whole-genome histories in large population datasets

Expanding the stdpopsim species catalog, and lessons learned for realistic genome simulations

GA4GH: International policies and standards for data sharing across genomic research and healthcare

Efficient ancestry and mutation simulation with msprime 1.0

Related Organisations

The University Of Edinburgh

University Of Oxford

University College Cork

MPSTOR

Related Funding Activities

Discovery Projects - Grant ID: DP210102168

ARDC NEWSLETTER SIGNUP