ARDC Research Link Australia

Publication

Sustainable data analysis with Snakemake

Publisher: F1000 Research Ltd

Date: 19-04-2021

Abstract: Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.

Publication

Sustainable data analysis with Snakemake

Publisher: F1000 Research Ltd

Date: 18-01-2021

DOI: 10.12688/F1000RESEARCH.29032.1

Abstract: Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid. Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.

Publication

Minos: variant adjudication and joint genotyping of cohorts of bacterial genomes

Publisher: Cold Spring Harbor Laboratory

Date: 15-09-2021

DOI: 10.1101/2021.09.15.460475

Abstract: Short-read variant calling for bacterial genomics is a mature field, and there are many widely-used software tools. Different underlying approaches (eg pileup, local or global assembly, paired-read use, haplotype use) lend each tool different strengths, especially when considering non-SNP (single nucleotide polymorphism) variation or potentially distant reference genomes. It would therefore be valuable to be able to integrate the results from multiple variant callers, using a robust statistical approach to “adjudicate” at loci where there is disagreement between callers. To this end, we present a tool, Minos, for variant adjudication by mapping reads to a genome graph of variant calls. Minos allows users to combine output from multiple variant callers without loss of precision. Minos also addresses a second problem of joint genotyping SNPs and indels in bacterial cohorts, which can also be framed as an adjudication problem. We benchmark on 62 s les from 3 species ( Mycobacterium tuberculosis, Staphylococcus aureus, Klebsiella pneumoniae ) and an outbreak of 385 M. tuberculosis s les. Finally, we joint genotype a large M. tuberculosis cohort (N ≈ 15k) for which the rif icin phenotype is known. We build a map of non-synonymous variants in the RRDR (rif icin resistance determining region) of the rpoB gene and extend current knowledge relating RRDR SNPs to heterogeneity in rif icin resistance levels. We replicate this finding in a second M. tuberculosis cohort (N ≈ 13k). Minos is released under the MIT license, available at qbal-lab-org/minos .

Publication

Risk assessment, eradication, and biological control: global efforts to limit Australian acacia invasions

Publisher: Wiley

Date: 08-08-2011

DOI: 10.1111/J.1472-4642.2011.00815.X

Publication

Nucleotide-resolution bacterial pan-genomics with reference graphs

Publisher: Cold Spring Harbor Laboratory

Date: 12-11-2020

DOI: 10.1101/2020.11.12.380378

Abstract: Bacterial genomes follow a U-shaped frequency distribution whereby most genomic loci are either rare (accessory) or common (core) the union of these is the pan-genome. The alignable fraction of two genomes from a single species can be low (e.g. 50-70%), such that no single reference genome can access all single nucleotide polymorphisms (SNPs). The pragmatic solution is to choose a close reference, and analyse SNPs only in the core genome. Given much bacterial adaptability hinges on the accessory genome, this is an unsatisfactory limitation. We present a novel pan-genome graph structure and algorithms implemented in the software pandora , which approximates a sequenced genome as a recombinant of reference genomes, detects novel variation and then pan-genotypes multiple s les. The method takes fastq as input and outputs a multi-s le VCF with respect to an inferred data-dependent reference genome, and is available at mcolq andora . Constructing a reference graph from 578 E. coli genomes, we analyse a erse set of 20 E. coli isolates. We show pandora recovers at least 13k more rare SNPs than single-reference based tools, achieves equal or better error rates with Nanopore as with Illumina data, 6-24x lower Nanopore error rates than other tools, and provides a stable framework for analysing erse s les without reference bias. We also show that our inferred recombinant VCF reference genome is significantly better than simply picking the closest RefSeq reference. This is a step towards comprehensive cohort analysis of bacterial pan-genomic variation, with potential impacts on genotype henotype and epidemiological studies.

Publication

Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs

Publisher: Springer Science and Business Media LLC

Date: 14-09-2021

DOI: 10.1186/S13059-021-02473-1

Abstract: We present pandora , a novel pan-genome graph structure and algorithms for identifying variants across the full bacterial pan-genome. As much bacterial adaptability hinges on the accessory genome, methods which analyze SNPs in just the core genome have unsatisfactory limitations. Pandora approximates a sequenced genome as a recombinant of references, detects novel variation and pan-genotypes multiple s les. Using a reference graph of 578 Escherichia coli genomes, we compare 20 erse isolates. Pandora recovers more rare SNPs than single-reference-based tools, is significantly better than picking the closest RefSeq reference, and provides a stable framework for analyzing erse s les without reference bias.

Brice Letcher

Researcher

Publications

Sustainable data analysis with Snakemake

Sustainable data analysis with Snakemake

Minos: variant adjudication and joint genotyping of cohorts of bacterial genomes

Risk assessment, eradication, and biological control: global efforts to limit Australian acacia invasions

Nucleotide-resolution bacterial pan-genomics with reference graphs

Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs

Related Organisations

University Of Cambridge

University Of Cambridge

EMBL-EBI

Ecole Normale Supérieure Lyon

Related Funding Activities

Brice Letcher

Researcher

Related Links

Publications

Sustainable data analysis with Snakemake

Sustainable data analysis with Snakemake

Minos: variant adjudication and joint genotyping of cohorts of bacterial genomes

Risk assessment, eradication, and biological control: global efforts to limit Australian acacia invasions

Nucleotide-resolution bacterial pan-genomics with reference graphs

Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs

Related Organisations

University Of Cambridge

University Of Cambridge

EMBL-EBI

Ecole Normale Supérieure Lyon

Related Funding Activities

ARDC NEWSLETTER SIGNUP