ORCID Profile
0000-0002-3048-5518
Current Organisations
University of Zurich
,
Swiss Institute of Bioinformatics
Does something not look right? The information on this page has been harvested from data sources that may not be up to date. We continue to work with information providers to improve coverage and quality. To report an issue, use the Feedback Form.
Publisher: Frontiers Media SA
Date: 13-06-2018
Publisher: Cold Spring Harbor Laboratory
Date: 15-11-2017
Abstract: Accurate annotation of all protein-coding sequences (CDSs) is an essential prerequisite to fully exploit the rapidly growing repertoire of completely sequenced prokaryotic genomes. However, large discrepancies among the number of CDSs annotated by different resources, missed functional short open reading frames (sORFs), and overprediction of spurious ORFs represent serious limitations. Our strategy toward accurate and complete genome annotation consolidates CDSs from multiple reference annotation resources, ab initio gene prediction algorithms and in silico ORFs (a modified six-frame translation considering alternative start codons) in an integrated proteogenomics database (iPtgxDB) that covers the entire protein-coding potential of a prokaryotic genome. By extending the PeptideClassifier concept of unambiguous peptides for prokaryotes, close to 95% of the identifiable peptides imply one distinct protein, largely simplifying downstream analysis. Searching a comprehensive Bartonella henselae proteomics data set against such an iPtgxDB allowed us to unambiguously identify novel ORFs uniquely predicted by each resource, including lipoproteins, differentially expressed and membrane-localized proteins, novel start sites and wrongly annotated pseudogenes. Most novelties were confirmed by targeted, parallel reaction monitoring mass spectrometry, including unique ORFs and single amino acid variations (SAAVs) identified in a re-sequenced laboratory strain that are not present in its reference genome. We demonstrate the general applicability of our strategy for genomes with varying GC content and distinct taxonomic origin. We release iPtgxDBs for B. henselae , Bradyrhizobium diazoefficiens and Escherichia coli and the software to generate both proteogenomics search databases and integrated annotation files that can be viewed in a genome browser for any prokaryote.
Publisher: F1000 Research Ltd
Date: 24-05-2019
DOI: 10.12688/F1000RESEARCH.11622.3
Abstract: High-dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high-throughput interrogation and characterization of cell populations. Here, we present an updated R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signalling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models or linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across s les to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g., multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g., plots of aggregated signals).
Publisher: F1000 Research Ltd
Date: 17-12-2019
DOI: 10.12688/F1000RESEARCH.11622.4
Abstract: High-dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high-throughput interrogation and characterization of cell populations. Here, we present an updated R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signalling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models or linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across s les to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g., multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g., plots of aggregated signals).
Publisher: Cold Spring Harbor Laboratory
Date: 11-03-2019
DOI: 10.1101/574525
Abstract: A platform for highly parallel direct sequencing of native RNA strands was recently described by Oxford Nanopore Technologies (ONT) in order to assess overall performance in transcript-level investigations, the technology was applied for sequencing sets of synthetic transcripts as well as a yeast transcriptome. However, despite initial efforts it remains crucial to further investigate characteristics of ONT native RNA sequencing when applied to much more complex transcriptomes. Here we thus undertook extensive native RNA sequencing of polyA+ RNA from two human cell lines, and thereby analysed ~5.2 million aligned native RNA reads which consisted of a total of ~4.6 billion bases. To enable informative comparisons, we also performed relevant ONT direct cDNA- and Illumina-sequencing. We find that while native RNA sequencing does enable some of the anticipated advantages, key unexpected aspects h er its performance, most notably the quite frequent inability to obtain full-length transcripts from single reads, as well as difficulties to unambiguously infer their true transcript of origin. While characterising issues that need to be addressed when investigating more complex transcriptomes, our study highlights that with some defined improvements, native RNA sequencing could be an important addition to the mammalian transcriptomics toolbox.
Publisher: F1000 Research Ltd
Date: 26-05-2017
DOI: 10.12688/F1000RESEARCH.11622.1
Abstract: High dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high throughput interrogation and characterization of cell populations.Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across s les to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g. plots of aggregated signals).
Publisher: F1000 Research Ltd
Date: 14-11-2017
DOI: 10.12688/F1000RESEARCH.11622.2
Abstract: High dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high throughput interrogation and characterization of cell populations.Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across s les to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g. plots of aggregated signals).
Publisher: Springer Science and Business Media LLC
Date: 10-02-2023
DOI: 10.1186/S13059-023-02859-3
Abstract: Quality control (QC) is a critical component of single-cell RNA-seq (scRNA-seq) processing pipelines. Current approaches to QC implicitly assume that datasets are comprised of one cell type, potentially resulting in biased exclusion of rare cell types. We introduce , which robustly fits a Gaussian mixture model across multiple s les, improves sensitivity, and reduces bias compared to current approaches. We show via simulations that is less susceptible to exclusion of rarer cell types. We also demonstrate on a complex real dataset (867k cells over 172 s les). is general, is implemented in R, and could be applied to other data types.
Publisher: Cold Spring Harbor Laboratory
Date: 03-06-2022
DOI: 10.1101/2022.06.01.493823
Abstract: Single-cell RNA-sequencing is advancing our understanding of synovial pathobiology in inflammatory arthritis. Here, we optimized the protocol for the dissociation of fresh synovial biopsies and created a reference single-cell map of fresh human synovium in inflammatory arthritis. We utilized the published method for dissociating cryopreserved synovium and optimized it for dissociating small fresh synovial biopsies. The optimized protocol enabled the isolation of a good yield of consistently highly viable cells, minimizing the dropout rate of prospectively collected biopsies. Our reference synovium map comprised over 100’000 unsorted single-cell profiles from 25 synovial tissues of patients with inflammatory arthritis. Synovial cells formed 11 lymphoid, 15 myeloid and 16 stromal cell clusters, including IFITM2+ synovial neutrophils. Using this reference map, we successfully annotated published synovial scRNA-seq datasets. Our dataset uncovered endothelial cell ersity and identified SOD2 high SAA1+SAA2+ and SERPINE1+COL5A3+ fibroblast clusters, expressing genes linked to cartilage breakdown (SDC4) and extracellular matrix remodelling (LOXL2, TGFBI, TGFB1), respectively. We broadened the characterization of tissue resident FOLR2+COLEC12 high and LYVE1+SLC40A1+ macrophages, inferring their extracellular matrix sensing and iron recycling activities. Our research brings an efficient synovium dissociation protocol and a reference annotation resource of fresh human synovium, while expanding the knowledge about synovial cell ersity in inflammatory arthritis.
Publisher: Springer Science and Business Media LLC
Date: 09-10-2019
Publisher: Research Square Platform LLC
Date: 16-03-2022
DOI: 10.21203/RS.3.RS-1367459/V1
Abstract: Neurons live for the lifespan of the in idual and underlie our ability for lifelong learning and memory. However, aging alters neuron morphology and function resulting in age-related cognitive decline. It is well established that epigenetic alterations are essential for learning and memory, yet few neuron-specific genome-wide epigenetic maps exist into old age. Comprehensive mapping of H3K4me3 and H3K27ac in mouse neurons across lifespan revealed plastic H3K4me3 marking that differentiates neuronal age linked to known characteristics of cellular and neuronal aging. We determined that neurons in old age recapitulate the H3K27ac enrichment at promoters, enhancers and super enhancers from young adult neurons, likely representing a re-activation of pathways to maintain neuronal output. Finally, this study identified new characteristics of neuronal aging, including altered rDNA regulation and epigenetic regulatory mechanisms. Collectively, these findings indicate a key role for epigenetic regulation in neurons, that is inextricably linked with aging.
Publisher: Mary Ann Liebert Inc
Date: 06-2020
Publisher: Cold Spring Harbor Laboratory
Date: 18-04-2017
DOI: 10.1101/127506
Abstract: The Mediterranean fruitfly Ceratitis capitata (medfly) is an invasive agricultural pest of high economical impact and has become an emerging model for developing new genetic control strategies as alternative to insecticides. Here, we report the successful adaptation of CRISPR-Cas9-based gene disruption in the medfly by injecting in vitro pre-assembled, solubilized Cas9 ribonucleoprotein complexes (RNPs) loaded with gene-specific sgRNAs into early embryos. When targeting the eye pigmentation gene white eye ( we ), we observed a high rate of somatic mosaicism in surviving G0 adults. Germline transmission of mutated we alleles by G0 animals was on average above 70%, with in idual cases achieving a transmission rate of nearly 100%. We further recovered large deletions in the we gene when two sites were simultaneously targeted by two sgRNAs. CRISPR-Cas9 targeting of the Ceratitis ortholog of the Drosophila segmentation paired gene ( Ccprd ) caused segmental malformations in late embryos and in hatched larvae. Mutant phenotypes correlate with repair by non-homologous end joining (NHEJ) lesions in the two targeted genes. This simple and highly effective Cas9 RNP-based gene editing to introduce mutations in Ceratitis capitata will significantly advance the design and development of new effective strategies for pest control management.
Publisher: Cold Spring Harbor Laboratory
Date: 16-02-2017
DOI: 10.1101/109082
Abstract: Reductions in sequencing cost and innovations in expression quantification have prompted an emergence of RNA-seq studies with complex designs and data analysis at transcript resolution. These applications involve multiple hypotheses per gene, leading to challenging multiple testing problems. Conventional approaches provide separate top-lists for every contrast and false discovery rate (FDR) control at in idual hypothesis level. Hence, they fail to establish proper gene-level error control, which compromises downstream validation experiments. Tests that aggregate in idual hypotheses are more powerful and provide gene-level FDR control, but in the RNA-seq literature no methods are available for post-hoc analysis of in idual hypotheses. We introduce a two-stage procedure that leverages the increased power of aggregated hypothesis tests while maintaining high biological resolution by post-hoc analysis of genes passing the screening hypothesis. Our method is evaluated on simulated and real RNA-seq experiments. It provides gene-level FDR control in studies with complex designs while boosting power for interaction effects without compromising the discovery of main effects. In a differential transcript usage/expression context, stage-wise testing gains power by aggregating hypotheses at the gene level, while providing transcript-level assessment of genes passing the screening stage. Finally, a prostate cancer case study highlights the relevance of combining gene with transcript level results. Stage-wise testing is a general paradigm that can be adopted whenever in idual hypotheses can be aggregated. In our context, it achieves an optimal middle ground between biological resolution and statistical power while providing gene-level FDR control, which is beneficial for downstream biological interpretation and validation.
Publisher: Springer Science and Business Media LLC
Date: 29-01-2014
Abstract: Chromothripsis is a recently discovered phenomenon of genomic rearrangement, possibly arising during a single genome-shattering event. This could provide an alternative paradigm in cancer development, replacing the gradual accumulation of genomic changes with a “one-off” catastrophic event. However, the term has been used with varying operational definitions, with the minimal consensus being a large number of locally clustered copy number aberrations. The mechanisms underlying these chromothripsis-like patterns (CTLP) and their specific impact on tumorigenesis are still poorly understood. Here, we identified CTLP in 918 cancer s les, from a dataset of more than 22,000 oncogenomic arrays covering 132 cancer types. Fragmentation hotspots were found to be located on chromosome 8, 11, 12 and 17. Among the various cancer types, soft-tissue tumors exhibited particularly high CTLP frequencies. Genomic context analysis revealed that CTLP rearrangements frequently occurred in genomes that additionally harbored multiple copy number aberrations (CNAs). An investigation into the affected chromosomal regions showed a large proportion of arm-level pulverization and telomere related events, which would be compatible to a number of underlying mechanisms. We also report evidence that these genomic events may be correlated with patient age, stage and survival rate. Through a large-scale analysis of oncogenomic array data sets, this study characterized features associated with genomic aberrations patterns, compatible to the spectrum of “chromothripsis”-definitions as previously used. While quantifying clustered genomic copy number aberrations in cancer s les, our data indicates an underlying biological heterogeneity behind these chromothripsis-like patterns, beyond a well defined “chromthripsis” phenomenon.
Publisher: F1000 Research Ltd
Date: 10-09-2018
DOI: 10.12688/F1000RESEARCH.15666.2
Abstract: Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple in idual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. All the code used for the evaluation is available on GitHub ( arkrobinsonuzh/scRNAseq_clustering_comparison ). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor ( ackages/DuoClustering2018 ).
Publisher: F1000 Research Ltd
Date: 26-07-2018
DOI: 10.12688/F1000RESEARCH.15666.1
Abstract: Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 12 clustering algorithms, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using 9 publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple in idual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. The R scripts providing an extensible framework for the evaluation of new methods and data sets are available on GitHub ( arkrobinsonuzh/scRNAseq_clustering_comparison ).
Publisher: Cold Spring Harbor Laboratory
Date: 25-11-2020
DOI: 10.1101/2020.11.24.394213
Abstract: We present distinct , a general method for differential analysis of full distributions that is well suited to applications on single-cell data, such as single-cell RNA sequencing and high-dimensional flow or mass cytometry data. High-throughput single-cell data reveal an unprecedented view of cell identity and allow complex variations between conditions to be discovered nonetheless, most methods for differential expression target differences in the mean and struggle to identify changes where the mean is only marginally affected. distinct is based on a hierarchical non-parametric permutation approach and, by comparing empirical cumulative distribution functions, identifies both differential patterns involving changes in the mean, as well as more subtle variations that do not involve the mean. We performed extensive bench-marks across both simulated and experimental datasets from single-cell RNA sequencing and mass cytometry data, where distinct shows favourable performance, identifies more differential patterns than competitors, and displays good control of false positive and false discovery rates. distinct is available as a Bioconductor R package.
Publisher: Cold Spring Harbor Laboratory
Date: 10-02-2022
DOI: 10.1101/2022.02.08.479579
Abstract: Long-read RNA sequencing (lrRNA-seq) produces detailed information about full-length transcripts, including novel and s le-specific isoforms. Furthermore, there is opportunity to call variants directly from lrRNA-seq data. However, most state-of-the-art variant callers have been developed for genomic DNA. Here, there are two objectives: first, we perform a mini-benchmark on GATK, DeepVariant, Clair3, and NanoCaller primarily on PacBio Iso-Seq, data, but also on Nanopore and Illumina RNA-seq data second, we propose a pipeline to process spliced-alignment files, making them suitable for variant calling with DNA-based callers. With such manipulations, high calling performance can be achieved using DeepVariant on Iso-seq data.
Publisher: Springer Science and Business Media LLC
Date: 03-2006
DOI: 10.1038/NATURE04670
Abstract: Identification of protein-protein interactions often provides insight into protein function, and many cellular processes are performed by stable protein complexes. We used tandem affinity purification to process 4,562 different tagged proteins of the yeast Saccharomyces cerevisiae. Each preparation was analysed by both matrix-assisted laser desorption/ionization-time of flight mass spectrometry and liquid chromatography tandem mass spectrometry to increase coverage and accuracy. Machine learning was used to integrate the mass spectrometry scores and assign probabilities to the protein-protein interactions. Among 4,087 different proteins identified with high confidence by mass spectrometry from 2,357 successful purifications, our core data set (median precision of 0.69) comprises 7,123 protein-protein interactions involving 2,708 proteins. A Markov clustering algorithm organized these interactions into 547 protein complexes averaging 4.9 subunits per complex, about half of them absent from the MIPS database, as well as 429 additional interactions between pairs of complexes. The data (all of which are available online) will help future studies on in idual proteins as well as functional genomics and systems biology.
Publisher: The American Association of Immunologists
Date: 15-03-2022
Abstract: The oncotherapeutic promise of IL-15, a potent immunostimulant, is limited by a short serum t1/2. The fusion protein N-803 is a chimeric IL-15 superagonist that has a & -fold longer in vivo t1/2 versus IL-15. This phase 1 study characterized the pharmacokinetic (PK) profile and safety of N-803 after s.c. administration to healthy human volunteers. Volunteers received two doses of N-803, and after each dose, PK and safety were assessed for 9 d. The primary endpoint was the N-803 PK profile, the secondary endpoint was safety, and immune cell levels and immunogenicity were measures of interest. Serum N-803 concentrations peaked 4 h after administration and declined with a t1/2 of ∼20 h. N-803 did not cause treatment-emergent serious adverse events (AEs) or grade ≥3 AEs. Injection site reactions, chills, and pyrexia were the most common AEs. Administration of N-803 was well tolerated and accompanied by proliferation of NK cells and CD8+ T cells and sustained increases in the number of NK cells. Our results suggest that N-803 administration can potentiate antitumor immunity.
Publisher: Cold Spring Harbor Laboratory
Date: 02-02-2020
DOI: 10.1101/2020.02.02.930578
Abstract: The massive growth of single-cell RNA-sequencing (scRNAseq) and the methods for its analysis still lack sufficient and up-to-date benchmarks that could guide analytical choices. Numerous benchmark studies already exist and cover most of scRNAseq processing and analytical methods but only a few give advice on a comprehensive pipeline. Moreover, current studies often focused on isolated steps of the process and do not address the impact of a tool on both the intermediate and the final steps of the analysis. Here, we present a flexible R framework for pipeline comparison with multi-level evaluation metrics. We apply it to the benchmark of scRNAseq analysis pipelines using simulated and real datasets with known cell identities, covering common methods of filtering, doublet detection, normalization, feature selection, denoising, dimensionality reduction and clustering. We evaluate the choice of these tools with multi-purpose metrics to assess their ability to reveal cell population structure and lead to efficient clustering. On the basis of our systematic evaluations of analysis pipelines, we make a number of practical recommendations about current analysis choices and for a comprehensive pipeline. The evaluation framework that we developed, pipeComp ( lger ipeComp ), has been implemented so as to easily integrate any other step, tool, or evaluation metric allowing extensible benchmarks and easy applications to other fields of research in Bioinformatics, as we demonstrate through a study of the impact of removal of unwanted variation on differential expression analysis.
Publisher: Oxford University Press (OUP)
Date: 11-11-2009
DOI: 10.1093/BIOINFORMATICS/BTP616
Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (bioconductor.org). Contact: mrobinson@wehi.edu.au
Publisher: Frontiers Media SA
Date: 05-03-2015
Publisher: Oxford University Press (OUP)
Date: 20-04-2014
DOI: 10.1093/NAR/GKU310
Publisher: Annual Reviews
Date: 20-07-2019
DOI: 10.1146/ANNUREV-BIODATASCI-072018-021255
Abstract: Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large ersity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.
Publisher: Elsevier BV
Date: 06-2003
DOI: 10.1016/S0092-8674(03)00466-5
Abstract: Predictive analysis using publicly available yeast functional genomics and proteomics data suggests that many more proteins may be involved in biogenesis of ribonucleoproteins than are currently known. Using a microarray that monitors abundance and processing of noncoding RNAs, we analyzed 468 yeast strains carrying mutations in protein-coding genes, most of which have not previously been associated with RNA or RNP synthesis. Many strains mutated in uncharacterized genes displayed aberrant noncoding RNA profiles. Ten factors involved in noncoding RNA biogenesis were verified by further experimentation, including a protein required for 20S pre-rRNA processing (Tsr2p), a protein associated with the nuclear exosome (Lrp1p), and a factor required for box C/D snoRNA accumulation (Bcd1p). These data present a global view of yeast noncoding RNA processing and confirm that many currently uncharacterized yeast proteins are involved in biogenesis of noncoding RNA.
Publisher: Life Science Alliance, LLC
Date: 04-05-2023
Abstract: Continuity, correctness, and completeness of genome assemblies are important for many biological projects. Long reads represent a major driver towards delivering high-quality genomes, but not everybody can achieve the necessary coverage for good long read-only assemblies. Therefore, improving existing assemblies with low-coverage long reads is a promising alternative. The improvements include correction, scaffolding, and gap filling. However, most tools perform only one of these tasks and the useful information of reads that supported the scaffolding is lost when running separate programs successively. Therefore, we propose a new tool for combined execution of all three tasks using PacBio or Oxford Nanopore reads. gapless is available at: chmeing/gapless .
Publisher: Springer Science and Business Media LLC
Date: 10-2016
DOI: 10.1038/NATURE19801
Publisher: Springer Science and Business Media LLC
Date: 08-12-2014
DOI: 10.1038/NG.3165
Abstract: Prostate cancer is driven by a combination of genetic and/or epigenetic alterations. Epigenetic alterations are frequently observed in all human cancers, yet how aberrant epigenetic signatures are established is poorly understood. Here we show that the gene encoding BAZ2A (TIP5), a factor previously implicated in epigenetic rRNA gene silencing, is overexpressed in prostate cancer and is paradoxically involved in maintaining prostate cancer cell growth, a feature specific to cancer cells. BAZ2A regulates numerous protein-coding genes and directly interacts with EZH2 to maintain epigenetic silencing at genes repressed in metastasis. BAZ2A overexpression is tightly associated with a molecular subtype displaying a CpG island methylator phenotype (CIMP). Finally, high BAZ2A levels serve as an independent predictor of biochemical recurrence in a cohort of 7,682 in iduals with prostate cancer. This work identifies a new aberrant role for the epigenetic regulator BAZ2A, which can also serve as a useful marker for metastatic potential in prostate cancer.
Publisher: Elsevier BV
Date: 07-2015
Publisher: Cold Spring Harbor Laboratory
Date: 08-11-2019
DOI: 10.1101/834242
Abstract: There is a growing appreciation of the role of non-coding RNAs in the regulation of gene and protein expression. Long non-coding RNAs can modulate splicing by hybridizing with precursor messenger RNAs (pre-mRNAs) and influence RNA editing, mRNA stability, translation activation and microRNA-mRNA interactions by binding to mature mRNAs. LncRNAs are highly abundant in the brain and have been implicated in neurodevelopmental disorders. Long intergenic non-coding RNAs are the largest subclass of lncRNAs and play a crucial role in gene regulation. We used RNA sequencing and bioinformatic analyses to identify lincRNAs and their predicted mRNA targets associated with fear extinction that was induced by intra-hippoc ally administered D-cycloserine in an animal model investigating the core phenotypes of PTSD. We identified 43 differentially expressed fear extinction related lincRNAs and 190 differentially expressed fear extinction related mRNAs. Eight of these lincRNAs were predicted to interact with and regulate 108 of these mRNAs and seven lincRNAs were predicted to interact with 22 of their pre-mRNA transcripts. On the basis of the functions of their target RNAs, we inferred that these lincRNAs bind to nucleotides, ribonucleotides and proteins and subsequently influence nervous system development, and morphology, immune system functioning, and are associated with nervous system and mental health disorders. Quantitative trait loci that overlapped with fear extinction related lincRNAs, included serum corticosterone level, neuroinflammation, anxiety, stress and despair related responses. This is the first study to identify lincRNAs and their RNA targets with a putative role in transcriptional regulation during fear extinction.
Publisher: Public Library of Science (PLoS)
Date: 10-11-2016
Publisher: Springer Science and Business Media LLC
Date: 24-04-2023
DOI: 10.1186/S13059-023-02923-Y
Abstract: Long-read RNA sequencing (lrRNA-seq) produces detailed information about full-length transcripts, including novel and s le-specific isoforms. Furthermore, there is an opportunity to call variants directly from lrRNA-seq data. However, most state-of-the-art variant callers have been developed for genomic DNA. Here, there are two objectives: first, we perform a mini-benchmark on GATK, DeepVariant, Clair3, and NanoCaller primarily on PacBio Iso-Seq, data, but also on Nanopore and Illumina RNA-seq data second, we propose a pipeline to process spliced-alignment files, making them suitable for variant calling with DNA-based callers. With such manipulations, high calling performance can be achieved using DeepVariant on Iso-seq data.
Publisher: Springer Science and Business Media LLC
Date: 20-06-2019
Publisher: Cold Spring Harbor Laboratory
Date: 30-06-2017
DOI: 10.1101/157982
Abstract: Dropout in single cell RNA-seq (scRNA-seq) applications causes many transcripts to go undetected. It induces excess zero counts, which leads to power issues in differential expression (DE) analysis and has triggered the development of bespoke scRNA-seq DE tools that cope with zero-inflation. Recent evaluations, however, have shown that dedicated scRNA-seq tools provide no advantage compared to traditional bulk RNA-seq tools. We introduce zingeR, a zero-inflated negative binomial model that identifies excess zero counts and generates observation weights to unlock bulk RNA-seq pipelines for zero-inflation, boosting performance in scRNA-seq differential expression analysis.
Publisher: Life Science Alliance, LLC
Date: 17-01-2019
Abstract: Most methods for statistical analysis of RNA-seq data take a matrix of abundance estimates for some type of genomic features as their input, and consequently the quality of any obtained results is directly dependent on the quality of these abundances. Here, we present the junction coverage compatibility score, which provides a way to evaluate the reliability of transcript-level abundance estimates and the accuracy of transcript annotation catalogs. It works by comparing the observed number of reads spanning each annotated splice junction in a genomic region to the predicted number of junction-spanning reads, inferred from the estimated transcript abundances and the genomic coordinates of the corresponding annotated transcripts. We show that although most genes show good agreement between the observed and predicted junction coverages, there is a small set of genes that do not. Genes with poor agreement are found regardless of the method used to estimate transcript abundances, and the corresponding transcript abundances should be treated with care in any downstream analyses.
Publisher: Springer Science and Business Media LLC
Date: 23-10-2015
Publisher: Springer Science and Business Media LLC
Date: 2014
Publisher: American Association for the Advancement of Science (AAAS)
Date: 27-09-2019
Abstract: The Mediterranean fruit fly or Medfly ( Ceratitis capitata ) is a global and highly destructive fruit pest. Meccariello et al. identified the master gene for male sex determination on the Y chromosome of Medfly and named it Maleness-on-the-Y ( MoY ) (see the Perspective by Makki and Meller). Flies of each sex were transformed into the other sex by genetic manipulation, and crosses of transformed files generated male and female progeny. MoY is functionally conserved in the olive fruit fly and in the invasive oriental fruit fly. This discovery has potential for insect genetic control based on mass release of sterile males and future strategies based on gene drive. Science , this issue p. 1457 see also p. 1380
Publisher: Cold Spring Harbor Laboratory
Date: 27-03-2021
DOI: 10.1101/2021.03.26.436976
Abstract: Epithelial-mesenchymal transition (EMT) equips breast cancer cells for metastasis and treatment resistance. Inhibition and elimination of EMT-undergoing cells are therefore promising therapy approaches. However, detecting EMT-undergoing cells is challenging due to the intrinsic heterogeneity of cancer cells and the phenotypic ersity of EMT programs. Here, we profiled EMT transition phenotypes in four non-cancerous human mammary epithelial cell lines using a FACS surface marker screen, RNA sequencing, and mass cytometry. EMT was induced in the HMLE and MCF10A cell lines and in the HMLE-Twist-ER and HMLE-Snail-ER cell lines by chronic exposure to TGFβ1 or 4-hydroxytamoxifen, respectively. We observed a spectrum of EMT transition phenotypes in each cell line and the spectrum varied across the time course. Our data provide multiparametric insights at single-cell level into the phenotypic ersity of EMT at different time points and in four human cellular models. These insights are valuable to better understand the complexity of EMT, to compare EMT transitions between the cellular models used herein, and for the design of EMT time course experiments. Mendeley Data: DOI: 10.17632 t3gmyk5r2.1 ArrayExpress Data: Accession number E-MTAB-9365
Publisher: Cold Spring Harbor Laboratory
Date: 28-08-2021
DOI: 10.1101/2021.08.28.458012
Abstract: Quality control (QC) is a critical component of single-cell RNA-seq (scRNA-seq) processing pipelines. Current approaches to QC implicitly assume that datasets are comprised of one celltype, potentially resulting in biased exclusion of rare celltypes. We introduce S leQC , which robustly fits a Gaussian mixture model across multiple s les, and improves sensitivity and reduces bias compared to current approaches. We show via simulations that S leQC is less susceptible to exclusion of rarer celltypes. We also demonstrate S leQC on a complex real dataset (867k cells over 172 s les). S leQC is general, is implemented in R, and could be applied to other data types.
Publisher: Cold Spring Harbor Laboratory
Date: 17-08-2023
DOI: 10.1101/2023.08.17.553679
Abstract: Although transcriptomics data is typically used to analyse mature spliced mRNA, recent attention has focused on jointly investigating spliced and unspliced (or precursor-) mRNA, which can be used to study gene regulation and changes in gene expression production. Nonetheless, most methods for spliced/unspliced inference (such as RNA velocity tools) focus on in idual s les, and rarely allow comparisons between groups of s les (e.g., healthy vs . diseased). Furthermore, this kind of inference is challenging, because spliced and unspliced mRNA abundance is characterized by a high degree of quantification uncertainty, due to the prevalence of multi-mapping reads, i.e., reads compatible with multiple transcripts (or genes), and/or with both their spliced and unspliced versions. Here, we present DifferentialRegulation , a Bayesian hierarchical method to discover changes between experimental conditions with respect to the relative abundance of unspliced mRNA (over the total mRNA). We model the quantification uncertainty via a latent variable approach, where reads are allocated to their gene/transcript of origin, and to the respective splice version. We designed several benchmarks where our approach shows good performance, in terms of sensitivity and error control, versus state-of-the-art competitors. Importantly, our tool is flexible, and works with both bulk and single-cell RNA-sequencing data. DifferentialRegulation is distributed as a Bioconductor R package.
Publisher: Springer Science and Business Media LLC
Date: 2010
Publisher: Cold Spring Harbor Laboratory
Date: 02-11-2010
Abstract: DNA methylation is an essential epigenetic modification that plays a key role associated with the regulation of gene expression during differentiation, but in disease states such as cancer, the DNA methylation landscape is often deregulated. There are now numerous technologies available to interrogate the DNA methylation status of CpG sites in a targeted or genome-wide fashion, but each method, due to intrinsic biases, potentially interrogates different fractions of the genome. In this study, we compare the affinity-purification of methylated DNA between two popular genome-wide techniques, methylated DNA immunoprecipitation (MeDIP) and methyl-CpG binding domain-based capture (MBDCap), and show that each technique operates in a different domain of the CpG density landscape. We explored the effect of whole-genome lification and illustrate that it can reduce sensitivity for detecting DNA methylation in GC-rich regions of the genome. By using MBDCap, we compare and contrast microarray- and sequencing-based readouts and highlight the impact that copy number variation (CNV) can make in differential comparisons of methylomes. These studies reveal that the analysis of DNA methylation data and genome coverage is highly dependent on the method employed, and consideration must be made in light of the GC content, the extent of DNA lification, and the copy number.
Publisher: F1000 Research Ltd
Date: 28-09-2021
DOI: 10.12688/F1000RESEARCH.73600.1
Abstract: Doublets are prevalent in single-cell sequencing data and can lead to artifactual findings. A number of strategies have therefore been proposed to detect them. Building on the strengths of existing approaches, we developed scDblFinder , a fast, flexible and accurate Bioconductor-based doublet detection method. Here we present the method, justify its design choices, demonstrate its performance on both single-cell RNA and accessibility sequencing data, and provide some observations on doublet formation, detection, and enrichment analysis. Even in complex datasets, scDblFinder can accurately identify most heterotypic doublets, and was already found by an independent benchmark to outcompete alternatives.
Publisher: Springer Science and Business Media LLC
Date: 26-02-2018
Publisher: F1000 Research Ltd
Date: 16-05-2022
DOI: 10.12688/F1000RESEARCH.73600.2
Abstract: Doublets are prevalent in single-cell sequencing data and can lead to artifactual findings. A number of strategies have therefore been proposed to detect them. Building on the strengths of existing approaches, we developed scDblFinder , a fast, flexible and accurate Bioconductor-based doublet detection method. Here we present the method, justify its design choices, demonstrate its performance on both single-cell RNA and accessibility (ATAC) sequencing data, and provide some observations on doublet formation, detection, and enrichment analysis. Even in complex datasets, scDblFinder can accurately identify most heterotypic doublets, and was already found by an independent benchmark to outcompete alternatives.
Publisher: Cold Spring Harbor Laboratory
Date: 15-11-2021
DOI: 10.1101/2021.11.15.468676
Abstract: With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyse aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant – on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task, and often use simulated data that provide a ground truth for evaluations. Thus, demanding a high quality standard for synthetically generated data is critical to make simulation study results credible and transferable to real data. Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects they yield over-optimistic performance of integration, and potentially unreliable ranking of clustering methods and, it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.
Publisher: The Company of Biologists
Date: 2017
DOI: 10.1242/DEV.144535
Abstract: Morphogenesis requires the dynamic regulation of gene expression, including transcription, mRNA maturation and translation. Dysfunction of the general mRNA splicing machinery can cause surprisingly specific cellular phenotypes, but the basis for these effects is not clear. Here we show that the Drosophila faint sausage (fas) locus, implicated in epithelial morphogenesis and previously reported to encode a secreted immunoglobulin domain protein, in fact encodes a subunit of the spliceosome-activating Prp19 complex, which is essential for efficient pre-mRNA splicing. Loss of zygotic fas function globally impairs the efficiency of splicing, and is associated with widespread retention of introns in mRNAs and dramatic changes in gene expression. Surprisingly, despite these general effects, zygotic fas mutants show specific defects in tracheal cell migration during mid-embryogenesis when maternally supplied splicing factors have declined. We propose that tracheal branching, which relies on dynamic changes in gene expression, is particularly sensitive for efficient spliceosome function. Our results reveal an entry point to study requirements of the splicing machinery during organogenesis and provide a better understanding of disease phenotypes associated with mutations in general splicing factors.
Publisher: Wiley
Date: 22-04-2019
DOI: 10.1111/NPH.15815
Publisher: Springer Science and Business Media LLC
Date: 02-07-2018
DOI: 10.1038/S41591-018-0094-7
Abstract: In the version of this article initially published, Figs. 5a,c and 6a were incorrect because of an error in a metadata spreadsheet that led to the healthy donor patient 2 (HD2) s les being used twice in the analysis of baseline s les and in the analysis at 12 weeks of anti-PD-1 therapy, while HD3 s les had not been used.
Publisher: Cold Spring Harbor Laboratory
Date: 18-01-2018
DOI: 10.1101/250126
Abstract: Dropout events in single-cell transcriptome sequencing (scRNA-seq) cause many transcripts to go undetected and induce an excess of zero read counts, leading to power issues in differential expression (DE) analysis. This has triggered the development of bespoke scRNA-seq DE methods to cope with zero inflation. Recent evaluations, however, have shown that dedicated scRNA-seq tools provide no advantage compared to traditional bulk RNA-seq tools. We introduce a weighting strategy, based on a zero-inflated negative binomial (ZINB) model, that identifies excess zero counts and generates gene and cell-specific weights to unlock bulk RNA-seq DE pipelines for zero-inflated data, boosting performance for scRNA-seq.
Publisher: Cold Spring Harbor Laboratory
Date: 07-09-2017
DOI: 10.1101/185744
Abstract: Mass cytometry enables simultaneous analysis of over 40 proteins and their modifications in single cells through use of metal-tagged antibodies. Compared to fluorescent dyes, the use of pure metal isotopes strongly reduces spectral overlap among measurement channels. Crosstalk still exists, however, caused by isotopic impurity, oxide formation, and mass cytometer properties. Spillover effects can be minimized, but not avoided, by following a set of constraining rules when designing an antibody panel. Generation of such low crosstalk panels requires considerable expert knowledge, knowledge of the abundance of each marker and substantial experimental effort. Here we describe a novel bead-based compensation workflow that includes R-based software and a web tool, which enables correction for interference between channels. We demonstrate utility in suspension mass cytometry and show how this approach can be applied to imaging mass cytometry. Our approach greatly simplifies the development of new antibody panels, increases flexibility for antibody-metal pairing, improves overall data quality, thereby reducing the risk of reporting cell phenotype and function artifacts, and greatly facilitates analysis of complex s les for which antigen abundances are unknown.
Publisher: American Society of Hematology
Date: 17-03-2016
DOI: 10.1182/BLOOD-2015-08-662635
Abstract: The sphingosine-1-phosphate receptor 2 (S1PR2) is a novel tumor suppressor and survival prognosticator in the ABC subtype of DLBCL. S1PR2 is a direct, repressed FOXP1 target ectopic S1PR2 expression induces apoptosis in DLBCL cells in vitro and prevents tumor growth.
Publisher: Elsevier BV
Date: 05-2018
Publisher: Cold Spring Harbor Laboratory
Date: 25-05-2023
DOI: 10.1101/2023.05.24.542159
Abstract: We previously identified 16,772 colorectal cancer-associated hypermethylated DNA regions that were also detectable in precancerous colorectal lesions (preCRCs) and unrelated to normal mucosal aging. We have now conducted a study to validate 990 of these differently methylated DNA regions in a new series of preCRCs. We used targeted bisulfite sequencing to validate these 990 potential biomarkers in 59 preCRC tissue s les (41 conventional adenomas, 18 sessile serrated lesions), each with a patient-matched normal mucosal s le. Differential DNA methylation tests for each CpG dinucleotide were conducted, with results aggregated at region level, to choose panels of candidate biomarkers that were (cross-)validated with respect to their stratifying potential between preCRCs and normal mucosas as well as on an independent cohort.. Strong differences in methylation level were observed across the full set of 990 investigated DMRs. Among the 100 randomly selected panels of 30 DMRs analyzed with our bioinformatic approach, the best performing panel correctly classified 58/59 tumors (area under the receiver operating curve: 0.998). These validated DNA hypermethylation markers can be exploited to develop more accurate noninvasive colorectal tumor screening assays.
Publisher: Cold Spring Harbor Laboratory
Date: 23-09-2022
DOI: 10.1101/2022.09.22.508982
Abstract: Computational methods represent the lifeblood of modern molecular biology. Benchmarking is important for all methods, but with a focus here on computational methods, benchmarking is critical to dissect important steps of analysis pipelines, formally assess performance across common situations as well as edge cases, and ultimately guide users on what tools to use. Benchmarking can also be important for community building and advancing methods in a principled way. We conducted a meta-analysis of recent single-cell benchmarks to summarize the scope, extensibility, neutrality, as well as technical features and whether best practices in open data and reproducible research were followed. The results highlight that while benchmarks often make code available and are in principle reproducible, they remain difficult to extend, for ex le, as new methods and new ways to assess methods emerge. In addition, embracing containerization and workflow systems would enhance reusability of intermediate benchmarking results, thus also driving wider adoption.
Publisher: Cold Spring Harbor Laboratory
Date: 08-04-2016
DOI: 10.1101/047613
Abstract: Recent technological developments in high-dimensional flow cytometry and mass cytometry (CyTOF) have made it possible to detect expression levels of dozens of protein markers in thousands of cells per second, allowing cell populations to be characterized in unprecedented detail. Traditional data analysis by “manual gating” can be inefficient and unreliable in these high-dimensional settings, which has led to the development of a large number of automated analysis methods. Methods designed for unsupervised analysis use specialized clustering algorithms to detect and define cell populations for further downstream analysis. Here, we have performed an up-to-date, extensible performance comparison of clustering methods for high-dimensional flow and mass cytometry data. We evaluated methods using several publicly available data sets from experiments in immunology, containing both major and rare cell populations, with cell population identities from expert manual gating as the reference standard. Several methods performed well, including FlowSOM, X-shift, PhenoGraph, Rclusterpp , and flowMeans . Among these, FlowSOM had extremely fast runtimes, making this method well-suited for interactive, exploratory analysis of large, high-dimensional data sets on a standard laptop or desktop computer. These results extend previously published comparisons by focusing on high-dimensional data and including new methods developed for CyTOF data. R scripts to reproduce all analyses are available from GitHub ( mweber/cytometry-clustering-comparison ), and pre-processed data files are available from FlowRepository (FR-FCM-ZZPH), allowing our comparisons to be extended to include new clustering methods and reference data sets.
Publisher: American Association for the Advancement of Science (AAAS)
Date: 12-05-2017
Abstract: Sex comes in many forms, even when considered at the molecular level. In different animals, the chromosomes and specific genes that function in sex determination vary widely. As a case in point, the familiar housefly displays a highly variable sex determination system. In this animal, the male determiner (M-factor) instructs male development when it is active, but female development results when it is inactive. Sharma et al. now identify the housefly M-factor, which arose via the co-option of existing genes, gene duplication, and neofunctionalization. The findings elucidate the remarkable ersity in sex-determining pathways and the forces that drive this ersity. Science , this issue p. 642
Publisher: Springer Science and Business Media LLC
Date: 22-05-2009
Publisher: Cold Spring Harbor Laboratory
Date: 07-04-2015
DOI: 10.1101/017673
Abstract: A correspondence with respect to: Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data Rapaport F, Khanin R, Liang Y, Pirun M, Krek A, Zumbo P, Mason CE, Socci ND and Betel D, Genome Biol 2013, 14:R95
Publisher: Elsevier BV
Date: 2004
DOI: 10.1016/S1097-2765(04)00003-6
Abstract: A remarkably large collection of evolutionarily conserved proteins has been implicated in processing of noncoding RNAs and biogenesis of ribonucleoproteins. To better define the physical and functional relationships among these proteins and their cognate RNAs, we performed 165 highly stringent affinity purifications of known or predicted RNA-related proteins from Saccharomyces cerevisiae. We systematically identified and estimated the relative abundance of stably associated polypeptides and RNA species using a combination of gel densitometry, protein mass spectrometry, and oligonucleotide microarray hybridization. Ninety-two discrete proteins or protein complexes were identified comprising 489 different polypeptides, many associated with one or more specific RNA molecules. Some of the pre-rRNA-processing complexes that were obtained are discrete sub-complexes of those previously described. Among these, we identified the IPI complex required for proper processing of the ITS2 region of the ribosomal RNA primary transcript. This study provides a high-resolution overview of the modular topology of noncoding RNA-processing machinery.
Publisher: Cold Spring Harbor Laboratory
Date: 12-04-2023
DOI: 10.1101/2023.04.12.536513
Abstract: Defects in blood development frequently occur among syndromic congenital anomalies. Thrombocytopenia-Absent Radius (TAR) Syndrome is a rare congenital condition with reduced platelets (hypomegakaryocytic thrombocytopenia) and forelimb anomalies, concurrent with more variable heart and kidney defects. TAR syndrome associates with hypomorphic gene function for RBM8A/Y14 that encodes a component of the exon junction complex involved in mRNA splicing, transport, and nonsense-mediated decay. How perturbing a general mRNA-processing factor causes the selective TAR Syndrome phenotypes remains unknown. Here, we connect zebrafish rbm8a perturbation to early hematopoietic defects via attenuated non-canonical Wnt/Planar Cell Polarity (PCP) signaling that controls developmental cell arrangements. In hypomorphic rbm8a zebrafish, we observe a significant reduction of cd41 -positive thrombocytes. rbm8a -mutant zebrafish embryos accumulate mRNAs with in idual retained introns, a hallmark of defective nonsense-mediated decay affected mRNAs include transcripts for non-canonical Wnt/PCP pathway components. We establish that rbm8a -mutant embryos show convergent extension defects and that reduced rbm8a function interacts with perturbations in non-canonical Wnt/PCP pathway genes w nt5b , wnt11f2 , fzd7a , and vangl2 . Using live-imaging, we found reduced rbm8a function impairs the architecture of the lateral plate mesoderm (LPM) that forms hematopoietic, cardiovascular, kidney, and forelimb skeleton progenitors as affected in TAR Syndrome. Both mutants for rbm8a and for the PCP gene vangl2 feature impaired expression of early hematopoietic/endothelial genes including runx1 and the megakaryocyte regulator gfi1aa . Together, our data propose aberrant LPM patterning and hematopoietic defects as possible consequence of attenuated non-canonical Wnt/PCP signaling upon reduced rbm8a function. These results link TAR Syndrome to a potential LPM origin and developmental mechanism.
Publisher: Oxford University Press (OUP)
Date: 07-2019
Abstract: The extensive generation of RNA sequencing (RNA-seq) data in the last decade has resulted in a myriad of specialized software for its analysis. Each software module typically targets a specific step within the analysis pipeline, making it necessary to join several of them to get a single cohesive workflow. Multiple software programs automating this procedure have been proposed, but often lack modularity, transparency or flexibility. We present ARMOR, which performs an end-to-end RNA-seq data analysis, from raw read files, via quality checks, alignment and quantification, to differential expression testing, geneset analysis and browser-based exploration of the data. ARMOR is implemented using the Snakemake workflow management system and leverages conda environments Bioconductor objects are generated to facilitate downstream analysis, ensuring seamless integration with many R packages. The workflow is easily implemented by cloning the GitHub repository, replacing the supplied input and reference files and editing a configuration file. Although we have selected the tools currently included in ARMOR, the setup is modular and alternative tools can be easily integrated.
Publisher: Cold Spring Harbor Laboratory
Date: 10-11-2020
DOI: 10.1101/2020.11.09.374447
Abstract: Innovations in single cell technologies have lead to a flurry of datasets and computational tools to process and interpret them, including analyses of cell composition changes and transition in cell states. The diffcyt workflow for differential discovery in cytometry data consist of several steps, including preprocessing, cell population identification and differential testing for an association with a binary or continuous covariate. However, the commonly measured quantity of survival time in clinical studies often results in a censored covariate where classical differential testing is inapplicable. To overcome this limitation, multiple methods to directly include censored covariates in differential abundance analysis were examined with the use of simulation studies and a case study. Results show high error control and decent sensitivity for a subset of the methods. The tested methods are implemented in the R package censcyt as an extension of diffcyt and are available at etogerber/censcyt . Methods for the direct inclusion of a censored variable as a predictor in GLMMs are a valid alternative to classical survival analysis methods, such as the Cox proportional hazard model, while allowing for more flexibility in the differential analysis.
Publisher: PeerJ
Date: 17-10-2018
DOI: 10.7287/PEERJ.PREPRINTS.27283V1
Abstract: Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large ersity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.
Publisher: PeerJ
Date: 24-11-2018
DOI: 10.7287/PEERJ.PREPRINTS.27283V2
Abstract: Gene expression is the fundamental level at which the result of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large ersity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq datasets as well as the performance of the myriad of methods developed. In this review, we give an overall view of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on quantification of gene expression and statistical approaches for differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies.
Publisher: American Association for the Advancement of Science (AAAS)
Date: 03-08-2018
DOI: 10.1126/SCIIMMUNOL.AAR4539
Abstract: MYD88 signaling in fibroblastic reticular cells drives the initiation of immune responses in fat-associated lymphoid clusters.
Publisher: Springer Science and Business Media LLC
Date: 31-07-2019
DOI: 10.1038/S41467-019-11272-Z
Abstract: A platform for highly parallel direct sequencing of native RNA strands was recently described by Oxford Nanopore Technologies, but despite initial efforts it remains crucial to further investigate the technology for quantification of complex transcriptomes. Here we undertake native RNA sequencing of polyA + RNA from two human cell lines, analysing ~5.2 million aligned native RNA reads. To enable informative comparisons, we also perform relevant ONT direct cDNA- and Illumina-sequencing. We find that while native RNA sequencing does enable some of the anticipated advantages, key unexpected aspects currently h er its performance, most notably the quite frequent inability to obtain full-length transcripts from single reads, as well as difficulties to unambiguously infer their true transcript of origin. While characterising issues that need to be addressed when investigating more complex transcriptomes, our study highlights that with some defined improvements, native RNA sequencing could be an important addition to the mammalian transcriptomics toolbox.
Publisher: Oxford University Press (OUP)
Date: 2019
Abstract: Next-generation sequencing technologies and the availability of an increasing number of mammalian and other genomes allow gene expression studies, particularly RNA sequencing, in many non-model organisms. However, incomplete genome annotation and assignments of genes to functional annotation databases can lead to a substantial loss of information in downstream data analysis. To overcome this, we developed Mammalian Annotation Database tool (MAdb, madb.ethz.ch) to conveniently provide homologous gene information for selected mammalian species. The assignment between species is performed in three steps: (i) matching official gene symbols, (ii) using ortholog information contained in Ensembl Compara and (iii) pairwise BLAST comparisons of all transcripts. In addition, we developed a new tool (AnnOverlappeR) for the reliable assignment of the National Center for Biotechnology Information (NCBI) and Ensembl gene IDs. The gene lists translated to gene IDs of well-annotated species such as a human can be used for improved functional annotation with relevant tools based on Gene Ontology and molecular pathway information. We tested the MAdb on a published RNA-seq data set for the pig and showed clearly improved overrepresentation analysis results based on the assigned human homologous gene identifiers. Using the MAdb revealed a similar list of human homologous genes and functional annotation results regardless of whether starting with gene IDs from NCBI or Ensembl. The MAdb database is accessible via a web interface and a Galaxy application.
Publisher: Springer Science and Business Media LLC
Date: 30-03-2016
DOI: 10.1038/NMETH.3805
Publisher: Springer Science and Business Media LLC
Date: 07-02-2020
DOI: 10.1186/S13059-020-1926-6
Abstract: The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands—or even millions—of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
Publisher: Cold Spring Harbor Laboratory
Date: 09-03-2022
DOI: 10.1101/2022.03.08.483466
Abstract: Continuity, correctness and completeness of genome assemblies are important for many biological projects. Long reads represent a major driver towards delivering high-quality genomes, but not everybody can achieve the necessary coverage for good long-read-only assemblies. Therefore, improving existing assemblies with low-coverage long reads is a promising alternative. The improvements include correction, scaffolding and gap filling. However, most tools perform only one of these tasks and the useful information of reads that supported the scaffolding is lost when running separate programs successively. Therefore, we propose a new tool for combined execution of all three tasks using PacBio or Oxford Nanopore reads. gapless is available at: chmeing/gapless.
Publisher: Rockefeller University Press
Date: 15-12-2014
DOI: 10.1084/JEM.20121192
Abstract: Aberrant Notch activity is oncogenic in several malignancies, but it is unclear how expression or function of downstream elements in the Notch pathway affects tumor growth. Transcriptional regulation by Notch is dependent on interaction with the DNA-binding transcriptional repressor, RBPJ, and consequent derepression or activation of associated gene promoters. We show here that RBPJ is frequently depleted in human tumors. Depletion of RBPJ in human cancer cell lines xenografted into immunodeficient mice resulted in activation of canonical Notch target genes, and accelerated tumor growth secondary to reduced cell death. Global analysis of activated regions of the genome, as defined by differential acetylation of histone H4 (H4ac), revealed that the cell death pathway was significantly dysregulated in RBPJ-depleted tumors. Analysis of transcription factor binding data identified several transcriptional activators that bind promoters with differential H4ac in RBPJ-depleted cells. Functional studies demonstrated that NF-κB and MYC were essential for survival of RBPJ-depleted cells. Thus, loss of RBPJ derepresses target gene promoters, allowing Notch-independent activation by alternate transcription factors that promote tumorigenesis.
Publisher: Society for Neuroscience
Date: 08-04-2019
Publisher: Cold Spring Harbor Laboratory
Date: 10-06-2020
DOI: 10.1101/2020.06.10.136010
Abstract: FUS is a primarily nuclear RNA-binding protein with important roles in RNA processing and transport. FUS mutations disrupting its nuclear localization characterize a subset of amyotrophic lateral sclerosis (ALS-FUS) patients, through an unidentified pathological mechanism. FUS regulates nuclear RNAs, but its role at the synapse is poorly understood. Here, we used super-resolution imaging to determine the physiological localization of extranuclear, neuronal FUS and found it predominantly near the vesicle reserve pool of presynaptic sites. Using CLIP-seq on synaptoneurosome preparations, we identified synaptic RNA targets of FUS that are associated with synapse organization and plasticity. Synaptic FUS was significantly increased in a knock-in mouse model of ALS-FUS, at presymptomatic stages, accompanied by alterations in density and size of GABAergic synapses. RNA-seq of synaptoneurosomes highlighted age-dependent dysregulation of glutamatergic and GABAergic synapses. Our study indicates that FUS accumulation at the synapse in early stages of ALS-FUS results in synaptic impairment, potentially representing an initial trigger of neurodegeneration.
Publisher: Cold Spring Harbor Laboratory
Date: 18-06-2018
DOI: 10.1101/349738
Abstract: 1 High-dimensional flow and mass cytometry allow cell types and states to be characterized in great detail by measuring expression levels of more than 40 targeted protein markers per cell at the single-cell level. However, data analysis can be difficult, due to the large size and dimensionality of datasets as well as limitations of existing computational methods. Here, we present diffcyt , a new computational framework for differential discovery analyses in high-dimensional cytometry data, based on a combination of high-resolution clustering and empirical Bayes moderated tests adapted from transcriptomics. Our approach provides improved statistical performance, including for rare cell populations, along with flexible experimental designs and fast runtimes in an open-source framework.
Publisher: Cold Spring Harbor Laboratory
Date: 09-06-2020
DOI: 10.1101/2020.06.08.140608
Abstract: The arrangement of hypotheses in a hierarchical structure (e.g., phylogenies, cell types) appears in many research fields and indicates different resolutions at which data can be interpreted. A common goal is to find a representative resolution that gives high sensitivity to identify relevant entities (e.g., microbial taxa or cell subpopulations) that are related to a phenotypic outcome (e.g. disease status) while controlling false detections, therefore providing a more compact view of detected entities and summarizing characteristics shared among them. Current methods, either performing hypothesis tests at an arbitrary resolution or testing hypotheses at all possible resolutions leading to nested results, are suboptimal. Moreover, they are not flexible enough to work in situations where each entity has multiple features to consider and different resolutions might be required for different features. For ex le, in single cell RNA-seq data, an increasing focus is to find differential state genes that change expression within a cell subpopulation in response to an external stimulus. Such differential expression might occur at different resolutions (e.g., all cells or a small set of cells) for different genes. Our new algorithm treeclimbR is designed to fill this gap by exploiting a hierarchical tree of entities, proposing multiple candidates that capture the latent signal and pinpointing branches or leaves that contain features of interest, in a data-driven way. It outperforms currently available methods on synthetic data, and we highlight the approach on various applications, including microbiome and microRNA surveys as well as single cell cytometry and RNA-seq datasets. With the emergence of various multi-resolution genomic datasets, treeclimbR provides a thorough inspection on entities across resolutions and gives additional flexibility to uncover biological associations.
Publisher: Informa UK Limited
Date: 2011
Abstract: DNA methylation primarily occurs at CpG dinucleotides in mammals and is a common epigenetic mark that plays a critical role in the regulation of gene expression. Profiling DNA methylation patterns across the genome is vital to understand DNA methylation changes that occur during development and in disease phenotype. In this study, we compared two commonly used approaches to enrich for methylated DNA regions of the genome, namely methyl-DNA immunoprecipitation (MeDIP) that is based on enrichment with antibodies specific for 5'-methylcytosine (5MeC), and capture of methylated DNA using a methyl-CpG binding domain-based (MBD) protein to discover differentially methylated regions (DMRs) in cancer. The enriched methylated DNA fractions were interrogated on Affymetrix promoter tiling arrays and differentially methylated regions were identified. A detailed validation study of 42 regions was performed using Sequenom MassCLEAVE technique. This detailed analysis revealed that both enrichment techniques are sensitive for detecting DMRs and preferentially identified different CpG rich regions of the prostate cancer genome, with MeDIP commonly enriching for methylated regions with a low CpG density, while MBD capture favors regions of higher CpG density and identifies the greatest proportion of CpG islands. This is the first detailed validation report comparing different methylated DNA enrichment techniques for identifying regions of differential DNA methylation. Our study highlights the importance of understanding the nuances of the methods used for DNA genome-wide methylation analyses so that accurate interpretation of the biology is not overlooked.
Publisher: Informa UK Limited
Date: 09-08-2021
Publisher: Springer Science and Business Media LLC
Date: 2002
Abstract: For effective exposition of biological information, especially with regard to analysis of large-scale data types, researchers need immediate access to multiple categorical knowledge bases and need summary information presented to them on collections of genes, as opposed to the typical one gene at a time. We present here a web-based tool (FunSpec) for statistical evaluation of groups of genes and proteins (e.g. co-regulated genes, protein complexes, genetic interactors) with respect to existing annotations (e.g. functional roles, biochemical properties, localization). FunSpec is available online at funspec.med.utoronto.ca FunSpec is helpful for interpretation of any data type that generates groups of related genes and proteins, such as gene expression clustering and protein complexes, and is useful for predictive methods employing "guilt-by-association."
Publisher: EMBO
Date: 13-11-2018
Publisher: Springer Science and Business Media LLC
Date: 2004
DOI: 10.1186/JBIOL16
Publisher: Springer Science and Business Media LLC
Date: 10-2018
DOI: 10.1038/S41591-018-0209-1
Abstract: CRISPR-Cas-based genome editing holds great promise for targeting genetic disorders, including inborn errors of hepatocyte metabolism. Precise correction of disease-causing mutations in adult tissues in vivo, however, is challenging. It requires repair of Cas9-induced double-stranded DNA (dsDNA) breaks by homology-directed mechanisms, which are highly inefficient in non iding cells. Here we corrected the disease phenotype of adult phenylalanine hydroxylase (Pah)
Publisher: Cold Spring Harbor Laboratory
Date: 12-03-2019
DOI: 10.1101/575951
Abstract: The extensive generation of RNA sequencing (RNA-seq) data in the last decade has resulted in a myriad of specialized software for its analysis. Each software module typically targets a specific step within the analysis pipeline, making it necessary to join several of them to get a single cohesive workflow. Multiple software programs automating this procedure have been proposed, but often lack modularity, transparency or flexibility. We present ARMOR, which performs an end-to-end RNA-seq data analysis, from raw read files, via quality checks, alignment and quantification, to differential expression testing, geneset analysis and browser-based exploration of the data. ARMOR is implemented using the Snakemake workflow management system and leverages conda environments Bioconductor objects are generated to facilitate downstream analysis, ensuring seamless integration with many R packages. The workflow is easily implemented by cloning the GitHub repository, replacing the supplied input and reference files and editing a configuration file. Although we have selected the tools currently included in ARMOR, the setup is modular and alternative tools can be easily integrated.
Publisher: Springer Science and Business Media LLC
Date: 27-02-2017
Publisher: F1000 Research Ltd
Date: 06-12-2016
DOI: 10.12688/F1000RESEARCH.8900.2
Abstract: There are many instances in genomics data analyses where measurements are made on a multivariate response. For ex le, alternative splicing can lead to multiple expressed isoforms from the same primary transcript. There are situations where differences (e.g. between normal and disease state) in the relative ratio of expressed isoforms may have significant phenotypic consequences or lead to prognostic capabilities. Similarly, knowledge of single nucleotide polymorphisms (SNPs) that affect splicing, so-called splicing quantitative trait loci (sQTL) will help to characterize the effects of genetic variation on gene expression. RNA sequencing (RNA-seq) has provided an attractive toolbox to carefully unravel alternative splicing outcomes and recently, fast and accurate methods for transcript quantification have become available. We propose a statistical framework based on the Dirichlet-multinomial distribution that can discover changes in isoform usage between conditions and SNPs that affect relative expression of transcripts using these quantifications. The Dirichlet-multinomial model naturally accounts for the differential gene expression without losing information about overall gene abundance and by joint modeling of isoform expression, it has the capability to account for their correlated nature. The main challenge in this approach is to get robust estimates of model parameters with limited numbers of replicates. We approach this by sharing information and show that our method improves on existing approaches in terms of standard statistical performance metrics. The framework is applicable to other multivariate scenarios, such as Poly-A-seq or where beta-binomial models have been applied (e.g., differential DNA methylation). Our method is available as a Bioconductor R package called DRIMSeq.
Publisher: Cold Spring Harbor Laboratory
Date: 30-03-2012
Abstract: The complex relationship between DNA methylation, chromatin modification, and underlying DNA sequence is often difficult to unravel with existing technologies. Here, we describe a novel technique based on high-throughput sequencing of bisulfite-treated chromatin immunoprecipitated DNA (BisChIP-seq), which can directly interrogate genetic and epigenetic processes that occur in normal and diseased cells. Unlike most previous reports based on correlative techniques, we found using direct bisulfite sequencing of Polycomb H3K27me3-enriched DNA from normal and prostate cancer cells that DNA methylation and H3K27me3-marked histones are not always mutually exclusive, but can co-occur in a genomic region-dependent manner. Notably, in cancer, the co-dependency of marks is largely redistributed with an increase of the dual repressive marks at CpG islands and transcription start sites of silent genes. In contrast, there is a loss of DNA methylation in intergenic H3K27me3-marked regions. Allele-specific methylation status derived from the BisChIP-seq data clearly showed that both methylated and unmethylated alleles can simultaneously be associated with H3K27me3 histones, highlighting that DNA methylation status in these regions is not dependent on Polycomb chromatin status. BisChIP-seq is a novel approach that can be widely applied to directly interrogate the genomic relationship between allele-specific DNA methylation, histone modification, or other important epigenetic regulators.
Publisher: Cold Spring Harbor Laboratory
Date: 10-10-2019
DOI: 10.1101/800383
Abstract: DNA methylation is a highly studied epigenetic signature that is associated with regulation of gene expression, whereby genes with high levels of promoter methylation are generally repressed. Genomic imprinting occurs when one of the parental alleles is methylated, i.e, when there is inherited allele-specific methylation (ASM). A special case of imprinting occurs during X chromosome inactivation in females, where one of the two X chromosomes is silenced, in order to achieve dosage compensation between the sexes. Another more widespread form of ASM is sequence dependent (SD-ASM), where ASM is linked to a nearby heterozygous single nucleotide polymorphism (SNP). We developed a method to screen for genomic regions that exhibit loss or gain of ASM in s les from two conditions (treatments, diseases, etc.). The method relies on the availability of bisulfite sequencing data from multiple s les of the two conditions. We leverage other established computational methods to screen for these regions within a new R package called DAMEfinder. It calculates an ASM score for all CpG sites or pairs in the genome of each s le, and then quantifies the change in ASM between conditions. It then clusters nearby CpG sites with consistent change into regions. In the absence of SNP information, our method relies only on reads to quantify ASM. This novel ASM score compares favourably to current methods that also screen for ASM. Not only does it easily discern between imprinted and non-imprinted regions, but also females from males based on X chromosome inactivation. We also applied DAMEfinder to a colorectal cancer dataset and observed that colorectal cancer subtypes are distinguishable according to their ASM signature. We also re-discover known cases of loss of imprinting. We have designed DAMEfinder to detect regions of differential ASM (DAMEs), which is a more refined definition of differential methylation, and can therefore help in breaking down the complexity of DNA methylation and its influence in development and disease.
Publisher: Springer Science and Business Media LLC
Date: 30-08-2017
DOI: 10.1038/S41598-017-10347-5
Abstract: The Mediterranean fruitfly Ceratitis capitata (medfly) is an invasive agricultural pest of high economic impact and has become an emerging model for developing new genetic control strategies as an alternative to insecticides. Here, we report the successful adaptation of CRISPR-Cas9-based gene disruption in the medfly by injecting in vitro pre-assembled, solubilized Cas9 ribonucleoprotein complexes (RNPs) loaded with gene-specific single guide RNAs (sgRNA) into early embryos. When targeting the eye pigmentation gene white eye ( we ), a high rate of somatic mosaicism in surviving G0 adults was observed. Germline transmission rate of mutated we alleles by G0 animals was on average above 52%, with in idual cases achieving nearly 100%. We further recovered large deletions in the we gene when two sites were simultaneously targeted by two sgRNAs. CRISPR-Cas9 targeting of the Ceratitis ortholog of the Drosophila segmentation paired gene ( Ccprd ) caused segmental malformations in late embryos and in hatched larvae. Mutant phenotypes correlate with repair by non-homologous end-joining (NHEJ) lesions in the two targeted genes. This simple and highly effective Cas9 RNP-based gene editing to introduce mutations in C. capitata will significantly advance the design and development of new effective strategies for pest control management.
Publisher: F1000 Research Ltd
Date: 13-06-2016
DOI: 10.12688/F1000RESEARCH.8900.1
Abstract: There are many instances in genomics data analyses where measurements are made on a multivariate response. For ex le, alternative splicing can lead to multiple expressed isoforms from the same primary transcript. There are situations where the total abundance of gene expression does not change (e.g. between normal and disease state), but differences in the relative ratio of expressed isoforms may have significant phenotypic consequences or lead to prognostic capabilities. Similarly, knowledge of single nucleotide polymorphisms (SNPs) that affect splicing, so-called splicing quantitative trait loci (sQTL), will help to characterize the effects of genetic variation on gene expression. RNA sequencing (RNA-seq) has provided an attractive toolbox to carefully unravel alternative splicing outcomes and recently, fast and accurate methods for transcript quantification have become available. We propose a statistical framework based on the Dirichlet-multinomial distribution that can discover changes in isoform usage between conditions and SNPs that affect splicing outcome using these quantifications. The Dirichlet-multinomial model naturally accounts for the differential gene expression without losing information about overall gene abundance and by joint modeling of isoform expression, it has the capability to account for their correlated nature. The main challenge in this approach is to get robust estimates of model parameters with limited numbers of replicates. We approach this by sharing information and show that our method improves on existing approaches in terms of standard statistical performance metrics. The framework is applicable to other multivariate scenarios, such as Poly-A-seq or where beta-binomial models have been applied (e.g., differential DNA methylation). Our method is available as a Bioconductor R package called DRIMSeq.
Publisher: Cold Spring Harbor Laboratory
Date: 25-07-2011
Abstract: Histone H2A.Z (H2A.Z) is an evolutionarily conserved H2A variant implicated in the regulation of gene expression however, its role in transcriptional deregulation in cancer remains poorly understood. Using genome-wide studies, we investigated the role of promoter-associated H2A.Z and acetylated H2A.Z (acH2A.Z) in gene deregulation and its relationship with DNA methylation and H3K27me3 in prostate cancer. Our results reconcile the conflicting reports of positive and negative roles for histone H2A.Z and gene expression states. We find that H2A.Z is enriched in a bimodal distribution at nucleosomes, surrounding the transcription start sites (TSSs) of both active and poised gene promoters. In addition, H2A.Z spreads across the entire promoter of inactive genes in a deacetylated state. In contrast, acH2A.Z is only localized at the TSSs of active genes. Gene deregulation in cancer is also associated with a reorganization of acH2A.Z and H2A.Z nucleosome occupancy across the promoter region and TSS of genes. Notably, in cancer cells we find that a gain of acH2A.Z at the TSS occurs with an overall decrease of H2A.Z levels, in concert with oncogene activation. Furthermore, deacetylation of H2A.Z at TSSs is increased with silencing of tumor suppressor genes. We also demonstrate that acH2A.Z anti-correlates with promoter H3K27me3 and DNA methylation. We show for the first time, that acetylation of H2A.Z is a key modification associated with gene activity in normal cells and epigenetic gene deregulation in tumorigenesis.
Publisher: Informa UK Limited
Date: 02-11-2018
Publisher: Wiley
Date: 12-2016
DOI: 10.1002/CYTO.A.23030
Abstract: Recent technological developments in high-dimensional flow cytometry and mass cytometry (CyTOF) have made it possible to detect expression levels of dozens of protein markers in thousands of cells per second, allowing cell populations to be characterized in unprecedented detail. Traditional data analysis by "manual gating" can be inefficient and unreliable in these high-dimensional settings, which has led to the development of a large number of automated analysis methods. Methods designed for unsupervised analysis use specialized clustering algorithms to detect and define cell populations for further downstream analysis. Here, we have performed an up-to-date, extensible performance comparison of clustering methods for high-dimensional flow and mass cytometry data. We evaluated methods using several publicly available data sets from experiments in immunology, containing both major and rare cell populations, with cell population identities from expert manual gating as the reference standard. Several methods performed well, including FlowSOM, X-shift, PhenoGraph, Rclusterpp, and flowMeans. Among these, FlowSOM had extremely fast runtimes, making this method well-suited for interactive, exploratory analysis of large, high-dimensional data sets on a standard laptop or desktop computer. These results extend previously published comparisons by focusing on high-dimensional data and including new methods developed for CyTOF data. R scripts to reproduce all analyses are available from GitHub (mweber/cytometry-clustering-comparison), and pre-processed data files are available from FlowRepository (FR-FCM-ZZPH), allowing our comparisons to be extended to include new clustering methods and reference data sets. © 2016 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of ISAC.
Publisher: Oxford University Press (OUP)
Date: 19-09-2007
DOI: 10.1093/BIOINFORMATICS/BTM453
Abstract: Motivation: Digital gene expression (DGE) technologies measure gene expression by counting sequence tags. They are sensitive technologies for measuring gene expression on a genomic scale, without the need for prior knowledge of the genome sequence. As the cost of sequencing DNA decreases, the number of DGE datasets is expected to grow dramatically. Various tests of differential expression have been proposed for replicated DGE data using binomial, Poisson, negative binomial or pseudo-likelihood (PL) models for the counts, but none of the these are usable when the number of replicates is very small. Results: We develop tests using the negative binomial distribution to model overdispersion relative to the Poisson, and use conditional weighted likelihood to moderate the level of overdispersion across genes. Not only is our strategy applicable even with the smallest number of libraries, but it also proves to be more powerful than previous strategies when more libraries are available. The methodology is equally applicable to other counting technologies, such as proteomic spectral counts. Availability: An R package can be accessed from bioinf.wehi.edu.au/resources/ Contact: smyth@wehi.edu.au Supplementary information: bioinf.wehi.edu.au/resources/
Publisher: F1000 Research Ltd
Date: 22-10-2020
DOI: 10.12688/F1000RESEARCH.26073.1
Abstract: Mass cytometry (CyTOF) has become a method of choice for in-depth characterization of tissue heterogeneity in health and disease, and is currently implemented in multiple clinical trials, where higher quality standards must be met. Currently, preprocessing of raw files is commonly performed in independent standalone tools, which makes it difficult to reproduce. Here, we present an R pipeline based on an updated version of CATALYST that covers all preprocessing steps required for downstream mass cytometry analysis in a fully reproducible way. This new version of CATALYST is based on Bioconductor’s SingleCellExperiment class and fully unit tested. The R-based pipeline includes file concatenation, bead-based normalization, single-cell deconvolution, spillover compensation and live cell gating after debris and doublet removal. Importantly, this pipeline also includes different quality checks to assess machine sensitivity and staining performance while allowing also for batch correction. This pipeline is based on open source R packages and can be easily be adapted to different study designs. It therefore has the potential to significantly facilitate the work of CyTOF users while increasing the quality and reproducibility of data generated with this technology.
Publisher: F1000 Research Ltd
Date: 08-08-2022
DOI: 10.12688/F1000RESEARCH.26073.2
Abstract: Mass cytometry (CyTOF) has become a method of choice for in-depth characterization of tissue heterogeneity in health and disease, and is currently implemented in multiple clinical trials, where higher quality standards must be met. Currently, preprocessing of raw files is commonly performed in independent standalone tools, which makes it difficult to reproduce. Here, we present an R pipeline based on an updated version of CATALYST that covers all preprocessing steps required for downstream mass cytometry analysis in a fully reproducible way. This new version of CATALYST is based on Bioconductor’s SingleCellExperiment class and fully unit tested. The R-based pipeline includes file concatenation, bead-based normalization, single-cell deconvolution, spillover compensation and live cell gating after debris and doublet removal. Importantly, this pipeline also includes different quality checks to assess machine sensitivity and staining performance while allowing also for batch correction. This pipeline is based on open source R packages and can be easily be adapted to different study designs. It therefore has the potential to significantly facilitate the work of CyTOF users while increasing the quality and reproducibility of data generated with this technology.
Publisher: Springer Science and Business Media LLC
Date: 05-2015
Publisher: F1000 Research Ltd
Date: 16-11-2020
DOI: 10.12688/F1000RESEARCH.15666.3
Abstract: Subpopulation identification, usually via some form of unsupervised clustering, is a fundamental step in the analysis of many single-cell RNA-seq data sets. This has motivated the development and application of a broad range of clustering methods, based on various underlying algorithms. Here, we provide a systematic and extensible performance evaluation of 14 clustering algorithms implemented in R, including both methods developed explicitly for scRNA-seq data and more general-purpose methods. The methods were evaluated using nine publicly available scRNA-seq data sets as well as three simulations with varying degree of cluster separability. The same feature selection approaches were used for all methods, allowing us to focus on the investigation of the performance of the clustering algorithms themselves. We evaluated the ability of recovering known subpopulations, the stability and the run time and scalability of the methods. Additionally, we investigated whether the performance could be improved by generating consensus partitions from multiple in idual clustering methods. We found substantial differences in the performance, run time and stability between the methods, with SC3 and Seurat showing the most favorable results. Additionally, we found that consensus clustering typically did not improve the performance compared to the best of the combined methods, but that several of the top-performing methods already perform some type of consensus clustering. All the code used for the evaluation is available on GitHub ( arkrobinsonuzh/scRNAseq_clustering_comparison ). In addition, an R package providing access to data and clustering results, thereby facilitating inclusion of new methods and data sets, is available from Bioconductor ( ackages/DuoClustering2018 ).
Publisher: Rockefeller University Press
Date: 07-11-2016
DOI: 10.1084/JEM.20160897
Abstract: Narcolepsy type 1 is a devastating neurological sleep disorder resulting from the destruction of orexin-producing neurons in the central nervous system (CNS). Despite its striking association with the HLA-DQB1*06:02 allele, the autoimmune etiology of narcolepsy has remained largely hypothetical. Here, we compared peripheral mononucleated cells from narcolepsy patients with HLA-DQB1*06:02-matched healthy controls using high-dimensional mass cytometry in combination with algorithm-guided data analysis. Narcolepsy patients displayed multifaceted immune activation in CD4+ and CD8+ T cells dominated by elevated levels of B cell–supporting cytokines. Additionally, T cells from narcolepsy patients showed increased production of the proinflammatory cytokines IL-2 and TNF. Although it remains to be established whether these changes are primary to an autoimmune process in narcolepsy or secondary to orexin deficiency, these findings are indicative of inflammatory processes in the pathogenesis of this enigmatic disease.
Publisher: Springer Science and Business Media LLC
Date: 17-05-2021
DOI: 10.1186/S13059-021-02368-1
Abstract: treeclimbR is for analyzing hierarchical trees of entities, such as phylogenies or cell types, at different resolutions. It proposes multiple candidates that capture the latent signal and pinpoints branches or leaves that contain features of interest, in a data-driven way. It outperforms currently available methods on synthetic data, and we highlight the approach on various applications, including microbiome and microRNA surveys as well as single-cell cytometry and RNA-seq datasets. With the emergence of various multi-resolution genomic datasets, treeclimbR provides a thorough inspection on entities across resolutions and gives additional flexibility to uncover biological associations.
Publisher: Research Square Platform LLC
Date: 22-06-2022
DOI: 10.21203/RS.3.RS-1702574/V1
Abstract: Single-cell RNA-sequencing is advancing our understanding of synovial pathobiology in inflammatory arthritis. Here, we optimized the protocol for dissociation of synovial biopsies and created a comprehensive reference single-cell atlas of fresh human synovium in inflammatory arthritis. We derived our protocol from the published dissociation method for cryopreserved synovium (Donlin L. et al. Arthritis Res. Ther. 2019) with modifications to enrich synovial cells and minimize cell loss. These modifications enabled consistently high cell yield and viability, thereby minimizing the rate of synovial tissue s le dropout. Our single-cell atlas of the human synovium comprised more than 100’000 unsorted single-cell profiles from 27 synovia of patients with inflammatory arthritis. Synovial cells formed ten lymphoid, 14 myeloid and 17 stromal cell clusters, including IFITM2+ synovial neutrophils. We identified lining SOD2 high SAA1+SAA2+ and transitional SERPINE1+COL5A3+ synovial fibroblasts, exhibiting gene signatures linked to cartilage breakdown (SDC4) and extracellular matrix remodelling (LOXL2, TGFBI, TGFB1), respectively. We uncovered synovial endothelial cell ersity and broadened the transcriptional characterization of tissue-resident FOLR2+ COLEC12+ and SLC40A1+ synovial macrophages, inferring their extracellular matrix sensing and iron recycling activities. Our research brings an efficient synovium dissociation protocol for prospectively collected fresh synovial biopsies and expands the knowledge about human synovium composition in inflammatory arthritis.
Publisher: Frontiers Media SA
Date: 30-01-2018
Publisher: Springer Science and Business Media LLC
Date: 15-07-2013
Publisher: Cold Spring Harbor Laboratory
Date: 09-04-2022
DOI: 10.1101/2022.04.06.487263
Abstract: The lack of understanding as to the cellular and molecular basis of clinical and genetic heterogeneity in progressive multiple sclerosis (MS) has hindered the search for new effective therapies and biomarkers. Here, to address this gap, we analysed 740,000 single nuclei RNAseq profiles of 165 s les of white matter (WM) lesions, normal appearing WM, grey matter (GM) lesions and normal appearing GM from 55 MS patients and 28 controls. We find that gene expression changes in response to MS are highly cell-type specific in WM and GM lesions but are largely shared within an in idual cell-type across lesions, following a continuum rather than discrete lesion-specific molecular programs. The major biological determinants of variability in gene expression in MS s les relate to in idual patient effects, rather than to lesion types or other metadata. Using multi-omics factor analysis (MOFA+), we identify three subgroups of MS patients with distinct oligodendrocyte composition and WM glial gene expression signatures, suggestive of engagement of different pathological/regenerative processes. The discovery of these three patterns significantly advances our mechanistic understanding of progressive MS, provides a framework to use molecular biomarkers to stratify patients for best therapeutic approaches for progressive MS, and highlights the need for precision-medicine approaches to address heterogeneity among MS patients.
Publisher: Springer Science and Business Media LLC
Date: 15-11-2007
Publisher: Oxford University Press (OUP)
Date: 10-05-2010
DOI: 10.1093/BIOINFORMATICS/BTQ247
Abstract: Summary: Epigenetics, the study of heritable somatic phenotypic changes not related to DNA sequence, has emerged as a critical component of the landscape of gene regulation. The epigenetic layers, such as DNA methylation, histone modifications and nuclear architecture are now being extensively studied in many cell types and disease settings. Few software tools exist to summarize and interpret these datasets. We have created a toolbox of procedures to interrogate and visualize epigenomic data (both array- and sequencing-based) and make available a software package for the cross-platform R language. Availability: The package is freely available under LGPL from the R-Forge web site (repitools.r-forge.r-project.org/) Contact: mrobinson@wehi.edu.au
Publisher: Springer Science and Business Media LLC
Date: 21-02-2010
DOI: 10.1038/NCB2023
Publisher: Cold Spring Harbor Laboratory
Date: 19-08-2015
Abstract: Tandem repeats (TRs) are stretches of DNA that are highly variable in length and mutate rapidly. They are thus an important source of genetic variation. This variation is highly informative for population and conservation genetics. It has also been associated with several pathological conditions and with gene expression regulation. However, genome-wide surveys of TR variation in humans and closely related species have been scarce due to technical difficulties derived from short-read technology. Here we explored the genome-wide ersity of TRs in a panel of 83 human and nonhuman great ape genomes, in a total of six different species, and studied their impact on gene expression evolution. We found that population ersity patterns can be efficiently captured with short TRs (repeat unit length, 1–5 bp). We examined the potential evolutionary role of TRs in gene expression differences between humans and primates by using 30,275 larger TRs (repeat unit length, 2–50 bp). Genes that contained TRs in the promoters, in their 3′ untranslated region, in introns, and in exons had higher expression ergence than genes without repeats in the regions. Polymorphic small repeats (1–5 bp) had also higher expression ergence compared with genes with fixed or no TRs in the gene promoters. Our findings highlight the potential contribution of TRs to human evolution through gene regulation.
Publisher: Proceedings of the National Academy of Sciences
Date: 08-08-2006
Abstract: Mapping transcriptional regulatory networks is difficult because many transcription factors (TFs) are activated only under specific conditions. We describe a generic strategy for identifying genes and pathways induced by in idual TFs that does not require knowledge of their normal activation cues. Microarray analysis of 55 yeast TFs that caused a growth phenotype when overexpressed showed that the majority caused increased transcript levels of genes in specific physiological categories, suggesting a mechanism for growth inhibition. Induced genes typically included established targets and genes with consensus promoter motifs, if known, indicating that these data are useful for identifying potential new target genes and binding sites. We identified the sequence 5′-TCACGCAA as a binding sequence for Hms1p, a TF that positively regulates pseudohyphal growth and previously had no known motif. The general strategy outlined here presents a straightforward approach to discovery of TF activities and mapping targets that could be adapted to any organism with transgenic technology.
Publisher: Springer Science and Business Media LLC
Date: 16-03-2020
DOI: 10.1186/S13059-020-01967-8
Abstract: Alternative splicing is a biological process during gene expression that allows a single gene to code for multiple proteins. However, splicing patterns can be altered in some conditions or diseases. Here, we present BANDITS, a R/Bioconductor package to perform differential splicing, at both gene and transcript level, based on RNA-seq data. BANDITS uses a Bayesian hierarchical structure to explicitly model the variability between s les and treats the transcript allocation of reads as latent variables. We perform an extensive benchmark across both simulated and experimental RNA-seq datasets, where BANDITS has extremely favourable performance with respect to the competitors considered.
Publisher: Elsevier BV
Date: 2013
DOI: 10.1016/J.CCR.2012.11.006
Abstract: Epigenetic gene deregulation in cancer commonly occurs through chromatin repression and promoter hypermethylation of tumor-associated genes. However, the mechanism underpinning epigenetic-based gene activation in carcinogenesis is still poorly understood. Here, we identify a mechanism of domain gene deregulation through coordinated long-range epigenetic activation (LREA) of regions that typically span 1 Mb and harbor key oncogenes, microRNAs, and cancer biomarker genes. Gene promoters within LREA domains are characterized by a gain of active chromatin marks and a loss of repressive marks. Notably, although promoter hypomethylation is uncommon, we show that extensive DNA hypermethylation of CpG islands or "CpG-island borders" is strongly related to cancer-specific gene activation or differential promoter usage. These findings have wide ramifications for cancer diagnosis, progression, and epigenetic-based gene therapies.
Publisher: Oxford University Press (OUP)
Date: 09-12-2020
DOI: 10.1093/NAR/GKAA1117
Abstract: Many microRNAs regulate gene expression via atypical mechanisms, which are difficult to discern using native cross-linking methods. To ascertain the scope of non-canonical miRNA targeting, methods are needed that identify all targets of a given miRNA. We designed a new class of miR-CLIP probe, whereby psoralen is conjugated to the 3p arm of a pre-microRNA to capture targetomes of miR-124 and miR-132 in HEK293T cells. Processing of pre-miR-124 yields miR-124 and a 5′-extended isoform, iso-miR-124. Using miR-CLIP, we identified overlapping targetomes from both isoforms. From a set of 16 targets, 13 were differently inhibited at mRNA rotein levels by the isoforms. Moreover, delivery of pre-miR-124 into cells repressed these targets more strongly than in idual treatments with miR-124 and iso-miR-124, suggesting that isomirs from one pre-miRNA may function synergistically. By mining the miR-CLIP targetome, we identified nine G-bulged target-sites that are regulated at the protein level by miR-124 but not isomiR-124. Using structural data, we propose a model involving AGO2 helix-7 that suggests why only miR-124 can engage these sites. In summary, access to the miR-124 targetome via miR-CLIP revealed for the first time how heterogeneous processing of miRNAs combined with non-canonical targeting mechanisms expand the regulatory range of a miRNA.
Publisher: Springer Science and Business Media LLC
Date: 24-06-2002
DOI: 10.1038/NG906
Publisher: Research Square Platform LLC
Date: 23-09-2022
DOI: 10.21203/RS.3.RS-2017343/V1
Abstract: The conserved SR-like protein Npl3 promotes splicing of erse pre-mRNAs. However, the RNA sequence(s) recognized by the RNA Recognition Motifs (RRM1 & RRM2) of Npl3 during the splicing reaction remain elusive. Here, we developed a split-iCRAC approach in yeast to uncover the consensus sequence bound to each RRM. High-resolution NMR structures show that RRM2 recognizes a 5´-GNGG-3´ motif leading to an unusual mille-feuille topology. These structures also reveal how RRM1 preferentially interacts with a CC-dinucleotide upstream of this motif, and how the inter-RRM linker and the region C-terminal to RRM2 contributes to cooperative RNA-binding. Structure-guided functional studies show that Npl3 genetically interacts with U2 snRNP specific factors and we provide evidence that Npl3 melts U2 snRNA stem-loop I, a prerequisite for U2/U6 duplex formation within the catalytic center of the B act spliceosomal complex. Thus, our findings suggest an unanticipated RNA chaperoning role for Npl3 during spliceosome active site formation.
Publisher: Springer Science and Business Media LLC
Date: 10-05-2021
DOI: 10.1186/S12859-021-04125-4
Abstract: Innovations in single cell technologies have lead to a flurry of datasets and computational tools to process and interpret them, including analyses of cell composition changes and transition in cell states. The diffcyt workflow for differential discovery in cytometry data consist of several steps, including preprocessing, cell population identification and differential testing for an association with a binary or continuous covariate. However, the commonly measured quantity of survival time in clinical studies often results in a censored covariate where classical differential testing is inapplicable. To overcome this limitation, multiple methods to directly include censored covariates in differential abundance analysis were examined with the use of simulation studies and a case study. Results show that multiple imputation based methods offer on-par performance with the Cox proportional hazards model in terms of sensitivity and error control, while offering flexibility to account for covariates. The tested methods are implemented in the package censcyt as an extension of diffcyt and are available at ackages/censcyt . Methods for the direct inclusion of a censored variable as a predictor in GLMMs are a valid alternative to classical survival analysis methods, such as the Cox proportional hazard model, while allowing for more flexibility in the differential analysis.
Publisher: Oxford University Press (OUP)
Date: 23-06-2008
DOI: 10.1093/BIOINFORMATICS/BTN284
Abstract: Motivation: Analyses of EST data show that alternative splicing is much more widespread than once thought. The advent of exon and tiling microarrays means that researchers now have the capacity to experimentally measure alternative splicing on a genome wide level. New methods are needed to analyze the data from these arrays. Results: We present a method, finding isoforms using robust multichip analysis (FIRMA), for detecting differential alternative splicing in exon array data. FIRMA has been developed for Affymetrix exon arrays, but could in principle be extended to other exon arrays, tiling arrays or splice junction arrays. We have evaluated the method using simulated data, and have also applied it to two datasets: a panel of 11 human tissues and a set of 10 pairs of matched normal and tumor colon tissue. FIRMA is able to detect exons in several genes confirmed by reverse transcriptase PCR. Availability: R code implementing our methods is contributed to the package aroma.affymetrix. Contact: epurdom@stat.berkeley.edu Supplementary information: Supplementary data are available at Bioinformatics online.
Publisher: F1000 Research Ltd
Date: 29-02-2016
DOI: 10.12688/F1000RESEARCH.7563.2
Abstract: High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Various quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that the presence of differential isoform usage can lead to inflated false discovery rates in differential gene expression analyses on simple count matrices but that this can be addressed by incorporating offsets derived from transcript-level abundance estimates. We also show that the problem is relatively minor in several real data sets. Finally, we provide an R package ( tximport ) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Publisher: Springer Science and Business Media LLC
Date: 29-10-2007
Abstract: Gas chromatography-mass spectrometry (GC-MS) is a robust platform for the profiling of certain classes of small molecules in biological s les. When multiple s les are profiled, including replicates of the same s le and/or different s le states, one needs to account for retention time drifts between experiments. This can be achieved either by the alignment of chromatographic profiles prior to peak detection, or by matching signal peaks after they have been extracted from chromatogram data matrices. Automated retention time correction is particularly important in non-targeted profiling studies. A new approach for matching signal peaks based on dynamic programming is presented. The proposed approach relies on both peak retention times and mass spectra. The alignment of more than two peak lists involves three steps: (1) all possible pairs of peak lists are aligned, and similarity of each pair of peak lists is estimated (2) the guide tree is built based on the similarity between the peak lists (3) peak lists are progressively aligned starting with the two most similar peak lists, following the guide tree until all peak lists are exhausted. When two or more experiments are performed on different s le states and each consisting of multiple replicates, peak lists within each set of replicate experiments are aligned first (within-state alignment), and subsequently the resulting alignments are aligned themselves (between-state alignment). When more than two sets of replicate experiments are present, the between-state alignment also employs the guide tree. We demonstrate the usefulness of this approach on GC-MS metabolic profiling experiments acquired on wild-type and mutant Leishmania mexicana parasites. We propose a progressive method to match signal peaks across multiple GC-MS experiments based on dynamic programming. A sensitive peak similarity function is proposed to balance peak retention time and peak mass spectra similarities. This approach can produce the optimal alignment between an arbitrary number of peak lists, and models explicitly within-state and between-state peak alignment. The accuracy of the proposed method was close to the accuracy of manually-curated peak matching, which required tens of man-hours for the analyzed data sets. The proposed approach may offer significant advantages for processing of high-throughput metabolomics data, especially when large numbers of experimental replicates and multiple s le states are analyzed.
Publisher: F1000 Research Ltd
Date: 30-12-2015
DOI: 10.12688/F1000RESEARCH.7563.1
Abstract: High-throughput sequencing of cDNA (RNA-seq) is used extensively to characterize the transcriptome of cells. Many transcriptomic studies aim at comparing either abundance levels or the transcriptome composition between given conditions, and as a first step, the sequencing reads must be used as the basis for abundance quantification of transcriptomic features of interest, such as genes or transcripts. Several different quantification approaches have been proposed, ranging from simple counting of reads that overlap given genomic regions to more complex estimation of underlying transcript abundances. In this paper, we show that gene-level abundance estimates and statistical inference offer advantages over transcript-level analyses, in terms of performance and interpretability. We also illustrate that while the presence of differential isoform usage can lead to inflated false discovery rates in differential expression analyses on simple count matrices and transcript-level abundance estimates improve the performance in simulated data, the difference is relatively minor in several real data sets. Finally, we provide an R package ( tximport ) to help users integrate transcript-level abundance estimates from common quantification pipelines into count-based statistical inference engines.
Publisher: Elsevier BV
Date: 05-2018
Publisher: Frontiers Media SA
Date: 28-04-2022
DOI: 10.3389/FCELL.2022.872688
Abstract: We present an optimized dissociation protocol for preparing high-quality skin cell suspensions for in-depth single-cell RNA-sequencing (scRNA-seq) analysis of fresh and cultured human skin. Our protocol enabled the isolation of a consistently high number of highly viable skin cells from small freshly dissociated punch skin biopsies, which we use for scRNA-seq studies. We recapitulated not only the main cell populations of existing single-cell skin atlases, but also identified rare cell populations, such as mast cells. Furthermore, we effectively isolated highly viable single cells from ex vivo cultured skin biopsy fragments and generated a global single-cell map of the explanted human skin. The quality metrics of the generated scRNA-seq datasets were comparable between freshly dissociated and cultured skin. Overall, by enabling efficient cell isolation and comprehensive cell mapping, our skin dissociation-scRNA-seq workflow can greatly facilitate scRNA-seq discoveries across erse human skin pathologies and ex vivo skin explant experimentations.
Publisher: Springer Science and Business Media LLC
Date: 21-05-2021
DOI: 10.1038/S41467-021-23188-8
Abstract: Mutations disrupting the nuclear localization of the RNA-binding protein FUS characterize a subset of amyotrophic lateral sclerosis patients (ALS-FUS). FUS regulates nuclear RNAs, but its role at the synapse is poorly understood. Using super-resolution imaging we determined that the localization of FUS within synapses occurs predominantly near the vesicle reserve pool of presynaptic sites. Using CLIP-seq on synaptoneurosomes, we identified synaptic FUS RNA targets, encoding proteins associated with synapse organization and plasticity. Significant increase of synaptic FUS during early disease in a mouse model of ALS was accompanied by alterations in density and size of GABAergic synapses. mRNAs abnormally accumulated at the synapses of 6-month-old ALS-FUS mice were enriched for FUS targets and correlated with those depicting increased short-term mRNA stability via binding primarily on multiple exonic sites. Our study indicates that synaptic FUS accumulation in early disease leads to synaptic impairment, potentially representing an initial trigger of neurodegeneration.
Publisher: Cold Spring Harbor Laboratory
Date: 12-11-2021
DOI: 10.1101/2021.11.11.467877
Abstract: Neurons live for the lifespan of the in idual and underlie our ability for lifelong learning and memory. However, aging alters neuron morphology and function resulting in age-related cognitive decline. It is well established that epigenetic alterations are essential for learning and memory, yet few neuron-specific genome-wide epigenetic maps exist into old age. Comprehensive mapping of H3K4me3 and H3K27ac in mouse neurons across lifespan revealed plastic H3K4me3 marking that differentiates neuronal age linked to known characteristics of cellular and neuronal aging. We determined that neurons in old age recapitulate the H3K27ac enrichment at promoters, enhancers and super enhancers from young adult neurons, likely representing a re-activation of pathways to maintain neuronal output. Finally, this study identified new characteristics of neuronal aging, including altered rDNA regulation and epigenetic regulatory mechanisms. Collectively, these findings indicate a key role for epigenetic regulation in neurons, that is inextricably linked with aging.
Publisher: Public Library of Science (PLoS)
Date: 30-05-2019
Publisher: eLife Sciences Publications, Ltd
Date: 29-05-2018
DOI: 10.7554/ELIFE.33761
Abstract: The CRISPR-Cas9 targeted nuclease technology allows the insertion of genetic modifications with single base-pair precision. The preference of mammalian cells to repair Cas9-induced DNA double-strand breaks via error-prone end-joining pathways rather than via homology-directed repair mechanisms, however, leads to relatively low rates of precise editing from donor DNA. Here we show that spatial and temporal co-localization of the donor template and Cas9 via covalent linkage increases the correction rates up to 24-fold, and demonstrate that the effect is mainly caused by an increase of donor template concentration in the nucleus. Enhanced correction rates were observed in multiple cell types and on different genomic loci, suggesting that covalently linking the donor template to the Cas9 complex provides advantages for clinical applications where high-fidelity repair is desired.
Publisher: Cold Spring Harbor Laboratory
Date: 17-07-2020
DOI: 10.1101/2020.07.17.209072
Abstract: In high-throughput sequencing data, performance comparisons between computational tools are essential for making informed decisions in the data processing from raw data to the scientific result. Simulations are a critical part of method comparisons, but for standard Illumina sequencing of genomic DNA, they are often oversimplified, which leads to optimistic results for most tools. ReSeq improves the authenticity of synthetic data by extracting and reproducing key components from real data. Major advancements are the inclusion of systematic errors, a fragment-based coverage model and s ling-matrix estimates based on two-dimensional margins. These improvements lead to a better representation of the original k-mer spectrum and more faithful performance evaluations. ReSeq and all of its code are available at: chmeing/ReSeq
Publisher: Springer Science and Business Media LLC
Date: 26-01-2016
Publisher: Springer Science and Business Media LLC
Date: 08-01-2018
DOI: 10.1038/NM.4466
Abstract: Immune-checkpoint blockade has revolutionized cancer therapy. In particular, inhibition of programmed cell death protein 1 (PD-1) has been found to be effective for the treatment of metastatic melanoma and other cancers. Despite a dramatic increase in progression-free survival, a large proportion of patients do not show durable responses. Therefore, predictive biomarkers of a clinical response are urgently needed. Here we used high-dimensional single-cell mass cytometry and a bioinformatics pipeline for the in-depth characterization of the immune cell subsets in the peripheral blood of patients with stage IV melanoma before and after 12 weeks of anti-PD-1 immunotherapy. During therapy, we observed a clear response to immunotherapy in the T cell compartment. However, before commencing therapy, a strong predictor of progression-free and overall survival in response to anti-PD-1 immunotherapy was the frequency of CD14
Publisher: Oxford University Press (OUP)
Date: 11-07-2007
DOI: 10.1093/BIOSTATISTICS/KXM030
Abstract: We derive a quantile-adjusted conditional maximum likelihood estimator for the dispersion parameter of the negative binomial distribution and compare its performance, in terms of bias, to various other methods. Our estimation scheme outperforms all other methods in very small s les, typical of those from serial analysis of gene expression studies, the motivating data for this study. The impact of dispersion estimation on hypothesis testing is studied. We derive an "exact" test that outperforms the standard approximate asymptotic tests.
Publisher: Cold Spring Harbor Laboratory
Date: 16-07-2020
DOI: 10.1101/2020.07.16.206193
Abstract: Whole genome duplication (WGD) events are common in the evolutionary history of many living organisms. For decades, researchers have been trying to understand the genetic and epigenetic impact of WGD and its underlying molecular mechanisms. Particular attention was given to allopolyploid study systems, species resulting from an hybridization event accompanied by WGD. Investigating the mechanisms behind the survival of a newly formed allopolyploid highlighted the key role of DNA methylation. With the improvement of high-throughput methods, such as whole genome bisulfite sequencing (WGBS), an opportunity opened to further understand the role of DNA methylation at a larger scale and higher resolution. However, only a few studies have applied WGBS to allopolyploids, which might be due to lack of genomic resources combined with a burdensome data analysis process. To overcome these problems, we developed the Automated Reproducible Polyploid EpiGenetic GuIdance workflOw (ARPEGGIO): the first workflow for the analysis of epigenetic data in polyploids. This workflow analyzes WGBS data from allopolyploid species via the genome assemblies of the allopolyploid’s parent species. ARPEGGIO utilizes an updated read classification algorithm (EAGLE-RC), to tackle the challenge of sequence similarity amongst parental genomes. ARPEGGIO offers automation, but more importantly, a complete set of analyses including spot checks starting from raw WGBS data: quality checks, trimming, alignment, methylation extraction, statistical analyses and downstream analyses. A full run of ARPEGGIO outputs a list of genes showing differential methylation. ARPEGGIO’s design focuses on ease of use and reproducibility. ARPEGGIO was made simple to set up, run and interpret, and its implementation includes both package management and containerization. Here we discuss all the steps, challenges and implementation strategies ex le datasets are provided to show how to use ARPEGGIO. In addition, we also test EAGLE-RC with publicly available datasets given a ground truth, and we show that EAGLE-RC decreases the error rate by 3 to 4 times compared to standard approaches. The goal of ARPEGGIO is to promote, support and improve polyploid research with a reproducible and automated set of analyses in a convenient implementation.
Publisher: American Association for Cancer Research (AACR)
Date: 07-2019
DOI: 10.1158/1538-7445.AM2019-4225
Abstract: Checkpoint inhibitors have significantly accelerated cancer treatment but still a majority of patients do not respond. Biomarker driven patient stratification early to the right immunotherapeutic might enhance response and patient survival. Here we used high-dimensional mass cytometry (CyTOF) combined with machine-learning bioinformatics for the in-depth characterization of immune responses before and during anti-PD-1 immunotherapy. CyTOF allows us to monitor protein expression of 34 markers on a single cell while running 20 s les simultaneously. The analysis is data driven, can be adapted to high throughput approaches and can model arbitrary trial designs such as batch effects and paired designs and is quantitative over millions of events. Using CyTOF as a precision medicine tool we could predict response to anti-PD-1 using liquid blood biopsies. Biobanked peripheral blood mononuclear cells (PBMCs) from 51 patients with stage IV melanoma before and after 12 weeks of anti-PD-1 therapy was analyzed. We observed a clear T cell response on therapy. The most evident difference in responders before therapy was an enhanced frequency of CD14+ CD16+HLA-DRhi classical monocytes. We validated our results using conventional flow and found a clear correlation of enhanced monocyte frequencies before therapy initiation with clinical response such as lower hazard and extended progression-free and overall survival. In a second study we used CyTOF to monitor immune response in 21 non small cell lung cancer (NSCLC) patients that initially responded and then progressed under anti-PD-1 to a novel combination immunotherapy of anti-PD-1 plus an IL-15 super-agonist (ALT-803). In this phase Ib clinical study a response in the CD8+ T cell compartment was observed. Unexpected our high dimensional unbiased analysis was able to detect and characterize a strong expansion of innate tumor-reactive effector NK cells starting around day 4 of therapy. Taken together, our unbiased artificial intelligence driven immune workflow might support patient selection prior to therapy, and serve as a novel tool for precision medicine to select the right drug combination and identify new drug-able cell populations. Citation Format: Carsten Krieg, Luis Cardenas, Silvia Guglietta, John Wrangle, Mark Rubinstein, Mark Robinson. Is biomarker-driven precision medicine possible by using high dimensional augmented intelligence assisted analysis of cancer immune responses [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019 2019 Mar 29-Apr 3 Atlanta, GA. Philadelphia (PA): AACR Cancer Res 2019 (13 Suppl):Abstract nr 4225.
Publisher: Springer Science and Business Media LLC
Date: 07-2016
DOI: 10.1038/NBT.3628
Publisher: Cold Spring Harbor Laboratory
Date: 11-11-2020
DOI: 10.1101/2020.11.11.355693
Abstract: The mesothelium forms epithelial membranes that line the bodies cavities and surround the internal organs. Mesothelia widely contribute to organ homeostasis and regeneration, and their dysregulation can result in congenital anomalies of the viscera, ventral wall defects, and mesothelioma tumors. Nonetheless, the embryonic ontogeny and developmental regulation of mesothelium formation has remained uncharted. Here, we combine genetic lineage tracing, in toto live imaging, and single-cell transcriptomics in zebrafish to track mesothelial progenitor origins from the lateral plate mesoderm (LPM). Our single-cell analysis uncovers a post-gastrulation gene expression signature centered on hand2 that delineates distinct progenitor populations within the forming LPM. Combining gene expression analysis and imaging of transgenic reporter zebrafish embryos, we chart the origin of mesothelial progenitors to the lateral-most, hand2 -expressing LPM and confirm evolutionary conservation in mouse. Our time-lapse imaging of transgenic hand2 reporter embryos captures zebrafish mesothelium formation, documenting the coordinated cell movements that form pericardium and visceral and parietal peritoneum. We establish that the primordial germ cells migrate associated with the forming mesothelium as ventral migration boundary. Functionally, hand2 mutants fail to close the ventral mesothelium due to perturbed migration of mesothelium progenitors. Analyzing mouse and human mesothelioma tumors hypothesized to emerge from transformed mesothelium, we find de novo expression of LPM-associated transcription factors, and in particular of Hand2, indicating the re-initiation of a developmental transcriptional program in mesothelioma. Taken together, our work outlines a genetic and developmental signature of mesothelial origins centered around Hand2, contributing to our understanding of mesothelial pathologies and mesothelioma.
Publisher: Springer Science and Business Media LLC
Date: 2010
Publisher: Springer Science and Business Media LLC
Date: 11-09-2005
DOI: 10.1038/NG1640
Abstract: The nature of synthetic genetic interactions involving essential genes (those required for viability) has not been previously examined in a broad and unbiased manner. We crossed yeast strains carrying promoter-replacement alleles for more than half of all essential yeast genes to a panel of 30 different mutants with defects in erse cellular processes. The resulting genetic network is biased toward interactions between functionally related genes, enabling identification of a previously uncharacterized essential gene (PGA1) required for specific functions of the endoplasmic reticulum. But there are also many interactions between genes with dissimilar functions, suggesting that in idual essential genes are required for buffering many cellular processes. The most notable feature of the essential synthetic genetic network is that it has an interaction density five times that of nonessential synthetic genetic networks, indicating that most yeast genetic interactions involve at least one essential gene.
Publisher: IBM
Date: 2004
Publisher: Public Library of Science (PLoS)
Date: 17-07-2017
Publisher: Public Library of Science (PLoS)
Date: 29-07-2011
Publisher: American Association for the Advancement of Science (AAAS)
Date: 11-03-2005
Abstract: Signaling pathways transmit information through protein interaction networks that are dynamically regulated by complex extracellular cues. We developed LUMIER (for luminescence-based mammalian interactome mapping), an automated high-throughput technology, to map protein-protein interaction networks systematically in mammalian cells and applied it to the transforming growth factorâβ (TGFβ) pathway. Analysis using self-organizing maps and k -means clustering identified links of the TGFβ pathway to the p21-activated kinase (PAK) network, to the polarity complex, and to Occludin, a structural component of tight junctions. We show that Occludin regulates TGFβ type I receptor localization for efficient TGFβ-dependent dissolution of tight junctions during epithelial-to-mesenchymal transitions.
Publisher: Springer Science and Business Media LLC
Date: 2014
DOI: 10.1186/GB4171
Publisher: Informa UK Limited
Date: 06-2004
Publisher: Springer Science and Business Media LLC
Date: 06-2020
DOI: 10.1186/S13072-020-00346-8
Abstract: DNA methylation is a highly studied epigenetic signature that is associated with regulation of gene expression, whereby genes with high levels of promoter methylation are generally repressed. Genomic imprinting occurs when one of the parental alleles is methylated, i.e., when there is inherited allele-specific methylation (ASM). A special case of imprinting occurs during X chromosome inactivation in females, where one of the two X chromosomes is silenced, to achieve dosage compensation between the sexes. Another more widespread form of ASM is sequence dependent (SD-ASM), where ASM is linked to a nearby heterozygous single nucleotide polymorphism (SNP). We developed a method to screen for genomic regions that exhibit loss or gain of ASM in s les from two conditions (treatments, diseases, etc.). The method relies on the availability of bisulfite sequencing data from multiple s les of the two conditions. We leverage other established computational methods to screen for these regions within a new R package called DAMEfinder. It calculates an ASM score for all CpG sites or pairs in the genome of each s le, and then quantifies the change in ASM between conditions. It then clusters nearby CpG sites with consistent change into regions. In the absence of SNP information, our method relies only on reads to quantify ASM. This novel ASM score compares favorably to current methods that also screen for ASM. Not only does it easily discern between imprinted and non-imprinted regions, but also females from males based on X chromosome inactivation. We also applied DAMEfinder to a colorectal cancer dataset and observed that colorectal cancer subtypes are distinguishable according to their ASM signature. We also re-discover known cases of loss of imprinting. We have designed DAMEfinder to detect regions of differential ASM (DAMEs), which is a more refined definition of differential methylation, and can therefore help in breaking down the complexity of DNA methylation and its influence in development and disease.
Publisher: Cold Spring Harbor Laboratory
Date: 10-12-2015
DOI: 10.1101/034140
Abstract: CRISPR-Cas9 and related technologies efficiently alter genomic DNA at targeted positions and have far-reaching implications for functional screening and therapeutic gene editing. Understanding and unlocking this potential requires accurate evaluation of editing efficiency. We show that methodological decisions for analyzing sequencing data can significantly affect mutagenesis efficiency estimates and we provide a comprehensive R-based toolkit, CrispRVariants and accompanying web tool CrispRVariantsLite, that resolves and localizes in idual mutant alleles with respect to the endonuclease cut site. CrispRVariants-enabled analyses of newly generated and existing genome editing datasets underscore how careful consideration of the full variant spectrum gives insight toward effective guide and licon design as well as the mutagenic process.
Publisher: Springer Science and Business Media LLC
Date: 29-03-2023
DOI: 10.1186/S13059-023-02904-1
Abstract: With the emergence of hundreds of single-cell RNA-sequencing (scRNA-seq) datasets, the number of computational tools to analyze aspects of the generated data has grown rapidly. As a result, there is a recurring need to demonstrate whether newly developed methods are truly performant—on their own as well as in comparison to existing tools. Benchmark studies aim to consolidate the space of available methods for a given task and often use simulated data that provide a ground truth for evaluations, thus demanding a high quality standard results credible and transferable to real data. Here, we evaluated methods for synthetic scRNA-seq data generation in their ability to mimic experimental data. Besides comparing gene- and cell-level quality control summaries in both one- and two-dimensional settings, we further quantified these at the batch- and cluster-level. Secondly, we investigate the effect of simulators on clustering and batch correction method comparisons, and, thirdly, which and to what extent quality control summaries can capture reference-simulation similarity. Our results suggest that most simulators are unable to accommodate complex designs without introducing artificial effects, they yield over-optimistic performance of integration and potentially unreliable ranking of clustering methods, and it is generally unknown which summaries are important to ensure effective simulation-based method comparisons.
Publisher: F1000 Research Ltd
Date: 02-03-2021
DOI: 10.12688/F1000RESEARCH.26669.2
Abstract: Data organized into hierarchical structures (e.g., phylogenies or cell types) arises in several biological fields. It is therefore of interest to have data containers that store the hierarchical structure together with the biological profile data, and provide functions to easily access or manipulate data at different resolutions. Here, we present TreeSummarizedExperiment, a R/S4 class that extends the commonly used SingleCellExperiment class by incorporating tree representations of rows and/or columns (represented by objects of the phylo class). It follows the convention of the SummarizedExperiment class, while providing links between the assays and the nodes of a tree to allow data manipulation at arbitrary levels of the tree. The package is designed to be extensible, allowing new functions on the tree (phylo) to be contributed. As the work is based on the SingleCellExperiment class and the phylo class, both of which are popular classes used in many R packages, it is expected to be able to interact seamlessly with many other tools.
Publisher: F1000 Research Ltd
Date: 15-10-2020
DOI: 10.12688/F1000RESEARCH.26669.1
Abstract: Data organized into hierarchical structures (e.g., phylogenies or cell types) arises in several biological fields. It is therefore of interest to have data containers that store the hierarchical structure together with the biological profile data, and provide functions to easily access or manipulate data at different resolutions. Here, we present TreeSummarizedExperiment, a R/S4 class that extends the commonly used SingleCellExperiment class by incorporating tree representations of rows and/or columns (represented by objects of the phylo class). It follows the convention of the SummarizedExperiment class, while providing links between the assays and the nodes of a tree to allow data manipulation at arbitrary levels of the tree. The package is designed to be extensible, allowing new functions on the tree (phylo) to be contributed. As the work is based on the SingleCellExperiment class and the phylo class, both of which are popular classes used in many R packages, it is expected to be able to interact seamlessly with many other tools.
Publisher: Springer Science and Business Media LLC
Date: 30-11-2020
DOI: 10.1038/S41467-020-19894-4
Abstract: Single-cell RNA sequencing (scRNA-seq) has become an empowering technology to profile the transcriptomes of in idual cells on a large scale. Early analyses of differential expression have aimed at identifying differences between subpopulations to identify subpopulation markers. More generally, such methods compare expression levels across sets of cells, thus leading to cross-condition analyses. Given the emergence of replicated multi-condition scRNA-seq datasets, an area of increasing focus is making s le-level inferences, termed here as differential state analysis however, it is not clear which statistical framework best handles this situation. Here, we surveyed methods to perform cross-condition differential state analyses, including cell-level mixed models and methods based on aggregated pseudobulk data. To evaluate method performance, we developed a flexible simulation that mimics multi-s le scRNA-seq data. We analyzed scRNA-seq data from mouse cortex cells to uncover subpopulation-specific responses to lipopolysaccharide treatment, and provide robust tools for multi-condition analysis within the muscat R package.
Publisher: Cold Spring Harbor Laboratory
Date: 15-07-2014
DOI: 10.1101/007120
Abstract: DNA methylation, and specifically the reversible addition of methyl groups at CpG dinucleotides genome-wide, represents an important layer that is associated with the regulation of gene expression. In particular, aberrations in the methylation status have been noted across a erse set of pathological states, including cancer. With the rapid development and uptake of large scale sequencing of short DNA fragments, there has been an explosion of data analytic methods for processing and discovering changes in DNA methylation across erse data types. In this mini-review, we aim to condense many of the salient challenges, such as experimental design, statistical methods for differential methylation detection and critical considerations such as cell type composition and the potential confounding that can arise from batch effects, into a compact and accessible format. Our main interests, from a statistical perspective, include the practical use of empirical Bayes or hierarchical models, which have been shown to be immensely powerful and flexible in genomics and the procedures by which control of false discoveries are made. Of course, there are many critical platform-specific data preprocessing aspects that we do not discuss here. In addition, we do not make formal performance comparisons of the methods, but rather describe the commonly used statistical models and many of the pertinent issues we make some recommendations for further study.
Publisher: Elsevier BV
Date: 12-2004
DOI: 10.1016/J.MIB.2004.10.009
Abstract: A major objective in post-genome research is to fully understand the transcriptional control of each gene and the targets of each transcription factor. In yeast, large-scale experimental and computational approaches have been applied to identify co-regulated genes, cis regulatory elements, and transcription factor DNA binding sites in vivo. Methods for modeling and predicting system behavior, and for reconciling discrepancies among data types, are being explored. The results indicate that a complete and comprehensive yeast transcriptional network will ultimately be achieved.
Publisher: Springer Science and Business Media LLC
Date: 28-08-2005
DOI: 10.1038/NG1630
Abstract: Recent mammalian microarray experiments detected widespread transcription and indicated that there may be many undiscovered multiple-exon protein-coding genes. To explore this possibility, we labeled cDNA from un lified, polyadenylation-selected RNA s les from 37 mouse tissues to microarrays encompassing 1.14 million exon probes. We analyzed these data using GenRate, a Bayesian algorithm that uses a genome-wide scoring function in a factor graph to infer genes. At a stringent exon false detection rate of 2.7%, GenRate detected 12,145 gene-length transcripts and confirmed 81% of the 10,000 most highly expressed known genes. Notably, our analysis showed that most of the 155,839 exons detected by GenRate were associated with known genes, providing microarray-based evidence that most multiple-exon genes have already been identified. GenRate also detected tens of thousands of potential new exons and reconciled discrepancies in current cDNA databases by 'stitching' new transcribed regions into previously annotated genes.
Publisher: Cold Spring Harbor Laboratory
Date: 07-08-2023
DOI: 10.1101/2023.08.04.552046
Abstract: Single-cell chromatin accessibility assays, such as scATAC-seq, are increasingly employed in in idual and joint multi-omic profiling of single cells. As the accumulation of scATAC-seq and multi-omics datasets continue, challenges in analyzing such sparse, noisy, and high-dimensional data become pressing. Specifically, one challenge relates to optimizing the processing of chromatin-level measurements and efficiently extracting information to discern cellular heterogeneity. This is of critical importance, since the identification of cell types is a fundamental step in current single-cell data analysis practices. We benchmarked 8 feature engineering pipelines derived from 5 recent methods to assess their ability to discover and discriminate cell types. By using 10 metrics calculated at the cell embedding, shared nearest neighbor graph, or partition levels, we evaluated the performance of each method at different data processing stages. This comprehensive approach allowed us to thoroughly understand the strengths and weaknesses of each method and the influence of parameter selection. Our analysis provides guidelines for choosing analysis methods for different datasets. Overall, feature aggregation, SnapATAC, and SnapATAC2 outperform latent semantic indexing-based methods. For datasets with complex cell-type structures, SnapATAC and SnapATAC2 are preferred. With large datasets, SnapATAC2 and ArchR are most scalable.
Publisher: Oxford University Press (OUP)
Date: 25-05-2012
DOI: 10.1093/NAR/GKS427
Publisher: Cold Spring Harbor Laboratory
Date: 18-04-2023
DOI: 10.1101/2023.04.17.537189
Abstract: Spatially resolved transcriptomics (SRT) enables scientists to investigate spatial context of mRNA abundance. Here, we introduce DESpace , a novel approach to discover spatially variable genes (SVGs), i.e., genes whose expression varies across the tissue. Our framework inputs all types of SRT data, summarizes spatial information via spatial clusters, and identifies spatially variable genes by performing differential gene expression testing between clusters. Although several methods have been proposed to identify SVGs, our approach adds some unique features in particular: it allows identifying (and testing) the specific areas of the tissue affected by spatial variability, and it enables joint modelling of multiple s les (i.e., biological replicates). Furthermore, in our benchmarks, DESpace displays a higher true positive rate than competitors, controls for false positive and false discovery rates, and is among the most computationally efficient SVG tools. DESpace is distributed as a Bioconductor R package.
Publisher: Future Medicine Ltd
Date: 08-2010
DOI: 10.2217/EPI.10.36
Abstract: The field of epigenetics is now capitalizing on the vast number of emerging technologies, largely based on second-generation sequencing, which interrogate DNA methylation status and histone modifications genome-wide. However, getting an exhaustive and unbiased view of a methylome at a reasonable cost is proving to be a significant challenge. In this article, we take a closer look at the impact of the DNA sequence and bias effects introduced to datasets by genome-wide DNA methylation technologies and where possible, explore the bioinformatics tools that deconvolve them. There remains much to be learned about the performance of genome-wide technologies, the data we mine from these assays and how it reflects the actual biology. While there are several methods to interrogate the DNA methylation status genome-wide, our opinion is that no single technique suitably covers the minimum criteria of high coverage and, high resolution at a reasonable cost. In fact, the fraction of the methylome that is studied currently depends entirely on the inherent biases of the protocol employed. There is promise for this to change, as the third generation of sequencing technologies is expected to again ‘revolutionize’ the way that we study genomes and epigenomes.
Publisher: Oxford University Press (OUP)
Date: 04-10-2018
DOI: 10.1093/BIOINFORMATICS/BTX631
Abstract: Statistical tools for biological data analysis are often evaluated using synthetic data, designed to mimic the features of a specific type of experimental data. The generalizability of such evaluations depends on how well the synthetic data reproduce the main characteristics of the experimental data, and we argue that an assessment of this similarity should accompany any synthetic dataset used for method evaluation. We describe countsimQC, which provides a straightforward way to generate a stand-alone report that shows the main characteristics of (e.g. RNA-seq) count data and can be provided alongside a publication as verification of the appropriateness of any utilized synthetic data. countsimQC is implemented as an R package (for R versions ≥ 3.4) and is available from soneson/countsimQC under a GPL (≥2) license.
Publisher: Rockefeller University Press
Date: 06-04-2015
DOI: 10.1084/JEM.20141957
Abstract: The epigenetic dysregulation of tumor suppressor genes is an important driver of human carcinogenesis. We have combined genome-wide DNA methylation analyses and gene expression profiling after pharmacological DNA demethylation with functional screening to identify novel tumor suppressors in diffuse large B cell lymphoma (DLBCL). We find that a CpG island in the promoter of the dual-specificity phosphatase DUSP4 is aberrantly methylated in nodal and extranodal DLBCL, irrespective of ABC or GCB subtype, resulting in loss of DUSP4 expression in 75% of & examined cases. The DUSP4 genomic locus is further deleted in up to 13% of aggressive B cell lymphomas, and the lack of DUSP4 is a negative prognostic factor in three independent cohorts of DLBCL patients. Ectopic expression of wild-type DUSP4, but not of a phosphatase-deficient mutant, dephosphorylates c-JUN N-terminal kinase (JNK) and induces apoptosis in DLBCL cells. Pharmacological or dominant-negative JNK inhibition restricts DLBCL survival in vitro and in vivo and synergizes strongly with the Bruton’s tyrosine kinase inhibitor ibrutinib. Our results indicate that DLBCL cells depend on JNK signaling for survival. This finding provides a mechanistic basis for the clinical development of JNK inhibitors in DLBCL, ideally in synthetic lethal combinations with inhibitors of chronic active B cell receptor signaling.
Publisher: EMBO
Date: 2007
DOI: 10.1038/MSB4100134
Publisher: Elsevier BV
Date: 11-2015
DOI: 10.1016/J.EJMECH.2015.10.020
Abstract: Aggressive behavior and diffuse infiltrative growth are the main features of Glioblastoma multiforme (GBM), together with the high degree of resistance and recurrence. Evidence indicate that GBM-derived stem cells (GSCs), endowed with unlimited proliferative potential, play a critical role in tumor development and maintenance. Among the many signaling pathways involved in maintaining GSC stemness, tumorigenic potential, and anti-apoptotic properties, the PDK1/Akt pathway is a challenging target to develop new potential agents able to affect GBM resistance to chemotherapy. In an effort to find new PDK1/Akt inhibitors, we rationally designed and synthesized a small family of 2-oxindole derivatives. Among them, compound 3 inhibited PDK1 kinase and downstream effectors such as CHK1, GS3Kα and GS3Kβ, which contribute to GCS survival. Compound 3 appeared to be a good tool for studying the role of the PDK1/Akt pathway in GCS self-renewal and tumorigenicity, and might represent the starting point for the development of more potent and focused multi-target therapies for GBM.
Publisher: Life Science Alliance, LLC
Date: 23-03-2021
Abstract: A key challenge in single-cell RNA-sequencing (scRNA-seq) data analysis is batch effects that can obscure the biological signal of interest. Although there are various tools and methods to correct for batch effects, their performance can vary. Therefore, it is important to understand how batch effects manifest to adjust for them. Here, we systematically explore batch effects across various scRNA-seq datasets according to magnitude, cell type specificity, and complexity. We developed a cell-specific mixing score (cms) that quantifies mixing of cells from multiple batches. By considering distance distributions, the score is able to detect local batch bias as well as differentiate between unbalanced batches and systematic differences between cells of the same cell type. We compare metrics in scRNA-seq data using real and synthetic datasets and whereas these metrics target the same question and are used interchangeably, we find differences in scalability, sensitivity, and ability to handle differentially abundant cell types. We find that cell-specific metrics outperform cell type–specific and global metrics and recommend them for both method benchmarks and batch exploration.
Publisher: Elsevier BV
Date: 04-2019
Publisher: Springer Science and Business Media LLC
Date: 21-01-2011
Abstract: Cancer is commonly associated with widespread disruption of DNA methylation, chromatin modification and miRNA expression. In this study, we established a robust discovery pipeline to identify epigenetically deregulated miRNAs in cancer. Using an integrative approach that combines primary transcription, genome-wide DNA methylation and H3K9Ac marks with microRNA (miRNA) expression, we identified miRNA genes that were epigenetically modified in cancer. We find miR-205, miR-21, and miR-196b to be epigenetically repressed, and miR-615 epigenetically activated in prostate cancer cells. We show that detecting changes in primary miRNA transcription levels is a valuable method for detection of local epigenetic modifications that are associated with changes in mature miRNA expression.
Publisher: Cold Spring Harbor Laboratory
Date: 09-12-2021
DOI: 10.1101/2021.12.08.471089
Abstract: Human cellular models of neurodegeneration require reproducibility and longevity, which is necessary for simulating these age-dependent diseases. Such systems are particularly needed for TDP-43 proteinopathies 1,2 , which involve human-specific mechanisms 3–6 that cannot be directly studied in animal models. To explore the emergence and consequences of TDP-43 pathologies, we generated iPSC-derived, colony morphology neural stem cells (iCoMoNSCs) via manual selection of neural precursors 7 . Single-cell transcriptomics (scRNA-seq) and comparison to independent NSCs 8 , showed that iCoMoNSCs are uniquely homogenous and self-renewing. Differentiated iCoMoNSCs formed a self-organized multicellular system consisting of synaptically connected and electrophysiologically active neurons, which matured into long-lived functional networks. Neuronal and glial maturation in iCoMoNSC-derived cultures was similar to that of cortical organoids 9 . Overexpression of wild-type TDP-43 in a minority of iCoMoNSC-derived neurons led to progressive fragmentation and aggregation, resulting in loss of function and neurotoxicity. scRNA-seq revealed a novel set of misregulated RNA targets coinciding in both TDP-43 overexpressing neurons and patient brains exhibiting loss of nuclear TDP-43. The strongest misregulated target encoded for the synaptic protein NPTX2, which was consistently misaccumulated in ALS and FTLD patient neurons with TDP-43 pathology. Our work directly links TDP-43 misregulation and NPTX2 accumulation, thereby highlighting a new pathway of neurotoxicity.
Publisher: Springer International Publishing
Date: 2014
Publisher: American Association for the Advancement of Science (AAAS)
Date: 14-12-2001
Abstract: In Saccharomyces cerevisiae , more than 80% of the ∼6200 predicted genes are nonessential, implying that the genome is buffered from the phenotypic consequences of genetic perturbation. To evaluate function, we developed a method for systematic construction of double mutants, termed synthetic genetic array (SGA) analysis, in which a query mutation is crossed to an array of ∼4700 deletion mutants. Inviable double-mutant meiotic progeny identify functional relationships between genes. SGA analysis of genes with roles in cytoskeletal organization ( BNI1 , ARP2 , ARC40 , BIM1 ), DNA synthesis and repair ( SGS1 , RAD27 ), or uncharacterized functions ( BBC1 , NBP2 ) generated a network of 291 interactions among 204 genes. Systematic application of this approach should produce a global map of gene function.
Publisher: Springer Science and Business Media LLC
Date: 08-10-2015
Publisher: Springer Science and Business Media LLC
Date: 17-05-2023
DOI: 10.1186/S13059-023-02962-5
Abstract: Computational methods represent the lifeblood of modern molecular biology. Benchmarking is important for all methods, but with a focus here on computational methods, benchmarking is critical to dissect important steps of analysis pipelines, formally assess performance across common situations as well as edge cases, and ultimately guide users on what tools to use. Benchmarking can also be important for community building and advancing methods in a principled way. We conducted a meta-analysis of recent single-cell benchmarks to summarize the scope, extensibility, and neutrality, as well as technical features and whether best practices in open data and reproducible research were followed. The results highlight that while benchmarks often make code available and are in principle reproducible, they remain difficult to extend, for ex le, as new methods and new ways to assess methods emerge. In addition, embracing containerization and workflow systems would enhance reusability of intermediate benchmarking results, thus also driving wider adoption.
Publisher: Public Library of Science (PLoS)
Date: 31-01-2006
Publisher: Springer Science and Business Media LLC
Date: 14-05-2019
DOI: 10.1038/S42003-019-0415-5
Abstract: High-dimensional flow and mass cytometry allow cell types and states to be characterized in great detail by measuring expression levels of more than 40 targeted protein markers per cell at the single-cell level. However, data analysis can be difficult, due to the large size and dimensionality of datasets as well as limitations of existing computational methods. Here, we present diffcyt , a new computational framework for differential discovery analyses in high-dimensional cytometry data, based on a combination of high-resolution clustering and empirical Bayes moderated tests adapted from transcriptomics. Our approach provides improved statistical performance, including for rare cell populations, along with flexible experimental designs and fast runtimes in an open-source framework.
Publisher: Elsevier BV
Date: 06-2019
Publisher: Wiley
Date: 19-01-2020
Publisher: Springer New York
Date: 22-09-2012
Publisher: Springer Science and Business Media LLC
Date: 26-02-2018
DOI: 10.1038/NMETH.4612
Abstract: Many methods have been used to determine differential gene expression from single-cell RNA (scRNA)-seq data. We evaluated 36 approaches using experimental and synthetic data and found considerable differences in the number and characteristics of the genes that are called differentially expressed. Prefiltering of lowly expressed genes has important effects, particularly for some of the methods developed for bulk RNA-seq data analysis. However, we found that bulk RNA-seq analysis methods do not generally perform worse than those developed specifically for scRNA-seq. We also present conquer, a repository of consistently processed, analysis-ready public scRNA-seq data sets that is aimed at simplifying method evaluation and reanalysis of published results. Each data set provides abundance estimates for both genes and transcripts, as well as quality control and exploratory analysis reports.
Publisher: Springer New York
Date: 2014
DOI: 10.1007/978-1-4939-0512-6_3
Abstract: The edgeR package, an R-based tool within the Bioconductor project, offers a flexible statistical framework for detection of changes in abundance based on counts. In this chapter, we illustrate the use of edgeR on a human embryonic stem cell dataset, in particular for RNA-seq and ChIP-seq data. We focus on a step-by-step statistical analysis of differential expression, going from raw data to a list of putative differentially expressed genes and give ex les of integrative analysis using the ChIP-seq data. We emphasize data quality spot checks and the use of positive controls throughout the process and give practical recommendations for reproducible research.
Publisher: Cold Spring Harbor Laboratory
Date: 29-08-2019
DOI: 10.1101/750018
Abstract: Alternative splicing is a biological process during gene expression that allows a single gene to code for multiple proteins. However, splicing patterns can be altered in some conditions or diseases. Here, we present BANDITS, a R/Bioconductor package to perform differential splicing, at both gene and transcript-level, based on RNA-seq data. BANDITS uses a Bayesian hierarchical structure to explicitly model the variability between s les, and treats the transcript allocation of reads as latent variables. We perform an extensive benchmark across both simulated and experimental RNA-seq datasets, where BANDITS has extremely favorable performance with respect to the competitors considered.
Publisher: Elsevier BV
Date: 10-2004
Publisher: Cold Spring Harbor Laboratory
Date: 22-07-2013
Abstract: Prokaryotes, due to their moderate complexity, are particularly amenable to the comprehensive identification of the protein repertoire expressed under different conditions. We applied a generic strategy to identify a complete expressed prokaryotic proteome, which is based on the analysis of RNA and proteins extracted from matched s les. Saturated transcriptome profiling by RNA-seq provided an endpoint estimate of the protein-coding genes expressed under two conditions which mimic the interaction of Bartonella henselae with its mammalian host. Directed shotgun proteomics experiments were carried out on four subcellular fractions. By specifically targeting proteins which are short, basic, low abundant, and membrane localized, we could eliminate their initial underrepresentation compared to the estimated endpoint. A total of 1250 proteins were identified with an estimated false discovery rate below 1%. This represents 85% of all distinct annotated proteins and ∼90% of the expressed protein-coding genes. Genes that were detected at the transcript but not protein level, were found to be highly enriched in several genomic islands. Furthermore, genes that lacked an ortholog and a functional annotation were not detected at the protein level these may represent ex les of overprediction in genome annotations. A dramatic membrane proteome reorganization was observed, including differential regulation of autotransporters, adhesins, and hemin binding proteins. Particularly noteworthy was the complete membrane proteome coverage, which included expression of all members of the VirB/D4 type IV secretion system, a key virulence factor.
Publisher: Springer Science and Business Media LLC
Date: 06-04-2020
DOI: 10.1186/S12885-020-06777-6
Abstract: Identifying molecular differences between primary and metastatic colorectal cancers—now possible with the aid of omics technologies—can improve our understanding of the biological mechanisms of cancer progression and facilitate the discovery of novel treatments for late-stage cancer. We compared the DNA methylomes of primary colorectal cancers (CRCs) and CRC metastases to the liver. Laser microdissection was used to obtain epithelial tissue (10 to 25 × 10 6 μm 2 ) from sections of fresh-frozen s les of primary CRCs ( n = 6), CRC liver metastases ( n = 12), and normal colon mucosa ( n = 3). DNA extracted from tissues was enriched for methylated sequences with a methylCpG binding domain (MBD) polypeptide-based protocol and subjected to deep sequencing. The performance of this protocol was compared with that of targeted enrichment for bisulfite sequencing used in a previous study of ours. MBD enrichment captured a total of 322,551 genomic regions (249.5 Mb or ~ 7.8% of the human genome), which included over seven million CpG sites. A few of these regions were differentially methylated at an expected false discovery rate (FDR) of 5% in neoplastic tissues (primaries: 0.67%, i.e., 2155 regions containing 279,441 CpG sites liver metastases: 1%, i.e., 3223 regions containing 312,723 CpG sites) as compared with normal mucosa s les. Most of the differentially methylated regions (DMRs 94% in primaries 70% in metastases) were hyper methylated, and almost 80% of these (1882 of 2396) were present in both lesion types. At 5% FDR, no DMRs were detected in liver metastases vs. primary CRC. However, short regions of low-magnitude hypo methylation were frequent in metastases but rare in primaries. Hypermethylated DMRs were far more abundant in sequences classified as intragenic, gene-regulatory, or CpG shelves-shores-island segments, whereas hypomethylated DMRs were equally represented in extragenic (mainly, open-sea) and intragenic (mainly, gene bodies) sequences of the genome. Compared with targeted enrichment, MBD capture provided a better picture of the extension of CRC-associated DNA hypermethylation but was less powerful for identifying hypomethylation. Our findings demonstrate that the hypermethylation phenotype in CRC liver metastases remains similar to that of the primary tumor, whereas CRC-associated DNA hypomethylation probably undergoes further progression after the cancer cells have migrated to the liver.
Publisher: The Company of Biologists
Date: 2016
DOI: 10.1242/DEV.134809
Abstract: CRISPR-Cas9 enables efficient sequence-specific mutagenesis for creating somatic or germline mutants of model organisms. Key constraints in vivo remain the expression and delivery of active Cas9-guideRNA ribonucleoprotein complexes (RNPs) with minimal toxicity, variable mutagenesis efficiencies depending on targeting sequence, and high mutation mosaicism. Here, we apply in vitro-assembled, fluorescent Cas9-sgRNA RNPs in solubilizing salt solution to achieve maximal mutagenesis efficiency in zebrafish embryos. MiSeq-based sequence analysis of targeted loci in in idual embryos using CrispRVariants, a customized software tool for mutagenesis quantification and visualization, reveals efficient bi-allelic mutagenesis that reaches saturation at several tested gene loci. Such virtually complete mutagenesis exposes loss-of-function phenotypes for candidate genes in somatic mutant embryos for subsequent generation of stable germline mutants. We further show that targeting of non-coding elements in gene-regulatory regions using saturating mutagenesis uncovers functional control elements in transgenic reporters and endogenous genes in injected embryos. Our results establish that optimally solubilized, in vitro assembled fluorescent Cas9-sgRNA RNPs provide a reproducible reagent for direct and scalable loss-of-function studies and applications beyond zebrafish experiments that require maximal DNA cutting efficiency in vivo.
Publisher: Springer Science and Business Media LLC
Date: 17-07-2021
DOI: 10.1186/S12864-021-07845-2
Abstract: Whole genome duplication (WGD) events are common in the evolutionary history of many living organisms. For decades, researchers have been trying to understand the genetic and epigenetic impact of WGD and its underlying molecular mechanisms. Particular attention was given to allopolyploid study systems, species resulting from an hybridization event accompanied by WGD. Investigating the mechanisms behind the survival of a newly formed allopolyploid highlighted the key role of DNA methylation. With the improvement of high-throughput methods, such as whole genome bisulfite sequencing (WGBS), an opportunity opened to further understand the role of DNA methylation at a larger scale and higher resolution. However, only a few studies have applied WGBS to allopolyploids, which might be due to lack of genomic resources combined with a burdensome data analysis process. To overcome these problems, we developed the Automated Reproducible Polyploid EpiGenetic GuIdance workflOw (ARPEGGIO): the first workflow for the analysis of epigenetic data in polyploids. This workflow analyzes WGBS data from allopolyploid species via the genome assemblies of the allopolyploid’s parent species. ARPEGGIO utilizes an updated read classification algorithm (EAGLE-RC), to tackle the challenge of sequence similarity amongst parental genomes. ARPEGGIO offers automation, but more importantly, a complete set of analyses including spot checks starting from raw WGBS data: quality checks, trimming, alignment, methylation extraction, statistical analyses and downstream analyses. A full run of ARPEGGIO outputs a list of genes showing differential methylation. ARPEGGIO was made simple to set up, run and interpret, and its implementation ensures reproducibility by including both package management and containerization. We evaluated ARPEGGIO in two ways. First, we tested EAGLE-RC’s performance with publicly available datasets given a ground truth, and we show that EAGLE-RC decreases the error rate by 3 to 4 times compared to standard approaches. Second, using the same initial dataset, we show agreement between ARPEGGIO’s output and published results. Compared to other similar workflows, ARPEGGIO is the only one supporting polyploid data. The goal of ARPEGGIO is to promote, support and improve polyploid research with a reproducible and automated set of analyses in a convenient implementation. ARPEGGIO is available at upermaxiste/ARPEGGIO .
Publisher: PeerJ
Date: 23-08-2019
DOI: 10.7287/PEERJ.PREPRINTS.27885V3
Abstract: The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing a data revolution in single cell biology. Although some issues are similar in spirit to those experienced in bulk sequencing, many of the emerging data science problems are unique to single cell analysis together, they give rise to the new realm of 'Single-Cell Data Science'. Here, we outline twelve challenges that will be central in bringing this new field forward. For each challenge, the current state of the art in terms of prior work is reviewed, and open problems are formulated, with an emphasis on the research goals that motivate them. This compendium is meant to serve as a guideline for established researchers, newcomers and students alike, highlighting interesting and rewarding problems in 'Single-Cell Data Science' for the coming years.
Publisher: Springer Science and Business Media LLC
Date: 30-03-2022
DOI: 10.1038/S41467-022-29311-7
Abstract: The mesothelium lines body cavities and surrounds internal organs, widely contributing to homeostasis and regeneration. Mesothelium disruptions cause visceral anomalies and mesothelioma tumors. Nonetheless, the embryonic emergence of mesothelia remains incompletely understood. Here, we track mesothelial origins in the lateral plate mesoderm (LPM) using zebrafish. Single-cell transcriptomics uncovers a post-gastrulation gene expression signature centered on hand2 in distinct LPM progenitor cells. We map mesothelial progenitors to lateral-most, hand2 -expressing LPM and confirm conservation in mouse. Time-lapse imaging of zebrafish hand2 reporter embryos captures mesothelium formation including pericardium, visceral, and parietal peritoneum. We find primordial germ cells migrate with the forming mesothelium as ventral migration boundary. Functionally, hand2 loss disrupts mesothelium formation with reduced progenitor cells and perturbed migration. In mouse and human mesothelioma, we document expression of LPM-associated transcription factors including Hand2, suggesting re-initiation of a developmental program. Our data connects mesothelium development to Hand2, expanding our understanding of mesothelial pathologies.
Publisher: PeerJ
Date: 06-08-2019
DOI: 10.7287/PEERJ.PREPRINTS.27885V1
Abstract: The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing a data revolution in single cell biology. Although some issues are similar in spirit to those experienced in bulk sequencing, many of the emerging data science problems are unique to single cell analysis together, they give rise to the new realm of 'Single Cell Data Science'. Here, we outline twelve challenges that will be central in bringing this new field forward. For each challenge, the current state of the art in terms of prior work is reviewed, and open problems are formulated, with an emphasis on the research goals that motivate them. This compendium is meant to serve as a guideline for established researchers, newcomers and students alike, highlighting interesting and rewarding problems in 'Single Cell Data Science' for the coming years.
Publisher: PeerJ
Date: 07-08-2019
DOI: 10.7287/PEERJ.PREPRINTS.27885V2
Abstract: The recent upswing of microfluidics and combinatorial indexing strategies, further enhanced by very low sequencing costs, have turned single cell sequencing into an empowering technology analyzing thousands—or even millions—of cells per experimental run is becoming a routine assignment in laboratories worldwide. As a consequence, we are witnessing a data revolution in single cell biology. Although some issues are similar in spirit to those experienced in bulk sequencing, many of the emerging data science problems are unique to single cell analysis together, they give rise to the new realm of 'Single Cell Data Science'. Here, we outline twelve challenges that will be central in bringing this new field forward. For each challenge, the current state of the art in terms of prior work is reviewed, and open problems are formulated, with an emphasis on the research goals that motivate them. This compendium is meant to serve as a guideline for established researchers, newcomers and students alike, highlighting interesting and rewarding problems in 'Single Cell Data Science' for the coming years.
Publisher: Springer Science and Business Media LLC
Date: 18-06-2014
Publisher: Cold Spring Harbor Laboratory
Date: 28-07-2018
DOI: 10.1101/378539
Abstract: Most methods for statistical analysis of RNA-seq data take a matrix of abundance estimates for some type of genomic features as their input, and consequently the quality of any obtained results are directly dependent on the quality of these abundances. Here, we present the junction coverage compatibility (JCC) score, which provides a way to evaluate the reliability of transcript-level abundance estimates as well as the accuracy of transcript annotation catalogs. It works by comparing the observed number of reads spanning each annotated splice junction in a genomic region to the predicted number of junction-spanning reads, inferred from the estimated transcript abundances and the genomic coordinates of the corresponding annotated transcripts. We show that while most genes show good agreement between the observed and predicted junction coverages, there is a small set of genes that do not. Genes with poor agreement are found regardless of the method used to estimate transcript abundances, and the corresponding transcript abundances should be treated with care in any downstream analyses.
Publisher: Springer Science and Business Media LLC
Date: 07-08-2017
Publisher: Frontiers Media SA
Date: 16-09-2014
Publisher: No publisher found
Date: 2019
DOI: 10.1242/DMM.039545
Publisher: Springer Science and Business Media LLC
Date: 22-08-2013
Abstract: RNA sequencing (RNA-seq) has been rapidly adopted for the profiling of transcriptomes in many areas of biology, including studies into gene regulation, development and disease. Of particular interest is the discovery of differentially expressed genes across different conditions (e.g., tissues, perturbations) while optionally adjusting for other systematic factors that affect the data-collection process. There are a number of subtle yet crucial aspects of these analyses, such as read counting, appropriate treatment of biological variability, quality control checks and appropriate setup of statistical modeling. Several variations have been presented in the literature, and there is a need for guidance on current best practices. This protocol presents a state-of-the-art computational and statistical RNA-seq differential expression analysis workflow largely based on the free open-source R language and Bioconductor software and, in particular, on two widely used tools, DESeq and edgeR. Hands-on time for typical small experiments (e.g., 4-10 s les) can be <1 h, with computation time <1 d using a standard desktop PC.
Publisher: Springer Science and Business Media LLC
Date: 15-04-2019
DOI: 10.1038/S41467-019-09728-3
Abstract: The splenic white pulp is underpinned by poorly characterized stromal cells that demarcate distinct immune cell microenvironments. Here we establish fibroblastic reticular cell (FRC)-specific fate-mapping in mice to define their embryonic origin and differentiation trajectories. Our data show that all reticular cell subsets descend from multipotent progenitors emerging at embryonic day 19.5 from periarterial progenitors. Commitment of FRC progenitors is concluded during the first week of postnatal life through occupation of niches along developing central arterioles. Single cell transcriptomic analysis facilitated deconvolution of FRC differentiation trajectories and indicated that perivascular reticular cells function both as adult lymphoid organizer cells and mural cell progenitors. The lymphotoxin-β receptor-independent sustenance of postnatal progenitor stemness unveils that systemic immune surveillance in the splenic white pulp is governed through subset specification of reticular cells from a multipotent periarterial progenitor cell. In sum, the finding that discrete signaling events in perivascular niches determine the differentiation trajectories of reticular cell networks explains the development of distinct microenvironmental niches in secondary and tertiary lymphoid tissues that are crucial for the induction and regulation of innate and adaptive immune processes.
Publisher: Springer Science and Business Media LLC
Date: 18-05-2023
DOI: 10.1038/S41590-023-01503-3
Abstract: B cell zone reticular cells (BRCs) form stable microenvironments that direct efficient humoral immunity with B cell priming and memory maintenance being orchestrated across lymphoid organs. However, a comprehensive understanding of systemic humoral immunity is h ered by the lack of knowledge of global BRC sustenance, function and major pathways controlling BRC–immune cell interactions. Here we dissected the BRC landscape and immune cell interactome in human and murine lymphoid organs. In addition to the major BRC subsets underpinning the follicle, including follicular dendritic cells, PI16 + RCs were present across organs and species. As well as BRC-produced niche factors, immune cell-driven BRC differentiation and activation programs governed the convergence of shared BRC subsets, overwriting tissue-specific gene signatures. Our data reveal that a canonical set of immune cell-provided cues enforce bidirectional signaling programs that sustain functional BRC niches across lymphoid organs and species, thereby securing efficient humoral immunity.
Publisher: F1000 Research Ltd
Date: 30-09-2021
DOI: 10.12688/F1000RESEARCH.73493.1
Abstract: Online accounts to keep track of scientific publications, such as Open Researcher and Contributor ID (ORCID) or Google Scholar, can be time consuming to maintain and synchronize. Furthermore, the open access status of publications is often not easily accessible, hindering potential opening of closed publications. To lessen the burden of managing personal profiles, we developed a R shiny app that allows publication lists from multiple platforms to be retrieved and consolidated, as well as interactive exploration and comparison of publication profiles. A live version can be found at pubassistant.ch.
Publisher: F1000 Research Ltd
Date: 20-12-2021
DOI: 10.12688/F1000RESEARCH.73493.2
Abstract: Online accounts to keep track of scientific publications, such as Open Researcher and Contributor ID (ORCID) or Google Scholar, can be time consuming to maintain and synchronize. Furthermore, the open access status of publications is often not easily accessible, hindering potential opening of closed publications. To lessen the burden of managing personal profiles, we developed a R shiny app that allows publication lists from multiple platforms to be retrieved and consolidated, as well as interactive exploration and comparison of publication profiles. A live version can be found at pubassistant.ch.
Publisher: Springer Science and Business Media LLC
Date: 09-02-2022
DOI: 10.1038/S41597-022-01137-4
Abstract: Epithelial-mesenchymal transition (EMT) equips breast cancer cells for metastasis and treatment resistance. However, detection, inhibition, and elimination of EMT-undergoing cells is challenging due to the intrinsic heterogeneity of cancer cells and the phenotypic ersity of EMT programs. We comprehensively profiled EMT transition phenotypes in four non-cancerous human mammary epithelial cell lines using a flow cytometry surface marker screen, RNA sequencing, and mass cytometry. EMT was induced in the HMLE and MCF10A cell lines and in the HMLE-Twist-ER and HMLE-Snail-ER cell lines by prolonged exposure to TGFβ1 or 4-hydroxytamoxifen, respectively. Each cell line exhibited a spectrum of EMT transition phenotypes, which we compared to the steady-state phenotypes of fifteen luminal, HER2-positive, and basal breast cancer cell lines. Our data provide multiparametric insights at single-cell level into the phenotypic ersity of EMT at different time points and in four human cellular models. These insights are valuable to better understand the complexity of EMT, to compare EMT transitions between the cellular models used here, and for the design of EMT time course experiments.
Publisher: Cold Spring Harbor Laboratory
Date: 11-12-2020
DOI: 10.1101/2020.12.11.420885
Abstract: A key challenge in single cell RNA-sequencing (scRNA-seq) data analysis are dataset- and batch-specific differences that can obscure the biological signal of interest. While there are various tools and methods to perform data integration and correct for batch effects, their performance can vary between datasets and according to the nature of the bias. Therefore, it is important to understand how batch effects manifest in order to adjust for them in a reliable way. Here, we systematically explore batch effects in a variety of scRNA-seq datasets according to magnitude, cell type specificity and complexity. We developed a cell-specific mixing score ( cms ) that quantifies how well cells from multiple batches are mixed. By considering distance distributions (in a lower dimensional space), the score is able to detect local batch bias and differentiate between unbalanced batches (i.e., when one cell type is more abundant in a batch) and systematic differences between cells of the same cell type. We implemented cms and related metrics to detect batch effects or measure structure preservation in the CellMixS R/Bioconductor package. We systematically compare different metrics that have been proposed to quantify batch effects or bias in scRNA-seq data using real datasets with known batch effects and synthetic data that mimic various real data scenarios. While these metrics target the same question and are used interchangeably, we find differences in inter- and intra-dataset scalability, sensitivity and in a metric’s ability to handle batch effects with differentially abundant cell types. We find that cell-specific metrics outperform cell type-specific and global metrics and recommend them for both method benchmarks and batch exploration.
Publisher: Springer Science and Business Media LLC
Date: 06-08-2018
Publisher: Cold Spring Harbor Laboratory
Date: 09-08-2012
Abstract: Developments in microarray and high-throughput sequencing (HTS) technologies have resulted in a rapid expansion of research into epigenomic changes that occur in normal development and in the progression of disease, such as cancer. Not surprisingly, copy number variation (CNV) has a direct effect on HTS read densities and can therefore bias differential detection results. We have developed a flexible approach called ABCD-DNA (affinity-based copy-number-aware differential quantitative DNA sequencing analyses) that integrates CNV and other systematic factors directly into the differential enrichment engine.
Publisher: Cold Spring Harbor Laboratory
Date: 24-04-2013
Publisher: F1000 Research Ltd
Date: 13-04-2022
DOI: 10.12688/F1000RESEARCH.73493.3
Abstract: Online accounts to keep track of scientific publications, such as Open Researcher and Contributor ID (ORCID) or Google Scholar, can be time consuming to maintain and synchronize. Furthermore, the open access status of publications is often not easily accessible, hindering potential opening of closed publications. To lessen the burden of managing personal profiles, we developed a R shiny app that allows publication lists from multiple platforms to be retrieved and consolidated, as well as interactive exploration and comparison of publication profiles. A live version can be found at pubassistant.ch.
Publisher: Cold Spring Harbor Laboratory
Date: 26-07-2019
DOI: 10.1101/713412
Abstract: Single-cell RNA sequencing (scRNA-seq) has quickly become an empowering technology to profile the transcriptomes of in idual cells on a large scale. Many early analyses of differential expression have aimed at identifying differences between subpopulations, and thus are focused on finding subpopulation markers either in a single s le or across multiple s les. More generally, such methods can compare expression levels in multiple sets of cells, thus leading to cross-condition analyses. However, given the emergence of replicated multi-condition scRNA-seq datasets, an area of increasing focus is making s le-level inferences, termed here as differential state analysis. For ex le, one could investigate the condition-specific responses of cell subpopulations measured from patients from each condition however, it is not clear which statistical framework best handles this situation. In this work, we surveyed the methods available to perform cross-condition differential state analyses, including cell-level mixed models and methods based on aggregated “pseudobulk” data. We developed a flexible simulation platform that mimics both single and multi-s le scRNA-seq data and provide robust tools for multi-condition analysis within the muscat R package.
Publisher: American Chemical Society (ACS)
Date: 20-10-2004
DOI: 10.1021/PR049909X
Abstract: Although HPLC-ESI-MS/MS is rapidly becoming an indispensable tool for the analysis of peptides in complex mixtures, the sequence coverage it affords is often quite poor. Low protein expression resulting in peptide signal intensities that fall below the limit of detection of the MS system in combination with differences in peptide ionization efficiency plays a significant role in this. A second important factor stems from differences in physicochemical properties of each peptide and how these properties relate to chromatographic retention and ultimate detection. To identify and understand those properties, we compared data from experimentally identified peptides with data from peptides predicted by in silico digest of all corresponding proteins in the experimental set. Three different complex protein mixtures extracted were used to define a training set to evaluate the amino acid retention coefficients based on linear regression analysis. The retention coefficients were also compared with other previous hydrophobic and retention scale. From this, we have constructed an empirical model that can be readily used to predict peptides that are likely to be observed on our HPLC-ESI-MS/MS system based on their physicochemical properties. Finally, we demonstrated that in silico prediction of peptides and their retention coefficients can be used to generate an inclusion list for a targeted mass spectrometric identification of low abundance proteins in complex protein s les. This approach is based on experimentally derived data to calibrate the method and therefore may theoretically be applied to any HPLC-MS/MS system on which data are being generated.
Publisher: Walter de Gruyter GmbH
Date: 12-2018
Abstract: This paper examines issues relating to the perceptions and adoption of open access (OA) and institutional repositories. Using a survey research design, we collected data from academics and other researchers in the humanities, arts and social sciences (HASS) at a university in Australia. We looked at factors influencing choice of publishers and journal outlets, as well as the use of social media and nontraditional channels for scholarly communication. We used an online questionnaire to collect data and used descriptive statistics to analyse the data. Our findings suggest that researchers are highly influenced by traditional measures of quality, such as journal impact factor, and are less concerned with making their work more findable and promoting it through social media. This highlights a disconnect between researchers’ desired outcomes and the efforts that they put in toward the same. Our findings also suggest that institutional policies have the potential to increase OA awareness and adoption. This study contributes to the growing literature on scholarly communication by offering evidence from the HASS field, where limited studies have been conducted. Based on the findings, we recommend that academic librarians engage with faculty through outreach and workshops to change perceptions of OA and the institutional repository.
Publisher: Springer Science and Business Media LLC
Date: 02-02-2015
DOI: 10.1038/NCOMMS6899
Abstract: Epigenetic alterations in the cancer methylome are common in breast cancer and provide novel options for tumour stratification. Here, we perform whole-genome methylation capture sequencing on small amounts of DNA isolated from formalin-fixed, paraffin-embedded tissue from triple-negative breast cancer (TNBC) and matched normal s les. We identify differentially methylated regions (DMRs) enriched with promoters associated with transcription factor binding sites and DNA hypersensitive sites. Importantly, we stratify TNBCs into three distinct methylation clusters associated with better or worse prognosis and identify 17 DMRs that show a strong association with overall survival, including DMRs located in the Wilms tumour 1 (WT1) gene, bi-directional-promoter and antisense WT1-AS. Our data reveal that coordinated hypermethylation can occur in oestrogen receptor-negative disease, and that characterizing the epigenetic framework provides a potential signature to stratify TNBCs. Together, our findings demonstrate the feasibility of profiling the cancer methylome with limited archival tissue to identify regulatory regions associated with cancer.
Publisher: Cold Spring Harbor Laboratory
Date: 17-04-2015
DOI: 10.1101/018200
Abstract: benchmarkR is an R package designed to assess and visualize the performance of statistical methods for datasets that have an independent truth (e.g., simulations or datasets with large-scale validation), in particular for methods that claim to control false discovery rates (FDR). We augment some of the standard performance plots (e.g., receiver operating characteristic, or ROC, curves) with information about how well the methods are calibrated (i.e., whether they achieve their expected FDR control). For ex le, performance plots are extended with a point to highlight the power or FDR at a user-set threshold (e.g., at a method's estimated 5% FDR). The package contains general containers to store simulation results (SimResults) and methods to create graphical summaries, such as receiver operating characteristic curves (rocX), false discovery plots (fdX) and power-to-achieved FDR plots (powerFDR) each plot is augmented with some form of calibration information. We find these plots to be an improved way to interpret relative performance of statistical methods for genomic datasets where many hypothesis tests are performed. The strategies, however, are general and will find applications in other domains.
Publisher: Springer Science and Business Media LLC
Date: 09-2020
DOI: 10.1186/S13059-020-02136-7
Abstract: We present pipeComp ( lger ipeComp ), a flexible R framework for pipeline comparison handling interactions between analysis steps and relying on multi-level evaluation metrics. We apply it to the benchmark of single-cell RNA-sequencing analysis pipelines using simulated and real datasets with known cell identities, covering common methods of filtering, doublet detection, normalization, feature selection, denoising, dimensionality reduction, and clustering. pipeComp can easily integrate any other step, tool, or evaluation metric, allowing extensible benchmarks and easy applications to other fields, as we demonstrate through a study of the impact of removal of unwanted variation on differential expression analysis.
Publisher: Springer Science and Business Media LLC
Date: 28-10-2021
DOI: 10.1186/S12864-021-08057-4
Abstract: Temperature change affects the myriad of concurrent cellular processes in a non-uniform, disruptive manner. While endothermic organisms minimize the challenge of ambient temperature variation by keeping the core body temperature constant, cells of many ectothermic species maintain homeostatic function within a considerable temperature range. The cellular mechanisms enabling temperature acclimation in ectotherms are still poorly understood. At the transcriptional level, the heat shock response has been analyzed extensively. The opposite, the response to sub-optimal temperature, has received lesser attention in particular in animal species. The tissue specificity of transcriptional responses to cool temperature has not been addressed and it is not clear whether a prominent general response occurs. Cis -regulatory elements (CREs), which mediate increased transcription at cool temperature, and responsible transcription factors are largely unknown. The ectotherm Drosophila melanogaster with a presumed temperature optimum around 25 °C was used for transcriptomic analyses of effects of temperatures at the lower end of the readily tolerated range (14–29 °C). Comparative analyses with adult flies and cell culture lines indicated a striking degree of cell-type specificity in the transcriptional response to cool. To identify potential cis -regulatory elements (CREs) for transcriptional upregulation at cool temperature, we analyzed temperature effects on DNA accessibility in chromatin of S2R+ cells. Candidate cis -regulatory elements (CREs) were evaluated with a novel reporter assay for accurate assessment of their temperature-dependency. Robust transcriptional upregulation at low temperature could be demonstrated for a fragment from the pastrel gene, which expresses more transcript and protein at reduced temperatures. This CRE is controlled by the JAK/STAT signaling pathway and antagonizing activities of the transcription factors Pointed and Ets97D. Beyond a rich data resource for future analyses of transcriptional control within the readily tolerated range of an ectothermic animal, a novel reporter assay permitting quantitative characterization of CRE temperature dependence was developed. Our identification and functional dissection of the pst _E1 enhancer demonstrate the utility of resources and assay. The functional characterization of this CoolUp enhancer provides initial mechanistic insights into transcriptional upregulation induced by a shift to temperatures at the lower end of the readily tolerated range.
Publisher: Cold Spring Harbor Laboratory
Date: 15-04-2022
DOI: 10.1101/2022.04.14.488419
Abstract: Spatially-resolved transcriptomics uncovers patterns of gene expression at supercellular, cellular, or subcellular resolution, providing insights into spatially variable cellular functions, diffusible morphogens, and cell-cell interactions. However, for practical reasons, multiplexed single cell RNA-sequencing remains the most widely used technology for profiling transcriptomes of single cells, especially in the context of large-scale anatomical atlassing. Devising techniques to accurately predict the latent physical positions as well as the latent cell-cell proximities of such dissociated cells, represents an exciting and new challenge. Most of the current approaches rely on an ‘autocorrelation’ assumption, i.e., cells with similar transcriptomic profiles are located close to each other in physical space and vice versa. However, this is not always the case in native biological contexts due to complex morphological and functional patterning. To address this challenge, we developed SageNet, a graph neural network approach that spatially reconstructs dissociated single cell data using one or more spatial references. SageNet first estimates a gene-gene interaction network from a reference spatial dataset. This informs the structure of the graph on which the graph neural network is trained to predict the region of dissociated cells. Finally, SageNet produces a low-dimensional embedding of the query dataset, corresponding to the reconstructed spatial coordinates of the dissociated tissue. Furthermore, SageNet reveals spatially informative genes by extracting the most important features from the neural network model. We demonstrate the utility and robust performance of SageNet using molecule-resolved seqFISH and spot-based Spatial Transcriptomics reference datasets as well as dissociated single-cell data, across multiple biological contexts. SageNet is provided as an open-source python software package at github.com/MarioniLab/SageNet .
Publisher: Elsevier BV
Date: 04-2015
DOI: 10.1016/J.BONE.2014.12.063
Abstract: Wnt pathway targeting is of high clinical interest for treating bone loss disorders such as osteoporosis. These therapies inhibit the action of negative regulators of osteoblastic Wnt signaling. The report that Wnt inhibitory factor 1 (WIF1) was epigenetically silenced via promoter DNA methylation in osteosarcoma (OS) raised potential concerns for such treatment approaches. Here we confirm that Wif1 expression is frequently reduced in OS. However, we demonstrate that silencing is not driven by DNA methylation. Treatment of mouse and human OS cells showed that Wif1 expression was robustly induced by HDAC inhibition but not by methylation inhibition. Consistent with HDAC dependent silencing, the Wif1 locus in OS was characterized by low acetylation levels and a bivalent H3K4/H3K27-trimethylation state. Wif1 expression marked late stages of normal osteoblast maturation and stratified OS tumors based on differentiation stage across species. Culture of OS cells under differentiation inductive conditions increased expression of Wif1. Together these results demonstrate that Wif1 is not targeted for silencing by DNA methylation in OS. Instead, the reduced expression of Wif1 in OS cells is in context with their stage in differentiation.
Publisher: Springer Science and Business Media LLC
Date: 23-01-2020
DOI: 10.1038/S41419-020-2261-2
Abstract: The original version of this article contained an error in the name of one of the co-authors (Erika Owsley). This has been corrected in the PDF and HTML versions.
No related grants have been discovered for Mark Robinson.