ARDC Research Link Australia

Publication

scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

Publisher: Cold Spring Harbor Laboratory

Date: 12-03-2023

DOI: 10.1101/2023.03.09.531861

Abstract: Annotation of cell-types is a critical step in the analysis of single-cell RNA sequencing (scRNA-seq) data that allows the study of heterogeneity across multiple cell populations. Currently this is most commonly done using unsupervised clustering algorithms, which project single-cell expression data into a lower dimensional space and then cluster cells based on their distances from each other. However, as these methods do not use reference datasets, they can only achieve a rough classification of cell-types, and it is difficult to improve the recognition accuracy further. To effectively solve this issue we propose a novel supervised annotation method, scDeepInsight. The scDeepInsight method is capable of performing manifold assignments. It is competent in executing data integration through batch normalization, performing supervised training on the reference dataset, doing outlier detection and annotating cell-types on query datasets. Moreover, it can help identify active genes or marker genes related to cell-types. The training of the scDeepInsight model is performed in a unique way. Tabular scRNA-seq data are first converted to corresponding images through the DeepInsight methodology. DeepInsight can create a trainable image transformer to convert non-image RNA data to images by comprehensively comparing interrelationships among multiple genes. Subsequently, the converted images are fed into convolutional neural networks (CNNs) such as EfficientNet-b3. This enables automatic feature extraction to identify the cell-types of scRNA-seq s les. We benchmarked scDeepInsight with six other mainstream cell annotation methods. The average accuracy rate of scDeepInsight reached 87.5%, which is more than 7% higher compared with the state-of-the-art methods.

Publication

Protein fold recognition using HMM–HMM alignment and dynamic programming

Publisher: Elsevier BV

Date: 03-2016

DOI: 10.1016/J.JTBI.2015.12.018

Abstract: Detecting three dimensional structures of protein sequences is a challenging task in biological sciences. For this purpose, protein fold recognition has been utilized as an intermediate step which helps in classifying a novel protein sequence into one of its folds. The process of protein fold recognition encompasses feature extraction of protein sequences and feature identification through suitable classifiers. Several feature extractors are developed to retrieve useful information from protein sequences. These features are generally extracted by constituting protein's sequential, physicochemical and evolutionary properties. The performance in terms of recognition accuracy has also been gradually improved over the last decade. However, it is yet to reach a well reasonable and accepted level. In this work, we first applied HMM-HMM alignment of protein sequence from HHblits to extract profile HMM (PHMM) matrix. Then we computed the distance between respective PHMM matrices using kernalized dynamic programming. We have recorded significant improvement in fold recognition over the state-of-the-art feature extractors. The improvement of recognition accuracy is in the range of 2.7-11.6% when experimented on three benchmark datasets from Structural Classification of Proteins.

Publication

OPAL+: Length‐Specific MoRF Prediction in Intrinsically Disordered Protein Sequences

Publisher: Wiley

Date: 02-11-2018

DOI: 10.1002/PMIC.201800058

Abstract: Intrinsically disordered proteins (IDPs) contain long unstructured regions, which play an important role in their function. These intrinsically disordered regions (IDRs) participate in binding events through regions called molecular recognition features (MoRFs). Computational prediction of MoRFs helps identify the potentially functional regions in IDRs. In this study, OPAL+, a novel MoRF predictor, is presented. OPAL+ uses separate models to predict MoRFs of varying lengths along with incorporating the hidden Markov model (HMM) profiles and physicochemical properties of MoRFs and their flanking regions. Together, these features help OPAL+ achieve a marginal performance improvement of 0.4-0.7% over its predecessor for erse MoRF test sets. This performance improvement comes at the expense of increased run time as a result of the requirement of HMM profiles. OPAL+ is available for download at oneshsharma/OPAL-plus/wiki/OPAL-plus-Download.

Publication

The International HapMap Project

Publisher: Springer Science and Business Media LLC

Date: 12-2003

DOI: 10.1038/NATURE02168

Publication

Cancer LncRNA Census reveals evidence for deep functional conservation of long noncoding RNAs in tumorigenesis

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S42003-019-0741-7

Abstract: Long non-coding RNAs (lncRNAs) are a growing focus of cancer genomics studies, creating the need for a resource of lncRNAs with validated cancer roles. Furthermore, it remains debated whether mutated lncRNAs can drive tumorigenesis, and whether such functions could be conserved during evolution. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, we introduce the Cancer LncRNA Census (CLC), a compilation of 122 GENCODE lncRNAs with causal roles in cancer phenotypes. In contrast to existing databases, CLC requires strong functional or genetic evidence. CLC genes are enriched amongst driver genes predicted from somatic mutations, and display characteristic genomic features. Strikingly, CLC genes are enriched for driver mutations from unbiased, genome-wide transposon-mutagenesis screens in mice. We identified 10 tumour-causing mutations in orthologues of 8 lncRNAs, including LINC-PINT and NEAT1 , but not MALAT1 . Thus CLC represents a dataset of high-confidence cancer lncRNAs. Mutagenesis maps are a novel means for identifying deeply-conserved roles of lncRNAs in tumorigenesis.

Publication

Subject-Specific-Frequency-Band for Motor Imagery EEG Signal Recognition Based on Common Spatial Spectral Pattern

Publisher: Springer International Publishing

Date: 2019

DOI: 10.1007/978-3-030-29911-8_55

Publication

Comparison of gene expression profiles between Opisthorchis viverrini and Non-opisthorchis viverrini associated human intrahepatic cholangiocarcinoma

Publisher: Ovid Technologies (Wolters Kluwer Health)

Date: 10-2006

DOI: 10.1002/HEP.21330

Abstract: Intrahepatic cholangiocarcinoma (ICC) is the second most common primary cancer in the liver, and its incidence is highest in the northeastern part of Thailand. ICCs in this region are known to be associated with infection with liver flukes, particularly Opisthorchis viverrini (OV), as well as nitrosamines from food. To clarify molecular mechanisms of ICC associated with or without liver flukes, we analyzed gene expression profiles of OV-associated ICCs from 20 Thai patients and compared their profiles with those of 20 Japanese ICCs that were not associated with OV, by means of laser microbeam microdissection and a cDNA microarray containing 27,648 genes. We identified 77 commonly upregulated genes and 325 commonly downregulated genes in the two ICC groups. Unsupervised hierarchical cluster analysis separated the 40 ICCs into two major branches almost completely according to the fluke status. The putative signature of OV-associated ICC exhibited elevated expression of genes involved in xenobiotic metabolism (UGT2B11, UGT1A10, CHST4, SULT1C1), whereas that of non-OV-associated ICC represented enhanced expression of genes related to growth factor signaling (TGFBI, PGF, IGFBP1, IGFBP3). Additional random permutation tests identified a total of 49 genes whose expression levels were significantly different between the two groups. We also identified genes associated with macroscopic type of ICCs. In conclusion, these data may not only contribute to clarification of common and OV-specific mechanisms underlying ICC, but also may serve as a starting point for the identification of novel diagnostic markers or therapeutic targets for the disease.

Publication

Author Correction: The repertoire of mutational signatures in human cancer

Publisher: Springer Science and Business Media LLC

Date: 25-01-2023

DOI: 10.1038/S41586-022-05600-5

Publication

Stepwise iterative maximum likelihood clustering approach

Publisher: Springer Science and Business Media LLC

Date: 24-08-2016

DOI: 10.1186/S12859-016-1184-5

Publication

EvolStruct-Phogly: incorporating structural properties and evolutionary information from profile bigrams for the phosphoglycerylation prediction

Publisher: Springer Science and Business Media LLC

Date: 04-2019

DOI: 10.1186/S12864-018-5383-5

Publication

Integrative pathway enrichment analysis of multivariate omics data

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41467-019-13983-9

Abstract: Multi-omics datasets represent distinct aspects of the central dogma of molecular biology. Such high-dimensional molecular profiles pose challenges to data interpretation and hypothesis generation. ActivePathways is an integrative method that discovers significantly enriched pathways across multiple datasets using statistical data fusion, rationalizes contributing evidence and highlights associated genes. As part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumor types, we integrated genes with coding and non-coding mutations and revealed frequently mutated pathways and additional cancer genes with infrequent mutations. We also analyzed prognostic molecular pathways by integrating genomic and transcriptomic features of 1780 breast cancers and highlighted associations with immune response and anti-apoptotic signaling. Integration of ChIP-seq and RNA-seq data for master regulators of the Hippo pathway across normal human tissues identified processes of tissue regeneration and stem cell regulation. ActivePathways is a versatile method that improves systems-level understanding of cellular organization in health and disease through integration of multiple molecular datasets and pathway annotations.

Publication

MoRFPred-plus: Computational Identification of MoRFs in Protein Sequences using Physicochemical Properties and HMM profiles

Publisher: Elsevier BV

Date: 2018

DOI: 10.1016/J.JTBI.2017.10.015

Abstract: Intrinsically Disordered Proteins (IDPs) lack stable tertiary structure and they actively participate in performing various biological functions. These IDPs expose short binding regions called Molecular Recognition Features (MoRFs) that permit interaction with structured protein regions. Upon interaction they undergo a disorder-to-order transition as a result of which their functionality arises. Predicting these MoRFs in disordered protein sequences is a challenging task. In this study, we present MoRFpred-plus, an improved predictor over our previous proposed predictor to identify MoRFs in disordered protein sequences. Two separate independent propensity scores are computed via incorporating physicochemical properties and HMM profiles, these scores are combined to predict final MoRF propensity score for a given residue. The first score reflects the characteristics of a query residue to be part of MoRF region based on the composition and similarity of assumed MoRF and flank regions. The second score reflects the characteristics of a query residue to be part of MoRF region based on the properties of flanks associated around the given residue in the query protein sequence. The propensity scores are processed and common averaging is applied to generate the final prediction score of MoRFpred-plus. Performance of the proposed predictor is compared with available MoRF predictors, MoRFchibi, MoRFpred, and ANCHOR. Using previously collected training and test sets used to evaluate the mentioned predictors, the proposed predictor outperforms these predictors and generates lower false positive rate. In addition, MoRFpred-plus is a downloadable predictor, which makes it useful as it can be used as input to other computational tools. oneshsharma/MoRFpred-plus/wiki/MoRFpred-plus:-Download.

Publication

Genome-wide association study identifies three novel loci for type 2 diabetes

Publisher: Oxford University Press (OUP)

Date: 14-08-2013

DOI: 10.1093/HMG/DDT399

Abstract: Although over 60 loci for type 2 diabetes (T2D) have been identified, there still remains a large genetic component to be clarified. To explore unidentified loci for T2D, we performed a genome-wide association study (GWAS) of 6 209 637 single-nucleotide polymorphisms (SNPs), which were directly genotyped or imputed using East Asian references from the 1000 Genomes Project (June 2011 release) in 5976 Japanese patients with T2D and 20 829 nondiabetic in iduals. Nineteen unreported loci were selected and taken forward to follow-up analyses. Combined discovery and follow-up analyses (30 392 cases and 34 814 controls) identified three new loci with genome-wide significance, which were MIR129-LEP [rs791595 risk allele = A risk allele frequency (RAF) = 0.080 P = 2.55 × 10(-13) odds ratio (OR) = 1.17], GPSM1 [rs11787792 risk allele = A RAF = 0.874 P = 1.74 × 10(-10) OR = 1.15] and SLC16A13 (rs312457 risk allele = G RAF = 0.078 P = 7.69 × 10(-13) OR = 1.20). This study demonstrates that GWASs based on the imputation of genotypes using modern reference haplotypes such as that from the 1000 Genomes Project data can assist in identification of new loci for common diseases.

Publication

CNN-Pred: Prediction of single-stranded and double-stranded DNA-binding protein using convolutional neural networks

Publisher: Elsevier BV

Date: 02-2023

DOI: 10.1016/J.GENE.2022.147045

Publication

Patterns of somatic structural variation in human cancer genomes

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41586-019-1913-9

Abstract: A key mutational process in cancer is structural variation, in which rearrangements delete, lify or reorder genomic segments that range in size from kilobases to whole chromosomes 1–7 . Here we develop methods to group, classify and describe somatic structural variants, using data from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), which aggregated whole-genome sequencing data from 2,658 cancers across 38 tumour types 8 . Sixteen signatures of structural variation emerged. Deletions have a multimodal size distribution, assort unevenly across tumour types and patients, are enriched in late-replicating regions and correlate with inversions. Tandem duplications also have a multimodal size distribution, but are enriched in early-replicating regions—as are unbalanced translocations. Replication-based mechanisms of rearrangement generate varied chromosomal structures with low-level copy-number gains and frequent inverted rearrangements. One prominent structure consists of 2–7 templates copied from distinct regions of the genome strung together within one locus. Such cycles of templated insertions correlate with tandem duplications, and—in liver cancer—frequently activate the telomerase gene TERT . A wide variety of rearrangement processes are active in cancer, which generate complex configurations of the genome upon which selection can act.

Publication

Bigram-PGK: Phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix

Publisher: Springer Science and Business Media LLC

Date: 12-2019

DOI: 10.1186/S12860-019-0240-1

Abstract: The biological process known as post-translational modification (PTM) is a condition whereby proteomes are modified that affects normal cell biology, and hence the pathogenesis. A number of PTMs have been discovered in the recent years and lysine phosphoglycerylation is one of the fairly recent developments. Even with a large number of proteins being sequenced in the post-genomic era, the identification of phosphoglycerylation remains a big challenge due to factors such as cost, time consumption and inefficiency involved in the experimental efforts. To overcome this issue, computational techniques have emerged to accurately identify phosphoglycerylated lysine residues. However, the computational techniques proposed so far hold limitations to correctly predict this covalent modification. We propose a new predictor in this paper called Bigram-PGK which uses evolutionary information of amino acids to try and predict phosphoglycerylated sites. The benchmark dataset which contains experimentally labelled sites is employed for this purpose and profile bigram occurrences is calculated from position specific scoring matrices of amino acids in the protein sequences. The statistical measures of this work, such as sensitivity, specificity, precision, accuracy, Mathews correlation coefficient and area under ROC curve have been reported to be 0.9642, 0.8973, 0.8253, 0.9193, 0.8330, 0.9306, respectively. The proposed predictor, based on the feature of evolutionary information and support vector machine classifier, has shown great potential to effectively predict phosphoglycerylated and non-phosphoglycerylated lysine residues when compared against the existing predictors. The data and software of this work can be acquired from belavit/Bigram-PGK .

Publication

Combined Genetic and Genealogic Studies Uncover a Large BAP1 Cancer Syndrome Kindred Tracing Back Nine Generations to a Common Ancestor from the 1700s

Publisher: Public Library of Science (PLoS)

Date: 18-12-2015

DOI: 10.1371/JOURNAL.PGEN.1005633

Publication

Application of cepstrum analysis and linear predictive coding for motor imaginary task classification

Publisher: IEEE

Date: 12-2015

DOI: 10.1109/APWCCSE.2015.7476214

Publication

Success: evolutionary and structural properties of amino acids prove effective for succinylation site prediction

Publisher: Springer Science and Business Media LLC

Date: 2018

DOI: 10.1186/S12864-017-4336-8

Publication

Importance of dimensionality reduction in protein fold recognition

Publisher: IEEE

Date: 12-2015

DOI: 10.1109/APWCCSE.2015.7476132

Publication

OPAL: prediction of MoRF regions in intrinsically disordered protein sequences

Publisher: Oxford University Press (OUP)

Date: 18-01-2018

DOI: 10.1093/BIOINFORMATICS/BTY032

Abstract: Intrinsically disordered proteins lack stable 3-dimensional structure and play a crucial role in performing various biological functions. Key to their biological function are the molecular recognition features (MoRFs) located within long disordered regions. Computationally identifying these MoRFs from disordered protein sequences is a challenging task. In this study, we present a new MoRF predictor, OPAL, to identify MoRFs in disordered protein sequences. OPAL utilizes two independent sources of information computed using different component predictors. The scores are processed and combined using common averaging method. The first score is computed using a component MoRF predictor which utilizes composition and sequence similarity of MoRF and non-MoRF regions to detect MoRFs. The second score is calculated using half-sphere exposure (HSE), solvent accessible surface area (ASA) and backbone angle information of the disordered protein sequence, using information from the amino acid properties of flanks surrounding the MoRFs to distinguish MoRF and non-MoRF residues. OPAL is evaluated using test sets that were previously used to evaluate MoRF predictors, MoRFpred, MoRFchibi and MoRFchibi-web. The results demonstrate that OPAL outperforms all the available MoRF predictors and is the most accurate predictor available for MoRF prediction. It is available at ools/opal/. Supplementary data are available at Bioinformatics online.

Publication

A Deep Learning Approach for Motor Imagery EEG Signal Classification

Publisher: IEEE

Date: 12-2016

DOI: 10.1109/APWC-ON-CSE.2016.017

Publication

DeepInsight-FS: Selecting features for non-image data using convolutional neural network

Publisher: Cold Spring Harbor Laboratory

Date: 19-09-2020

DOI: 10.1101/2020.09.17.301515

Abstract: Identifying smaller element or gene subsets from biological or other data types is an essential step in discovering underlying mechanisms. Statistical machine learning methods have played a key role in revealing gene subsets. However, growing data complexity is pushing the limits of these techniques. A review of the recent literature shows that arranging elements by similarity in image-form for a convolutional neural network (CNN) improves classification performance over treating them in idually. Expanding on this, here we show a pipeline, DeepInsight-FS, to uncover gene subsets of clinical relevance. DeepInsight-FS converts non-image s les into image-form and performs element selection via CNN. To our knowledge, this is the first approach to employ CNN for element or gene selection on non-image data. A real world application of DeepInsight-FS to publicly available cancer data identified gene sets with significant overlap to several cancer-associated pathways suggesting the potential of this method to discover biomedically meaningful connections.

Publication

A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41467-019-13825-8

Abstract: In cancer, the primary tumour’s organ of origin and histopathology are the strongest determinants of its clinical behaviour, but in 3% of cases a patient presents with a metastatic tumour and no obvious primary. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium , we train a deep learning classifier to predict cancer type based on patterns of somatic passenger mutations detected in whole genome sequencing (WGS) of 2606 tumours representing 24 common cancer types produced by the PCAWG Consortium. Our classifier achieves an accuracy of 91% on held-out tumor s les and 88% and 83% respectively on independent primary and metastatic s les, roughly double the accuracy of trained pathologists when presented with a metastatic tumour without knowledge of the primary. Surprisingly, adding information on driver mutations reduced accuracy. Our results have clinical applicability, underscore how patterns of somatic passenger mutations encode the state of the cell of origin, and can inform future strategies to detect the source of circulating tumour DNA.

Publication

Hierarchical Maximum Likelihood Clustering Approach

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 2017

DOI: 10.1109/TBME.2016.2542212

Publication

A comparison of machine learning classifiers for dementia with Lewy bodies using miRNA expression data

Publisher: Springer Science and Business Media LLC

Date: 30-10-2019

DOI: 10.1186/S12920-019-0607-3

Abstract: Dementia with Lewy bodies (DLB) is the second most common subtype of neurodegenerative dementia in humans following Alzheimer’s disease (AD). Present clinical diagnosis of DLB has high specificity and low sensitivity and finding potential biomarkers of prodromal DLB is still challenging. MicroRNAs (miRNAs) have recently received a lot of attention as a source of novel biomarkers. In this study, using serum miRNA expression of 478 Japanese in iduals, we investigated potential miRNA biomarkers and constructed an optimal risk prediction model based on several machine learning methods: penalized regression, random forest, support vector machine, and gradient boosting decision tree. The final risk prediction model, constructed via a gradient boosting decision tree using 180 miRNAs and two clinical features, achieved an accuracy of 0.829 on an independent test set. We further predicted candidate target genes from the miRNAs. Gene set enrichment analysis of the miRNA target genes revealed 6 functional genes included in the DHA signaling pathway associated with DLB pathology. Two of them were further supported by gene-based association studies using a large number of single nucleotide polymorphism markers (BCL2L1: P = 0.012, PIK3R2: P = 0.021). Our proposed prediction model provides an effective tool for DLB classification. Also, a gene-based association test of rare variants revealed that BCL2L1 and PIK3R2 were statistically significantly associated with DLB.

Publication

Divisive hierarchical maximum likelihood clustering

Publisher: Springer Science and Business Media LLC

Date: 12-2017

DOI: 10.1186/S12859-017-1965-5

Publication

Comprehensive analysis of chromothripsis in 2,658 human cancers using whole-genome sequencing

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41588-019-0576-7

Abstract: Chromothripsis is a mutational phenomenon characterized by massive, clustered genomic rearrangements that occurs in cancer and other diseases. Recent studies in selected cancer types have suggested that chromothripsis may be more common than initially inferred from low-resolution copy-number data. Here, as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), we analyze patterns of chromothripsis across 2,658 tumors from 38 cancer types using whole-genome sequencing data. We find that chromothripsis events are pervasive across cancers, with a frequency of more than 50% in several cancer types. Whereas canonical chromothripsis profiles display oscillations between two copy-number states, a considerable fraction of events involve multiple chromosomes and additional structural alterations. In addition to non-homologous end joining, we detect signatures of replication-associated processes and templated insertions. Chromothripsis contributes to oncogene lification and to inactivation of genes such as mismatch-repair-related genes. These findings show that chromothripsis is a major process that drives genome evolution in human cancer.

Publication

Genome-wide detection and characterization of positive selection in human populations

Publisher: Springer Science and Business Media LLC

Date: 10-2007

DOI: 10.1038/NATURE06250

Publication

Combined burden and functional impact tests for cancer driver discovery using DriverPower

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41467-019-13929-1

Abstract: The discovery of driver mutations is one of the key motivations for cancer genome sequencing. Here , as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium , which aggregated whole genome sequencing data from 2658 cancers across 38 tumour types, we describe DriverPower, a software package that uses mutational burden and functional impact evidence to identify driver mutations in coding and non-coding sites within cancer whole genomes. Using a total of 1373 genomic features derived from public sources, DriverPower’s background mutation model explains up to 93% of the regional variance in the mutation rate across multiple tumour types. By incorporating functional impact scores, we are able to further increase the accuracy of driver discovery. Testing across a collection of 2583 cancer genomes from the PCAWG project, DriverPower identifies 217 coding and 95 non-coding driver candidates. Comparing to six published methods used by the PCAWG Drivers and Functional Interpretation Working Group, DriverPower has the highest F1 score for both coding and non-coding driver discovery. This demonstrates that DriverPower is an effective framework for computational driver discovery.

Publication

The evolutionary history of 2,658 cancers

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41586-019-1907-7

Abstract: Cancer develops through a process of somatic evolution 1,2 . Sequencing data from a single biopsy represent a snapshot of this process that can reveal the timing of specific genomic aberrations and the changing influence of mutational processes 3 . Here, by whole-genome sequencing analysis of 2,658 cancers as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) 4 , we reconstruct the life history and evolution of mutational processes and driver mutation sequences of 38 types of cancer. Early oncogenesis is characterized by mutations in a constrained set of driver genes, and specific copy number gains, such as trisomy 7 in glioblastoma and isochromosome 17q in medulloblastoma. The mutational spectrum changes significantly throughout tumour evolution in 40% of s les. A nearly fourfold ersification of driver genes and increased genomic instability are features of later stages. Copy number alterations often occur in mitotic crises, and lead to simultaneous gains of chromosomal segments. Timing analyses suggest that driver mutations often precede diagnosis by many years, if not decades. Together, these results determine the evolutionary trajectories of cancer, and highlight opportunities for early cancer detection.

Publication

Computational Pipelines and Workflows in Bioinformatics

Publisher: Elsevier

Date: 2019

DOI: 10.1016/B978-0-12-809633-8.20089-7

Publication

A meta-analysis identifies adolescent idiopathic scoliosis association withLBX1locus in multiple ethnic groups

Publisher: BMJ

Date: 10-04-2014

DOI: 10.1136/JMEDGENET-2013-102067

Abstract: Adolescent idiopathic scoliosis (AIS) is a common rotational deformity of the spine that presents in children worldwide, yet its etiology is poorly understood. Recent genome-wide association studies (GWAS) have identified a few candidate risk loci. One locus near the chromosome 10q24.31 LBX1 gene (OMIM #604255) was originally identified by a GWAS of Japanese subjects and replicated in additional Asian populations. To extend this result, and to create larger AIS cohorts for the purpose of large-scale meta-analyses in multiple ethnicities, we formed a collaborative group called the International Consortium for Scoliosis Genetics (ICSG). Here, we report the first ICSG study, a meta-analysis of the LBX1 locus in six Asian and three non-Asian cohorts. We find significant evidence for association of this locus with AIS susceptibility in all nine cohorts. Results for seven cohorts containing both genders yielded P=1.22×10-43 for rs11190870, and P=2.94×10-48 for females in all nine cohorts. Comparing the regional haplotype structures for three populations, we refined the boundaries of association to a ∼25 kb block encompassing the LBX1 gene. The LBX1 protein, a homeobox transcription factor that is orthologous to the Drosophila ladybird late gene, is involved in proper migration of muscle precursor cells, specification of cardiac neural crest cells, and neuronal determination in developing neural tubes. Our results firmly establish the LBX1 region as the first major susceptibility locus for AIS in Asian and non-Hispanic white groups, and provide a platform for larger studies in additional ancestral groups.

Publication

Sex differences in oncogenic mutational processes

Publisher: Springer Science and Business Media LLC

Date: 28-08-2020

DOI: 10.1038/S41467-020-17359-2

Abstract: Sex differences have been observed in multiple facets of cancer epidemiology, treatment and biology, and in most cancers outside the sex organs. Efforts to link these clinical differences to specific molecular features have focused on somatic mutations within the coding regions of the genome. Here we report a pan-cancer analysis of sex differences in whole genomes of 1983 tumours of 28 subtypes as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium. We both confirm the results of exome studies, and also uncover previously undescribed sex differences. These include sex-biases in coding and non-coding cancer drivers, mutation prevalence and strikingly, in mutational signatures related to underlying mutational processes. These results underline the pervasiveness of molecular sex differences and strengthen the call for increased consideration of sex in molecular cancer research.

Publication

An integrative machine learning approach for prediction of toxicity-related drug safety

Publisher: Cold Spring Harbor Laboratory

Date: 29-10-2018

DOI: 10.1101/455667

Abstract: Recent trends in drug development have been marked by diminishing returns of escalating costs and falling rate of new drug approval. Unacceptable drug toxicity is a substantial cause of drug failure during clinical trials as well as the leading cause of drug withdraws after release to market. Computational methods capable of predicting these failures can reduce waste of resources and time devoted to the investigation of compounds that ultimately fail. We propose an original machine learning method that leverages identity of drug targets and off-targets, functional impact score computed from Gene Ontology annotations, and biological network data to predict drug toxicity. We demonstrate that our method (TargeTox) can distinguish potentially idiosyncratically toxic drugs from safe drugs and is also suitable for speculative evaluation of different target sets to support the design of optimal low-toxicity combinations. Prediction of toxicity-related drug clinical trial failures, withdrawals from market and idiosyncratic toxicity risk by combining biological network analysis with machine learning.

Publication

Inferring structural variant cancer cell fraction

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41467-020-14351-8

Abstract: We present SVclone, a computational method for inferring the cancer cell fraction of structural variant (SV) breakpoints from whole-genome sequencing data. SVclone accurately determines the variant allele frequencies of both SV breakends, then simultaneously estimates the cancer cell fraction and SV copy number. We assess performance using in silico mixtures of real s les, at known proportions, created from two clonal metastases from the same patient. We find that SVclone’s performance is comparable to single-nucleotide variant-based methods, despite having an order of magnitude fewer data points. As part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) consortium, which aggregated whole-genome sequencing data from 2658 cancers across 38 tumour types, we use SVclone to reveal a subset of liver, ovarian and pancreatic cancers with subclonally enriched copy-number neutral rearrangements that show decreased overall survival. SVclone enables improved characterisation of SV intra-tumour heterogeneity.

Publication

DeepFeature: feature selection in nonimage data using convolutional neural network

Publisher: Oxford University Press (OUP)

Date: 06-08-2021

DOI: 10.1093/BIB/BBAB297

Abstract: Artificial intelligence methods offer exciting new capabilities for the discovery of biological mechanisms from raw data because they are able to detect vastly more complex patterns of association that cannot be captured by classical statistical tests. Among these methods, deep neural networks are currently among the most advanced approaches and, in particular, convolutional neural networks (CNNs) have been shown to perform excellently for a variety of difficult tasks. Despite that applications of this type of networks to high-dimensional omics data and, most importantly, meaningful interpretation of the results returned from such models in a biomedical context remains an open problem. Here we present, an approach applying a CNN to nonimage data for feature selection. Our pipeline, DeepFeature, can both successfully transform omics data into a form that is optimal for fitting a CNN model and can also return sets of the most important genes used internally for computing predictions. Within the framework, the Snowfall compression algorithm is introduced to enable more elements in the fixed pixel framework, and region accumulation and element decoder is developed to find elements or genes from the class activation maps. In comparative tests for cancer type prediction task, DeepFeature simultaneously achieved superior predictive performance and better ability to discover key pathways and biological processes meaningful for this context. Capabilities offered by the proposed framework can enable the effective use of powerful deep learning methods to facilitate the discovery of causal mechanisms in high-dimensional biomedical data.

Publication

SucStruct: Prediction of succinylated lysine residues by using structural properties of amino acids

Publisher: Elsevier BV

Date: 06-2017

DOI: 10.1016/J.AB.2017.03.021

Abstract: Post-Translational Modification (PTM) is a biological reaction which contributes to ersify the proteome. Despite many modifications with important roles in cellular activity, lysine succinylation has recently emerged as an important PTM mark. It alters the chemical structure of lysines, leading to remarkable changes in the structure and function of proteins. In contrast to the huge amount of proteins being sequenced in the post-genome era, the experimental detection of succinylated residues remains expensive, inefficient and time-consuming. Therefore, the development of computational tools for accurately predicting succinylated lysines is an urgent necessity. To date, several approaches have been proposed but their sensitivity has been reportedly poor. In this paper, we propose an approach that utilizes structural features of amino acids to improve lysine succinylation prediction. Succinylated and non-succinylated lysines were first retrieved from 670 proteins and characteristics such as accessible surface area, backbone torsion angles and local structure conformations were incorporated. We used the k-nearest neighbors cleaning treatment for dealing with class imbalance and designed a pruned decision tree for classification. Our predictor, referred to as SucStruct (Succinylation using Structural features), proved to significantly improve performance when compared to previous predictors, with sensitivity, accuracy and Mathew's correlation coefficient equal to 0.7334-0.7946, 0.7444-0.7608 and 0.4884-0.5240, respectively.

Publication

Brain wave classification using long short-term memory network based OPTICAL predictor

Publisher: Springer Science and Business Media LLC

Date: 24-06-2019

DOI: 10.1038/S41598-019-45605-1

Abstract: Brain-computer interface (BCI) systems having the ability to classify brain waves with greater accuracy are highly desirable. To this end, a number of techniques have been proposed aiming to be able to classify brain waves with high accuracy. However, the ability to classify brain waves and its implementation in real-time is still limited. In this study, we introduce a novel scheme for classifying motor imagery (MI) tasks using electroencephalography (EEG) signal that can be implemented in real-time having high classification accuracy between different MI tasks. We propose a new predictor, OPTICAL, that uses a combination of common spatial pattern (CSP) and long short-term memory (LSTM) network for obtaining improved MI EEG signal classification. A sliding window approach is proposed to obtain the time-series input from the spatially filtered data, which becomes input to the LSTM network. Moreover, instead of using LSTM directly for classification, we use regression based output of the LSTM network as one of the features for classification. On the other hand, linear discriminant analysis (LDA) is used to reduce the dimensionality of the CSP variance based features. The features in the reduced dimensional plane after performing LDA are used as input to the support vector machine (SVM) classifier together with the regression based feature obtained from the LSTM network. The regression based feature further boosts the performance of the proposed OPTICAL predictor. OPTICAL showed significant improvement in the ability to accurately classify left and right-hand MI tasks on two publically available datasets. The improvements in the average misclassification rates are 3.09% and 2.07% for BCI Competition IV Dataset I and GigaDB dataset, respectively. The Matlab code is available at github.com/ShiuKumar/OPTICAL .

Publication

2D–EM clustering approach for high-dimensional data through folding feature vectors

Publisher: Springer Science and Business Media LLC

Date: 12-2017

DOI: 10.1186/S12859-017-1970-8

Publication

Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41588-019-0562-0

Abstract: About half of all cancers have somatic integrations of retrotransposons. Here, to characterize their role in oncogenesis, we analyzed the patterns and mechanisms of somatic retrotransposition in 2,954 cancer genomes from 38 histological cancer subtypes within the framework of the Pan-Cancer Analysis of Whole Genomes (PCAWG) project. We identified 19,166 somatically acquired retrotransposition events, which affected 35% of s les and spanned a range of event types. Long interspersed nuclear element (LINE-1 L1 hereafter) insertions emerged as the first most frequent type of somatic structural variation in esophageal adenocarcinoma, and the second most frequent in head-and-neck and colorectal cancers. Aberrant L1 integrations can delete megabase-scale regions of a chromosome, which sometimes leads to the removal of tumor-suppressor genes, and can induce complex translocations and large-scale duplications. Somatic retrotranspositions can also initiate breakage–fusion–bridge cycles, leading to high-level lification of oncogenes. These observations illuminate a relevant role of 22 L1 retrotransposition in remodeling the cancer genome, with potential implications for the development of human tumors.

Publication

Pan-cancer analysis of whole genomes

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41586-020-1969-6

Abstract: Cancer is driven by genetic change, and the advent of massively parallel sequencing has enabled systematic documentation of this variation at the whole-genome scale 1–3 . Here we report the integrative analysis of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). We describe the generation of the PCAWG resource, facilitated by international data sharing using compute clouds. On average, cancer genomes contained 4–5 driver mutations when combining coding and non-coding genomic elements however, in around 5% of cases no drivers were identified, suggesting that cancer driver discovery is not yet complete. Chromothripsis, in which many clustered structural variants arise in a single catastrophic event, is frequently an early event in tumour evolution in acral melanoma, for ex le, these events precede most somatic point mutations and affect several cancer-associated genes simultaneously. Cancers with abnormal telomere maintenance often originate from tissues with low replicative activity and show several mechanisms of preventing telomere attrition to critical levels. Common and rare germline variants affect patterns of somatic mutation, including point mutations, structural variants and somatic retrotransposition. A collection of papers from the PCAWG Consortium describes non-coding mutations that drive cancer beyond those in the TERT promoter 4 identifies new signatures of mutational processes that cause base substitutions, small insertions and deletions and structural variation 5,6 analyses timings and patterns of tumour evolution 7 describes the erse transcriptional consequences of somatic mutation on splicing, expression levels, fusion genes and promoter activity 8,9 and evaluates a range of more-specialized features of cancer genomes 8,10–18 .

Publication

An integrative machine learning approach for prediction of toxicity-related drug safety

Publisher: Life Science Alliance, LLC

Date: 28-11-2018

DOI: 10.26508/LSA.201800098

Abstract: Recent trends in drug development have been marked by diminishing returns caused by the escalating costs and falling rates of new drug approval. Unacceptable drug toxicity is a substantial cause of drug failure during clinical trials and the leading cause of drug withdraws after release to the market. Computational methods capable of predicting these failures can reduce the waste of resources and time devoted to the investigation of compounds that ultimately fail. We propose an original machine learning method that leverages identity of drug targets and off-targets, functional impact score computed from Gene Ontology annotations, and biological network data to predict drug toxicity. We demonstrate that our method (TargeTox) can distinguish potentially idiosyncratically toxic drugs from safe drugs and is also suitable for speculative evaluation of different target sets to support the design of optimal low-toxicity combinations.

Publication

Discovering MoRFs by trisecting intrinsically disordered protein sequence into terminals and middle regions

Publisher: Springer Science and Business Media LLC

Date: 02-2019

DOI: 10.1186/S12859-018-2396-7

Publication

Assessment of network module identification across complex diseases

Publisher: Springer Science and Business Media LLC

Date: 30-08-2019

DOI: 10.1038/S41592-019-0509-5

Publication

Multiancestry association study identifies new asthma risk loci that colocalize with immune-cell enhancer marks

Publisher: Springer Science and Business Media LLC

Date: 22-12-2017

DOI: 10.1038/S41588-017-0014-7

Publication

Risk prediction models for dementia constructed by supervised principal component analysis using miRNA expression data

Publisher: Springer Science and Business Media LLC

Date: 25-02-2019

DOI: 10.1038/S42003-019-0324-7

Abstract: Alzheimer’s disease (AD) is the most common subtype of dementia, followed by Vascular Dementia (VaD), and Dementia with Lewy Bodies (DLB). Recently, microRNAs (miRNAs) have received a lot of attention as the novel biomarkers for dementia. Here, using serum miRNA expression of 1,601 Japanese in iduals, we investigated potential miRNA biomarkers and constructed risk prediction models, based on a supervised principal component analysis (PCA) logistic regression method, according to the subtype of dementia. The final risk prediction model achieved a high accuracy of 0.873 on a validation cohort in AD, when using 78 miRNAs: Accuracy = 0.836 with 86 miRNAs in VaD Accuracy = 0.825 with 110 miRNAs in DLB. To our knowledge, this is the first report applying miRNA-based risk prediction models to a dementia prospective cohort. Our study demonstrates our models to be effective in prospective disease risk prediction, and with further improvement may contribute to practical clinical use in dementia.

Publication

PSSM-Suc: Accurately predicting succinylation using position specific scoring matrix into bigram for feature extraction

Publisher: Elsevier BV

Date: 07-2017

DOI: 10.1016/J.JTBI.2017.05.005

Abstract: Post-translational modification (PTM) is a covalent and enzymatic modification of proteins, which contributes to ersify the proteome. Despite many reported PTMs with essential roles in cellular functioning, lysine succinylation has emerged as a subject of particular interest. Because its experimental identification remains a costly and time-consuming process, computational predictors have been recently proposed for tackling this important issue. However, the performance of current predictors is still very limited. In this paper, we propose a new predictor called PSSM-Suc which employs evolutionary information of amino acids for predicting succinylated lysine residues. Here we described each lysine residue in terms of profile bigrams extracted from position specific scoring matrices. We compared the performance of PSSM-Suc to that of existing predictors using a widely used benchmark dataset. PSSM-Suc showed a significant improvement in performance over state-of-the-art predictors. Its sensitivity, accuracy and Matthews correlation coefficient were 0.8159, 0.8199 and 0.6396, respectively.

Publication

Multi-representation DeepInsight: an improvement on tabular data analysis

Publisher: Cold Spring Harbor Laboratory

Date: 05-08-2023

DOI: 10.1101/2023.08.02.551620

Abstract: Tabular data analysis is a critical task in various domains, enabling us to uncover valuable insights from structured datasets. While traditional machine learning methods have been employed for feature engineering and dimensionality reduction, they often struggle to capture the intricate relationships and dependencies within real-world datasets. In this paper, we present Multi-representation DeepInsight (abbreviated as MRep-DeepInsight), an innovative extension of the DeepInsight method, specifically designed to enhance the analysis of tabular data. By generating multiple representations of s les using erse feature extraction techniques, our approach aims to capture a broader range of features and reveal deeper insights. We demonstrate the effectiveness of MRep-DeepInsight on single-cell datasets, Alzheimer’s data, and artificial data, showcasing an improved accuracy over the original DeepInsight approach and machine learning methods like random forest and L2-regularized logistic regression. Our results highlight the value of incorporating multiple representations for robust and accurate tabular data analysis. By embracing the power of erse representations, MRep-DeepInsight offers a promising avenue for advancing decision-making and scientific discovery across a wide range of fields.

Publication

Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

Publisher: Springer Science and Business Media LLC

Date: 21-09-2020

DOI: 10.1038/S41467-020-18151-Y

Abstract: The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) curated consensus somatic mutation calls using whole exome sequencing (WES) and whole genome sequencing (WGS), respectively. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2,658 cancers across 38 tumour types, we compare WES and WGS side-by-side from 746 TCGA s les, finding that ~80% of mutations overlap in covered exonic regions. We estimate that low variant allele fraction (VAF 15%) and clonal heterogeneity contribute up to 68% of private WGS mutations and 71% of private WES mutations. We observe that ~30% of private WGS mutations trace to mutations identified by a single variant caller in WES consensus efforts. WGS captures both ~50% more variation in exonic regions and un-observed mutations in loci with variable GC-content. Together, our analysis highlights technological ergences between two reproducible somatic variant detection efforts.

Publication

DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture

Publisher: Springer Science and Business Media LLC

Date: 06-08-2019

DOI: 10.1038/S41598-019-47765-6

Abstract: It is critical, but difficult, to catch the small variation in genomic or other kinds of data that differentiates phenotypes or categories. A plethora of data is available, but the information from its genes or elements is spread over arbitrarily, making it challenging to extract relevant details for identification. However, an arrangement of similar genes into clusters makes these differences more accessible and allows for robust identification of hidden mechanisms (e.g. pathways) than dealing with elements in idually. Here we propose, DeepInsight, which converts non-image s les into a well-organized image-form. Thereby, the power of convolution neural network (CNN), including GPU utilization, can be realized for non-image s les. Furthermore, DeepInsight enables feature extraction through the application of CNN for non-image s les to seize imperative information and shown promising results. To our knowledge, this is the first work to apply CNN simultaneously on different kinds of non-image datasets: RNA-seq, vowels, text, and artificial.

Publication

SumSec: Accurate Prediction of Sumoylation Sites Using Predicted Secondary Structure

Publisher: MDPI AG

Date: 10-12-2018

DOI: 10.3390/MOLECULES23123260

Abstract: Post Translational Modification (PTM) is defined as the modification of amino acids along the protein sequences after the translation process. These modifications significantly impact on the functioning of proteins. Therefore, having a comprehensive understanding of the underlying mechanism of PTMs turns out to be critical in studying the biological roles of proteins. Among a wide range of PTMs, sumoylation is one of the most important modifications due to its known cellular functions which include transcriptional regulation, protein stability, and protein subcellular localization. Despite its importance, determining sumoylation sites via experimental methods is time-consuming and costly. This has led to a great demand for the development of fast computational methods able to accurately determine sumoylation sites in proteins. In this study, we present a new machine learning-based method for predicting sumoylation sites called SumSec. To do this, we employed the predicted secondary structure of amino acids to extract two types of structural features from neighboring amino acids along the protein sequence which has never been used for this task. As a result, our proposed method is able to enhance the sumoylation site prediction task, outperforming previously proposed methods in the literature. SumSec demonstrated high sensitivity (0.91), accuracy (0.94) and MCC (0.88). The prediction accuracy achieved in this study is 21% better than those reported in previous studies. The script and extracted features are publicly available at: github.com/YosvanyLopez/SumSec.

Publication

Butler enables rapid cloud-based analysis of thousands of human genomes

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41587-019-0360-3

Abstract: We present Butler, a computational tool that facilitates large-scale genomic analyses on public and academic clouds. Butler includes innovative anomaly detection and self-healing functions that improve the efficiency of data processing and analysis by 43% compared with current approaches. Butler enabled processing of a 725-terabyte cancer genome dataset from the Pan-Cancer Analysis of Whole Genomes (PCAWG) project in a time-efficient and uniform manner.

Publication

Comprehensive molecular characterization of mitochondrial genomes in human cancers

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41588-019-0557-X

Abstract: Mitochondria are essential cellular organelles that play critical roles in cancer. Here, as part of the International Cancer Genome Consortium/The Cancer Genome Atlas Pan-Cancer Analysis of Whole Genomes Consortium, which aggregated whole-genome sequencing data from 2,658 cancers across 38 tumor types, we performed a multidimensional, integrated characterization of mitochondrial genomes and related RNA sequencing data. Our analysis presents the most definitive mutational landscape of mitochondrial genomes and identifies several hypermutated cases. Truncating mutations are markedly enriched in kidney, colorectal and thyroid cancers, suggesting oncogenic effects with the activation of signaling pathways. We find frequent somatic nuclear transfers of mitochondrial DNA, some of which disrupt therapeutic target genes. Mitochondrial copy number varies greatly within and across cancers and correlates with clinical variables. Co-expression analysis highlights the function of mitochondrial genes in oxidative phosphorylation, DNA repair and the cell cycle, and shows their connections with clinically actionable genes. Our study lays a foundation for translating mitochondrial biology into clinical applications.

Publication

PhoglyStruct: Prediction of phosphoglycerylated lysine residues using structural properties of amino acids

Publisher: Springer Science and Business Media LLC

Date: 18-12-2018

DOI: 10.1038/S41598-018-36203-8

Abstract: The biological process known as post-translational modification (PTM) contributes to ersifying the proteome hence affecting many aspects of normal cell biology and pathogenesis. There have been many recently reported PTMs, but lysine phosphoglycerylation has emerged as the most recent subject of interest. Despite a large number of proteins being sequenced, the experimental method for detection of phosphoglycerylated residues remains an expensive, time-consuming and inefficient endeavor in the post-genomic era. Instead, the computational methods are being proposed for accurately predicting phosphoglycerylated lysines. Though a number of predictors are available, performance in detecting phosphoglycerylated lysine residues is still limited. In this paper, we propose a new predictor called PhoglyStruct that utilizes structural information of amino acids alongside a multilayer perceptron classifier for predicting phosphoglycerylated and non-phosphoglycerylated lysine residues. For the experiment, we located phosphoglycerylated and non-phosphoglycerylated lysines in our employed benchmark. We then derived and integrated properties such as accessible surface area, backbone torsion angles, and local structure conformations. PhoglyStruct showed significant improvement in the ability to detect phosphoglycerylated residues from non-phosphoglycerylated ones when compared to previous predictors. The sensitivity, specificity, accuracy, Mathews correlation coefficient and AUC were 0.8542, 0.7597, 0.7834, 0.5468 and 0.8077, respectively. The data and Matlab/Octave software packages are available at belavit/PhoglyStruct .

Publication

Predicting MoRFs in protein sequences using HMM profiles

Publisher: Springer Science and Business Media LLC

Date: 12-2016

DOI: 10.1186/S12859-016-1375-0

Publication

Predicting protein-peptide binding sites with a deep convolutional neural network

Publisher: Elsevier BV

Date: 07-2020

DOI: 10.1016/J.JTBI.2020.110278

Publication

Genomic basis for RNA alterations in cancer

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41586-020-1970-0

Abstract: Transcript alterations often result from somatic changes in cancer genomes 1 . Various forms of RNA alterations have been described in cancer, including overexpression 2 , altered splicing 3 and gene fusions 4 however, it is difficult to attribute these to underlying genomic changes owing to heterogeneity among patients and tumour types, and the relatively small cohorts of patients for whom s les have been analysed by both transcriptome and whole-genome sequencing. Here we present, to our knowledge, the most comprehensive catalogue of cancer-associated gene alterations to date, obtained by characterizing tumour transcriptomes from 1,188 donors of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA) 5 . Using matched whole-genome sequencing data, we associated several categories of RNA alterations with germline and somatic DNA alterations, and identified probable genetic mechanisms. Somatic copy-number alterations were the major drivers of variations in total gene and allele-specific expression. We identified 649 associations of somatic single-nucleotide variants with gene expression in cis , of which 68.4% involved associations with flanking non-coding regions of the gene. We found 1,900 splicing alterations associated with somatic mutations, including the formation of exons within introns in proximity to Alu elements. In addition, 82% of gene fusions were associated with structural variants, including 75 of a new class, termed ‘bridged’ fusions, in which a third genomic location bridges two genes. We observed transcriptomic alteration signatures that differ between cancer types and have associations with variations in DNA mutational signatures. This compendium of RNA alterations in the genomic context provides a rich resource for identifying genes and mechanisms that are functionally implicated in cancer.

Publication

Gene masking - a technique to improve accuracy for cancer classification with high dimensionality in microarray data

Publisher: Springer Science and Business Media LLC

Date: 12-2016

DOI: 10.1186/S12920-016-0233-2

Publication

Analyses of non-coding somatic drivers in 2,658 cancer whole genomes

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41586-020-1965-X

Abstract: The discovery of drivers of cancer has traditionally focused on protein-coding genes 1–4 . Here we present analyses of driver point mutations and structural variants in non-coding regions across 2,658 genomes from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium 5 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). For point mutations, we developed a statistically rigorous strategy for combining significance levels from multiple methods of driver discovery that overcomes the limitations of in idual methods. For structural variants, we present two methods of driver discovery, and identify regions that are significantly affected by recurrent breakpoints and recurrent somatic juxtapositions. Our analyses confirm previously reported drivers 6,7 , raise doubts about others and identify novel candidates, including point mutations in the 5′ region of TP53 , in the 3′ untranslated regions of NFKBIZ and TOB1 , focal deletions in BRD4 and rearrangements in the loci of AKR1C genes. We show that although point mutations and structural variants that drive cancer are less frequent in non-coding genes and regulatory sequences than in protein-coding genes, additional ex les of these drivers will be found as more cancer genomes become available.

Publication

Decimation filter with Common Spatial Pattern and Fishers Discriminant Analysis for motor imagery classification

Publisher: IEEE

Date: 07-2016

DOI: 10.1109/IJCNN.2016.7727457

Publication

Prognosis prediction model for conversion from mild cognitive impairment to Alzheimer’s disease created by integrative analysis of multi-omics data

Publisher: Springer Science and Business Media LLC

Date: 10-11-2020

DOI: 10.1186/S13195-020-00716-0

Abstract: Mild cognitive impairment (MCI) is a precursor to Alzheimer’s disease (AD), but not all MCI patients develop AD. Biomarkers for early detection of in iduals at high risk for MCI-to-AD conversion are urgently required. We used blood-based microRNA expression profiles and genomic data of 197 Japanese MCI patients to construct a prognosis prediction model based on a Cox proportional hazard model. We examined the biological significance of our findings with single nucleotide polymorphism-microRNA pairs (miR-eQTLs) by focusing on the target genes of the miRNAs. We investigated functional modules from the target genes with the occurrence of hub genes though a large-scale protein-protein interaction network analysis. We further examined the expression of the genes in 610 blood s les (271 ADs, 248 MCIs, and 91 cognitively normal elderly subjects [CNs]). The final prediction model, composed of 24 miR-eQTLs and three clinical factors (age, sex, and APOE4 alleles), successfully classified MCI patients into low and high risk of MCI-to-AD conversion (log-rank test P = 3.44 × 10 −4 and achieved a concordance index of 0.702 on an independent test set. Four important hub genes associated with AD pathogenesis ( SHC1 , FOXO1 , GSK3B , and PTEN ) were identified in a network-based meta-analysis of miR-eQTL target genes. RNA-seq data from 610 blood s les showed statistically significant differences in PTEN expression between MCI and AD and in SHC1 expression between CN and AD ( PTEN , P = 0.023 SHC1 , P = 0.049). Our proposed model was demonstrated to be effective in MCI-to-AD conversion prediction. A network-based meta-analysis of miR-eQTL target genes identified important hub genes associated with AD pathogenesis. Accurate prediction of MCI-to-AD conversion would enable earlier intervention for MCI patients at high risk, potentially reducing conversion to AD.

Publication

SPECTRA: a tool for enhanced brain wave signal recognition

Publisher: Springer Science and Business Media LLC

Date: 02-06-2021

DOI: 10.1186/S12859-021-04091-X

Abstract: Brain wave signal recognition has gained increased attention in neuro-rehabilitation applications. This has driven the development of brain–computer interface (BCI) systems. Brain wave signals are acquired using electroencephalography (EEG) sensors, processed and decoded to identify the category to which the signal belongs. Once the signal category is determined, it can be used to control external devices. However, the success of such a system essentially relies on significant feature extraction and classification algorithms. One of the commonly used feature extraction technique for BCI systems is common spatial pattern (CSP). The performance of the proposed spatial-frequency-temporal feature extraction (SPECTRA) predictor is analysed using three public benchmark datasets. Our proposed predictor outperformed other competing methods achieving lowest average error rates of 8.55%, 17.90% and 20.26%, and highest average kappa coefficient values of 0.829, 0.643 and 0.595 for BCI Competition III dataset IVa, BCI Competition IV dataset I and BCI Competition IV dataset IIb, respectively. Our proposed SPECTRA predictor effectively finds features that are more separable and shows improvement in brain wave signal recognition that can be instrumental in developing improved real-time BCI systems that are computationally efficient.

Publication

Pathway and network analysis of more than 2500 whole cancer genomes

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41467-020-14367-0

Abstract: The catalog of cancer driver mutations in protein-coding genes has greatly expanded in the past decade. However, non-coding cancer driver mutations are less well-characterized and only a handful of recurrent non-coding mutations, most notably TERT promoter mutations, have been reported. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancer across 38 tumor types, we perform multi-faceted pathway and network analyses of non-coding mutations across 2583 whole cancer genomes from 27 tumor types compiled by the ICGC/TCGA PCAWG project that was motivated by the success of pathway and network analyses in prioritizing rare mutations in protein-coding genes. While few non-coding genomic elements are recurrently mutated in this cohort, we identify 93 genes harboring non-coding mutations that cluster into several modules of interacting proteins. Among these are promoter mutations associated with reduced mRNA expression in TP53 , TLE4 , and TCF4 . We find that biological processes had variable proportions of coding and non-coding mutations, with chromatin remodeling and proliferation pathways altered primarily by coding mutations, while developmental pathways, including Wnt and Notch, altered by both coding and non-coding mutations. RNA splicing is primarily altered by non-coding mutations in this cohort, and s les containing non-coding mutations in well-known RNA splicing factors exhibit similar gene expression signatures as s les with coding mutations in these genes. These analyses contribute a new repertoire of possible cancer genes and mechanisms that are altered by non-coding mutations and offer insights into additional cancer vulnerabilities that can be investigated for potential therapeutic treatments.

Publication

DeepInsight-3D for precision oncology: an improved anti-cancer drug response prediction from high-dimensional multi-omics data with convolutional neural networks

Publisher: Cold Spring Harbor Laboratory

Date: 16-07-2022

DOI: 10.1101/2022.07.14.500140

Abstract: Modern oncology offers a wide range of treatments and therefore choosing the best option for particular patient is very important for optimal outcomes. Multi-omics profiling in combination with AI-based predictive models have great potential for streamlining these treatment decisions. However, these encouraging developments continue to be h ered by very high dimensionality of the datasets in combination with insufficiently large numbers of annotated s les. In this study, we propose a novel deep learning-based method to predict patient-specific anticancer drug response from three types of multiomics data. The proposed DeepInsight-3D approach relies on structured data-to-image conversion that then allows use of convolutional neural networks, which are particularly robust to high dimensionality of the inputs while retaining capabilities to model highly complex relationships between variables. Of particular note, we demonstrate that in this formalism additional channels of an image can be effectively used to accommodate data from different ‘omics layers while explicitly encoding the connection between them. DeepInsight-3D was able to outperform two other state-of-the-art methods proposed for this task. These advances can facilitate the development of better personalized treatment strategies for different cancers in the future.

Publication

RAM-PGK: Prediction of Lysine Phosphoglycerylation Based on Residue Adjacency Matrix

Publisher: MDPI AG

Date: 20-12-2020

DOI: 10.3390/GENES11121524

Abstract: Background: Post-translational modification (PTM) is a biological process that is associated with the modification of proteome, which results in the alteration of normal cell biology and pathogenesis. There have been numerous PTM reports in recent years, out of which, lysine phosphoglycerylation has emerged as one of the recent developments. The traditional methods of identifying phosphoglycerylated residues, which are experimental procedures such as mass spectrometry, have shown to be time-consuming and cost-inefficient, despite the abundance of proteins being sequenced in this post-genomic era. Due to these drawbacks, computational techniques are being sought to establish an effective identification system of phosphoglycerylated lysine residues. The development of a predictor for phosphoglycerylation prediction is not a first, but it is necessary as the latest predictor falls short in adequately detecting phosphoglycerylated and non-phosphoglycerylated lysine residues. Results: In this work, we introduce a new predictor named RAM-PGK, which uses sequence-based information relating to amino acid residues to predict phosphoglycerylated and non-phosphoglycerylated sites. A benchmark dataset was employed for this purpose, which contained experimentally identified phosphoglycerylated and non-phosphoglycerylated lysine residues. From the dataset, we extracted the residue adjacency matrix pertaining to each lysine residue in the protein sequences and converted them into feature vectors, which is used to build the phosphoglycerylation predictor. Conclusion: RAM-PGK, which is based on sequential features and support vector machine classifiers, has shown a noteworthy improvement in terms of performance in comparison to some of the recent prediction methods. The performance metrics of the RAM-PGK predictor are: 0.5741 sensitivity, 0.6436 specificity, 0.0531 precision, 0.6414 accuracy, and 0.0824 Mathews correlation coefficient.

Publication

Genomic footprints of activated telomere maintenance mechanisms in cancer

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41467-019-13824-9

Abstract: Cancers require telomere maintenance mechanisms for unlimited replicative potential. They achieve this through TERT activation or alternative telomere lengthening associated with ATRX or DAXX loss. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium , we dissect whole-genome sequencing data of over 2500 matched tumor-control s les from 36 different tumor types aggregated within the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium to characterize the genomic footprints of these mechanisms. While the telomere content of tumors with ATRX or DAXX mutations (ATRX/DAXX trunc ) is increased, tumors with TERT modifications show a moderate decrease of telomere content. One quarter of all tumor s les contain somatic integrations of telomeric sequences into non-telomeric DNA. This fraction is increased to 80% prevalence in ATRX/DAXX trunc tumors, which carry an aberrant telomere variant repeat (TVR) distribution as another genomic marker. The latter feature includes enrichment or depletion of the previously undescribed singleton TVRs TTCGGG and TTTGGG, respectively. Our systematic analysis provides new insight into the recurrent genomic alterations associated with telomere maintenance mechanisms in cancer.

Publication

scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

Publisher: Oxford University Press (OUP)

Date: 31-07-2023

DOI: 10.1093/BIB/BBAD266

Abstract: Annotation of cell-types is a critical step in the analysis of single-cell RNA sequencing (scRNA-seq) data that allows the study of heterogeneity across multiple cell populations. Currently, this is most commonly done using unsupervised clustering algorithms, which project single-cell expression data into a lower dimensional space and then cluster cells based on their distances from each other. However, as these methods do not use reference datasets, they can only achieve a rough classification of cell-types, and it is difficult to improve the recognition accuracy further. To effectively solve this issue, we propose a novel supervised annotation method, scDeepInsight. The scDeepInsight method is capable of performing manifold assignments. It is competent in executing data integration through batch normalization, performing supervised training on the reference dataset, doing outlier detection and annotating cell-types on query datasets. Moreover, it can help identify active genes or marker genes related to cell-types. The training of the scDeepInsight model is performed in a unique way. Tabular scRNA-seq data are first converted to corresponding images through the DeepInsight methodology. DeepInsight can create a trainable image transformer to convert non-image RNA data to images by comprehensively comparing interrelationships among multiple genes. Subsequently, the converted images are fed into convolutional neural networks such as EfficientNet-b3. This enables automatic feature extraction to identify the cell-types of scRNA-seq s les. We benchmarked scDeepInsight with six other mainstream cell annotation methods. The average accuracy rate of scDeepInsight reached 87.5%, which is more than 7% higher compared with the state-of-the-art methods.

Publication

An improved discriminative filter bank selection approach for motor imagery EEG signal classification using mutual information

Publisher: Springer Science and Business Media LLC

Date: 12-2017

DOI: 10.1186/S12859-017-1964-6

Publication

GlyStruct: Glycation prediction using structural properties of amino acid residues

Publisher: Springer Science and Business Media LLC

Date: 02-2019

DOI: 10.1186/S12859-018-2547-X

Publication

Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAAC

Publisher: Institute of Electrical and Electronics Engineers (IEEE)

Date: 12-2015

DOI: 10.1109/TNB.2015.2500186

Publication

Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams

Publisher: Public Library of Science (PLoS)

Date: 12-02-2018

DOI: 10.1371/JOURNAL.PONE.0191900

Publication

HseSUMO: Sumoylation site prediction using half-sphere exposures of amino acids residues

Publisher: Springer Science and Business Media LLC

Date: 04-2019

DOI: 10.1186/S12864-018-5206-8

Publication

Divergent mutational processes distinguish hypoxic and normoxic tumours

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41467-019-14052-X

Abstract: Many primary tumours have low levels of molecular oxygen (hypoxia), and hypoxic tumours respond poorly to therapy. Pan-cancer molecular hallmarks of tumour hypoxia remain poorly understood, with limited comprehension of its associations with specific mutational processes, non-coding driver genes and evolutionary features. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole genome sequencing data from 2658 cancers across 38 tumour types, we quantify hypoxia in 1188 tumours spanning 27 cancer types. Elevated hypoxia associates with increased mutational load across cancer types, irrespective of underlying mutational class. The proportion of mutations attributed to several mutational signatures of unknown aetiology directly associates with the level of hypoxia, suggesting underlying mutational processes for these signatures. At the gene level, driver mutations in TP53 , MYC and PTEN are enriched in hypoxic tumours, and mutations in PTEN interact with hypoxia to direct tumour evolutionary trajectories. Overall, hypoxia plays a critical role in shaping the genomic and evolutionary landscapes of cancer.

Publication

Disruption of chromatin folding domains by somatic genomic rearrangements in human cancer

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41588-019-0564-Y

Abstract: Chromatin is folded into successive layers to organize linear DNA. Genes within the same topologically associating domains (TADs) demonstrate similar expression and histone-modification profiles, and boundaries separating different domains have important roles in reinforcing the stability of these features. Indeed, domain disruptions in human cancers can lead to misregulation of gene expression. However, the frequency of domain disruptions in human cancers remains unclear. Here, as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), which aggregated whole-genome sequencing data from 2,658 cancers across 38 tumor types, we analyzed 288,457 somatic structural variations (SVs) to understand the distributions and effects of SVs across TADs. Notably, SVs can lead to the fusion of discrete TADs, and complex rearrangements markedly change chromatin folding maps in the cancer genomes. Notably, only 14% of the boundary deletions resulted in a change in expression in nearby genes of more than twofold.

Publication

Single-stranded and double-stranded DNA-binding protein prediction using HMM profiles

Publisher: Elsevier BV

Date: 2021

DOI: 10.1016/J.AB.2020.113954

Publication

Computational Prediction of Lysine Pupylation Sites in Prokaryotic Proteins Using Position Specific Scoring Matrix into Bigram for Feature Extraction

Publisher: Springer International Publishing

Date: 2019

DOI: 10.1007/978-3-030-29894-4_39

Publication

The landscape of viral associations in human cancers

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41588-019-0558-9

Abstract: Here, as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, for which whole-genome and—for a subset—whole-transcriptome sequencing data from 2,658 cancers across 38 tumor types was aggregated, we systematically investigated potential viral pathogens using a consensus approach that integrated three independent pipelines. Viruses were detected in 382 genome and 68 transcriptome datasets. We found a high prevalence of known tumor-associated viruses such as Epstein–Barr virus (EBV), hepatitis B virus (HBV) and human papilloma virus (HPV for ex le, HPV16 or HPV18). The study revealed significant exclusivity of HPV and driver mutations in head-and-neck cancer and the association of HPV with APOBEC mutational signatures, which suggests that impaired antiviral defense is a driving force in cervical, bladder and head-and-neck carcinoma. For HBV, HPV16, HPV18 and adeno-associated virus-2 (AAV2), viral integration was associated with local variations in genomic copy numbers. Integrations at the TERT promoter were associated with high telomerase expression evidently activating this tumor-driving process. High levels of endogenous retrovirus (ERV1) expression were linked to a worse survival outcome in patients with kidney cancer.

Publication

Clustering of Small-Sample Single-Cell RNA-Seq Data via Feature Clustering and Selection

Publisher: Springer International Publishing

Date: 2019

DOI: 10.1007/978-3-030-29894-4_36

Publication

High-coverage whole-genome analysis of 1220 cancers reveals hundreds of genes deregulated by rearrangement-mediated cis-regulatory alterations

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41467-019-13885-W

Abstract: The impact of somatic structural variants (SVs) on gene expression in cancer is largely unknown. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole-genome sequencing data and RNA sequencing from a common set of 1220 cancer cases, we report hundreds of genes for which the presence within 100 kb of an SV breakpoint associates with altered expression. For the majority of these genes, expression increases rather than decreases with corresponding breakpoint events. Up-regulated cancer-associated genes impacted by this phenomenon include TERT , MDM2 , CDK4 , ERBB2 , CD274 , PDCD1LG2 , and IGF2 . TERT -associated breakpoints involve ~3% of cases, most frequently in liver biliary, melanoma, sarcoma, stomach, and kidney cancers. SVs associated with up-regulation of PD1 and PDL1 genes involve ~1% of non- lified cases. For many genes, SVs are significantly associated with increased numbers or greater proximity of enhancer regulatory elements near the gene. DNA methylation near the promoter is often increased with nearby SV breakpoint, which may involve inactivation of repressor elements.

Publication

Genetic Variation in the SLC8A1 Calcium Signaling Pathway Is Associated With Susceptibility to Kawasaki Disease and Coronary Artery Abnormalities

Publisher: Ovid Technologies (Wolters Kluwer Health)

Date: 12-2016

DOI: 10.1161/CIRCGENETICS.116.001533

Abstract: Kawasaki disease (KD) is an acute pediatric vasculitis in which host genetics influence both susceptibility to KD and the formation of coronary artery aneurysms. Variants discovered by genome-wide association studies and linkage studies only partially explain the influence of genetics on KD susceptibility. To search for additional functional genetic variation, we performed pathway and gene stability analysis on a genome-wide association study data set. Pathway analysis using European genome-wide association study data identified 100 significantly associated pathways ( P ×10 − 4 ). Gene stability selection identified 116 single nucleotide polymorphisms in 26 genes that were responsible for driving the pathway associations, and gene ontology analysis demonstrated enrichment for calcium transport ( P =1.05×10 − 4 ). Three single nucleotide polymorphisms in solute carrier family 8, member 1 ( SLC8A1 ), a sodium/calcium exchanger encoding NCX1, were validated in an independent Japanese genome-wide association study data set (meta-analysis P =0.0001). Patients homozygous for the A (risk) allele of rs13017968 had higher rates of coronary artery abnormalities ( P =0.029). NCX1, the protein encoded by SLC8A1 , was expressed in spindle-shaped and inflammatory cells in the aneurysm wall. Increased intracellular calcium mobilization was observed in B cell lines from healthy controls carrying the risk allele. Pathway-based association analysis followed by gene stability selection proved to be a valuable tool for identifying risk alleles in a rare disease with complex genetics. The role of SLC8A1 polymorphisms in altering calcium flux in cells that mediate coronary artery damage in KD suggests that this pathway may be a therapeutic target and supports the study of calcineurin inhibitors in acute KD.

Publication

Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen

Publisher: Springer Science and Business Media LLC

Date: 17-06-2019

DOI: 10.1038/S41467-019-09799-2

Abstract: The effectiveness of most cancer targeted therapies is short-lived. Tumors often develop resistance that might be overcome with drug combinations. However, the number of possible combinations is vast, necessitating data-driven approaches to find optimal patient-specific treatments. Here we report AstraZeneca’s large drug combination dataset, consisting of 11,576 experiments from 910 combinations across 85 molecularly characterized cancer cell lines, and results of a DREAM Challenge to evaluate computational strategies for predicting synergistic drug pairs and biomarkers. 160 teams participated to provide a comprehensive methodological development and benchmarking. Winning methods incorporate prior knowledge of drug-target interactions. Synergy is predicted with an accuracy matching biological replicates for % of combinations. However, 20% of drug combinations are poorly predicted by all methods. Genomic rationale for synergy predictions are identified, including ADAM17 inhibitor antagonism when combined with PIK3CB/D inhibition contrasting to synergy when combined with other PI3K-pathway inhibitors in PIK3CA mutant cells.

Publication

Forecasting the spread of COVID-19 using LSTM network

Publisher: Springer Science and Business Media LLC

Date: 10-06-2021

DOI: 10.1186/S12859-021-04224-2

Abstract: The novel coronavirus (COVID-19) is caused by severe acute respiratory syndrome coronavirus 2, and within a few months, it has become a global pandemic. This forced many affected countries to take stringent measures such as complete lockdown, shutting down businesses and trade, as well as travel restrictions, which has had a tremendous economic impact. Therefore, having knowledge and foresight about how a country might be able to contain the spread of COVID-19 will be of paramount importance to the government, policy makers, business partners and entrepreneurs. To help social and administrative decision making, a model that will be able to forecast when a country might be able to contain the spread of COVID-19 is needed. The results obtained using our long short-term memory (LSTM) network-based model are promising as we validate our prediction model using New Zealand’s data since they have been able to contain the spread of COVID-19 and bring the daily new cases tally to zero. Our proposed forecasting model was able to correctly predict the dates within which New Zealand was able to contain the spread of COVID-19. Similarly, the proposed model has been used to forecast the dates when other countries would be able to contain the spread of COVID-19. The forecasted dates are only a prediction based on the existing situation. However, these forecasted dates can be used to guide actions and make informed decisions that will be practically beneficial in influencing the real future. The current forecasting trend shows that more stringent actions/restrictions need to be implemented for most of the countries as the forecasting model shows they will take over three months before they can possibly contain the spread of COVID-19.

Publication

Reconstructing evolutionary trajectories of mutation signature activities in cancer using TrackSig

Publisher: Springer Science and Business Media LLC

Date: 05-02-2020

DOI: 10.1038/S41467-020-14352-7

Abstract: The type and genomic context of cancer mutations depend on their causes. These causes have been characterized using signatures that represent mutation types that co-occur in the same tumours. However, it remains unclear how mutation processes change during cancer evolution due to the lack of reliable methods to reconstruct evolutionary trajectories of mutational signature activity. Here, as part of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium, which aggregated whole-genome sequencing data from 2658 cancers across 38 tumour types, we present TrackSig, a new method that reconstructs these trajectories using optimal, joint segmentation and deconvolution of mutation type and allele frequencies from a single tumour s le. In simulations, we find TrackSig has a 3–5% activity reconstruction error, and 12% false detection rate. It outperforms an aggressive baseline in situations with branching evolution, CNA gain, and neutral mutations. Applied to data from 2658 tumours and 38 cancer types, TrackSig permits pan-cancer insight into evolutionary changes in mutational processes.

Tatsuhiko Tsunoda

Researcher

Related Links

Publications

scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

Protein fold recognition using HMM–HMM alignment and dynamic programming

OPAL+: Length‐Specific MoRF Prediction in Intrinsically Disordered Protein Sequences

The International HapMap Project

Cancer LncRNA Census reveals evidence for deep functional conservation of long noncoding RNAs in tumorigenesis

Subject-Specific-Frequency-Band for Motor Imagery EEG Signal Recognition Based on Common Spatial Spectral Pattern

Comparison of gene expression profiles between Opisthorchis viverrini and Non-opisthorchis viverrini associated human intrahepatic cholangiocarcinoma

Author Correction: The repertoire of mutational signatures in human cancer

Stepwise iterative maximum likelihood clustering approach

EvolStruct-Phogly: incorporating structural properties and evolutionary information from profile bigrams for the phosphoglycerylation prediction

Integrative pathway enrichment analysis of multivariate omics data

MoRFPred-plus: Computational Identification of MoRFs in Protein Sequences using Physicochemical Properties and HMM profiles

Genome-wide association study identifies three novel loci for type 2 diabetes

CNN-Pred: Prediction of single-stranded and double-stranded DNA-binding protein using convolutional neural networks

Patterns of somatic structural variation in human cancer genomes

Bigram-PGK: Phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix

Combined Genetic and Genealogic Studies Uncover a Large BAP1 Cancer Syndrome Kindred Tracing Back Nine Generations to a Common Ancestor from the 1700s

Application of cepstrum analysis and linear predictive coding for motor imaginary task classification

Success: evolutionary and structural properties of amino acids prove effective for succinylation site prediction

Importance of dimensionality reduction in protein fold recognition

OPAL: prediction of MoRF regions in intrinsically disordered protein sequences

A Deep Learning Approach for Motor Imagery EEG Signal Classification

DeepInsight-FS: Selecting features for non-image data using convolutional neural network

A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns

Hierarchical Maximum Likelihood Clustering Approach

A comparison of machine learning classifiers for dementia with Lewy bodies using miRNA expression data

Divisive hierarchical maximum likelihood clustering

Comprehensive analysis of chromothripsis in 2,658 human cancers using whole-genome sequencing

Genome-wide detection and characterization of positive selection in human populations

Combined burden and functional impact tests for cancer driver discovery using DriverPower

The evolutionary history of 2,658 cancers

Computational Pipelines and Workflows in Bioinformatics

A meta-analysis identifies adolescent idiopathic scoliosis association withLBX1locus in multiple ethnic groups

Sex differences in oncogenic mutational processes

An integrative machine learning approach for prediction of toxicity-related drug safety

Inferring structural variant cancer cell fraction

DeepFeature: feature selection in nonimage data using convolutional neural network

SucStruct: Prediction of succinylated lysine residues by using structural properties of amino acids

Brain wave classification using long short-term memory network based OPTICAL predictor

2D–EM clustering approach for high-dimensional data through folding feature vectors

Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition

Pan-cancer analysis of whole genomes

An integrative machine learning approach for prediction of toxicity-related drug safety

Discovering MoRFs by trisecting intrinsically disordered protein sequence into terminals and middle regions

Assessment of network module identification across complex diseases

Multiancestry association study identifies new asthma risk loci that colocalize with immune-cell enhancer marks

Risk prediction models for dementia constructed by supervised principal component analysis using miRNA expression data

PSSM-Suc: Accurately predicting succinylation using position specific scoring matrix into bigram for feature extraction

Multi-representation DeepInsight: an improvement on tabular data analysis

Retrospective evaluation of whole exome and genome mutation calls in 746 cancer samples

DeepInsight: A methodology to transform a non-image data to an image for convolution neural network architecture

SumSec: Accurate Prediction of Sumoylation Sites Using Predicted Secondary Structure

Butler enables rapid cloud-based analysis of thousands of human genomes

Comprehensive molecular characterization of mitochondrial genomes in human cancers

PhoglyStruct: Prediction of phosphoglycerylated lysine residues using structural properties of amino acids

Predicting MoRFs in protein sequences using HMM profiles

Predicting protein-peptide binding sites with a deep convolutional neural network

Genomic basis for RNA alterations in cancer

Gene masking - a technique to improve accuracy for cancer classification with high dimensionality in microarray data

Analyses of non-coding somatic drivers in 2,658 cancer whole genomes

Decimation filter with Common Spatial Pattern and Fishers Discriminant Analysis for motor imagery classification

Prognosis prediction model for conversion from mild cognitive impairment to Alzheimer’s disease created by integrative analysis of multi-omics data

SPECTRA: a tool for enhanced brain wave signal recognition

Pathway and network analysis of more than 2500 whole cancer genomes

DeepInsight-3D for precision oncology: an improved anti-cancer drug response prediction from high-dimensional multi-omics data with convolutional neural networks

RAM-PGK: Prediction of Lysine Phosphoglycerylation Based on Residue Adjacency Matrix

Genomic footprints of activated telomere maintenance mechanisms in cancer

scDeepInsight: a supervised cell-type identification method for scRNA-seq data with deep learning

An improved discriminative filter bank selection approach for motor imagery EEG signal classification using mutual information

GlyStruct: Glycation prediction using structural properties of amino acid residues

Predict Gram-Positive and Gram-Negative Subcellular Localization via Incorporating Evolutionary Information and Physicochemical Features Into Chou's General PseAAC

Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams

HseSUMO: Sumoylation site prediction using half-sphere exposures of amino acids residues

Divergent mutational processes distinguish hypoxic and normoxic tumours

Disruption of chromatin folding domains by somatic genomic rearrangements in human cancer

Single-stranded and double-stranded DNA-binding protein prediction using HMM profiles