Literature DB >> 29659719

pysster: classification of biological sequences by learning sequence and structure motifs with convolutional neural networks.

Stefan Budach1, Annalisa Marsico1,2.   

Abstract

Summary: Convolutional neural networks (CNNs) have been shown to perform exceptionally well in a variety of tasks, including biological sequence classification. Available implementations, however, are usually optimized for a particular task and difficult to reuse. To enable researchers to utilize these networks more easily, we implemented pysster, a Python package for training CNNs on biological sequence data. Sequences are classified by learning sequence and structure motifs and the package offers an automated hyper-parameter optimization procedure and options to visualize learned motifs along with information about their positional and class enrichment. The package runs seamlessly on CPU and GPU and provides a simple interface to train and evaluate a network with a handful lines of code. Using an RNA A-to-I editing dataset and cross-linking immunoprecipitation (CLIP)-seq binding site sequences, we demonstrate that pysster classifies sequences with higher accuracy than previous methods, such as GraphProt or ssHMM, and is able to recover known sequence and structure motifs. Availability and implementation: pysster is freely available at https://github.com/budach/pysster. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2018        PMID: 29659719      PMCID: PMC6129303          DOI: 10.1093/bioinformatics/bty222

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

In recent years, deep convolutional neural networks (CNNs) have been shown to be an accurate method for biological sequence classification and sequence motif detection (Alipanahi ; Angermueller ; Kelley ). The increasing amount of sequence data and the rise of general-purpose computing on graphics processing units (GPUs) have enabled CNNs to outperform other machine learning methods, such as random forests and support vector machines, in terms of both classification performance and runtime performance (Angermueller ; Kelley ). While a number of publications have made use of CNNs on biological data, these implementations are usually hard to reuse (Alipanahi ) or tailored to a specific problem, such as prediction of DNA CpG methylation from single-cell data (Angermueller ). Basset (Kelley ) and iDeep (Pan and Shen, 2017) represent more general frameworks for training of CNNs on DNA and RNA data, respectively. However, they don’t provide detailed motif interpretations, such as motif locations and class enrichment of motifs, and they are not able to learn structure motifs in the RNA case. For the specific task of classifying RNA by learning sequence and structure motifs other tools not based on neural networks are available, e.g. RNAcontext (Kazan ) and GraphProt (Maticzka ). These tools usually learn only a single motif and similar to CNN-based methods provide no additional information such as motif locations. To address these issues and to enable researchers to easily utilize CNNs, we implemented pysster, a python package for training CNN classifiers on biological sequences. Supervised classification is enabled by the automatic detection of sequence motifs. Our package focuses on interpretability and extends previous implementations by providing information about the positional and class enrichment of learned motifs. Moreover, by incorporating structure information, e.g. in the form of secondary structure predictions for RNA sequences, it is possible to learn structure motifs corresponding to the sequence motifs. We demonstrate that our tool is able to learn well-known motifs for an RNA A-to-I editing dataset and multiple CLIP-seq datasets and that it outperforms GraphProt, a state-of-the-art classifier for RNA sequences and structures, both in terms of classification and runtime performance. Providing a simple programming interface we hope that our package enables more researchers to make effective use of CNNs in classifying and interpreting large sets of biological sequences.

2 Implementation and features

We implemented an established network architecture and multiple interpretation options as an easy-to-use python package. The basic architecture of the network consists of a variable number of convolutional and max-pooling layers followed by a variable number of dense layers (Fig. 1A). These layers are interspersed by dropout layers after the input layer and after every max-pooling and dense layer. Using an automated grid search, the network can be tuned via a number of hyper-parameters, such as number of convolutional layers, number of kernels, length of kernels and dropout ratios. The main features of the package are:
Fig. 1.

(A) The basic network architecture consists of a variable number of convolutional/max-pooling stacks followed by a variable number of dense layers interspersed by dropout layers. The network can be tuned via an extensive list of hyper-parameters. The network inputs are one-hot encoded sequences and the network outputs predicted probabilities, indicating class membership. (B) RNA sequence and structure input strings are encoded into a single string by combining the sequence alphabet and the secondary structure alphabet into an extended alphabet consisting of arbitrary characters. Subsequently, this string is one-hot encoded and used as the network input. (C) For the motif interpretation the string over the arbitrary alphabet can be decoded into the two original strings to construct sequence logos for the original alphabets. The shown example motif corresponds to an ALU repeat motif found in the classification task of RNA A-to-I editing sites (see the tutorial workflow on github for detailed information)

(A) The basic network architecture consists of a variable number of convolutional/max-pooling stacks followed by a variable number of dense layers interspersed by dropout layers. The network can be tuned via an extensive list of hyper-parameters. The network inputs are one-hot encoded sequences and the network outputs predicted probabilities, indicating class membership. (B) RNA sequence and structure input strings are encoded into a single string by combining the sequence alphabet and the secondary structure alphabet into an extended alphabet consisting of arbitrary characters. Subsequently, this string is one-hot encoded and used as the network input. (C) For the motif interpretation the string over the arbitrary alphabet can be decoded into the two original strings to construct sequence logos for the original alphabets. The shown example motif corresponds to an ALU repeat motif found in the classification task of RNA A-to-I editing sites (see the tutorial workflow on github for detailed information) multi-class and single-label or multi-label classifications sensible default parameters and an optional hyper-parameter tuning interpretation of learned motifs in terms of positional and class enrichment (Supplementary Figs S1 and S3) and motif co-occurrence (Supplementary Fig. S2) support of input strings over user-defined alphabets (i.e. applicable to DNA, RNA and protein data) optional use of structure information, handcrafted features and recurrent layers visualization of all network layers using visualization by optimization (Olah ) seamless CPU or GPU computation by building on top of TensorFlow (Abadi ) Structure information (e.g. for RNA in the form of dot-bracket strings or annotated dot-bracket strings) is incorporated into the network by encoding the given sequence string over an alphabet of size N and the corresponding structure string over an alphabet of size M into a single new string using an extended alphabet of size N*M (Fig. 1B). The network is then trained on these new strings, which can be decoded back into the original strings after the training to enable the visualization of two motifs as position-weight matrices (Fig. 1C). Detailed descriptions of the model and visualization options can be found in the tutorials and documentation.

3 Case studies

pysster is freely available at https://github.com/budach/pysster. Its documentation includes a workflow tutorial that showcases the main functionality on an RNA A-to-I editing dataset (Picardi ). Editing sites are known to be enriched in repetitive ALU sequences and we show that we are able to classify the editing location with high accuracy and that we learn known ALU motifs (Fig. 1C). Finally, we have trained our tool on sequences derived from CLIP-seq data for a number of RNA binding proteins with known binding site motifs (Supplementary Table S1). Similar to GraphProt and ssHMM (Heller ), an unsupervised hidden Markov model-based approach for learning sequence/structure motifs, pysster can recover the known motifs but outperforms GraphProt both in terms of classification and runtime performance. Conflict of Interest: none declared. Click here for additional data file.
  8 in total

1.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

Authors:  Babak Alipanahi; Andrew Delong; Matthew T Weirauch; Brendan J Frey
Journal:  Nat Biotechnol       Date:  2015-07-27       Impact factor: 54.908

2.  REDIportal: a comprehensive database of A-to-I RNA editing events in humans.

Authors:  Ernesto Picardi; Anna Maria D'Erchia; Claudio Lo Giudice; Graziano Pesole
Journal:  Nucleic Acids Res       Date:  2016-09-01       Impact factor: 16.971

3.  RNAcontext: a new method for learning the sequence and structure binding preferences of RNA-binding proteins.

Authors:  Hilal Kazan; Debashish Ray; Esther T Chan; Timothy R Hughes; Quaid Morris
Journal:  PLoS Comput Biol       Date:  2010-07-01       Impact factor: 4.475

4.  ssHMM: extracting intuitive sequence-structure motifs from high-throughput RNA-binding protein data.

Authors:  David Heller; Ralf Krestel; Uwe Ohler; Martin Vingron; Annalisa Marsico
Journal:  Nucleic Acids Res       Date:  2017-11-02       Impact factor: 16.971

5.  GraphProt: modeling binding preferences of RNA-binding proteins.

Authors:  Daniel Maticzka; Sita J Lange; Fabrizio Costa; Rolf Backofen
Journal:  Genome Biol       Date:  2014-01-22       Impact factor: 13.583

6.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks.

Authors:  David R Kelley; Jasper Snoek; John L Rinn
Journal:  Genome Res       Date:  2016-05-03       Impact factor: 9.043

7.  RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach.

Authors:  Xiaoyong Pan; Hong-Bin Shen
Journal:  BMC Bioinformatics       Date:  2017-02-28       Impact factor: 3.169

8.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning.

Authors:  Christof Angermueller; Heather J Lee; Wolf Reik; Oliver Stegle
Journal:  Genome Biol       Date:  2017-04-11       Impact factor: 13.583

  8 in total
  16 in total

Review 1.  Machine Learning Approaches on High Throughput NGS Data to Unveil Mechanisms of Function in Biology and Disease.

Authors:  Vasileios C Pezoulas; Orsalia Hazapis; Nefeli Lagopati; Themis P Exarchos; Andreas V Goules; Athanasios G Tzioufas; Dimitrios I Fotiadis; Ioannis G Stratis; Athanasios N Yannacopoulos; Vassilis G Gorgoulis
Journal:  Cancer Genomics Proteomics       Date:  2021 Sep-Oct       Impact factor: 4.069

Review 2.  Multi-Omics Approaches to Study Long Non-coding RNA Function in Atherosclerosis.

Authors:  Adam W Turner; Doris Wong; Mohammad Daud Khan; Caitlin N Dreisbach; Meredith Palmore; Clint L Miller
Journal:  Front Cardiovasc Med       Date:  2019-02-19

3.  Graph Convolutional Network and Convolutional Neural Network Based Method for Predicting lncRNA-Disease Associations.

Authors:  Ping Xuan; Shuxiang Pan; Tiangang Zhang; Yong Liu; Hao Sun
Journal:  Cells       Date:  2019-08-30       Impact factor: 6.600

4.  Functional impacts of non-coding RNA processing on enhancer activity and target gene expression.

Authors:  Evgenia Ntini; Annalisa Marsico
Journal:  J Mol Cell Biol       Date:  2019-10-25       Impact factor: 6.216

5.  gammaBOriS: Identification and Taxonomic Classification of Origins of Replication in Gammaproteobacteria using Motif-based Machine Learning.

Authors:  Theodor Sperlea; Lea Muth; Roman Martin; Christoph Weigel; Torsten Waldminghaus; Dominik Heider
Journal:  Sci Rep       Date:  2020-04-21       Impact factor: 4.379

6.  Molecular Generation for Desired Transcriptome Changes With Adversarial Autoencoders.

Authors:  Rim Shayakhmetov; Maksim Kuznetsov; Alexander Zhebrak; Artur Kadurin; Sergey Nikolenko; Alexander Aliper; Daniil Polykovskiy
Journal:  Front Pharmacol       Date:  2020-04-17       Impact factor: 5.810

7.  RBPsuite: RNA-protein binding sites prediction suite based on deep learning.

Authors:  Xiaoyong Pan; Yi Fang; Xianfeng Li; Yang Yang; Hong-Bin Shen
Journal:  BMC Genomics       Date:  2020-12-09       Impact factor: 3.969

8.  Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure.

Authors:  Lei Deng; Youzhi Liu; Yechuan Shi; Wenhao Zhang; Chun Yang; Hui Liu
Journal:  BMC Genomics       Date:  2020-12-17       Impact factor: 3.969

9.  DeepVISP: Deep Learning for Virus Site Integration Prediction and Motif Discovery.

Authors:  Haodong Xu; Peilin Jia; Zhongming Zhao
Journal:  Adv Sci (Weinh)       Date:  2021-03-08       Impact factor: 16.806

Review 10.  Learning the Regulatory Code of Gene Expression.

Authors:  Jan Zrimec; Filip Buric; Mariia Kokina; Victor Garcia; Aleksej Zelezniak
Journal:  Front Mol Biosci       Date:  2021-06-10
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.