Literature DB >> 35415416

A Data Adaptive Biological Sequence Representation for Supervised Learning.

Hande Cakin1, Berk Gorgulu1, Mustafa Gokce Baydogan1, Na Zou2, Jing Li3.   

Abstract

Proper expression of the genes plays a vital role in the function of an organism. Recent advancements in DNA microarray technology allow for monitoring the expression level of thousands of genes. One of the important tasks in this context is to understand the underlying mechanisms of gene regulation. Recently, researchers have focused on identifying local DNA elements, or motifs to infer the relation between the expression and the nucleotide sequence of the gene. This study proposes a novel data adaptive representation approach for supervised learning to predict the response associated with the biological sequences. Biological sequences such as DNA and protein are a class of categorical sequences. In machine learning, categorical sequences are generally mapped to a lower dimensional representation for learning tasks to avoid problems with high dimensionality. The proposed method, namely SW-RF (sliding window-random forest), is a feature-based approach requiring two main steps to learn a representation for categorical sequences. In the first step, each sequence is represented by overlapping subsequences of constant length. Then a tree-based learner on this representation is trained to obtain a bag-of-words like representation which is the frequency of subsequences on the terminal nodes of the tree for each sequence. After representation learning, any classifier can be trained on the learned representation. A lasso logistic regression is trained on the learned representation to facilitate the identification of important patterns for the classification task. Our experiments show that proposed approach provides significantly better results in terms of accuracy on both synthetic data and DNA promoter sequence data. Moreover, a common problem for microarray datasets, namely missing values, is handled efficiently by the tree learners in SW-RF. Although the focus of this paper is on biological sequences, SW-RF is flexible in handling any categorical sequence data from different applications. © Springer Nature Switzerland AG 2018.

Entities:  

Keywords:  Biological sequences; Categorical; Classification; Gene expression; Representation learning; Time series

Year:  2018        PMID: 35415416      PMCID: PMC8982718          DOI: 10.1007/s41666-018-0038-5

Source DB:  PubMed          Journal:  J Healthc Inform Res        ISSN: 2509-498X


  13 in total

Review 1.  Regulation of gene expression by GC-rich DNA cis-elements.

Authors:  J P Hapgood; J Riedemann; S D Scherer
Journal:  Cell Biol Int       Date:  2001       Impact factor: 3.612

2.  DNA helix: the importance of being GC-rich.

Authors:  Alexander E Vinogradov
Journal:  Nucleic Acids Res       Date:  2003-04-01       Impact factor: 16.971

3.  Predicting gene expression from sequence.

Authors:  Michael A Beer; Saeed Tavazoie
Journal:  Cell       Date:  2004-04-16       Impact factor: 41.582

4.  Gene regulatory networks and the role of robustness and stochasticity in the control of gene expression.

Authors:  Lesley T Macneil; Albertha J M Walhout
Journal:  Genome Res       Date:  2011-02-04       Impact factor: 9.043

Review 5.  Signal transduction by receptors with tyrosine kinase activity.

Authors:  A Ullrich; J Schlessinger
Journal:  Cell       Date:  1990-04-20       Impact factor: 41.582

6.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli.

Authors:  G D Stormo; T D Schneider; L Gold; A Ehrenfeucht
Journal:  Nucleic Acids Res       Date:  1982-05-11       Impact factor: 16.971

7.  Genomic sequencing: assessing the health care system, policy, and big-data implications.

Authors:  Kathryn A Phillips; Julia R Trosman; Robin K Kelley; Mark J Pletcher; Michael P Douglas; Christine B Weldon
Journal:  Health Aff (Millwood)       Date:  2014-07       Impact factor: 6.301

8.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.

Authors:  Rachid Ounit; Steve Wanamaker; Timothy J Close; Stefano Lonardi
Journal:  BMC Genomics       Date:  2015-03-25       Impact factor: 3.969

9.  The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances.

Authors:  Anthony Bagnall; Jason Lines; Aaron Bostrom; James Large; Eamonn Keogh
Journal:  Data Min Knowl Discov       Date:  2016-11-23       Impact factor: 3.670

10.  Automated alphabet reduction for protein datasets.

Authors:  Jaume Bacardit; Michael Stout; Jonathan D Hirst; Alfonso Valencia; Robert E Smith; Natalio Krasnogor
Journal:  BMC Bioinformatics       Date:  2009-01-06       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.