Literature DB >> 17488755

Fast model-based protein homology detection without alignment.

Sepp Hochreiter1, Martin Heusel, Klaus Obermayer.   

Abstract

MOTIVATION: As more genomes are sequenced, the demand for fast gene classification techniques is increasing. To analyze a newly sequenced genome, first the genes are identified and translated into amino acid sequences which are then classified into structural or functional classes. The best-performing protein classification methods are based on protein homology detection using sequence alignment methods. Alignment methods have recently been enhanced by discriminative methods like support vector machines (SVMs) as well as by position-specific scoring matrices (PSSM) as obtained from PSI-BLAST. However, alignment methods are time consuming if a new sequence must be compared to many known sequences-the same holds for SVMs. Even more time consuming is to construct a PSSM for the new sequence. The best-performing methods would take about 25 days on present-day computers to classify the sequences of a new genome (20,000 genes) as belonging to just one specific class--however, there are hundreds of classes. Another shortcoming of alignment algorithms is that they do not build a model of the positive class but measure the mutual distance between sequences or profiles. Only multiple alignments and hidden Markov models are popular classification methods which build a model of the positive class but they show low classification performance. The advantage of a model is that it can be analyzed for chemical properties common to the class members to obtain new insights into protein function and structure. We propose a fast model-based recurrent neural network for protein homology detection, the 'Long Short-Term Memory' (LSTM). LSTM automatically extracts indicative patterns for the positive class, but in contrast to profile methods it also extracts negative patterns and uses correlations between all detected patterns for classification. LSTM is capable to automatically extract useful local and global sequence statistics like hydrophobicity, polarity, volume, polarizability and combine them with a pattern. These properties make LSTM complementary to alignment-based approaches as it does not use predefined similarity measures like BLOSUM or PAM matrices.
RESULTS: We have applied LSTM to a well known benchmark for remote protein homology detection, where a protein must be classified as belonging to a SCOP superfamily. LSTM reaches state-of-the-art classification performance but is considerably faster for classification than other approaches with comparable classification performance. LSTM is five orders of magnitude faster than methods which perform slightly better in classification and two orders of magnitude faster than the fastest SVM-based approaches (which, however, have lower classification performance than LSTM). Only PSI-BLAST and HMM-based methods show comparable time complexity as LSTM, but they cannot compete with LSTM in classification performance. To test the modeling capabilities of LSTM, we applied LSTM to PROSITE classes and interpreted the extracted patterns. In 8 out of 15 classes, LSTM automatically extracted the PROSITE motif. In the remaining 7 cases alternative motifs are generated which give better classification results on average than the PROSITE motifs. AVAILABILITY: The LSTM algorithm is available from http://www.bioinf.jku.at/software/LSTM_protein/.

Mesh:

Substances:

Year:  2007        PMID: 17488755     DOI: 10.1093/bioinformatics/btm247

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  22 in total

1.  Classification of nucleotide sequences using support vector machines.

Authors:  Tae-Kun Seo
Journal:  J Mol Evol       Date:  2010-08-26       Impact factor: 2.395

2.  GP4: an integrated Gram-Positive Protein Prediction Pipeline for subcellular localization mimicking bacterial sorting.

Authors:  Stefano Grasso; Tjeerd van Rij; Jan Maarten van Dijl
Journal:  Brief Bioinform       Date:  2021-07-20       Impact factor: 11.622

3.  Protein remote homology detection by combining Chou's distance-pair pseudo amino acid composition and principal component analysis.

Authors:  Bin Liu; Junjie Chen; Xiaolong Wang
Journal:  Mol Genet Genomics       Date:  2015-04-21       Impact factor: 3.291

4.  A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential.

Authors:  Steven T Hill; Rachael Kuintzle; Amy Teegarden; Erich Merrill; Padideh Danaee; David A Hendrix
Journal:  Nucleic Acids Res       Date:  2018-09-19       Impact factor: 16.971

5.  Physicochemical property distributions for accurate and rapid pairwise protein homology detection.

Authors:  Bobbie-Jo M Webb-Robertson; Kyle G Ratuiste; Christopher S Oehmen
Journal:  BMC Bioinformatics       Date:  2010-03-19       Impact factor: 3.169

Review 6.  Template-based protein modeling: recent methodological advances.

Authors:  Pankaj R Daga; Ronak Y Patel; Robert J Doerksen
Journal:  Curr Top Med Chem       Date:  2010       Impact factor: 3.295

7.  Using amino acid physicochemical distance transformation for fast protein remote homology detection.

Authors:  Bin Liu; Xiaolong Wang; Qingcai Chen; Qiwen Dong; Xun Lan
Journal:  PLoS One       Date:  2012-09-28       Impact factor: 3.240

8.  PSimScan: algorithm and utility for fast protein similarity search.

Authors:  Anna Kaznadzey; Natalia Alexandrova; Vladimir Novichkov; Denis Kaznadzey
Journal:  PLoS One       Date:  2013-03-07       Impact factor: 3.240

Review 9.  Opportunities and obstacles for deep learning in biology and medicine.

Authors:  Travers Ching; Daniel S Himmelstein; Brett K Beaulieu-Jones; Alexandr A Kalinin; Brian T Do; Gregory P Way; Enrico Ferrero; Paul-Michael Agapow; Michael Zietz; Michael M Hoffman; Wei Xie; Gail L Rosen; Benjamin J Lengerich; Johnny Israeli; Jack Lanchantin; Stephen Woloszynek; Anne E Carpenter; Avanti Shrikumar; Jinbo Xu; Evan M Cofer; Christopher A Lavender; Srinivas C Turaga; Amr M Alexandari; Zhiyong Lu; David J Harris; Dave DeCaprio; Yanjun Qi; Anshul Kundaje; Yifan Peng; Laura K Wiley; Marwin H S Segler; Simina M Boca; S Joshua Swamidass; Austin Huang; Anthony Gitter; Casey S Greene
Journal:  J R Soc Interface       Date:  2018-04       Impact factor: 4.293

10.  Application of nonnegative matrix factorization to improve profile-profile alignment features for fold recognition and remote homolog detection.

Authors:  Inkyung Jung; Jaehyung Lee; Soo-Young Lee; Dongsup Kim
Journal:  BMC Bioinformatics       Date:  2008-07-01       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.