| Literature DB >> 30021530 |
Kazunori D Yamada1,2, Kengo Kinoshita3,4,5.
Abstract
BACKGROUND: Long short-term memory (LSTM) is one of the most attractive deep learning methods to learn time series or contexts of input data. Increasing studies, including biological sequence analyses in bioinformatics, utilize this architecture. Amino acid sequence profiles are widely used for bioinformatics studies, such as sequence similarity searches, multiple alignments, and evolutionary analyses. Currently, many biological sequences are becoming available, and the rapidly increasing amount of sequence data emphasizes the importance of scalable generators of amino acid sequence profiles.Entities:
Keywords: Deep learning; Long short-term memory; Neural networks; Protein sequence profile; Sequence context; Similarity search
Mesh:
Substances:
Year: 2018 PMID: 30021530 PMCID: PMC6052547 DOI: 10.1186/s12859-018-2284-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Network of learning. a Overview of the designed network in this study. Here, , , and t represent an input vector, an output vector, and a position of an amino acid sequence. In the squares, “Embed,” “Full connect,” and “Softmax” stand for a word embedding operation, a fully connected network, and a softmax function layer, respectively. The solid and broken arrows represent a matrix operation and an array operation, respectively. The numbers at the bottom of panel (a) stand for a dimension of vectors of each layer. b Description of LSTM layer. Here, , , , , ×, +, dot, τ, σ, , , and stand for an input vector to an LSTM unit, an output vector from an LSTM unit, a previous input vector, a unit for constant error, multiplication of matrices, summation of matrices, a Hadamard product calculation, a hyperbolic tangent, a sigmoid function, a weight matrix to be learned, another weight matrix, and a bias vector
Fig. 2Performance comparisons of (a, b) similarity searches and (c) calculation time. a ROC curves of SPBuild and other methods. Here, the performance of blastpgp was added for a reference. b The pAUC values of SPBuild, CSBuild, RPS-BLAST, and blastpgp. c The scatterplot of the profile generation time for each method on the SCOP20 test dataset
Comparison of profile generation times
| Mean | SD | |
|---|---|---|
| SPBuild | 5.99 | 3.83 |
| CSBuild | 0.390 | 0.161 |
| RPS-BLAST | 0.208 | 0.102 |
| HHBlits | 120 | 105 |
Means and standard deviations (SDs) of profile generation times (s) against 5819 sequences in the SCOP20 test dataset
Fig. 3Effects of memory power of LSTM on predictors. a Comparison of profile generators with various reset lengths of memory on LSTM. The benchmark dataset was the SCOP20 test dataset. The reset time of SPBuild corresponded to the input sequence length. b Mean cosine similarity between output vectors of SPBuild and target vectors as a function of the position of residues in input sequences of the SCOP20 test dataset
Fig. 4Relative sensitivity of SPBuild against that of existing methods on the test dataset. a The relative sensitivity of SPBuild against existing methods was calculated by dividing the pAUC of SPBuild by that of each method. Here, the label “others” includes SCOP classes e, f, and g. b The relative sensitivity of the profile generator with various memory powers of LSTM against CSBuild. c The relative sensitivity of the profile generator with various memory powers of LSTM against RPS-BLAST
Fig. 5Performance comparisons of similarity searches on SCOP20 strict-test dataset. ROC curves of SPBuild and other methods. The performances of HHBlits (three iterations) and blastpgp were added for a reference