| Literature DB >> 31775612 |
Jiale Liu1, Xinqi Gong2,3.
Abstract
BACKGROUND: Recurrent neural network(RNN) is a good way to process sequential data, but the capability of RNN to compute long sequence data is inefficient. As a variant of RNN, long short term memory(LSTM) solved the problem in some extent. Here we improved LSTM for big data application in protein-protein interaction interface residue pairs prediction based on the following two reasons. On the one hand, there are some deficiencies in LSTM, such as shallow layers, gradient explosion or vanishing, etc. With a dramatic data increasing, the imbalance between algorithm innovation and big data processing has been more serious and urgent. On the other hand, protein-protein interaction interface residue pairs prediction is an important problem in biology, but the low prediction accuracy compels us to propose new computational methods.Entities:
Keywords: Attention; LSTM; Monte Carlo; Protein-protein interaction prediction; Residual architecture
Mesh:
Substances:
Year: 2019 PMID: 31775612 PMCID: PMC6882172 DOI: 10.1186/s12859-019-3199-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1There is a standard RNN model, including three layers-input, recurrent, and output layer, whose outputs will be activated by linear or nonlinear functions acting on previous or latter inputs. The arrows show the flow in detail
Fig. 2The memory block with one cell of LSTM neural network
Fig. 3The evolutional flow processes from methods to application in this paper
The accuracy order of dimers in test set
| Accuracy order | layer _10 | layer _20 | layer _30 | layer _40 | layer _50 | layer _70 | layer _56 | layer _58 | layer _59 | layer _60 | layer _61 | layer _62 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1H9D | 0.002534 | 0.003481 | 0.000013 | 0.000040 | 0.000067 | 0.000747 | 0.003801 | 0.001147 | 0.000854 | 0.000053 | 0.017938 | 0.001227 | |
| 1GL1 | 0.018904 | 0.006083 | 0.012480 | 0.000708 | 0.003592 | 0.008416 | 0.011222 | 0.000105 | 0.001363 | 0.005086 | 0.005034 | 0.000026 | |
| 2G77 | 0.009398 | 0.006355 | 0.002103 | 0.000076 | 0.001325 | 0.000098 | 0.002614 | 0.001325 | 0.000443 | 0.000636 | 0.000210 | 0.002914 | |
| 2VDB | 0.000991 | 0.000991 | 0.002419 | 0.000091 | 0.001487 | 0.000417 | 0.002680 | 0.001213 | 0.000972 | 0.000202 | 0.000913 | 0.004108 | |
| 1KTZ | 0.011788 | 0.006598 | 0.004096 | 0.007914 | 0.014994 | 0.022055 | 0.060532 | 0.005134 | 0.003077 | 0.002094 | 0.034992 | 0.004874 | |
| 1S1Q | 0.003033 | 0.002597 | 0.000437 | 0.002757 | 0.000827 | 0.001126 | 70.001815 | 0.003699 | 0.006112 | 0.000758 | 0.000184 | 0.009720 | |
| 1BUH | 0.000137 | 0.002547 | 0.001425 | 0.010694 | 0.007806 | 0.000742 | 0.004908 | 0.003434 | 0.001229 | 0.009923 | 0.016499 | 0.000185 | |
| 1BKD | 0.003846 | 0.000317 | 0.002938 | 0.002416 | 0.000311 | 0.000386 | 0.000053 | 0.000945 | 0.002301 | 0.000227 | 0.000724 | 0.001468 | |
| 1GPW | 0.000556 | 0.000281 | 0.004957 | 0.001203 | 0.001449 | 0.000311 | 0.002241 | 0.000160 | 0.000226 | 0.000386 | 0.000647 | 0.000496 | |
| 1SYX | 0.000989 | 0.006525 | 0.000537 | 0.000141 | 0.001271 | 0.001864 | 0.001328 | 0.000141 | 0.009181 | 0.000876 | 0.001977 | 0.002740 | |
| 1Z5Y | 0.029783 | 0.001220 | 0.001341 | 0.000157 | 0.006787 | 0.003635 | 0.001981 | 0.004903 | 0.008816 | 0.000254 | 0.000157 | 0.002778 | |
| mean | 0.007451 | 0.003363 | 0.002977 | 0.002382 | 0.003629 | 0.003618 | 0.008470 | 0.002019 | 0.003143 | 0.001863 | 0.007207 | 0.002776 |
Note: mean means the average of columns and the bold fonts are the minimal mean values of the corresponding model and the layer _m means that the layer number is m
The accuracy order of dimers in test set with layer _60
| Accuracy order | unit _5 | unit _6 | unit _7 | unit _8 | unit _9 |
|---|---|---|---|---|---|
| 1H9D | 0.002574 | 0.000293 | 0.000373 | 0.000053 | 0.006642 |
| 1GL1 | 0.006397 | 0.000419 | 0.000052 | 0.005086 | 0.000629 |
| 2G77 | 0.000336 | 0.004471 | 0.003813 | 0.000636 | 0.006704 |
| 2VDB | 0.000848 | 0.000339 | 0.008646 | 0.000202 | 0.000711 |
| 1KTZ | 0.014790 | 0.001890 | 0.015494 | 0.002094 | 0.004689 |
| 1S1Q | 0.024311 | 0.001287 | 0.006916 | 0.000758 | 0.001677 |
| 1BUH | 0.000751 | 0.000332 | 0.000703 | 0.009923 | 0.003493 |
| 1BKD | 0.003591 | 0.001284 | 0.007017 | 0.000227 | 0.000078 |
| 1GPW | 0.002180 | 0.000311 | 0.000401 | 0.000386 | 0.000571 |
| 1SYX | 0.005085 | 0.004633 | 0.035678 | 0.000876 | 0.001215 |
| 1Z5Y | 0.004928 | 0.001135 | 0.000556 | 0.000254 | 0.007379 |
| mean | 0.005981 | 0.001490 | 0.007241 | 0.001863 | 0.003072 |
The prediction results of layer _60_unit_8 in test set
| PDB Code | 1H9D | 1GL1 | 2G77 | 2VDB | 1KTZ | 1S1Q | 1BUH | 1BKD | 1GPW | 1SYX | 1Z5Y |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Protein function | OX | EI | OG | OX | OR | OX | EI | OG | OX | OX | ES |
| RFPP | 4 | 194 | 142 | 31 | 113 | 33 | 1017 | 73 | 77 | 31 | 21 |
| Number of surface residue pair | 74980 | 38141 | 223440 | 153360 | 53955 | 43520 | 102490 | 321630 | 199500 | 35400 | 82800 |
| Accuracy order(%0) | 0.053 | 5.086 | 0.636 | 0.202 | 2.094 | 0.758 | 9.923 | 0.227 | 0.386 | 0.876 | 0.254 |
| NCPD | 1%0 | 3 | 8 | ||||||||
| 8 | 4 | 7 | |||||||||
| Number of interface residue pair | 501 | 300 | 425 | 382 | 188 | 245 | 301 | 687 | 434 | 210 | 264 |
| Random experiment | 141 | 124 | 442 | 364 | 274 | 173 | 317 | 413 | 401 | 165 | 296 |
Note: NCPD(m%0)=n means that there are n dimers which meet the in equation accuracy order ≤ m%0, and the result of last row will be explained in next section
Comparison with PAIRpred, PPiPP and multi-layered LSTM
| Data set | Method | RFTP(p) | |||||
|---|---|---|---|---|---|---|---|
| 10% | 25% | 50% | 75% | 90% | |||
| DBD 3.0 | PPiPP | 9 | 19 | 78 | 297 | 760 | |
| PAIRPred | |||||||
| PAIRPred _1 | No post-processing | 2 | 13 | 68 | 257 | 804 | |
| PAIRPred _2 | No post-processing | 1 | 5 | 22 | 89 | 282 | |
| With post-processing | 1 | 3 | 16 | 103 | 272 | ||
| DBD 4.0 | PAIRPred _2 | No post-processing | 2 | 6 | 19 | 75 | 340 |
| With post-processing | 1 | 3 | 18 | 101 | 282 | ||
| DBD 5.0 | Multi-layered LSTM Network | lstm _1_ | 12 | 53 | 139 | 175 | 331 |
| lstm _5_ | 13 | 17 | 46 | 146 | 271 | ||
| lstm _6_ | 1 | 2 | 7 | 639 | 1384 | ||
| lstm _5_ | 4 | 13 | 36 | 94 | 847 | ||
| our model | layer _60_ | 4 | 31 | 33 | 113 | 194 | |
Note: lstm _m_nodes_n means the model has m layer LSTMs,and each layer has n units
Comparison by choosing top 1%0 residue pairs
| Methods | Precision |
|---|---|
| multi-layer LSTM[ | 30.8% |
| different machine learning[ | 42.4% |
| our model | 72.7% |
Fig. 4Prediction of different model parameters, where code _m_n means the layer number of LSTM is n, and the unit number in each LSTM layer is m. Longitudinal axis represents accuracy order and horizontal axis means PDB respectively
Fig. 5Model architecture. Where big block LSTM is defined as mentioned above
Fig. 6Some of prediction of protein-protein interaction interface residue pairs, which are highlighted in surface and shown in different colors with amino acid name and site in corresponding chains. a 1H9D b 2VDB c 1GL1 d 1BUH
The data partition structure and homology (≥30%)
| Train(32) | Validation(11) | Test(11) | Homology(%) |
|---|---|---|---|
| 1UDI,1EWY,2SIC,2I25,7CEI,2I9B,1FFW,1ACB, 2J0T,1OC0,1Y64,2O3B,1MAH,1DFJ, 1R0R,1BVN, 2OUL,2ABZ,2A5T,2HLE,1GLA,1WQ1,1ATN,1GHQ, 2B42,1R6Q,1CLV,1KXQ,1IBR,1KAC, 1US7,1AK4 | 1OYV,2PCC,1CGI, 2AJF,1B6C,1MQ8, 1FC2,1AY7,1ZM4, 4CPA,1KXP | 1H9D,1GL1,2G77, 2VDB,1KTZ,1S1Q, 1BUH,1BKD,1GPW,1SYX,1Z5Y | 1KXQ,1BVN(tr,tr,98.59); 2I25,1H9D(tr,te,40); 2ABZ,4CPA(tr,va,97.72); 4CPA,1H9D(va,te,33.33); 2SIC,1OYV(tr,va,68.5); 1GPW,1H9D(te,te,33.33); 2SIC,1R0R(tr,tr,68.25); 1BUH,1H9D(te,te,33.33) |
Note: A,B(C,D,E) in homology column means the homology between dimers A and B is E%, where C and D is the corresponding data partition structure of A and B.
The 9 features and their computation
| Features | Abbreviation | Software or Researchers |
|---|---|---|
| Interior Contact area | IC | Qcontacts |
| Exterior Contact area with other residues | EC | Qcontacts |
| Exterior Void area | EV | NACCES, Qcontacts |
| Absolute Exterior Solvent Accessible area | AESA | NACCES |
| Relative Exterior Solvent Accessible area | RESA | NACCES |
| Hydropathy index, version 1 | H1 | Jack Kyte et al. |
| Hydropathy index, version 2 | H2 | David Eisenberg |
| pKa1: computation | pKa1 | PROPKA3.1 |
| pKa2: standard | pKa2 | PROPKA3.1 |
Fig. 7Big block LSTM with no connection from the same layers and full connection from adjacent two layer networks. To simplify the network, we just consider an input with one unit in the layer l and an output with one unit in the layer l+2