| Literature DB >> 28152994 |
Cristiano Lacerda Nunes Pinto1, Cristiane Neri Nobre2, Luis Enrique Zárate2.
Abstract
BACKGROUND: The correct protein coding region identification is an important and latent problem in the molecular biology field. This problem becomes a challenge due to the lack of deep knowledge about the biological systems and unfamiliarity of conservative characteristics in the messenger RNA (mRNA). Therefore, it is fundamental to research for computational methods aiming to help the patterns discovery for identification of the Translation Initiation Sites (TIS). In the field of Bioinformatics, machine learning methods have been widely applied based on the inductive inference, as Inductive Support Vector Machine (ISVM). On the other hand, not so much attention has been given to transductive inference-based machine learning methods such as Transductive Support Vector Machine (TSVM). The transductive inference performs well for problems in which the amount of unlabeled sequences is considerably greater than the labeled ones. Similarly, the problem of predicting the TIS may take advantage of transductive methods due to the fact that the amount of new sequences grows rapidly with the progress of Genome Project that allows the study of new organisms. Consequently, this work aims to investigate the transductive learning towards TIS identification and compare the results with those obtained in inductive method.Entities:
Keywords: Machine learning; SVM; TSVM; Transductive learning; Translation initiation site; mRNA
Mesh:
Year: 2017 PMID: 28152994 PMCID: PMC5290616 DOI: 10.1186/s12859-017-1502-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Representation of a mRNA sequence according to the scanning model in the eukaryotes
Fig. 2ISVM and TSVM evaluation methodology towards the solution for the TIS prediction problem schematically represented
Fig. 3Box plot for the CDS region size per organism
Fig. 4Frequency histogram of the intervals in the size of the CDS region from Mus musculus
Frequency histogram of the intervals in the size of the CDS region from Mus musculus
| Interval | [94,376) | [376,659) | [659,941) | [941,1220) | [1220,1510) | [1510,1790) | [1790,2070) |
|---|---|---|---|---|---|---|---|
| Relative frequency (%) | 7.59 | 18.21 | 15.46 | 10.98 | 10.34 | 7.59 | 5.76 |
| Median | 235 | 518 | 800 | 1081 | 1365 | 1650 | 1930 |
Fig. 5A sequence of an mRNA with the identification of the regions
Amount of sequences extracted by classification and amount of duplicated sequences eliminated during the preprocessing
| Downstream | TIS | UPSTREAM | UPSTREAM | CDS | CDS | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| out of Phase (nTIS) | in Phase | in Phase | out of Phase | |||||||
| Non-duplicated | Duplicated | Non-duplicated | Duplicated | Non-duplicated | Duplicated | Non-duplicated | Duplicated | Non-duplicated | Duplicated | ||
|
| |||||||||||
| 235 | 113 | 49 | 123 | 61 | 58 | 34 | 11703 | 1373 | 9738 | 1989 | |
| 518 | 100 | 38 | 124 | 47 | 60 | 29 | 8630 | 1120 | 6161 | 1638 | |
| 800 | 81 | 29 | 114 | 41 | 58 | 28 | 2141 | 945 | 2170 | 1451 | |
| 1081 | 66 | 22 | 101 | 39 | 57 | 24 | 546 | 824 | 983 | 1278 | |
| 1365 | 48 | 14 | 86 | 37 | 54 | 15 | 463 | 741 | 822 | 1158 | |
| 1650 | 42 | 11 | 69 | 28 | 40 | 11 | 420 | 675 | 720 | 1056 | |
|
| |||||||||||
| 235 | 678 | 358 | 776 | 471 | 308 | 170 | 5154 | 4853 | 8364 | 8067 | |
| 518 | 581 | 272 | 810 | 384 | 315 | 147 | 4323 | 3779 | 7230 | 6316 | |
| 800 | 466 | 203 | 726 | 331 | 288 | 127 | 3612 | 2927 | 6102 | 5000 | |
| 1081 | 398 | 158 | 632 | 293 | 260 | 113 | 2976 | 2311 | 5318 | 3931 | |
| 1365 | 319 | 113 | 568 | 234 | 242 | 92 | 2506 | 1839 | 4440 | 3124 | |
| 1650 | 277 | 79 | 495 | 187 | 208 | 68 | 2104 | 1463 | 3757 | 2454 | |
|
| |||||||||||
| 235 | 13564 | 7271 | 17729 | 9177 | 6972 | 3386 | 109658 | 109137 | 194726 | 192024 | |
| 518 | 13124 | 5674 | 18760 | 7188 | 7606 | 2492 | 94503 | 83527 | 171463 | 150192 | |
| 800 | 11579 | 4398 | 17917 | 5914 | 7334 | 1986 | 79368 | 63663 | 148148 | 117440 | |
| 1081 | 9716 | 3366 | 16085 | 4902 | 6629 | 1677 | 65717 | 48341 | 126260 | 90786 | |
| 1365 | 7753 | 2469 | 13662 | 3853 | 5649 | 1371 | 54818 | 37422 | 106842 | 71030 | |
| 1650 | 5877 | 1793 | 10918 | 2871 | 4537 | 1098 | 46233 | 29136 | 91066 | 56808 | |
|
| |||||||||||
| 235 | 15225 | 10455 | 26777 | 18065 | 12378 | 8252 | 142022 | 194250 | 200046 | 285816 | |
| 518 | 13723 | 9076 | 27548 | 16202 | 12787 | 7432 | 119185 | 162288 | 171884 | 244359 | |
| 800 | 11942 | 7745 | 26905 | 14748 | 12581 | 6704 | 99615 | 134106 | 146638 | 208787 | |
| 1081 | 10122 | 6594 | 25725 | 13314 | 12086 | 6092 | 82645 | 110225 | 124842 | 178443 | |
| 1365 | 8344 | 5400 | 23695 | 11929 | 11233 | 5474 | 69079 | 91113 | 106093 | 153047 | |
| 1650 | 6657 | 4390 | 21482 | 10472 | 10227 | 4740 | 58253 | 75705 | 90979 | 131754 | |
|
| |||||||||||
| 235 | 20867 | 5157 | 15869 | 3515 | 6542 | 1319 | 196447 | 56135 | 388223 | 116238 | |
| 518 | 18440 | 9200 | 15663 | 2677 | 6555 | 975 | 145585 | 38519 | 299284 | 82624 | |
| 800 | 14948 | 3013 | 14112 | 2122 | 5968 | 750 | 105415 | 25929 | 221892 | 56195 | |
| 1081 | 11082 | 2046 | 11644 | 1592 | 4942 | 562 | 74236 | 17100 | 160512 | 38259 | |
| 1365 | 7683 | 1281 | 8453 | 1112 | 3658 | 399 | 51462 | 11625 | 115329 | 26194 | |
| 1650 | 4967 | 839 | 5952 | 808 | 2582 | 283 | 36505 | 8297 | 83812 | 18706 | |
Amount of sequences after the elimination of duplicated sequences
| Downstream |
|
|
|
|
| ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| TIS | nTIS | TIS/nTIS | TIS | nTIS | TIS/nTIS | TIS | nTIS | TIS/nTIS | TIS | nTIS | TIS/nTIS | TIS | nTIS | TIS/nTIS |
| 235 | 113 | 123 | 0.9187 | 678 | 776 | 0.8737 | 13564 | 17729 | 0.7651 | 15225 | 26777 | 0.5686 | 20867 | 15869 | 1.3150 |
| 518 | 100 | 120 | 0.8334 | 581 | 810 | 0.7173 | 13124 | 18760 | 0.6996 | 13723 | 27548 | 0.4981 | 18440 | 15663 | 1.1773 |
| 800 | 81 | 114 | 0.7105 | 466 | 726 | 0.6419 | 11579 | 17917 | 0.6462 | 11942 | 26905 | 0.4438 | 14948 | 14112 | 1.0592 |
| 1081 | 66 | 101 | 0.6535 | 398 | 632 | 0.6297 | 10122 | 25725 | 0.3935 | 10122 | 25725 | 0.3935 | 11082 | 11644 | 0.9517 |
| 1365 | 48 | 86 | 0.5581 | 319 | 568 | 0.5616 | 8344 | 23695 | 0.3521 | 8344 | 23695 | 0.3521 | 7683 | 8453 | 0.9089 |
| 1650 | 42 | 69 | 0.6087 | 277 | 495 | 0.4586 | 5877 | 10918 | 0.5383 | 6657 | 21482 | 0.3099 | 4967 | 5952 | 0.8345 |
Validation precision results using ISVM and TSVM methods for the Scenarios 1 and 2
| Downstream | Scenario 1 | Scenario 2 | ||
|---|---|---|---|---|
|
| ISVM (inductive) | TSVM(transdutivo) | ISVM (inductive) | TSVM (transdutivo) |
|
| ||||
| 235 | 79.22±2.71 | 81.13±5.08 | 69.81±4.45 | 72.00±2.72 |
| 518 | 89.00±5.51 | 89.33±3.03 | 82.33±3.05 | 79.52±2.72 |
| 800 | 90.69±4.23 | 89.07±3.65 | 94.37±3.01 | 84.71±2.43 |
| 1081 | 89.40±8.60 | 89.66±5.26 | 97.66±4.33 | 79.48±3.99 |
| 1365 | 96.00±4.95 | 85.00±7.59 | 88.00±3.71 | 77.58±3.45 |
| 1650 | 100.00±0.0 | 93.57±6.32 | 100.00±0.0 | 77.31±3.42 |
|
| ||||
| 235 | 88.51±2.36 | 87.29±0.93 | 83.32±1.04 | 84.32±0.55 |
| 518 | 93.41±1.45 | 93.37±1.46 | 93.17±1.10 | 92.45±0.66 |
| 800 | 98.23±0.80 | 97.38±0.80 | 97.86±0.48 | 93.68±0.47 |
| 1081 | 99.20±0.75 | 97.94±1.18 | 98.68±0.42 | 95.45±0.55 |
| 1365 | 99.35±1.19 | 98.70±0.97 | 99.57±0.20 | 96.16±1.20 |
| 1650 | 99.62±0.68 | 99.25±0.91 | 99.69±0.19 | 96.84±1.55 |
|
| ||||
| 235 | 91.99±0.43 | 90.48±0.16 | 90.50±0.30 | 87.41±0.11 |
| 518 | 96.15±0.24 | 94.98±0.11 | 95.72±0,09 | 94.97±0.54 |
| 800 | 97.83±0.17 | 97.69±0.06 | 97.55±0.24 | 96.57±0.05 |
| 1081 | 98.03±0.33 | 97.69±0.11 | 97.85±0.23 | 97.42±0.04 |
| 1365 | 98.81±0.23 | 98.43±0.10 | 98.52±0.21 | 98.08±0.06 |
| 1650 | 99.04±0.31 | 98.76±0.13 | 98.63±0.22 | 98.39±0.06 |
|
| ||||
| 235 | 93.38±0.38 | 93.46±0.20 | 91.97±0.35 | 90.32±0.07 |
| 518 | 95.74±0.34 | 95.75±0.13 | 95.37±0.17 | 94.47±0.06 |
| 800 | 96.73±0.28 | 96.92±0.06 | 96.57±0.30 | 95.53±0.06 |
| 1081 | 96.86±0.26 | 96.74±0.07 | 96.76±0.25 | 96.20±0.08 |
| 1365 | 97.23±0.41 | 97.07±0.14 | 97.33±0.17 | 96.65±0.08 |
| 1650 | 97.71±0.27 | 97.93±0.12 | 97.64±0.27 | 96.57±0.16 |
|
| ||||
| 235 | 93.10±0.22 | 93.73±0.26 | 91.39±0.16 | 92.77±0.06 |
| 518 | 97.05±0.28 | 97.50±0.13 | 96.30±0.10 | 97.26±0.04 |
| 800 | 98.16±0.20 | 98.58±0.13 | 97.84±0.05 | 98.46±0.04 |
| 1081 | 98.76±0.20 | 98.96±0.09 | 98.50±0.04 | 99.06±0.02 |
| 1365 | 99.03±0.17 | 99.31±0.14 | 98.85±0.09 | 99.32±0.02 |
| 1650 | 99.22±0.02 | 99.54±0.14 | 99.18±0.05 | 99.35±0.07 |
Validation sensitivity results using ISVM and TSVM methods for the Scenarios 1 and 2
| Downstream | Scenario 1 | Scenario 2 | ||
|---|---|---|---|---|
|
| ISVM (inductive) | TSVM(transdutivo) | ISVM (inductive) | TSVM (transdutivo) |
|
| ||||
| 235 | 81.30±8.20 | 78.77±4.36 | 61.63±4.69 | 72.00±2.73 |
| 518 | 88.00±8.67 | 91.00±3.33 | 59.89±6.90 | 79.67±2.73 |
| 800 | 88.89±4.16 | 88.89±4.16 | 34.66±10.69 | 84.63±2.38 |
| 1081 | 82.50±13.15 | 85.83±6.55 | 18.01±14.06 | 78.42±4.01 |
| 1365 | 79.17±13.71 | 82.5±7.10 | 11.41±13.73 | 75.03±3.37 |
| 1650 | 81.66±10.27 | 95.00±6.19 | 7.93±2.63 | 77.44±3.77 |
|
| ||||
| 235 | 83.88±3.94 | 87.16±0.84 | 76.93±1.51 | 84.21±0.45 |
| 518 | 90.97±1.58 | 92.77±1.50 | 81.75±1.45 | 92.36±0.39 |
| 800 | 95.28±2.43 | 96.97±0.66 | 78.16±2.35 | 93.72±0.49 |
| 1081 | 95.70±1.74 | 97.94±1.18 | 79.54±1.47 | 95.53±0.59 |
| 1365 | 96.58±1.58 | 97.94±1.49 | 67.05±3.67 | 96.16±0.78 |
| 1650 | 97.40±2.06 | 98.35±1.78 | 65.25±3.94 | 96.74±0.65 |
|
| ||||
| 235 | 88.72±0.44 | 82.83±0.33 | 90.52±0.28 | 87.42±0.11 |
| 518 | 95.26±0.25 | 91.92±0.26 | 95.71±0.08 | 94.69±0.17 |
| 800 | 97.17±0.20 | 94.12±0.18 | 97.53±0.26 | 96.57±0.08 |
| 1081 | 97.74±0.27 | 95.89±0.19 | 97.84±0.23 | 97.44±0.04 |
| 1365 | 98.31±0.30 | 96.47±0.13 | 98.52±0.21 | 98.09±0.04 |
| 1650 | 98.33±0.35 | 96.61±0.22 | 98.60±0.24 | 98.41±0.07 |
|
| ||||
| 235 | 90.28±0.46 | 85.98±0.30 | 91.96±0.34 | 90.33±0.07 |
| 518 | 94.98±0.23 | 91.98±0.25 | 95.38±0.17 | 94.48±0.06 |
| 800 | 96.38±0.16 | 93.01±0.17 | 96.57±0.30 | 95.54±0.06 |
| 1081 | 96.80±0.38 | 94.82±0.21 | 96.76±0.25 | 96.21±0.08 |
| 1365 | 97.36±0.45 | 95.38±0.23 | 97.31±0.18 | 96.66±0.07 |
| 1650 | 97.32±0.39 | 94.42±0.28 | 97.70±0.30 | 96.57±0.16 |
|
| ||||
| 235 | 94.74±0.37 | 93.75±0.27 | 94.10±0.14 | 92.76±0.05 |
| 518 | 98.13±0.17 | 97.50±0.13 | 97.73±0.09 | 97.26±0.04 |
| 800 | 99.25±0.10 | 98.57±0.12 | 99.01±0.05 | 98.46±0.04 |
| 1081 | 99.38±0.10 | 98.94±0.10 | 99.24±0.06 | 99.06±0.02 |
| 1365 | 99.48±0.13 | 99.30±0.14 | 99.44±0.08 | 99.32±0.03 |
| 1650 | 99.68±0.18 | 99.48±0.21 | 99.44±0.11 | 99.35±0.08 |
Validation F-measure results using ISVM and TSVM methods for the Scenarios 1 and 2
| Downstream | Scenario 1 | Scenario 2 | ||
|---|---|---|---|---|
|
| ISVM (inductive) | TSVM(transdutivo) | ISVM (inductive) | TSVM (transdutivo) |
|
| ||||
| 235 | 79.88±5.07 | 79.89±4.58 | 65.04±3.19 | 71.99±2.72 |
| 518 | 88.00±6.42 | 90.09±2.77 | 69.00±4.78 | 79.53±2.33 |
| 800 | 89.44±2.39 | 88.92±3.71 | 46.96±12.52 | 84.64±2.16 |
| 1081 | 84.58±10.31 | 87.51±5.46 | 25.05±12.10 | 84.64±2.87 |
| 1365 | 84.06±9.84 | 83.57±6.94 | 14.84±13.31 | 76.00±1.91 |
| 1650 | 88.95±6.52 | 94.23±6.12 | 14.43±4.36 | 77.15±2.57 |
|
| ||||
| 235 | 86.04±2.80 | 87.23±0.88 | 79.96±0.81 | 84.26±0.50 |
| 518 | 92.13±0.84 | 93.06±1.42 | 87.05±0.69 | 92.40±0.38 |
| 800 | 96.68±1.26 | 97.17±0.69 | 86.85±1.40 | 93.70±0.41 |
| 1081 | 97.40±1.09 | 97.94±1.18 | 88.06±0.90 | 95.49±0.57 |
| 1365 | 97.93±1.17 | 98.30±1.05 | 79.99±2.56 | 96.14±0.52 |
| 1650 | 98.47±1.17 | 98.78±1.16 | 78.70±2.81 | 96.76±0.68 |
|
| ||||
| 235 | 90.32±0.29 | 86.49±0.13 | 90.51±0.29 | 87.42±0.11 |
| 518 | 95.71±0.11 | 93.43±0.13 | 95.72±0.08 | 94.83±0.32 |
| 800 | 97.50±0.15 | 95.87±0.08 | 97.54±0.25 | 96.57±0.06 |
| 1081 | 97.89±0.20 | 96.78±0.09 | 97.85±0.23 | 97.43±0.04 |
| 1365 | 98.56±0.18 | 97.44±0.07 | 98.52±0.21 | 98.09±0.05 |
| 1650 | 98.69±0.23 | 97.68±0.11 | 98.62±0.23 | 98.40±0.06 |
|
| ||||
| 235 | 91.81±0.38 | 89.56±0.09 | 91.97±0.34 | 90.33±0.07 |
| 518 | 95.36±0.21 | 93.82±0.07 | 95,38±0.17 | 94.48±0.06 |
| 800 | 96.56±0.17 | 94.93±0.07 | 96.57±0.30 | 95.54±0.06 |
| 1081 | 96.83±0.24 | 95.77±0.08 | 96.76±0.25 | 96.21±0.08 |
| 1365 | 97.30±0.31 | 96.22±0.09 | 97.32±0.17 | 96.66±0.07 |
| 1650 | 97.52±0.16 | 96.14±0.13 | 97.67±0.28 | 96.57±0.14 |
|
| ||||
| 235 | 93.91±0.18 | 93.73±0.27 | 92.73±0.04 | 92.76±0,05 |
| 518 | 97.59±0.16 | 97.50±0.13 | 97.01±0.05 | 97.25±0.04 |
| 800 | 98.70±0,11 | 98.57±0.13 | 98.42±0.02 | 98.45±0.04 |
| 1081 | 99.06±0.14 | 98.95±0.09 | 98.86±0.03 | 99.05±0.02 |
| 1365 | 99.25±0.10 | 99.30±0.14 | 99.14±0.04 | 99.31±0.02 |
| 1650 | 99.44±0.16 | 99.50±0.17 | 99.30±0.05 | 99.35±0.04 |
Fig. 6ROC curve for a Rattus norvegicus and b Mus musculus organisms
TSVM’s retraining computational cost
| Organism | Amount of SV | Time (s) |
|---|---|---|
|
| 165 | 2 |
|
| 544 | 6 |
|
| 4275 | 759 |
|
| 4537 | 1175 |
|
| 3188 | 219 |
Comparison among methods
|
|
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Method | Hit | not Hit | Hit | not Hit | Hit | not Hit | Hit | not Hit | Hit | not Hit |
| TransduTIS-I | 109 | 16 | 22 | 14 | 102 | 11 | 95 | 11 | 15 | 0 |
| TransduTIS-T | 122 | 3 | 36 | 0 | 107 | 6 | 105 | 1 | 15 | 0 |
| TISHunter | 112 | 13 | 35 | 1 | 106 | 7 | 93 | 13 | 14 | 1 |
| TIS Miner | 89 | 36 | 34 | 2 | 91 | 22 | 76 | 30 | 12 | 3 |
| NetStart | 109 | 16 | 31 | 5 | 84 | 29 | 78 | 28 | 5 | 10 |
TransduTIS-I and TransduTIS-T are, respectively, the inductive and transductive approaches developed in this work