| Literature DB >> 25695053 |
Gokcen Cilingir1, Shira L Broschat2.
Abstract
Supervised machine learning algorithms are used by life scientists for a variety of objectives. Expert-curated public gene and protein databases are major resources for gathering data to train these algorithms. While these data resources are continuously updated, generally, these updates are not incorporated into published machine learning algorithms which thereby can become outdated soon after their introduction. In this paper, we propose a new model of operation for supervised machine learning algorithms that learn from genomic data. By defining these algorithms in a pipeline in which the training data gathering procedure and the learning process are automated, one can create a system that generates a classifier or predictor using information available from public resources. The proposed model is explained using three case studies on SignalP, MemLoci, and ApicoAP in which existing machine learning models are utilized in pipelines. Given that the vast majority of the procedures described for gathering training data can easily be automated, it is possible to transform valuable machine learning algorithms into self-evolving learners that benefit from the ever-changing data available for gene products and to develop new machine learning algorithms that are similarly capable.Entities:
Mesh:
Year: 2015 PMID: 25695053 PMCID: PMC4324891 DOI: 10.1155/2015/234236
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1The SignalP 4.0 Pipeline.
Figure 2The MemLoci Pipeline.
Figure 3The ApicoAP Pipeline.
Cardinalities of the positive interim training sets for the 17 apicomplexan species gathered by ApicoAP-CS.
| Apicomplexan species | Ortho-MCLa | BLASTb | Confirmedc | All combinedd | Conflicts removede | Non-SP filteredf |
|---|---|---|---|---|---|---|
|
| 46 | 45 | 4 | 61 | 59 | 18 |
|
| 51 | 50 | 0 | 61 | 58 | 23 |
|
| 17 | 24 | 0 | 28 | 25 | 1 |
|
| 19 | 27 | 0 | 32 | 29 | 0 |
|
| 17 | 24 | 0 | 29 | 26 | 2 |
|
| 82 | 61 | 1 | 89 | 84 | 30 |
|
| 78 | 68 | 0 | 82 | 77 | 21 |
|
| 72 | 73 | 0 | 77 | 73 | 49 |
|
| 72 | 73 | 0 | 77 | 72 | 51 |
|
| 70 | 72 | 0 | 77 | 73 | 31 |
|
| 45 | 60 | 40 | 89 | 85 | 52 |
|
| 72 | 72 | 0 | 77 | 73 | 49 |
|
| 69 | 72 | 0 | 75 | 71 | 51 |
|
| 70 | 68 | 3 | 77 | 73 | 41 |
|
| 45 | 47 | 0 | 56 | 54 | 23 |
|
| 49 | 42 | 0 | 59 | 57 | 25 |
|
| 53 | 59 | 45 | 102 | 96 | 42 |
aCardinality of the set gathered by ortholog search using OrthoMCL.
bCardinality of the set gathered by ortholog search using the BLAST-based algorithm.
cCardinality of the set containing experimentally confirmed positive/negative proteins.
dCardinality of the set that is the union of the sets presented in column 2, 3 and 4.
eCardinality of the union set when conflicts with the negative/positive set is removed.
fCardinality of the final training set after proteins without signal peptides have been removed.
Cardinalities of the negative interim training sets for the 17 apicomplexan species gathered by ApicoAP-CS.
| Apicomplexan Species | OrthoMCLa | BLASTb | Confirmedc | All Combinedd | Conflicts Removede | Non-SP Filteredf |
|---|---|---|---|---|---|---|
|
| 144 | 136 | 8 | 161 | 159 | 33 |
|
| 142 | 130 | 0 | 159 | 156 | 23 |
|
| 135 | 130 | 0 | 157 | 154 | 28 |
|
| 143 | 137 | 0 | 163 | 160 | 34 |
|
| 130 | 129 | 10 | 164 | 161 | 33 |
|
| 400 | 175 | 8 | 443 | 438 | 169 |
|
| 254 | 220 | 15 | 288 | 283 | 81 |
|
| 222 | 212 | 28 | 260 | 256 | 101 |
|
| 238 | 223 | 2 | 258 | 253 | 108 |
|
| 259 | 224 | 0 | 273 | 269 | 93 |
|
| 284 | 173 | 156 | 443 | 439 | 138 |
|
| 236 | 227 | 6 | 258 | 254 | 91 |
|
| 261 | 227 | 13 | 281 | 277 | 103 |
|
| 242 | 216 | 16 | 270 | 266 | 89 |
|
| 151 | 133 | 4 | 169 | 167 | 42 |
|
| 186 | 128 | 4 | 204 | 202 | 71 |
|
| 194 | 198 | 131 | 333 | 327 | 92 |
aCardinality of the set gathered by ortholog search using OrthoMCL.
bCardinality of the set gathered by ortholog search using the BLAST-based algorithm.
cCardinality of the set containing experimentally confirmed positive/negative proteins.
dCardinality of the set that is the union of the sets presented in columns 2, 3, and 4.
eCardinality of the union set when conflicts with the negative/positive set are removed.
fCardinality of the final training set after proteins without signal peptides have been removed.
Cardinalities of the final training sets for the 17 apicomplexan species.
| Apicomplexan species | Positive training set | Negative training set |
|---|---|---|
|
| 18 | 30 |
|
| 23 | 22 |
|
| 1 | 28 |
|
| 0 | 34 |
|
| 2 | 33 |
|
| 30 | 143 |
|
| 21 | 77 |
|
| 49 | 94 |
|
| 51 | 98 |
|
| 31 | 90 |
|
| 51 | 132 |
|
| 48 | 90 |
|
| 51 | 101 |
|
| 41 | 87 |
|
| 23 | 41 |
|
| 25 | 61 |
|
| 42 | 86 |
ApicoTP classifier performances with the training sets gathered by ApicoAP-CS.
| Apicomplexan species | True negative rate | True positive rate | Overall accuracy |
|---|---|---|---|
|
| 1.000 | 1.000 | 1.000 |
|
| 0.909 | 1.000 | 0.956 |
|
| 0.951 | 0.800 | 0.925 |
|
| 1.000 | 0.857 | 0.969 |
|
| 0.936 | 0.959 | 0.944 |
|
| 0.959 | 0.902 | 0.940 |
|
| 1.000 | 0.839 | 0.959 |
|
| 0.924 | 0.843 | 0.902 |
|
| 0.922 | 0.958 | 0.935 |
|
| 0.901 | 0.980 | 0.928 |
|
| 0.954 | 0.854 | 0.922 |
|
| 0.854 | 0.913 | 0.875 |
|
| 0.820 | 0.960 | 0.860 |
|
| 0.977 | 0.905 | 0.953 |