| Literature DB >> 29977071 |
Caroline König1, Ilmira Shaim1, Alfredo Vellido2,3, Enrique Romero1, René Alquézar1, Jesús Giraldo4,5.
Abstract
Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.Entities:
Mesh:
Substances:
Year: 2018 PMID: 29977071 PMCID: PMC6033909 DOI: 10.1038/s41598-018-28330-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Number of receptors in each subtype for the class C GPCR datasets from the different database versions, including percentages of sequences preserved from one version to the next.
| Subtype | 2011 | May 2016 | Sept 2016 | 2011 ∩ May 2016 | May 2016 ∩ Sept 2016 |
|---|---|---|---|---|---|
| mG | 351 | 467 | 516 | 93 (26%) | 357 (76%) |
| CS | 48 | 125 | 103 | 10 (21%) | 91 (73%) |
| GB | 208 | 60 | 89 | 10 (5%) | 50 (83%) |
| Ta | 65 | 193 | 228 | 42 (65%) | 129 (67%) |
| VN | 344 | 0 | 0 | ||
| Ph | 392 | 0 | 0 | ||
| Od | 102 | 0 | 0 | ||
| Orphans | 147 | 193 | 18 | 0 | 18 (9%) |
| Total | 1657 | 1038 | 954 | 155 | 645 |
Receptor acronyms as described in the main text. The last two columns reflect the intersection between different database versions.
Figure 1Subtype distribution (number of sequences and percentage) for the different databases: Left - March 2011, Middle - May 2016, Right - September 2016. Orphans are not included.
Classification results for the 2011 version dataset. Prot2Vec1 corresponds to the Swiss-Prot-based representation and Prot2Vec2 corresponds to the GPCRdb-based representation.
| Model | Classifier | Accuracy | MCC | F-measure |
|---|---|---|---|---|
| AAC | SVM | 0.8855 | 0.8549 | 0.8842 |
| RF | 0.8570 | 0.8207 | 0.8542 | |
| NB | 0.7033 | 0.6307 | 0.7046 | |
| Digram |
|
|
|
|
| RF | 0.9139 | 0.8929 | 0.9124 | |
| NB | 0.8358 | 0.7949 | 0.8375 | |
| ACC | SVM | 0.9252 | 0.9054 | 0.9234 |
| RF | 0.8894 | 0.8624 | 0.8838 | |
| NB | 0.8430 | 0.8064 | 0.8455 | |
| Prot2Vec1 | SVM | 0.8987 | 0.8715 | 0.8981 |
| RF | 0.8596 | 0.8245 | 0.8587 | |
| NB | 0.6000 | 0.5153 | 0.6070 | |
| Prot2Vec2 | SVM | 0.8695 | 0.8353 | 0.8692 |
| RF | 0.8093 | 0.7625 | 0.8110 | |
| NB | 0.5854 | 0.4931 | 0.5889 |
Classification results for the May and September 2016 version datasets respectively.
| May 2016 | Sept. 2016 | ||||||
|---|---|---|---|---|---|---|---|
| Model | Classifier | Accuracy | MCC | F-measure | Accuracy | MCC | F-measure |
| AAC | SVM | 0.9822 | 0.9714 | 0.982 | 0.9893 | 0.9824 | 0.9892 |
| RF | 0.9716 | 0.9538 | 0.9706 | 0.9850 | 0.9757 | 0.9850 | |
| NB | 0.9550 | 0.9271 | 0.9551 | 0.9594 | 0.9368 | 0.9598 | |
| Digram | SVM | 0.9917 | 0.9884 | 0.9916 | 0.9946 | 0.9925 | 0.9946 |
| RF | 0.9905 | 0.9847 | 0.9905 | 0.9914 | 0.9860 | 0.9914 | |
| NB | 0.9811 | 0.9688 | 0.9808 | 0.9893 | 0.9826 | 0.9893 | |
| ACC |
|
|
|
|
|
|
|
| RF | 0.9893 | 0.9830 | 0.9891 | 0.9925 | 0.9878 | 0.9925 | |
| NB | 0.9799 | 0.9673 | 0.9798 | 0.9904 | 0.9845 | 0.9903 | |
| Prot2Vec1 | SVM | 0.9822 | 0.9716 | 0.9822 | 0.9893 | 0.9839 | 0.9893 |
| RF | 0.9763 | 0.9612 | 0.9759 | 0.9861 | 0.9776 | 0.9861 | |
| NB | 0.8118 | 0.7229 | 0.8207 | 0.9904 | 0.9845 | 0.9903 | |
| Prot2Vec2 | SVM | 0.9822 | 0.9759 | 0.9823 | 0.9936 | 0.9912 | 0.9936 |
| RF | 0.9822 | 0.9714 | 0.9821 | 0.9904 | 0.9847 | 0.9903 | |
| NB | 0.8615 | 0.7972 | 0.8688 | 0.9808 | 0.9692 | 0.9809 | |
Subtype classification results obtained by SVM from the Digram transformation of the 2011 version dataset.
| Subtype | Precision | Recall | MCC | F-measure |
|---|---|---|---|---|
| mG | 0.9462 | 0.9829 | 0.9639 | 0.9532 |
| CS | 1.0 | 0.9356 | 0.9645 | 0.9652 |
| GB | 0.9905 | 0.9856 | 0.9880 | 0.9861 |
| Vn | 0.9185 | 0.9128 | 0.9153 | 0.8907 |
| Ph | 0.8980 | 0.9131 | 0.9050 | 0.8719 |
| Od | 0.8610 | 0.7362 | 0.7896 | 0.7806 |
| Ta | 1.0 | 0.9846 | 0.9920 | 0.9918 |
Subtype classification results obtained by SVM from the ACC transformation of the May and Sept. 2016 version dataset respectively.
| Subtype | May 2016 | Sept. 2016 | ||||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | MCC | F-measure | Precision | Recall | MCC | F-measure | |
| mG | 0.9958 | 1.0 | 0.9979 | 0.9953 | 0.9962 | 1.0 | 0.9981 | 0.9957 |
| CS | 0.9923 | 0.9760 | 0.9833 | 0.9811 | 1.0 | 0.9804 | 0.9899 | 0.9889 |
| GB | 1.0 | 0.9833 | 0.9913 | 0.9909 | 1.0 | 0.9889 | 0.9943 | 0.9938 |
| Ta | 0.9903 | 0.9949 | 0.9924 | 0.9902 | 0.9957 | 1.0 | 0.9979 | 0.9972 |
Analysis of misclassification of sequences h2u5u4_takru and t2mdm0_hydvu: For each sequence s the true class (TC), the predicted class (PC), the error rate (ER), the voting ratio (R) and the cumulative decision value (CDV) are reported. For the meaning of these measures, see the Methods section.
| h2u5u4_takru | t2mdm0_hydvu | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Model | TC | PC |
|
|
| TC | PC |
|
|
|
| AAC | GB | Ta | 100 | 0.49 | 38.18 | mG | Ta | 100 | 0.34 | −59.58 |
| Digram | GB | Ta | 96 | 0.51 | −9 | mG | Ta | 100 | 0.38 | 28.75 |
| ACC | GB | mG | 100 | 0.46 | 19.16 | mG | mG | 0 | — | — |
| Prot2Vec1 | GB | Ta | 100 | 0.58 | −42.54 | mG | CS | 100 | 0.33 | 55.5 |
| Prot2Vec2 | GB | Ta | 100 | 0.41 | −28.52 | mG | CS | 100 | 0.39 | −10.36 |