| Literature DB >> 26415951 |
Caroline König1, Martha I Cárdenas2,3, Jesús Giraldo4, René Alquézar5,6, Alfredo Vellido7,8.
Abstract
BACKGROUND: The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26415951 PMCID: PMC4587730 DOI: 10.1186/s12859-015-0731-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance measures for binary classifiers
| Measure | Formula | Meaning |
|---|---|---|
| Accuracy |
| Measure of correctness |
| Precision |
| Measure of quality |
| Recall |
| Measure of completeness |
| MCC |
| Correlation coefficient |
Performance measures for multi-class classifiers. t p , t n , f p and f n are, in turn, tp, tn, fp and fn for class i [59]. The multi-class MCC is calculated taking into account all the entries of the confusion matrix C involving all K classes [60]. The ij-th entry (c ) is the number of examples of the true class i that have been assigned to the class j by the classifier
| Measure | Formula | |
|---|---|---|
| Accuracy |
| |
| MCC |
|
SVM classifier results: Global results for the four data transformations; accuracy (Accu), Matthews Correlation Coefficient (MCC)
| Data | Accu | MCC |
|---|---|---|
| AAC | 0.88 | 0.84 |
| Digram |
|
|
| ACC |
|
|
| PDBT | 0.92 | 0.90 |
Best results highlighted in bold
SVM classifier results: Class C GPCR results per subtype for the ACC data set only, including MCC, Precision (Prec) and Recall (Rec)
| Class | MCC | Prec | Rec |
|---|---|---|---|
| mG | 0.95 | 0.95 | 0.99 |
| CS | 0.93 | 1.00 | 0.88 |
| GB | 0.98 | 0.99 | 0.99 |
| VN | 0.89 | 0.91 | 0.92 |
| Ph | 0.86 | 0.89 | 0.90 |
| Od | 0.79 | 0.89 | 0.74 |
| Ta | 0.99 | 1.00 | 0.98 |
Fig. 1Boxplot representation of the Accu of the AAC, Digram, ACC and PDBT dataset
Fig. 2Boxplot representation of the MCC of the AAC, Digram, ACC and PDBT dataset
Illustrative example of misclassification statistics for the ACC data set. For some sequences s identified by number ♯ , the error rate (E R ), the true class (T C ), and how many times this sequence was misclassified as belonging to each of the other subtypes (from mG to Ta), are displayed. The three last columns list the sum of the votes for the true class (V T ), for the most frequently predicted class (V P ), and the ratio (R ) of one to the other
|
|
|
| mG | CS | GB | VN | Ph | Od | Ta |
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 100 | CS | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 91 | 600 | 0.15 |
| 6 | 100 | VN | 0 | 0 | 0 | 0 | 96 | 4 | 0 | 404 | 596 | 0.67 |
| 7 | 100 | VN | 100 | 0 | 0 | 0 | 0 | 0 | 0 | 300 | 600 | 0.5 |
Sequences with large classification errors: For each sequence s numbered ♯ , the GPCRDB Identifier (I d ), the true class (T C ), the predicted class (P C ), the voting ratio (R ) and the cumulative decision value (C D V ) are displayed
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| 1 | q5i5c3_9tele | mG | Od | 0.75 |
|
| 2 | XP_002123664 | CS | mG |
| 50 |
| 3 | q8c0m6_mouse | CS | Ph |
| –46 |
| 4 | XP_002740613 | CS | mG |
|
|
| 5 | XP_002936197 | VN | Ph | 0.83 |
|
| 6 | XP_002940476 | VN | Ph | 0.67 |
|
| 7 | XP_002941777 | VN | mG |
| 45 |
| 8 | B0UYJ3_DANRE | Ph | mG | 0.79 |
|
| 9 | XP_001518611 | Od | mG |
| 46 |
| 10 | XP_002940324 | Od | VN | 0.49 |
|
| 11 | GPC6A_DANRE | Od | Ph |
|
|
Extreme R and C D V values highlighted in bold
Fig. 3Radial PT plot showing the main areas of distribution of the seven class C GPCR subtypes. Treevolution radial PT in which the main sections occupied by each of the seven class C GPCR subtypes are explicitly represented by archs or groups of archs in the periphery of the tree. Note that branch colors are automatically generated during PT construction and do not correspond to class C subtypes
Fig. 4Mislabelings predicted to be mG. Five sequences with large classification errors were mislabeled as mG. Sequence ♯7 was labeled as VN in GPCRDB; ♯2 and ♯4 were labeled as CS; ♯8 was labeled as Ph; and ♯9 was labeled as Od
Fig. 5Mislabelings predicted to be Od. One sequence (♯1, labeled as mG in GPCRDB) with large classification error was mislabeled as Od
Fig. 6Mislabelings predicted to be Ph. Four sequences with large classification errors were mislabeled as Ph. Sequence ♯3 was labeled as CS in GPCRDB; ♯11 was labeled as Od; and ♯5 and ♯6 were labeled as VN
Fig. 7Mislabelings predicted to be Vn. One sequence (♯10, labeled as Od in GPCRDB) with large classification error was mislabeled as Vn