| Literature DB >> 23029121 |
Tian Mi1, Sanguthevar Rajasekaran, Jerlin Camilus Merlin, Michael Gryk, Martin R Schiller.
Abstract
The low complexity of minimotif patterns results in a high false-positive prediction rate, hampering protein function prediction. A multi-filter algorithm, trained and tested on a linear regression model, support vector machine model, and neural network model, using a large dataset of verified minimotifs, vastly improves minimotif prediction accuracy while generating few false positives. An optimal threshold for the best accuracy reaches an overall accuracy above 90%, while a stringent threshold for the best specificity generates less than 1% false positives or even no false positives and still produces more than 90% true positives for the linear regression and neural network models. The minimotif multi-filter with its excellent accuracy represents the state-of-the-art in minimotif prediction and is expected to be very useful to biologists investigating protein function and how missense mutations cause disease.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23029121 PMCID: PMC3459956 DOI: 10.1371/journal.pone.0045589
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1ROC plots comparing linear regression, support vector machine, and neural network multi-filters with, individual CF, MF, PPI, FS, and SP filters.
ROCs are colored orange for linear regression, cyan for support vector machine, cyan dark green for neural network, red for PPI filter, blue for CF filter, green for MF filter, purple for FS filter, and yellow for SP filter.
ROC statistics for individual motif filters.
| Method | AUC | P-value |
| CF | 0.72 | 0.12 |
| MF | 0.83 | 0.03 |
| FS | 0.72 | 0.08 |
| PPI | 0.88 | 1.4×10−3 |
| SP | 0.38 | 1.00 |
ROC statistics for three minimotif multi-filter models.
| 5-fold cross validation | #1 | #2 | #3 | #4 | #5 | Average | Standard Deviation | |
| linear regression | AUC | 96.4% | 96.7% | 96.6% | 96.2% | 95.9% | 96.7% | 0.3% |
| P-Value | 2.3*10−114 | 5.3*10−116 | 1.3*10−115 | 1.9*10−113 | 4.2*10−112 | – | – | |
| support vector machine | AUC | 93.6% | 92.9% | 93.8% | 92.9% | 93.8% | 93.4% | 0.5% |
| P-Value | 9.8*10−259 | 1.2*10−178 | 3.3*10−266 | <10−325 | 1.7*10−266 | – | – | |
| neural network | AUC | 99.6% | 99.0% | 99.8% | 99.3% | 96.7% | 98.9% | 1.3% |
| P-Value | <10−325 | <10−325 | <10−325 | <10−325 | 1.7*10−211 | – | – | |
Figure 2Dependence of minimotif multi-filter performance on threshold values for the linear regression and neural network models.
Sensitivity, specificity, and accuracy for the linear regression (A) support vector machine (B) and neural network (C) models. Thresholds were selected by picking the best model in the 5-fold cross validation (model 2 of the linear regression and model 3 of the neural network) evaluated using the test dataset.
Summary of filtering statistics for three models.
| Model |
| Sensitivity | Specificity | Accuracy |
|
| Linearregression | To = 0.48 | 84.3% | 100.0% | 92.1% | 0.85 |
| Ts = 0.48 | 84.3% | 100.0% | 92.1% | 0.85 | |
| support vectormachine | To = −0.99 | 85.3% | 99.3% | 92.3% | 0.85 |
| Ts = 3.00 | 73.5% | 99.8% | 86.6% | 0.76 | |
| neuralnetwork | To = 0.50 | 89.8% | 94.8% | 92.3% | 0.85 |
| Ts = 0.74 | 83.0% | 99.8% | 91.4% | 0.84 |
To: the optimal threshold with maximum accuracy; Ts: the stringent threshold that minimizes the number of false positives while retaining high sensitivity.
MCC: Matthews Correlation coefficient.
Variations of multi-filter combinations for each model.
| To | Ts | ||||||
| linear regression | Combinations | AVG(AUC) | STD | Accuracy | MCC | Accuracy | MCC |
| 5 | CF+MF+FS+PPI+SP | 95.5% | 0.00 | 92.7% | 0.76 | 90.1% | 0.50 |
|
| MF+FS+PPI+SP | 95.4% | 0.01 | 92.7% | 0.75 | 90.2% | 0.51 |
| 3 | MF+FS+PPI | 95.6% | 0.00 | 92.9% | 0.74 | 91.0% | 0.55 |
| 2 | FS+PPI | 97.7% | 0.01 | 95.5% | 0.81 | 90.0% | 0.50 |
| support vector machine | |||||||
| 5 | CF+MF+FS+PPI+SP | 90.2% | 0.08 | 96.6% | 0.85 | 92.6% | 0.62 |
|
| MF+FS+PPI+SP | 92.5% | 0.04 | 97.2% | 0.88 | 90.0% | 0.50 |
|
| FS+PPI+SP | 92.4% | 0.04 | 95.8% | 0.82 | 92.5% | 0.62 |
| 2 | FS+SP | 97.1% | 0.02 | 94.6% | 0.77 | 93.6% | 0.67 |
| neural network | |||||||
| 5 | CF+MF+FS+PPI+SP | 97.6% | 0.04 | 97.2% | 0.90 | 95.6% | 0.79 |
| 4 | MF+FS+PPI+SP | 99.5% | 0.01 | 96.4% | 0.86 | 95.0% | 0.76 |
|
| FS+PPI+SP | 99.1% | 0.01 | 95.8% | 0.82 | 91.7% | 0.58 |
| 2 | FS+PPI | 97.8% | 0.00 | 95.8% | 0.79 | 90.4% | 0.52 |
Alternative filter combinations that were not significantly different than the combination tested in the same row (P<0.05) were found: CF+MF+FS+SP for 4-filter combination in linear regression; CF+MF+FS+SP or CF+MF+PPI+SP for 4-filter combination, and MF+FS+PPI or MF+FS+SP for 3-filter combination in support vector machine; MF+FS+PPI for 3-filter combination in neural network.