| Literature DB >> 26020952 |
Wen Zhang1, Yanqing Niu2, Hua Zou3, Longqiang Luo4, Qianchao Liu3, Weijian Wu3.
Abstract
BACKGROUND: T-cell epitopes play the important role in T-cell immune response, and they are critical components in the epitope-based vaccine design. Immunogenicity is the ability to trigger an immune response. The accurate prediction of immunogenic T-cell epitopes is significant for designing useful vaccines and understanding the immune system.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26020952 PMCID: PMC4447411 DOI: 10.1371/journal.pone.0128194
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Details about sequence-derived features.
| Index | Feature | Dimension | Parameters | Annotation |
|---|---|---|---|---|
| F1 | Physicochemical propensities | 99 | No parameter | used in [ |
| F2 | Amino acid composition (AAC) | 20 | No parameter | used in [ |
| F3 | Amino acid pair profile | 400 | No parameter | used in [ |
| F4 | Sparse profile | 20 | No parameter | used in [ |
| F5 | Pairwise similarity profile |
| No parameter | used in [ |
| F6 | AAPPs | 360 | No parameter | used in [ |
| F7 | QTMS | 189 | No parameter | used in [ |
| F8 | Amino acid composition (CTDC) | 21 | No parameter | New feature |
| F9 | Amino acid Transition (CTDT) | 21 | No parameter | New feature |
| F10 | Amino acid Distribution (CTDD) | 105 | No parameter | New feature |
| F11 | Moran autocorrelation | 8×λ | λ, the lag of the autocorrelation | New feature |
| F12 | Geary autocorrelation | 8×λ | λ, the lag of the autocorrelation | New feature |
| F13 | MoreauBroto autocorrelation | 8×λ | λ, the lag of the autocorrelation | New feature |
| F14 | Quasi-sequence-order (QSO) | 40+2×λ |
| New feature |
| F15 | Pseudo Amino Acid Composition (PseAA) | 20+λ |
| New feature |
| F16 | Amphiphilic Pseudo Amino Acid Composition (AmPseAA) | 20+2×λ |
| New feature |
| F17 | Predicted relative accessible surface areas (RASA) | 9 | No parameter | New feature |
| F18 | Predicted secondary structure (SS) | 27 | No parameter | New feature |
* n is the number of sequences in the dataset, 0<λ< L(sequence length), the new feature means that the features were not used in the immunogenic epitope prediction
Fig 1The flowchart of GA-based ensemble method.
The average AUC scores of individual feature-based models using different values for λ, evaluated on IMMA2 by 20 independent runs of the 10-CV.
| Parameter λ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Optimal value |
|---|---|---|---|---|---|---|---|---|---|
| Moran autocorrelation | 0.567 | 0.585 | 0.592 | 0.605 | 0.604 | 0.617 |
| 0.631 | 7 |
| Geary autocorrelation | 0.571 | 0.580 | 0.588 | 0.608 | 0.609 | 0.616 |
| 0.636 | 7 |
| MoreauBroto autocorrelation | 0.655 | 0.663 | 0.659 | 0.674 | 0.679 | 0.680 |
| 0.684 | 7 |
| Quasi-sequence-order | 0.704 | 0.708 | 0.708 | 0.712 | 0.716 | 0.721 | 0.718 |
| 8 |
| Pseudo Amino Acid Composition | 0.704 | 0.699 | 0.705 | 0.701 | 0.708 | 0.709 | 0.705 |
| 8 |
| Amphiphilic Pseudo Amino Acid Composition | 0.691 |
| 0.712 | 0.715 | 0.708 | 0.704 | 0.707 | 0.707 | 2 |
The average performances of different individual feature-based models, evaluated on IMMA2 by 20 independent runs of the 10-CV.
| # | Feature | SN | SP | ACC | MCC | AUC |
|---|---|---|---|---|---|---|
| F1 | Physicochemical propensities | 0.507 | 0.847 | 0.672 | 0.377 | 0.738 |
| F2 | Amino acid composition (AAC) | 0.574 | 0.701 | 0.636 | 0.289 | 0.693 |
| F3 | Amino acid pair profile | 0.541 | 0.793 | 0.664 | 0.348 | 0.718 |
| F4 | Sparse profile | 0.523 | 0.811 | 0.663 | 0.352 | 0.725 |
| F5 | Pairwise similarity profile | 0.692 | 0.680 | 0.687 | 0.375 | 0.741 |
| F6 | AAPPs | 0.550 | 0.813 | 0.678 | 0.382 | 0.747 |
| F7 | QTMS | 0.507 | 0.825 | 0.662 | 0.354 | 0.732 |
| F8 | Amino acid composition (CTDC) | 0.730 | 0.523 | 0.629 | 0.262 | 0.667 |
| F9 | Amino acid Transition (CTDT) | 0.512 | 0.742 | 0.624 | 0.266 | 0.671 |
| F10 | Amino acid Distribution (CTDD) | 0.592 | 0.743 | 0.666 | 0.340 | 0.720 |
| F11 | Moran autocorrelation | 0.337 | 0.868 | 0.595 | 0.246 | 0.633 |
| F12 | Geary autocorrelation | 0.333 | 0.861 | 0.589 | 0.238 | 0.640 |
| F13 | MoreauBroto autocorrelation | 0.411 | 0.847 | 0.623 | 0.293 | 0.684 |
| F14 | Quasi-sequence-order (QSO) | 0.626 | 0.724 | 0.674 | 0.352 | 0.723 |
| F15 | Pseudo Amino Acid Composition (PseAA) | 0.661 | 0.657 | 0.659 | 0.325 | 0.713 |
| F16 | Amphiphilic Pseudo Amino Acid Composition (AmPseAA) | 0.664 | 0.646 | 0.655 | 0.325 | 0.719 |
| F17 | Predicted relative accessible surface areas (RASA) | 0.643 | 0.781 | 0.710 | 0.430 | 0.783 |
| F18 | Predicted secondary structure (SS) | 0.917 | 0.295 | 0.615 | 0.273 | 0.585 |
The absolute values of correlation coefficients of AUC scores yielded by individual feature-based models
| F1 | F2 | F3 | F4 | F5 | F6 | F7 | F8 | F9 | F10 | F11 | F12 | F13 | F14 | F15 | F16 | F17 | F18 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 1.00 | 0.57 | 0.00 | 0.50 | 0.18 | 0.51 | 0.67 | 0.07 | 0.51 | 0.41 | 0.05 | 0.19 | 0.05 | 0.38 | 0.49 | 0.39 | 0.19 | 0.34 |
|
| 0.57 | 1.00 | 0.18 | 0.69 | 0.52 | 0.42 | 0.56 | 0.11 | 0.29 | 0.40 | 0.07 | 0.27 | 0.02 | 0.60 | 0.56 | 0.38 | 0.27 | 0.02 |
|
| 0.00 | 0.18 | 1.00 | 0.08 | 0.02 | 0.02 | 0.30 | 0.22 | 0.15 | 0.09 | 0.20 | 0.17 | 0.19 | 0.05 | 0.02 | 0.26 | 0.13 | 0.04 |
|
| 0.50 | 0.69 | 0.08 | 1.00 | 0.63 | 0.47 | 0.82 | 0.03 | 0.53 | 0.49 | 0.17 | 0.01 | 0.02 | 0.41 | 0.45 | 0.15 | 0.27 | 0.13 |
|
| 0.18 | 0.52 | 0.02 | 0.63 | 1.00 | 0.51 | 0.50 | 0.01 | 0.42 | 0.03 | 0.28 | 0.19 | 0.37 | 0.33 | 0.13 | 0.22 | 0.01 | 0.12 |
|
| 0.51 | 0.42 | 0.02 | 0.47 | 0.51 | 1.00 | 0.68 | 0.10 | 0.45 | 0.19 | 0.04 | 0.08 | 0.28 | 0.11 | 0.09 | 0.29 | 0.14 | 0.10 |
|
| 0.67 | 0.56 | 0.30 | 0.82 | 0.50 | 0.68 | 1.00 | 0.05 | 0.53 | 0.44 | 0.26 | 0.12 | 0.01 | 0.39 | 0.40 | 0.12 | 0.16 | 0.24 |
|
| 0.07 | 0.11 | 0.22 | 0.03 | 0.01 | 0.10 | 0.05 | 1.00 | 0.07 | 0.07 | 0.06 | 0.67 | 0.10 | 0.14 | 0.38 | 0.10 | 0.02 | 0.21 |
|
| 0.51 | 0.29 | 0.15 | 0.53 | 0.42 | 0.45 | 0.53 | 0.07 | 1.00 | 0.39 | 0.03 | 0.12 | 0.41 | 0.21 | 0.33 | 0.20 | 0.08 | 0.19 |
|
| 0.41 | 0.40 | 0.09 | 0.49 | 0.03 | 0.19 | 0.44 | 0.07 | 0.39 | 1.00 | 0.12 | 0.05 | 0.03 | 0.56 | 0.65 | 0.17 | 0.33 | 0.02 |
|
| 0.05 | 0.07 | 0.20 | 0.17 | 0.28 | 0.04 | 0.26 | 0.06 | 0.03 | 0.12 | 1.00 | 0.17 | 0.45 | 0.19 | 0.10 | 0.30 | 0.00 | 0.40 |
|
| 0.19 | 0.27 | 0.17 | 0.01 | 0.19 | 0.08 | 0.12 | 0.67 | 0.12 | 0.05 | 0.17 | 1.00 | 0.29 | 0.26 | 0.23 | 0.25 | 0.18 | 0.23 |
|
| 0.05 | 0.02 | 0.19 | 0.02 | 0.37 | 0.28 | 0.01 | 0.10 | 0.41 | 0.03 | 0.45 | 0.29 | 1.00 | 0.18 | 0.02 | 0.28 | 0.31 | 0.22 |
|
| 0.38 | 0.60 | 0.05 | 0.41 | 0.33 | 0.11 | 0.39 | 0.14 | 0.21 | 0.56 | 0.19 | 0.26 | 0.18 | 1.00 | 0.80 | 0.48 | 0.13 | 0.08 |
|
| 0.49 | 0.56 | 0.02 | 0.45 | 0.13 | 0.09 | 0.40 | 0.38 | 0.33 | 0.65 | 0.10 | 0.23 | 0.02 | 0.80 | 1.00 | 0.50 | 0.15 | 0.02 |
|
| 0.39 | 0.38 | 0.26 | 0.15 | 0.22 | 0.29 | 0.12 | 0.10 | 0.20 | 0.17 | 0.30 | 0.25 | 0.28 | 0.48 | 0.50 | 1.00 | 0.25 | 0.05 |
|
| 0.19 | 0.27 | 0.13 | 0.27 | 0.01 | 0.14 | 0.16 | 0.02 | 0.08 | 0.33 | 0.00 | 0.18 | 0.31 | 0.13 | 0.15 | 0.25 | 1.00 | 0.35 |
|
| 0.34 | 0.02 | 0.04 | 0.13 | 0.12 | 0.10 | 0.24 | 0.21 | 0.19 | 0.02 | 0.40 | 0.23 | 0.22 | 0.08 | 0.02 | 0.05 | 0.35 | 1.00 |
The average performances of models merging different feature vectors, evaluated by 20 independent runs of the 10-CV.
| # | Feature | SN | SP | ACC | MCC | AUC |
|---|---|---|---|---|---|---|
| Combination 1 | F17+F6 | 0.692 | 0.758 | 0.724 | 0.455 | 0.799 |
| Combination 2 | F17+F6+F5 | 0.658 | 0.767 | 0.711 | 0.430 | 0.783 |
| Combination 3 | F17+F6+F5+F1 | 0.663 | 0.763 | 0.712 | 0.430 | 0.782 |
| Combination 4 | F17+F6+F5+F1+F7 | 0.652 | 0.774 | 0.711 | 0.431 | 0.783 |
| Combination 5 | F17+F6+F5+F1+F7+F4 | 0.639 | 0.782 | 0.708 | 0.429 | 0.782 |
| Combination 6 | F17+F6+F5+F1+F7+F4+F14 | 0.631 | 0.793 | 0.710 | 0.432 | 0.782 |
| Combination 7 | F17+F6+F5+F1+F7+F4+F14+F10 | 0.653 | 0.770 | 0.710 | 0.428 | 0.781 |
The average performances of GA-based ensemble method on benchmark datasets, evaluated by 20 runs of 10-CV.
| Dataset | SN | SP | ACC | MCC | AUC |
|---|---|---|---|---|---|
| IMMA2 | 0.715 | 0.812 | 0.762 | 0.534 | 0.846 |
| PAAQD | 0.919 | 0.534 | 0.817 | 0.509 | 0.829 |
The frequencies of features in the optimal feature subsets.
| Index | F1 | F2 | F3 | F4 | F5 | F6 | F7 | F8 | F9 | F10 | F11 | F12 | F13 | F14 | F15 | F16 | F17 | F18 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Frequencies | 4 | 0 | 17 | 3 | 145 | 77 | 97 | 0 | 1 | 30 | 0 | 7 | 0 | 2 | 1 | 24 | 200 | 11 |
The average performances of different models evaluated by 20 independent runs of 10-CV.
| Dataset | Method | SN | SP | ACC | MCC | AUC |
|---|---|---|---|---|---|---|
| IMMA2 | POPI | N.A. | N.A. | 0.60 | 0.19 | 0.64 |
| POPISK | N.A. | N.A. | 0.68 | 0.37 | 0.74 | |
| PAAQD | 0.523 | 0.832 | 0.673 | 0.379 | 0.747 | |
| Our previous method | 0.573 | 0.818 | 0.692 | 0.406 | 0.766 | |
| GA-based ensemble method |
|
|
|
|
| |
| PAAQD | PAAQD | 0.508 | 0.898 | 0.612 | 0.373 | 0.749 |
| Our previous method | 0.548 | 0.902 | 0.642 | 0.403 | 0.773 | |
| GA-based ensemble method |
|
|
|
|
|
*N.A. means data not available.
The statistics of improvements over benchmark methods (significance level 0.05).
| Dataset | Method | POPI | POPISK | PAAQD | Our previous method |
|---|---|---|---|---|---|
| IMMA2 | GA-based ensemble method | 1.9E-16 | 3.0E-11 | 4.0E-22 | 1.3E-20 |
| PAAQD | GA-based ensemble method | N.A. | N.A. | 3.3E-14 | 3.5E-12 |
*N.A. means data not available.