| Literature DB >> 27478823 |
Jieru Zhang1, Ying Ju2, Huijuan Lu3, Ping Xuan4, Quan Zou5.
Abstract
Cancerlectins are cancer-related proteins that function as lectins. They have been identified through computational identification techniques, but these techniques have sometimes failed to identify proteins because of sequence diversity among the cancerlectins. Advanced machine learning identification methods, such as support vector machine and basic sequence features (n-gram), have also been used to identify cancerlectins. In this study, various protein fingerprint features and advanced classifiers, including ensemble learning techniques, were utilized to identify this group of proteins. We improved the prediction accuracy of the original feature extraction methods and classification algorithms by more than 10% on average. Our work provides a basis for the computational identification of cancerlectins and reveals the power of hybrid machine learning techniques in computational proteomics.Entities:
Year: 2016 PMID: 27478823 PMCID: PMC4961832 DOI: 10.1155/2016/7604641
Source DB: PubMed Journal: Int J Genomics ISSN: 2314-436X Impact factor: 2.326
Figure 1The main flow chart of the identification method of cancerlectin.
The number of pieces of data used in the ProtrWeb.
| Train set | Test set | |||
|---|---|---|---|---|
| Cancerlectin | Noncancerlectin | Cancerlectin | Noncancerlectin | |
| Amino Acid Composition | 178 | 226 | 20 | 20 |
| Dipeptide Composition | 178 | 226 | 20 | 20 |
| Normalized Moreau-Broto Autocorrelation | 178 | 225 | 20 | 20 |
| Moran Autocorrelation | 178 | 226 | 20 | 20 |
| Geary autocorrelation | 178 | 226 | 20 | 20 |
| Conjoint Triad | 178 | 226 | 20 | 20 |
| Sequence-Order-Coupling Number | 178 | 225 | 20 | 20 |
| Quasi-Sequence-Order Descriptors | 178 | 225 | 20 | 20 |
| Pseudo-Amino Acid Composition | 178 | 225 | 20 | 20 |
| Amphiphilic Pseudo-Amino Acid Composition | 178 | 225 | 20 | 20 |
Figure 5The most significant 5 conserved motifs of the first group.
The 5 most significant conserved motifs of the first group.
| Motif | Width |
| Best possible match |
|---|---|---|---|
| 1 | 50 | 1.3 | FA[ED][RK]L[YH][KQ][AS]MKG[AL]GT[RD]D[KN][TV]LIRI[ML] |
|
| |||
| 2 | 50 | 3.3 | [YW]F[EQ][EY][LI]G[KL]YD[EMP]G[ML][ED][IV]WGGEN[FL]E |
|
| |||
| 3 | 50 | 6.2 | MKG[ALV]GTDED[CAV]LIE[IV]L[AC][ST]R[TS][NP][EK][EQ][IL] |
|
| |||
| 4 | 50 | 1.6 | VD[EP][AD]L[AV][DQ]QDA[QR]DLY[EAD]AGEK[RK][WK]GTD |
|
| |||
| 5 | 41 | 2.1 | PTTS[VI][IV]I[TV]FHNE[AG][WR]STLLRT[VI]HSVL[KN]R[ST]P |
Figure 6The most significant 5 conserved motifs of the second group.
The 5 most significant conserved motifs of the second group.
| Motif | Width |
| Best possible match |
|---|---|---|---|
| 1 | 15 | 8.2 | CPENWIX[FY][GQ]N[KS]CY[YL]F |
|
| |||
| 2 | 29 | 2.1 | [WF]XD[AS][QEK]XXCXXXG[AG]HL[VA][VS][IV]D[SN]XEEQ |
|
| |||
| 3 | 15 | 2.6 | WNDXXC[ND]XK[LN][YL][FS][IV]C[EK] |
|
| |||
| 4 | 40 | 7.7 | YD[AST]GM[DE][IV]WGGENLE[IF]SFRIW[QM]CGG[KTV]L |
|
| |||
| 5 | 15 | 1.6 | WIG[LV]S[DR]XXSEGXWQW |
The numbers of positive and negative samples of training set.
| Before balancing | After balancing | |||||
|---|---|---|---|---|---|---|
| Cancerlectin | Noncancerlectin | Total | Cancerlectin | Noncancerlectin | Total | |
| Conjoint Triad | 178 | 226 | 404 | 356 | 226 | 582 |
| Pseudo-Amino Acid Composition | 178 | 225 | 403 | 356 | 225 | 581 |
The comparisons before and after balancing the training set.
| Before balancing | After balancing | |||
|---|---|---|---|---|
| Cross-validation | Method with supplied test set | Cross-validation | Method with supplied test set | |
| Conjoint Triad | 54.9505% | 70% | 71.134% | 67.5% |
| Pseudo-Amino Acid Composition | 57.8164% | 70% | 72.4613% | 67.5% |
Dimensions of feature extraction algorithms in Part I.
| Mode | Dimension |
|---|---|
| Pse-in-one | 22 |
| 188 dimensions | 188 |
| 473 dimensions | 473 |
| 1-skip | 400 |
| 2-skip | 400 |
| 188 dimensions + Pse-in-one | 210 |
| 473 dimensions + Pse-in-one | 495 |
| 473 dimensions + 188 dimensions | 661 |
| 473 dimensions + 188 dimensions + Pse-in-one | 683 |
| 473 dimensions + 188 dimensions + Pse-in-one + 1-skip | 1083 |
| 473 dimensions + 188 dimensions + Pse-in-one + 2-skip | 1083 |
Figure 2The accuracy rate of prediction in Part I.
Figure 3The accuracy rate of prediction in ProtrWeb.
Dimensions of feature extraction algorithms in ProtrWeb.
| Mode | Dimension | Dimension reduction |
|---|---|---|
| Amino Acid Composition | 20 | 19 |
| Dipeptide Composition | 400 | 49 |
| Normalized Moreau-Broto Autocorrelation | 240 | 47 |
| Moran Autocorrelation | 240 | 43 |
| Geary autocorrelation | 240 | 220 |
| Conjoint Triad | 343 | 81 |
| Sequence-Order-Coupling Number | 60 | 17 |
| Quasi-Sequence-Order Descriptors | 100 | 42 |
| Pseudo-Amino Acid Composition | 50 | 23 |
| Amphiphilic Pseudo-Amino Acid Composition | 80 | 15 |
Figure 4The accuracy rate of prediction before and after dimension reduction.
The prediction results of libSVM.
| Mode | libSVM (%) | libSVM + Grid (%) |
|---|---|---|
| Conjoint Triad | 55.9406 | 81.1881 |
| Pseudo-Amino Acid Composition | 86.1042 | 70.9677 |