| Literature DB >> 27922074 |
Yinyin Cai1,2, Zhijun Liao3, Ying Ju1, Juan Liu4, Yong Mao2,5, Xiangrong Liu1,2.
Abstract
The research on resistance genes (R-gene) plays a vital role in bioinformatics as it has the capability of coping with adverse changes in the external environment, which can form the corresponding resistance protein by transcription and translation. It is meaningful to identify and predict R-gene of Larimichthys crocea (L.Crocea). It is friendly for breeding and the marine environment as well. Large amounts of L.Crocea's immune mechanisms have been explored by biological methods. However, much about them is still unclear. In order to break the limited understanding of the L.Crocea's immune mechanisms and to detect new R-gene and R-gene-like genes, this paper came up with a more useful combination prediction method, which is to extract and classify the feature of available genomic data by machine learning. The effectiveness of feature extraction and classification methods to identify potential novel R-gene was evaluated, and different statistical analyzes were utilized to explore the reliability of prediction method, which can help us further understand the immune mechanisms of L.Crocea against pathogens. In this paper, a webserver called LCRG-Pred is available at http://server.malab.cn/rg_lc/.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27922074 PMCID: PMC5138596 DOI: 10.1038/srep38367
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The main flowchart of the identification process.
Results based on three different sampling methods using random forest.
| Sampling Method | Training set | Accuracy | ||||
|---|---|---|---|---|---|---|
| Resistance gene | Non-Resistance gene | SN | SP | Accuracy (%) | ROC Area | |
| Original instance | 6720 | 10028 | 0.821 | 0.696 | 77.0898 | 0.855 |
| Random-under-sampling | 6720 | 6720 | 0.831 | 0.687 | 75.878 | 0.850 |
| Weighted random-sampling | 6720 | 10028 | 0.767 | 0.761 | 76.3974 | 0.854 |
Performance comparison of different classifier.
| Classifier | Attributes | SN | SP | Mcc | Accuracy (%) | ROC Area |
|---|---|---|---|---|---|---|
| Random forest | 13440 | 0.831 | 0.687 | 0.523 | 75.878 | 0.850 |
| LibD3C | 13440 | 0.820 | 0.700 | 0.524 | 76.0045 | 0.846 |
| J48 | 13440 | 0.688 | 0.683 | 0.371 | 68.5491 | 0.678 |
| Bayes Network | 13440 | 0.810 | 0.597 | 0.417 | 70.3646 | 0.761 |
| Naive Bayes | 13440 | 0.882 | 0.264 | 0.185 | 57.2768 | 0.690 |
| KNN-IB1 | 13440 | 0.639 | 0.765 | 0.408 | 70.2158 | 0.706 |
| AdaBoostM1 | 13440 | 0.782 | 0.605 | 0.393 | 69.3601 | 0.763 |
| Bagging | 13440 | 0.786 | 0.696 | 0.483 | 74.0699 | 0.822 |
| GBDT | 13440 | 0.718 | 0.705 | 0.456 | 72.7902 | 0.818 |
| Random tree | 13440 | 0.673 | 0.672 | 0.346 | 67.2842 | 0.673 |
| RandomSubSpace | 13440 | 0.819 | 0.662 | 0.486 | 74.0179 | 0.826 |
| SMO | 13440 | 0.677 | 0.749 | 0.427 | 71.2798 | 0.713 |
| LibSVM | 13440 | 0.947 | 0.307 | 0.331 | 62.7232 | 0.627 |
Figure 2Performance of test sets on different classifiers.
Performance comparison of 188-D features and 473-D features.
| Feature extraction method | Dimension | Training set | Accuracy | ||||
|---|---|---|---|---|---|---|---|
| Resistance gene | Non-Resistance gene | SN | SP | Mcc | Accuracy (%) | ||
| 188-D | 188 | 6720 | 6720 | 0.831 | 0.687 | 0.523 | 75.878 |
| Pse-AAC | 30 | 6720 | 6720 | 0.761 | 0.627 | 0.392 | 69.4345 |
| 473-D | 473 | 178 | 226 | 0.371 | 0.752 | 0.133 | 58.4158 |
| 188-D | 188 | 0 | 3308 | 69.347 | |||
| Pse-AAC | 30 | 0 | 3308 | 60.9129 | |||
| 473-D | 473 | 20 | 20 | 55.0 | |||
Figure 3Prediction results of L.Crocea on different classification models.
Prediction results of ΩLC under different data balancing models.
| Prediction model | Accuracy | ||
|---|---|---|---|
| TP Rate | TN Rate | Accuracy (%) | |
| Ω0riglR−g model | 0.453 | 0.547 | 45.3047 |
| Ωtr model | 0.646 | 0.354 | 64.6409 |
| Ωwtr model | 0.546 | 0.454 | 54.3956 |
Comparison of SVMProt-RM and NBSPred prediction for R-gene of L.Crocea.
| Dataset | Number of sequences | SVMProt-RF prediction | NBSPred prediction |
|---|---|---|---|
| L.Crocea Dataset | 18018 | 9801 | 457 17964 (total number after NBSPred) |
| Accuracy (%) | 54.3956 | 2.5440 |
Feature of PSBA1 R-gene in Acaryochloris marina.
| Property | Value of feature vector | ||||||
|---|---|---|---|---|---|---|---|
| amino acid composition | 9.3664 | 0.2755 | 1.6529 | 3.5813 | 6.0606 | 8.5399 | 3.8567 |
| 7.1625 | 0.8264 | 12.1212 | 4.6832 | 3.5813 | 4.9587 | 2.7548 | |
| 3.3058 | 9.3664 | 6.8871 | 5.5096 | 2.7548 | 2.7548 | ||
| Hydrophobic | 15.7025 | 45.7300 | 38.5675 | 12.9834 | 12.4309 | 37.5690 | 1.6529 |
| 29.2011 | 62.8099 | 82.6446 | 97.5207 | 0.5510 | 24.7934 | 49.5868 | |
| 73.0027 | 100.0 | 1.6529 | 25.3443 | 52.066 | 75.7576 | 99.1735 | |
| Van der Waals volume | 0.2755 | 28.9256 | 50.9642 | 74.1047 | 99.4490 | 41.3223 | 39.1185 |
| 19.5592 | 33.9779 | 17.6796 | 12.7072 | 0.2755 | 23.4160 | 45.1791 | |
| 72.1763 | 99.4490 | 0.5510 | 23.1405 | 48.4848 | 73.8292 | 100.0 | |
Confusion matrix of binary classification performance of R-gene.
| Classification | Positive instance of prediction | Negative instance of prediction |
|---|---|---|
| Positive instance | ||
| Negative instance |