| Literature DB >> 29896173 |
Prabina K Meher1, Tanmaya K Sahu2, Jyotilipsa Mohanty1,3, Shachi Gahoi2, Supriya Purru2, Monendra Grover2, Atmakuri R Rao2.
Abstract
As inorganic nitrogen compounds are essential for basic building blocks of life (e.g., nucleotides and amino acids), the role of biological nitrogen-fixation (BNF) is indispensible. All nitrogen fixing microbes rely on the same nitrogenase enzyme for nitrogen reduction, which is in fact an enzyme complex consists of as many as 20 genes. However, the occurrence of six genes viz., nifB, nifD, nifE, nifH, nifK, and nifN has been proposed to be essential for a functional nitrogenase enzyme. Therefore, identification of these genes is important to understand the mechanism of BNF as well as to explore the possibilities for improving BNF from agricultural sustainability point of view. Further, though the computational tools are available for the annotation and phylogenetic analysis of nifH gene sequences alone, to the best of our knowledge no tool is available for the computational prediction of the above mentioned six categories of nitrogen-fixation (nif) genes or proteins. Thus, we proposed an approach, which is first of its kind for the computational identification of nif proteins encoded by the six categories of nif genes. Sequence-derived features were employed to map the input sequences into vectors of numeric observations that were subsequently fed to the support vector machine as input. Two types of classifier were constructed: (i) a binary classifier for classification of nif and non-nitrogen-fixation (non-nif) proteins, and (ii) a multi-class classifier for classification of six categories of nif proteins. Higher accuracies were observed for the combination of composition-transition-distribution (CTD) feature set and radial kernel, as compared to the other feature-kernel combinations. The overall accuracies were observed >90% in both binary and multi-class classifications. The developed approach further achieved >92% accuracy, while evaluated with blind (independent) test datasets. The developed approach also produced higher accuracy in identifying nif proteins, while evaluated using proteome-wide datasets of several species. Furthermore, we established a prediction server nifPred (http://webapp.cabgrid.res.in/nifPred) to assist the scientific community for proteome-wide identification of six categories of nif proteins. Besides, the source code of nifPred is also available at https://github.com/PrabinaMeher/nifPred. The developed web server is expected to supplement the transcriptional profiling and comparative genomics studies for the identification and functional annotation of genes related to BNF.Entities:
Keywords: Fe protein; Fe-Mo protein; biological nitrogen fixation; di-nitrogenase; diaztroph; nitrogenase
Year: 2018 PMID: 29896173 PMCID: PMC5986947 DOI: 10.3389/fmicb.2018.01100
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Summary of the collected dataset with different percentage of sequence identity.
| 60 | 8 | 13 | 24 | 20 | 41 | 25 | 116 |
| 70 | 13 | 24 | 37 | 39 | 57 | 38 | 193 |
| 90 | 59 | 72 | 86 | 80 | 80 | 74 | 438 |
Three different datasets were prepared, where the sequences having higher pair-wise sequence identities than the considered threshold were excluded using CD-HIT program. The number of sequences at different level of identities show that the nifH sequences are more conserved and nifN sequences are least conserved among six categories of protein sequences. The last column represents the total number of sequences in the dataset at different level of pair-wise sequence identity.
Figure 1Flow diagram of the prediction process. This diagram shows the steps involved in construction of a binary and multiclass classifiers and the prediction of the test instance in two stages.
Figure 2Graphical representation of performance metrics for different feature-kernel combinations. (A) ROC curves of four different kernels of SVM for classification of nitrogen-fixation (nif) and non-nitrogen-fixation (non-nif) proteins with seven different feature sets. (B) Bar plots of AUC-ROC for radial kernel of SVM with different combination of features. (C) ROC curves for different feature sets with respect to classification of nif and non-nif proteins using four different kernels of SVM. (D) Bar plots of performance metrics for four different kernels with CTD feature set. The figures show that the combination of radial kernel and CTD feature set is better than the other feature-kernel combinations for classification of nif and non-nif proteins.
Figure 3Bar diagrams of the estimates of performance metrics for different supervised learning techniques. The performance of SVM was compared with other six machine learning approaches with respect to classification of nif and non-nif proteins with CTD feature sets. Classification accuracies increased with increase in pair-wise sequence identity level in the positive dataset. The accuracies of kNN and NB classifiers are observed to be lowest, whereas highest accuracies are observed for SVM followed by RF classifier. The performance metrics of SVM are also found to be more stable (less standard error) as compared to the other classifiers.
Figure 4Graphical representation of the prediction accuracy of the developed approach under jackknife validation. (A) Confusion matrix and (B) bar plots of performance metrics for classification of six categories of nitrogen fixation-proteins using jackknife validation technique. From the confusion matrix (color from light to dark represents lower to higher numbers), it is seen that the protein sequences are mostly misclassified in nifN category, whereas no sequence is misclassified in nifH category. From the performance metrics, accuracies in discriminating nifH, nifD, and nifK from rest of the sequences are found higher among the six categories of protein sequences.
Performance metrics of the proposed approach and blast algorithms.
| Proposed | 0.887 | 0.993 | 0.940 | 0.992 | 0.885 |
| BlastP | 0.995 | 0.538 | 0.767 | 0.683 | 0.600 |
| PSI-Blast | 0.995 | 0.545 | 0.770 | 0.686 | 0.605 |
The performance of the developed method was compared with that of BlastP and PSI-Blast with respect to the classification of nitrogen-fixation (nif) and non-nitrogen-fixation (non-nif) proteins, where the performances were measured over the 5-folds of the cross-validation. The blast algorithms are observed highly biased toward the nif category. Though the sensitivity of the proposed approach is seen to be less than that of BlastP and PSI-Blast, specificity is observed much higher for the proposed approach. Nevertheless, the overall accuracy, precision and MCC for the proposed approach are observed much higher than that of blast algorithms.
Performance metrics of the proposed approach and hidden Markov model (HMM).
| HMM | 1 | 0.834 | 0.979 | 0.907 | 0.980 | 0.841 |
| 10 | 0.876 | 0.814 | 0.845 | 0.813 | 0.709 | |
| Proposed | NA | 0.887 | 0.993 | 0.94 | 0.992 | 0.885 |
The performance of the developed approach was also compared with that of HMM for classification of nitrogen-fixation (nif) and non-nitrogen-fixation (non-nif) proteins. In terms of all the performance metrics, the developed approach achieved higher accuracies than that of HMM. NA, Not applicable.
Figure 5Confusion matrices of the prediction results for the independent datasets. (A) Confusion matrix showing the prediction result for the Test set-I, and (B) Confusion matrix showing the prediction result for the Test set-II, where the prediction was made in two stages. The color from light to dark represents lower to higher numbers. For the Test set-I, except nifN category, >90% sequences are correctly predicted in other five categories. Similarly for the Test set-II, except nifB and nifH, >90% accuracies are observed in other categories.
Performance of the proposed approach for prediction of nitrogen-fixation (nif) proteins using proteome-wide datasets.
| Non-Diaztroph | 6032 | 19 | 0 | 25 | 3 | 3 | 131 | 8 | 0 | 13 | 0 | 0 | 38 | |
| Non-Diaztroph | 6012 | 24 | 1 | 37 | 1 | 8 | 155 | 8 | 0 | 16 | 0 | 1 | 55 | |
| Non-Diaztroph | 7582 | 19 | 1 | 40 | 1 | 7 | 154 | 4 | 1 | 24 | 0 | 2 | 59 | |
| Non-Diaztroph | 7137 | 19 | 2 | 41 | 1 | 8 | 146 | 4 | 1 | 22 | 0 | 2 | 59 | |
| Non-Diaztroph | 6849 | 21 | 1 | 39 | 2 | 6 | 139 | 5 | 1 | 22 | 0 | 0 | 57 | |
| Non-Diaztroph | 4599 | 17 | 3 | 15 | 1 | 4 | 125 | 3 | 1 | 7 | 0 | 2 | 52 | |
| Non-Diaztroph | 4692 | 14 | 3 | 17 | 0 | 3 | 133 | 2 | 1 | 8 | 0 | 0 | 50 | |
| Non-Diaztroph | 4662 | 12 | 1 | 24 | 1 | 1 | 114 | 3 | 1 | 13 | 1 | 0 | 53 | |
| Non-Diaztroph | 6275 | 19 | 0 | 35 | 1 | 4 | 162 | 4 | 0 | 14 | 0 | 0 | 56 | |
| Non-Diaztroph | 5816 | 13 | 1 | 27 | 0 | 3 | 132 | 4 | 1 | 14 | 0 | 1 | 40 | |
| Diaztroph | 4773 | 21 | 4 | 20 | 1 | 3 | 142 | 7 | 2 | 7 | 1 | 3 | 53 | |
| Diaztroph | 4894 | 15 | 3 | 26 | 1 | 5 | 130 | 4 | 2 | 11 | 1 | 3 | 52 | |
| Diaztroph | 4291 | 21 | 2 | 26 | 1 | 6 | 124 | 7 | 1 | 13 | 1 | 3 | 37 | |
| Diaztroph | 4604 | 18 | 3 | 23 | 2 | 4 | 116 | 4 | 2 | 12 | 1 | 1 | 40 | |
| Diaztroph | 5319 | 20 | 3 | 24 | 1 | 8 | 150 | 4 | 2 | 12 | 1 | 4 | 44 | |
| Diaztroph | 5005 | 23 | 1 | 25 | 2 | 4 | 150 | 7 | 1 | 13 | 2 | 2 | 59 | |
| Diaztroph | 5542 | 17 | 2 | 38 | 1 | 4 | 128 | 5 | 1 | 18 | 1 | 2 | 46 | |
| Diaztroph | 5792 | 19 | 2 | 37 | 2 | 2 | 131 | 7 | 2 | 20 | 1 | 1 | 44 | |
| Diaztroph | 4261 | 18 | 2 | 38 | 2 | 3 | 135 | 5 | 2 | 21 | 2 | 2 | 44 | |
| Diaztroph | 4559 | 23 | 2 | 47 | 5 | 3 | 147 | 9 | 1 | 27 | 5 | 2 | 46 | |
The prediction of nif protein for 10 diaztrophs and 10 non-diaztrophs is made in two stages, where in the first stage the sequences are predicted as nif or non-nif types and the sequences predicted as nif types are only subjected to the second stage in which they are classified into any one of the six categories of nif proteins. In the first stage, classification accuracies are observed >96%. Though, the number of false positives predicted in the second stage are little larger at default threshold, it is reduced by ~60% while predicted with threshold 0.4. Interestingly, no sequences are predicted in nifH category except one species for non-diaztrophs at the threshold 0.4.
Figure 6Heat map of the prediction probabilities of nif protein sequences. It shows the probabilities with which the protein sequences of six categories of nitrogen-fixation are predicted in the second stage for the proteome-wide dataset of 10 diaztrophs. The color from light to dark represents lower to higher probabilities, and the blank cell indicates that no sequence was predicted in the corresponding category. Except one nifE and one nifN, all the nif sequences are correctly predicted. Further with the threshold 0.4, it is observed that except two nifK sequences all other nif sequences are correctly predicted in their respective categories.
Performance of nifPred in identifying nif proteins in 49 diaztroph species.
| x | ||||||
| T | ||||||
To assess the performance of nifPred with threshold value 0.4, prediction for nif proteins is made by using proteome-wide datasets of diaztrophs. The nifH, nifD, and nifE are correctly predicted in all the 49 species. Besides, 42 nifK, 34 nifB, and 19 nifN are also correctly identified in 49 species.
, wrongly predicted;
, predicted with highest probability;
, predicted with second highest probability.