| Literature DB >> 35360826 |
Jefferson Daniel Suquilanda-Pesántez1, Evelyn Dayana Aguiar Salazar1, Diego Almeida-Galárraga1, Graciela Salum1, Fernando Villalba-Meneses1, Marco Esteban Gudiño Gomezjurado1.
Abstract
Atmospheric nitrogen fixation carried out by microorganisms has environmental and industrial importance, related to the increase of soil fertility and productivity. The present work proposes the development of a new high precision system that allows the recognition of amino acid sequences of the nitrogenase enzyme (NifH) as a promising way to improve the identification of diazotrophic bacteria. For this purpose, a database obtained from UniProt built a processed dataset formed by a set of 4911 and 4782 amino acid sequences of the NifH and non-NifH proteins respectively. Subsequently, the feature extraction was developed using two methodologies: (i) k-mers counting and (ii) embedding layers to obtain numerical vectors of the amino acid chains. Afterward, for the embedding layer, the data was crossed by an external trainable convolutional layer, which received a uniform matrix and applied convolution using filters to obtain the feature maps of the model. Finally, a deep neural network was used as the primary model to classify the amino acid sequences as NifH protein or not. Performance evaluation experiments were carried out, and the results revealed an accuracy of 96.4%, a sensitivity of 95.2%, and a specificity of 96.7%. Therefore, an amino acid sequence-based feature extraction method that uses a neural network to detect N-fixing organisms is proposed and implemented. NIFtHool is available from: https://nifthool.anvil.app/. Copyright:Entities:
Keywords: Deep Neural Network; Embedding Layers; NifH protein; Software; k-mers
Mesh:
Substances:
Year: 2022 PMID: 35360826 PMCID: PMC8956849 DOI: 10.12688/f1000research.107925.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. Nitrogenase complex.
The nifHDK operon (genes encoding for the Fe/Mo-Fe nitrogenase protein complex) codifies for the subunits of the nitrogenase enzyme, which catalyzes the reduction of the N 2 to NH 4 in an ATP- dependent manner through the electron flux from the dinitrogenase reductase to the molybdenum-iron (Mo-Fe) protein subunit. Modified from Ref. 10.
Figure 2. Description of the methodology applied in this work.
i) Data acquisition, ii) Feature extraction, iii) Modeling of deep learning, iv) K-fold Cross-validation and v) prediction and providing information. Dinitrogenase reductase = NifH.
Figure 3. Correlation of the sequences number with the number of the amino acid (aa).
The curves show the correlation between the number of sequences and the number of amino acids of dinitrogenase reductase (NifH) and non-NifH proteins.
Figure 4. Description of feature extraction.
This process comprised some sequential steps: 1) Input of the amino acid (aa) sequences, 2) extraction, and 3) development of the convolutional layer. Where M is the length of the input sequence and L is the length of the embedding vector.
Performance metrics (precision, recall, F1-score, accuracy, and loss) calculated for each dataset methodology.
| Methodology | Statistical parameters (%) | ||||
|---|---|---|---|---|---|
| Precision | Recall | F1-score | Accuracy | Loss | |
| Embedding Vectors (EV) | 93 | 93 | 93 | 92.82 | 20.96 |
| 3-mers (100 features) + EV | 95 | 96 | 95 | 95.46 | 17.34 |
| 3-mers (200 features) + EV | 96 | 96 | 96 | 95.58 | 21.42 |
| 3-mers (300 features) + EV | 96 | 96 | 96 | 95.71 | 19.32 |
| 5-mers (100 features) + EV | 95 | 95 | 95 | 95.38 | 13.54 |
| 7-mers (100 features) + EV | 96 | 96 | 96 | 95.5 | 13.51 |
| 7-mers (200 features) + EV | 95 | 95 | 95 | 95.38 | 16.14 |
| 7-mers (300 features) + EV | 96 | 96 | 96 | 95.38 | 15.43 |
| 15-mers (100 features) + EV | 95 | 95 | 95 | 94.97 | 15.45 |
| 15-mers (200 features) + EV | 94 | 94 | 94 | 94.39 | 16.71 |
| 20-mers (100 features) + EV | 94 | 94 | 94 | 93.69 | 18.96 |
| 20-mers (200 features) + EV | 93 | 93 | 93 | 93.48 | 19.12 |
| 3-mers (100 f) + 5-mers (100 f) + EV | 96 | 96 | 96 | 95.71 | 19.17 |
| 3-mers (100 f) + 7-mers (300 f) + EV | 96 | 96 | 96 | 96.04 | 16.78 |
| 3-mers (300 f) + 7-mers (300 f) + EV | 96 | 96 | 96 | 96.29 | 20.31 |
| 5-mers (100 f) + 7-mers (100 f) + EV | 96 | 96 | 96 | 96.37 | 14.79 |
| 5-mers (100 f) + 7-mers (300 f) + EV | 96 | 96 | 96 | 95.63 | 13.9 |
Figure 5. Visualisation of the deep neural network architecture.
It was composed of four blocks, and the number of neurons for each block was 128, 64, 32, and 2, from the first to last one, respectively.
Layers of the deep neural network implemented in this model.
| Layer | Type | Output shape | Param # |
|---|---|---|---|
| dense_108 | Dense | (None, 128) | 42112 |
| dropout_96 | Dropout | (None, 128) | 0 |
| dense_109 | Dense | (None, 64) | 8256 |
| batch_normalization_56 | Batch | (None, 64) | 256 |
| activation_82 | Activation | (None, 64) | 0 |
| dropout_83 | Dropout | (None, 64) | 0 |
| dense_110 | Dense | (None, 32) | 2080 |
| batch_normalization_57 | Batch | (None, 32) | 128 |
| activation_83 | Activation | (None, 32) | 0 |
| dropout_84 | Dropout | (None, 32) | 0 |
| dense_111 | Dense | (None, 2) | 66 |
| activation_84 | Activation | (None, 2) | 0 |
Figure 6. Four-fold cross-validation diagram.
Data were divided into four folders, three were used to train the classifier, and one was allowed to test during each of four iterations.
Figure 7. TensorBoard (RRID:SCR_016345) visualization of the distributed training metrics for the classifier after 30 epochs.
Where x-axis represent the number of epochs and the y-axis represents the values of accuracy and loss as a function of unity (1= 100%). a) Loss evaluation. b) Accuracy evaluation.
Figure 8. Assessment of the efficiency of the Deep Neural Network by a Confusion matrix.
a) Panel a shows the confusion matrix for the number of evaluated sequences, and panel b corresponds to the number of evaluated sequences normalized to one. TP: true positive =1188, TN: true negative = 1132, FP: false positive = 25, and FN: false negative = 48.
Performance of our model in comparisons with other methods of machine learning.
| Method | Learning technology | Purpose | Database | Metrics evaluated (%) | Reference | ||
|---|---|---|---|---|---|---|---|
| Sensitivity | Specificity | Accuracy | |||||
| Embedding vectors and
|
| Identification of
| 9793 | 95.17 | 96.67 | 96.37 | This work |
| Image processing and
| NN | Identification of NifH proteins | 42767 | 98.26 | 88.79 | 99.00 |
|
| Feature Generation and
|
| Identification of Nif proteins: NifH,
| 747 | 88.70 | 99.30 | 94.00 |
|
| Embedding vectors and
| NN | Identification of
| 3776 | 100 | 99.33 | 99.50 |
|
| Embedding vectors and
| NN | Identification of DNA-binding proteins | 1261 | 98.00 | 97.00 | 99.02 |
|
|
| ML | Classification of NifH Protein Sequences | 32954 | N/D | N/D | 95-99 |
|
| CART and decision trees | ML | Classification of NifH Protein Sequences | 290 | N/D | N/D | 96-97 |
|
ANN: Artificial neural network.
CNN: Convolutional neural network.
SVM: Support vector machine.
MBD-LSTM: Multilayer bi-directional long short term memory.
DeepDBP-ANN: Deep neural networks for identification of DNA binding proteins.
CART: Classification and regression trees statistical models.
NN: Neural networks.
ML: Machine leaning.
NifH: Nitrogenase Iron Protein.
NifD: Nitrogenase molybdenum-iron protein alpha chain.
NifK: Nitrogenase molybdenum-iron protein beta chain.
NifE: Nitrogenase iron-molybdenum cofactor biosynthesis protein NifE.
NifN: Nitrogenase iron-molybdenum cofactor biosynthesis protein NifN.
NifB: Nitrogenase iron-molybdenum cofactor biosynthesis protein NifB, N/D: No data.