| Literature DB >> 35736770 |
Shriprabha R Upadhyaya1, Philipp E Bayer1, Cassandria G Tay Fernandez1, Jakob Petereit1, Jacqueline Batley1, Mohammed Bennamoun2, Farid Boussaid3, David Edwards1.
Abstract
Gene models are regions of the genome that can be transcribed into RNA and translated to proteins, or belong to a class of non-coding RNA genes. The prediction of gene models is a complex process that can be unreliable, leading to false positive annotations. To help support the calling of confident conserved gene models and minimize false positives arising during gene model prediction we have developed Truegene, a machine learning approach to classify potential low confidence gene models using 14 gene and 41 protein-based characteristics. Amino acid and nucleotide sequence-based features were calculated for conserved (high confidence) and non-conserved (low confidence) annotated genes from the published Pisum sativum Cameor genome. These features were used to train eXtreme Gradient Boost (XGBoost) classifier models to predict whether a gene model is likely to be real. The optimized models demonstrated a prediction accuracy ranging from 87% to 90% and an F-1 score of 0.91-0.94. We used SHapley Additive exPlanations (SHAP) and feature importance plots to identify the features that contribute to the model predictions, and we show that protein and gene-based features can be used to build accurate models for gene prediction that have applications in supporting future gene annotation processes.Entities:
Keywords: SHAP; XGBoost; gene models; machine learning; pea
Year: 2022 PMID: 35736770 PMCID: PMC9230120 DOI: 10.3390/plants11121619
Source DB: PubMed Journal: Plants (Basel) ISSN: 2223-7747
Pearson correlation coefficient for protein features with p-value < 0.05.
| Feature 1 | Feature 2 | Correlation Co-Efficient (R) |
|---|---|---|
| Length | Flexibility | 0.99 |
| Length | Molecular weight | 0.99 |
| Length | Molar extinction coefficient reduced | 0.81 |
| Length | Molar extinction coefficient oxidised | 0.81 |
| Aliphaticity | Aliphatic index | 0.93 |
| Gravy Value | Non-polar amino acids | 0.86 |
| Tiny amino acids | Amino acid percentage G | 0.57 |
| Iso-electric point | Acidic amino acids | −0.58 |
| Tiny amino acids | Amino acid percentage A | 0.48 |
Pearson correlation test for nucleotide features with p-value < 0.05.
| Feature 1 | Feature 2 | Correlation Co-Efficient (R) |
|---|---|---|
| Length | Molecular weight | 0.99 |
| Length | Entropy | 0.83 |
| Length | Melting temperature | 0.17 |
| Length | Zlib compression ratio | −0.68 |
| GC content | Melting temperature | 0.92 |
| GC content | GC at position 3 | 0.72 |
| GC content | GC at position 2 | 0.71 |
| GC content | GC at position 1 | 0.58 |
| Molecular weight | Entropy | 0.83 |
Evaluation metrics for XGBoost classifier models.
| Evaluation Metric | Protein Model | Nucleotide Model |
|---|---|---|
| Prediction Accuracy | 89.92% | 86.72% |
| 10-fold cross validation | 88.66% (±0.65%) | 85.38% (±0.40%) |
| F1_score | 0.94 | 0.91 |
| Average precision score | 0.93 | 0.90 |
| MCC | 0.93 | 0.93 |
| AUC value | 0.94 | 0.92 |
Figure 1AUROC curves for the (A) protein model and (B) nucleotide model. The true positive rate is plotted against false positive rate at different classification thresholds.
Figure 2PR curves for (A) protein model (B) nucleotide model. Precision is plotted against recall at different probability thresholds.
Figure 3Confusion matrix for the (A) protein model (B) nucleotide model. The matrix is coloured based on the number of sequences in each class.
Figure 4Feature importance Gain plot for XGBoost protein classifier model showing the top 20 features contributing to the model.
Figure 5Feature importance gain plot for nucleotide model showing the 14 features contributing the model. CAI = Codon Adaptation Index, GC_Stdev: Standard deviation of GC skew value.
Figure 6Beeswarm plot for top 20 features that contribute to protein model. Each dot indicates one value, and they pile up in each row to show density. The red dots represent higher feature value while the blue dots represent lower feature value. The positive side indicates high confidence genes while the negative side indicates low confidence genes.
Figure 7Beeswarm plot for the 14 features that contribute to nucleotide model. Each dot indicates one value, and they pile up in each row to show density. The red dots represent higher feature value while the blue dots represent lower feature value. The positive side indicates high confidence genes and while the negative side indicates low confidence genes.