| Literature DB >> 30621574 |
Johny Ijaq1,2, Girik Malik3,2,4, Anuj Kumar2,5, Partha Sarathi Das2,6, Narendra Meena7, Neeraja Bethi1, Vijayaraghava Seshadri Sundararajan8, Prashanth Suravajhala9,10.
Abstract
BACKGROUND: Hypothetical proteins [HP] are those that are predicted to be expressed in an organism, but no evidence of their existence is known. In the recent past, annotation and curation efforts have helped overcome the challenge in understanding their diverse functions. Techniques to decipher sequence-structure-function relationship, especially in terms of functional modelling of the HPs have been developed by researchers, but using the features as classifiers for HPs has not been attempted. With the rise in number of annotation strategies, next-generation sequencing methods have provided further understanding the functions of HPs.Entities:
Keywords: Classification features; Functional genomics; Hypothetical proteins; Machine learning
Mesh:
Substances:
Year: 2019 PMID: 30621574 PMCID: PMC6325861 DOI: 10.1186/s12859-018-2554-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Description of annotation for the three newly introduced features
| Feature | Principle | Scoring criteria | Result |
|---|---|---|---|
|
| It is generally believed that the majority of HPs are the products of pseudogenes. Follow-up of BLAST: if the hits do not have starting codon ATG across six reading frames, then it may be assumed to be a pseudogene. | Predicted and synthetic sequences, sequences with end-to-end alignment are ignored. Sequences from | Sequences starting without methionine and meeting all the above criteria were given 1, otherwise 0. |
|
| As sequence-structure implies function, it is possible to assign function to HP if we could model the protein to find any interacting domains. | Based on % identity between query and PDB template | If there is more than 30% similarity, score = 1, otherwise 0. |
|
| Most of the HPs from GenBank lack protein coding capacity and some of them may themselves be noncoding RNAs | The top three hits are considered for sequences from | If the above criterion is met, score 1, otherwise 0. |
Comparison of all accuracies of all features using multiple learning algorithms derived through WEKA (ver 3.8) with additional 3 new features increasing accuracy of the model
| Learning algorithms | Accuracy with all 9 features | Average accuracy | Accuracy with all 6 features |
|---|---|---|---|
| trees_j48 | 97.00 | 95.85 | 67.57 |
| trees_DecisionStump | 86.33 | 45.95 | |
| trees_RandomForest | 98.00 | 70.27 | |
| trees_REPTree | 98.00 | 43.24 | |
| HoeffdingTree | 96.67 | Not reported | |
| trees_LMT | 98.33 | 70.27 | |
| trees_RandomTree | 96.67 | 67.57 | |
| functions_smo_PolyK | 98.33 | 96.33 | 78.38 |
| functions_smo_RBFK | 93.00 | 24.32 | |
| functions_smo_npolyk | 96.67 | 59.46 | |
| functions_smo_Puk | 97.33 | Not reported | |
| functions_RBFNetwork | 96.67 | 97.11 | 48.65 |
| functions_mlp | 97.67 | 81.08 | |
| functions_VotedPerceptron | 97.00 | Not reported | |
| bayes_nbay | 96.67 | 94.83 | 54.05 |
| bayes_NaiveBayesUpdateable | 96.67 | 55.21 | |
| bayes_NaiveBayesMultinomial | 93.00 | Not reported | |
| bayes_NaiveBayesMultinomialUpdateable | 93.00 | Not reported |
Ranking to show the impact of each feature (Rank 1: High impact, Rank 9: Less impact)
| Features | Functions_ smo_npolyk | trees_ j48 | bayes_ nbay | Functions_mlp | Rules NNge |
|---|---|---|---|---|---|
| Pfam | 5 | 5 | 5 | 5 | 5 |
| Orthology | 4 | 4 | 4 | 4 | 4 |
| Pro_intercations | 6 | 6 | 6 | 6 | 9 |
| Bidirectional_best_blast_hits | 7 | 7 | 7 | 7 | 8 |
| Subcellular_location | 7 | 7 | 7 | 9 | 7 |
| Functional_linkages | 2 | 2 | 2 | 2 | 3 |
| Pseudogenes | 3 | 3 | 3 | 3 | 1 |
| Homology modelling | 7 | 7 | 7 | 7 | 6 |
| Non-coding RNAs | 1 | 1 | 1 | 1 | 2 |
Derived accuracies by learning algorithms with default parameters set by WEKA are listed above. Column 1 lists different algorithms
| Algorithms | ALL | Cfs | PCA | |||
|---|---|---|---|---|---|---|
| Earlier study [ | Current study | Earlier study [ | Current study | Earlier study [ | Current study | |
| Selected Features □ | 1,2,3,4,5,6 | 1,2,3,4,5,6,7,8,9 | 1 2 5 6 | 1,2,3,6,7,9 | 1,2,3,4,5,6 | 1,2,3,4,5,6,7,8 |
| bayes_NaiveBayesUpdateable | 55.21 | 96.67 | 54.05 | 96.67 | 72.97 | 93.00 |
| functions_smo_npolyk | 59.46 | 96.67 | 54.05 | 96.00 | 51.35 | 97.00 |
| rules_DecisionTable | 48.65 | 96.00 | 54.05 | 96.00 | 70.27 | 92.33 |
| functions_mlp | 81.08 | 97.67 | 59.46 | 96.67 | 81.08 | 96.00 |
| bayes_nbay | 54.05 | 96.67 | 54.05 | 96.67 | 72.97 | 93.00 |
| trees_j48 | 67.57 | 97.00 | 51.35 | 96.00 | 72.97 | 97.00 |
| Average | 97.39 | 96.26 | 94.53 | |||
Column 2 shows accuracies on the entire data through ten-fold cross-validation. Columns 3 and 4 show accuracies by different algorithms after applying feature selection algorithms as per the column header (Cfs Correlation Feature Selection, PCA Principal Component Analysis). Cfs uses best fit method and PCA uses Ranker method as set by WEKA
Subset evaluation. Accuracies by learning algorithms with default parameters set by WEKA and best data subset by combination (Column 3) and Feature selection method (column 5) are listed above
| Algorithms | Best combination Subsets (from complete dataset) | Accuracy | Feature selection subsets | Accuracy |
|---|---|---|---|---|
| bayes_NaiveBayesUpdateable | 1,6,7,9 | 96.67 | Cfs 1,2,3,6,7,9 | 96.67 |
| functions_smo_npolyk | 1,2,4,6,7,9 | 98.00 | PCA 1,2,3,4,5,6,7,8 | 97.00 |
| rules_DecisionTable | 6,7,9 | 96.00 | Cfs 1,2,3,6,7,9 | 96.00 |
| functions_mlp | 1,2,4,6,7,9 | 98.33 | Cfs 1,2,3,6,7,9 | 96.67 |
| bayes_nbay | 1,6,7,9 | 96.67 | Cfs 1,2,3,6,7,9 | 96.67 |
| trees_j48 | 1,2,4,6,9 | 97.67 | PCA 1,2,3,4,5,6,7,8 | 97.00 |
Column 1 lists different algorithms. Columns 2 & 4 list the best data subsets and Columns 3 & 5 accuracies, respectively. (1: Pfam; 2: Orthology; 3: Prot_interactions; 4: Best Blast hits; 5: Subcellular localization; 6: Functional linkages; 7: HPs linked to Pseudogenes 8: Homology modelling; 9: HPs linked to ncRNAs). Accuracies shown by both the subset combinations are almost same, with subset combinations from the complete dataset showing a slightly higher accuracy
Individual nine-point schema data are subjected through learning algorithms and scoring metrics are derived, averaged and tabulated. Values are compared with the six-point performance metrics
| Algorithm | Sensitivity/ Recall (%) | Specificity (%) | Precision (%) | F1 Score (%) | MCC (%) | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Six point | Nine point | Six point | Nine point | Six point | Nine point | Six point | Nine point | Six point | Nine point | |
| Decision Tree (j48) | 37 | 38 | 90 | 93 | 17 | 85 | 23 | 41 | 16 | 54 |
| SVM (functions_smo_npolyk) | 36 | 37 | 89 | 93 | 16 | 57 | 22 | 41 | 15 | 36 |
| Neural networks(functions_mlp) | 36 | 38 | 89 | 92 | 16 | 80 | 22 | 43 | 15 | 53 |
| Naïve Bayes (Bayes_Naïve BayesUpdateable) | 37 | 37 | 89 | 93 | 16 | 81 | 22 | 40 | 17 | 53 |
Fig. 1Methodology adopted to generate the classification model
Fig. 2Workflow to annotate HPs across each classifier (Details in Additional file 2: Figure S1)