| Literature DB >> 26911862 |
M Stanley Fujimoto1, Anton Suvorov2, Nicholas O Jensen3, Mark J Clement1, Seth M Bybee3.
Abstract
BACKGROUND: Accurate detection of homologous relationships of biological sequences (DNA or amino acid) amongst organisms is an important and often difficult task that is essential to various evolutionary studies, ranging from building phylogenies to predicting functional gene annotations. There are many existing heuristic tools, most commonly based on bidirectional BLAST searches that are used to identify homologous genes and combine them into two fundamentally distinct classes: orthologs and paralogs. Due to only using heuristic filtering based on significance score cutoffs and having no cluster post-processing tools available, these methods can often produce multiple clusters constituting unrelated (non-homologous) sequences. Therefore sequencing data extracted from incomplete genome/transcriptome assemblies originated from low coverage sequencing or produced by de novo processes without a reference genome are susceptible to high false positive rates of homology detection.Entities:
Mesh:
Year: 2016 PMID: 26911862 PMCID: PMC4765110 DOI: 10.1186/s12859-016-0955-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
All Features that were used in order to train the machine learning algorithm. Each of these features was calculated for each of the clusters
| Feature | Description |
|---|---|
| Aliscore | The number of positions identified by Aliscore as randomly aligned |
| Length | The length of the alignment |
| # of Sequences | The number of sequences in the alignment |
| # of Gaps | Number of base positions marked with a gap |
| # of Amino Acids | Number of amino acids in the alignment |
| Range | Longest non-aligned sequence length minus shortest non-aligned sequence length |
| Amino Acid Charged | Standard deviation for the proportions of amino acids in the charged class for each sequence |
| Amino Acid Uncharged | Standard deviation for the proportions of amino acids in the uncharged class for each sequence |
| Amino Acid Special | Standard deviation for the proportions of amino acids in the non-charged and non-hydrophobic class for each sequence |
| Amino Acid Hydrophobic | Standard deviation for the proportions of amino acids in the hydrophobic class for each sequence |
The machine learning parameters used for each of the different algorithms in WEKA
| Algorithm | Parameters |
|---|---|
| Neural Network | weka.classifiers.functions.MultilayerPerceptron -L 0.1 -M 0.05 -N 3000 -V 0 -S 0 -E 40 -H a |
| Support Vector Machine (SVM) | weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V −1 -W 1 -K “weka.classifiers.functions.supportVector.PolyKernel -C |
| Random Forest | weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1 |
| Naive Bayes | weka.classifiers.bayes.NaiveBayes |
| Logistic Regression | weka.classifiers.functions.Logistic -R 1.0E-8 -M −1 |
| Meta-Classifier w/o Logistic Regression | weka.classifiers.meta.Stacking -X 10 -M “weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a” -S 1 -B “weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1” -B “weka.classifiers.bayes.NaiveBayes ” -B “weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V −1 -W 1 -K “weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0”” |
| Meta-Classifier w/Logistic Regression | weka.classifiers.meta.Stacking -X 10 -M “weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a“ -S 1 -B ”weka.classifiers.functions.Logistic -R 1.0E-8 -M −1” -B “weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a” -B “weka.classifiers.trees.RandomForest -I 10 -K 0 -S 1” -B “weka.classifiers.bayes.NaiveBayes ” -B “weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V −1 -W 1 -K “weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0”” |
Fig. 1A diagram of the workflow. This figure shows the different steps that were used in developing our machine learning model. Arthropod phylogeny was generated in previous studies and deposited in OrthoDB. These sequences were then gathered from OrthoDB and used as our orthology and paralogy clusters. They were combined with generated non-homology clusters. The combination represents our training data set used to train the machine learning algorithms. The experimental data were assembled with proteins inferred from the assemblies. InParanoid was then used to identify putative homologs. Once putative homologs were identified they were input into the trained machine learning algorithms for classification and subsequent cluster trimming
Summary of arthropod machine learning model performance
| OrthoDB Arthropod EQUAL | OrthoDB Arthropod PROP | |||
|---|---|---|---|---|
| Algorithm | Validation | Testing | Validation | Testing |
| Neural Network | 97.1815 % | 96.8153 % | 97.5452 % | 96.5423 % |
| Suppor Vector Machine (SVM) | 89.1351 % | 88.0801 % | 88.0668 % | 88.2621 % |
| Random Forest | 98.1362 % | 95.9054 % | 97.8748 % | 95.5414 % |
| Naive Bayes | 53.0628 % | 52.5023 % | 61.2229 % | 60.3276 % |
| Logistic Regression | 96.5905 % | 97.2702 % | 96.3064 % | 96.3603 % |
| Meta-Classifier w/o Logistic Regression | 98.5112 % | 98.3621 % | 98.5907 % | 96.8153 % |
| Meta-Classifier w/ Logistic Regression | 98.6362 % | 97.7252 % | 98.5680 % | 97.5432 % |
This table shows the performance of each of the different learning algorithms that were trained, validated, and tested with the OrthoDB arthropod gene clusters
Fig. 2Bootstrapping results for the machine learning models. Bootstrapping was conducted using 100 replicates for each classifier. Error envelopes can also be seen for each classifier. Except for Naive Bayes, as the percentage of total training instances used during learning increases accuracy increases and the error envelope decreases
Fig. 3Accuracy curves for individual features (EQUAL training data set) using meta-classifier w/ logistic regression. The number of gaps, amino acid composition and number of amino acids features exhibit better predictive accuracy
Summary of InParanoid and HaMStR cluster filtering
| Kept | Removed | ||
|---|---|---|---|
| Odonata | InParanoid | 10500 | 3497 |
| HaMStR | 1231 | 896 |
The number of clusters that were kept and removed for the OD_S clusters from InParanoid and HaMStR. Filtering was accomplished using the meta-classifier w/ logistic regression model trained on the EQUAL data set
Fig. 4Examples of a high quality homology (a) and false-positive homology (b) clusters (OD_S data set) classified by meta-classifier w/ logistic regression. All sequences within the homology cluster (a) belong to one protein family (FAM81A1-like protein). The sequence in the false-positive homology cluster indicated by the arrow represents Aprataxin and PNK-like factor whereas other sequences represent tyrosyl-DNA phosphodiesterase