| Literature DB >> 26605337 |
Deborah Galpert1, Sara Del Río2, Francisco Herrera2, Evys Ancede-Gallardo3, Agostinho Antunes4, Guillermin Agüero-Chapin5.
Abstract
Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiae-Schizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification.Entities:
Mesh:
Year: 2015 PMID: 26605337 PMCID: PMC4641943 DOI: 10.1155/2015/748681
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Gene pair features.
| Measure | Definition | Parameters |
|---|---|---|
| Local and global alignment |
|
|
|
| ||
| Length |
| |
|
| ||
| Membership to locally collinear blocks |
| Mauve software parameters |
|
| ||
| Physicochemical profile |
|
|
Big data framework, applications, and algorithms.
| Big data framework | Application | Algorithms |
|---|---|---|
| Hadoop 2.0.0 (Cloudera CDH4.7.1) with the head node configured as name-node and job-tracker, and the rest as data-nodes and task-trackers | (i) MapReduce ROS implementation | RF-BDCS |
|
| ||
| Apache Spark 1.0.0 with the head node configured as master and name-node, and the rest as workers and data-nodes | Apache Spark Support Vector Machines (MLLib) | ROS (100%) + SVM-BD |
Figure 1Workflow of the evaluation of supervised versus unsupervised POD algorithms.
S. cerevisiae-K. lactis datasets.
| Datasets | #Ex. | #Atts. | Class | #Class | %Class | IR |
|---|---|---|---|---|---|---|
| Blosum50 | 22.649.328 | 6 | (0; 1) | (22.646.914; 2414) | (99.989; 0.011) | 9381.489 |
|
| ||||||
| Blosum621 | 22.649.328 | 6 | (0; 1) | (22.646.914; 2414) | (99.989; 0.011) | 9381.489 |
|
| ||||||
| Blosum622 | 22.649.328 | 6 | (0; 1) | (22.646.914; 2414) | (99.989; 0.011) | 9381.489 |
|
| ||||||
| Pam250 | 22.649.328 | 6 | (0; 1) | (22.646.914; 2414) | (99.989; 0.011) | 9381.489 |
S. cerevisiae-C. glabrata datasets.
| Datasets | #Ex. | #Atts. | Class | #Class | %Class | IR |
|---|---|---|---|---|---|---|
| Blosum50 | 29.887.416 | 6 | (0; 1) | (29.884.575, 2841) | (99.99; 0.01) | 10519.034 |
|
| ||||||
| Blosum621 | 29.887.416 | 6 | (0; 1) | (29.884.575, 2841) | (99.99; 0.01) | 10519.034 |
|
| ||||||
| Blosum622 | 29.887.416 | 6 | (0; 1) | (29.884.575, 2841) | (99.99; 0.01) | 10519.034 |
|
| ||||||
| Pam250 | 29.887.416 | 6 | (0; 1) | (29.884.575, 2841) | (99.99; 0.01) | 10519.034 |
S. cerevisiae-S. pombe datasets.
| Datasets | #Ex. | #Atts. | Class | #Class | %Class | IR |
|---|---|---|---|---|---|---|
| Blosum50 | 8.095.907 | 6 | (0; 1) | (8.090.950; 4.957) | (99.939; 0.061) | 1632.227 |
|
| ||||||
| Blosum621 | 8.095.907 | 6 | (0; 1) | (8.090.950; 4.957) | (99.939; 0.061) | 1632.227 |
|
| ||||||
| Blosum622 | 8.095.907 | 6 | (0; 1) | (8.090.950; 4.957) | (99.939; 0.061) | 1632.227 |
|
| ||||||
| Pam250 | 8.095.907 | 6 | (0; 1) | (8.090.950; 4.957) | (99.939; 0.061) | 1632.227 |
Combination of alignment parameter settings on the datasets.
| Dataset | Substitution matrix | Gap open | Gap extended |
|---|---|---|---|
| Blosum50 | Blosum50 | 15 | 8 |
| Blosum621 | Blosum62 | 8 | 7 |
| Blosum622 | Blosum62 | 12 | 6 |
| Pam250 | Pam250 | 10 | 8 |
Supervised algorithms and parameter values in the experiments.
| Algorithm | Parameter values |
|---|---|
| RF-BD1 | Number of trees: 100 |
|
| |
| RF-BDCS | Number of trees: 100 |
|
| |
| ROS (100%) + RF-BD | RS3 = 100% |
|
| |
| ROS (130%) + RF-BD | RS = 130% |
|
| |
| SVM-BD | Regulation parameter: |
|
| |
| ROS (100%) + SVM-BD | RS = 100% |
|
| |
| ROS (130%) + SVM-BD | RS = 130% |
1BD: big data.
2int(log2N + 1), where N is the number of attributes of the dataset.
3RS: resampling size.
Unsupervised algorithms and parameter values in the experiments.
| Algorithm | Parameter values | Implementation |
|---|---|---|
| RBH | Soft filter and Smith Waterman alignment | BLASTp program1
|
|
| ||
| RSD |
| BLASTp program1
|
|
| ||
| OMA | Default parameter values | OMA stand-alone3 |
1Available in http://www.ncbi.nlm.nih.gov/BLAST/.
2Available in https://pypi.python.org/pypi/reciprocal_smallest_distance/1.1.4/.
3Available in http://omabrowser.org/standalone/OMA.0.99z.3.tgz.
Geometric mean results of the best supervised classifiers in each dataset.
| Dataset | ROS (RS: 100%) + RF-BD (Scer-Klac) | ROS (RS: 130%) + RF-BD (Scer-Klac) | RF-BDCS (Scer-Klac) | ROS (RS: 100%) + RF-BD (Scer-Cgla) | ROS (RS: 130%) + RF-BD (Scer-Cgla) | RF-BDCS (Scer-Cgla) | ROS (RS: 100%) + SVM-BD (regParam: 1.0) | ROS (RS: 100%) + SVM-BD (regParam: 0.5) |
|---|---|---|---|---|---|---|---|---|
| Blosum50 | 0.9818 | 0.9818 |
| 0.9889 | 0.9885 |
| 0.8393 |
|
| Blosum621 | 0.9801 | 0.9818 |
| 0.9891 | 0.9903 |
| 0.8707 |
|
| Blosum622 | 0.9793 | 0.9793 |
| 0.9910 | 0.9910 |
| 0.8536 |
|
| Pam250 | 0.9818 | 0.9818 |
| 0.9912 | 0.9905 |
| 0.8495 |
|
AUC and G-Mean results of supervised classifiers in Experiments 1 and 2.
| Algorithm |
|
|
| |||
|---|---|---|---|---|---|---|
| AUC |
| AUC |
| AUC |
| |
| RF-BD | 0.6979 | 0.6291 | 0.7455 | 0.7005 | 0.5172 | 0.1851 |
| ROS (RS: 100%) + RF-BD | 0.9809 | 0.9807 | 0.9901 | 0.9900 | 0.6096 | 0.4527 |
| ROS (RS: 130%) + RF-BD | 0.9813 | 0.9812 | 0.9901 | 0.9901 | 0.6121 | 0.4581 |
| RF-BDCS |
|
|
|
| 0.7294 | 0.6745 |
| ROS (RS: 100%) + SVM-BD (regParam: 1.0) | 0.9477 | 0.9477 | 0.9542 | 0.9542 | 0.8632 | 0.8533 |
| ROS (RS: 100%) + SVM-BD (regParam: 0.5) | 0.8845 | 0.8791 | 0.9540 | 0.9539 |
|
|
| ROS (RS: 100%) + SVM-BD (regParam: 0.0) | 0.6135 | 0.4961 | 0.9432 | 0.9431 | 0.6135 | 0.4961 |
| ROS (RS: 130%) + SVM-BD (regParam: 1.0) | 0.8164 | 0.7956 | 0.9523 | 0.9522 | 0.8164 | 0.7956 |
| ROS (RS: 130%) + SVM-BD (regParam: 0.5) | 0.8629 | 0.8528 | 0.9539 | 0.9539 | 0.8629 | 0.8528 |
| ROS (RS: 130%) + SVM-BD (regParam: 0.0) | 0.6248 | 0.5147 | 0.9429 | 0.9428 | 0.6248 | 0.5147 |
Figure 2Average true positive and true negative rate values of supervised classifiers obtained in Experiments 1 and 2.
Run time results in seconds of the big data solutions in Experiments 1 and 2.
| Datasets |
|
|
|
|---|---|---|---|
| RF-BD | 1201.59 | 2174.90 | 2060.99 |
| ROS (RS: 100%) + RF-BD | 2983.75 | 4562.38 | 4440.03 |
| ROS (RS: 130%) + RF-BD | 3345.04 | 4805.50 | 4681.51 |
| RF-BDCS | 1302.41 | 2362.04 | 2025.15 |
| SVM-BD |
|
|
|
| ROS (RS: 100%) + SVM-BD (regParam: 1.0) |
|
|
|
| ROS (RS: 100%) + SVM-BD (regParam: 0.5) |
|
|
|
| ROS (RS: 100%) + SVM-BD (regParam: 0.0) |
|
|
|
| ROS (RS: 130%) + SVM-BD (regParam: 1.0) | 927.14 | 1079.19 | 1079.58 |
| ROS (RS: 130%) + SVM-BD (regParam: 0.5) | 929.17 | 1084.19 | 1076.33 |
| ROS (RS: 130%) + SVM-BD (regParam: 0.0) | 924.42 | 1076.37 | 1077.21 |
AUC and G-Mean results of the unsupervised and the best supervised classifiers in Experiments 1 and 2.
| Algorithm |
|
|
| |||
|---|---|---|---|---|---|---|
| AUC |
| AUC |
| AUC |
| |
| RBH | 0.1497 | 0.0062 | 0.8196 | 0.7995 | 0.4697 | 0.4525 |
| RSD 0.2 1 | 0.5862 | 0.4862 | 0.9238 | 0.9206 | 0.4874 | 0.4438 |
| RSD 0.5 1 | 0.5926 | 0.4643 | 0.9340 | 0.9316 | 0.4980 | 0.4063 |
| RSD 0.8 1 | 0.5886 | 0.4518 | 0.9382 | 0.9362 | 0.5009 | 0.3899 |
| OMA | 0.5765 | 0.4904 | 0.9287 | 0.9259 | 0.5151 | 0.4644 |
| RF-BDCS |
|
|
|
| 0.7294 | 0.6745 |
| ROS (RS: 100%) + SVM-BD (regParam: 1.0) | 0.9477 | 0.9477 | 0.9542 | 0.9542 | 0.8632 | 0.8533 |
| ROS (RS: 100%) + SVM-BD (regParam: 0.5) | 0.8845 | 0.8791 | 0.9540 | 0.9539 |
|
|
Figure 3Average true positive and true negative rate values of the unsupervised and the best supervised classifiers in Experiments 1 and 2.