| Literature DB >> 28633662 |
Kai Battenberg1, Ernest K Lee2, Joanna C Chiu2, Alison M Berry3, Daniel Potter3.
Abstract
BACKGROUND: Identifying orthologous genes is an initial step required for phylogenetics, and it is also a common strategy employed in functional genetics to find candidates for functionally equivalent genes across multiple species. At the same time, in silico orthology prediction tools often require large computational resources only available on computing clusters. Here we present OrthoReD, an open-source orthology prediction tool with accuracy comparable to published tools that requires only a desktop computer. The low computational resource requirement of OrthoReD is achieved by repeating orthology searches on one gene of interest at a time, thereby generating a reduced dataset to limit the scope of orthology search for each gene of interest.Entities:
Keywords: Gene evolution; Gene orthology; Genome; Phylogenetics; Transcriptome
Mesh:
Year: 2017 PMID: 28633662 PMCID: PMC5479036 DOI: 10.1186/s12859-017-1726-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1OrthoReD overview. To determine the orthology of the gene of interest, gene of interest is used as a query for a BLASTP search against the dataset (step 1). The BLASTP hits are screened to generate a reduced dataset (step 2). All-v-all BLASTP is conducted on the reduced dataset (step 3) to generate pairwise similarity matrix used by MCL to separate the reduced dataset into clusters (step 4). Most likely phylogeny is reconstructed for the members within the cluster of interest (step 5) and long branches are subsequently removed from the tree (step 6). Finally, all members of the clade that share the most recent gene duplication event are returned as predicted orthologs (step 7)
Dataset information
| Fly | Plant | Actino | |
|---|---|---|---|
| Number of taxa | 13 | 11 | 100 |
| Lis of taxa |
|
| See supplemental Table S1 |
| Range of taxa | Genus | Rosids | Phylum (Actinobacteria) |
| Selected outgroup |
|
|
|
| Number of sequences | 194,469 | 532,305 | 444,382 |
| Number of AA residues | 92,515,839 | 215,684,745 | 146,754,746 |
| Average sequence length | 476 | 405 | 330 |
Parameter settings for each OrthoReD execution on each dataset
| Dataset | Similarity search |
| MAFFT options | |
|---|---|---|---|---|
| ReD_s.aln | FLY | NCBI | 4 | --localpair --retree 2 --maxiterate 1000 |
| ReD_f.aln | FLY | NCBI | 4 | --6merpair --retree 2 --maxiterate 1000 |
| ReD_AB | PLANT | AB | 5 | --6merpair --retree 2 --maxiterate 1000 |
| ReD_SW | PLANT | SWIPE | 5 | --6merpair --retree 2 --maxiterate 1000 |
| ReD_n5 | PLANT | NCBI | 5 | --6merpair --retree 2 --maxiterate 1000 |
| ReD_n6 | PLANT | NCBI | 6 | --6merpair --retree 2 --maxiterate 1000 |
| ReD_n7 | PLANT | NCBI | 7 | --6merpair --retree 2 --maxiterate 1000 |
| ReD_n8 | PLANT | NCBI | 8 | --6merpair --retree 2 --maxiterate 1000 |
| ReD_n9 | PLANT | NCBI | 9 | --6merpair --retree 2 --maxiterate 1000 |
| ReD_n10 | PLANT | NCBI | 10 | --6merpair --retree 2 --maxiterate 1000 |
| ReD_n4 | ACTINO | NCBI | 4 | --6merpair --retree 2 --maxiterate 1000 |
*n: Maximum number of genes per species passed on after the initial similarity search
Total count of predicted orthologs and distribution of predicted orthologs at each e-value rank under different conditions of ortholog prediction
| Fly | Plant | Actino | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ReD_s.aln | ReD_f.aln | OID | ODB | ReD_AB | ReD_SW | ReD_n5 | ReD_n6 | ReD_n7 | ReD_n8 | ReD_n9 | ReD_n10 | OID | ReD_n4 | |
| Total count | 156,223 | 157,105 | 142,437 | 219,715 | 489,574 | 424,334 | 514,093 | 519,680 | 524,341 | 531,003 | 534,538 | 537,454 | 295,966 | 169,860 |
| 1 | 94.0% | 94.0% | 95.4% | 79.3% | 76.6% | 69.6% | 75.8% | 74.5% | 73.4% | 72.7% | 72.2% | 71.6% | 66.6% | 92.8% |
| 2 | 3.9% | 3.9% | 2.7% | 7.2% | 10.7% | 14.5% | 10.7% | 10.5% | 10.3% | 10.1% | 10.0% | 9.9% | 8.7% | 5.1% |
| 3 | 1.3% | 1.3% | 0.7% | 3.4% | 6.3% | 7.8% | 6.4% | 6.2% | 6.0% | 5.9% | 5.8% | 5.8% | 4.6% | 1.4% |
| 4 | 0.8% | 0.8% | 0.3% | 1.9% | 3.9% | 4.9% | 4.2% | 4.0% | 3.9% | 3.8% | 3.7% | 3.7% | 2.8% | 0.7% |
| 5 | 0.0% | 0.0% | 0.2% | 1.3% | 2.5% | 3.3% | 2.9% | 2.8% | 2.7% | 2.6% | 2.5% | 2.5% | 1.8% | 0.0% |
| 6 | 0.0% | 0.0% | 0.1% | 0.9% | 0.0% | 0.0% | 0.0% | 2.1% | 2.0% | 1.9% | 1.9% | 1.9% | 1.3% | 0.0% |
| 7 | 0.0% | 0.0% | 0.1% | 0.8% | 0.0% | 0.0% | 0.0% | 0.0% | 1.6% | 1.6% | 1.6% | 1.5% | 1.0% | 0.0% |
| 8 | 0.0% | 0.0% | 0.1% | 0.6% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 1.3% | 1.3% | 1.2% | 0.8% | 0.0% |
| 9 | 0.0% | 0.0% | 0.1% | 0.5% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 1.0% | 1.0% | 0.7% | 0.0% |
| 10 | 0.0% | 0.0% | 0.1% | 0.4% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.9% | 0.6% | 0.0% |
| 11+ | 0.0% | 0.0% | 0.4% | 3.6% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 0.0% | 11.0% | 0.0% |
Comparisons between outputs generated by different conditions of OrthoReD, OrthologID, and OrthoDB
| Database | FLY | PLANT | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Conditions Compared | ReD_s.aln | ReD_s.aln | ReD_s.aln | ReD_f.aln | ReD_f.aln | OID | ReD_AB | ReD_SW | ReD_n5 | ReD_n6 | ReD_n7 | ReD_n8 | ReD_n9 | ReD_n10 | ReD_n5 | ReD_n5 | ReD_AB | ReD_n5 | ReD_n6 | ReD_n7 | ReD_n8 | ReD_n9 |
| ReD_f.aln | OID | ODB | OID | ODB | ODB | OID | OID | OID | OID | OID | OID | OID | OID | ReD_AB | ReD_SW | ReD_SW | ReD_n6 | ReD_n7 | ReD_n8 | ReD_n9 | ReD_n10 | |
| average % id | 96.4% | 88.1% | 81.2% | 88.2% | 81.4% | 80.5% | 58.7% | 59.8% | 60.5% | 61.1% | 61.2% | 61.3% | 61.6% | 61.8% | 77.9% | 72.2% | 71.0% | 82.3% | 85.8% | 85.7% | 86.1% | 87.3% |
| % 100% id | 89.5% | 68.1% | 55.1% | 68.3% | 55.1% | 56.2% | 33.9% | 36.1% | 35.4% | 36.4% | 36.8% | 36.9% | 37.4% | 37.7% | 58.1% | 53.2% | 51.3% | 62.0% | 69.0% | 70.0% | 71.0% | 80.6% |
| % > =90% id | 92.2% | 76.7% | 66.7% | 76.8% | 67.0% | 67.1% | 41.2% | 42.6% | 43.3% | 44.4% | 44.8% | 44.9% | 45.5% | 45.9% | 65.4% | 59.3% | 57.8% | 72.2% | 77.8% | 77.8% | 78.8% | 87.3% |
Fig. 2Output comparison between OrthoReD, OrthologID, and OrthoDB. The overall similarities of the outputs in two datasets (FLY and PLANT) generated under different conditions are compared based on the fraction of genes of interest with % identity above a threshold. ReD_s.aln and ReD_n10 (red) used OrthoReD, OID (blue) used OrthologID, and ODB (green) used OrthoDB to generate the output
Fig. 3Total CPU time for each condition of OrthoReD. Each box indicates the total CPU time incurred by different conditions of OrthoReD. The line in the box indicates the median, upper and the lower ends of the box indicate the upper and the lower quartiles. The minimum runtime is indicated by the lowest point on the line extended below the box (lowest quartile). The maximum runtime is not indicated
The degree of overlap among the predicted orthologous groups under each condition
| Number of genes | Number of merged groups | % Identical to Orthologous group1 | % Genes rescued2 | |
|---|---|---|---|---|
| FLY_ReD_s.aln | 13,972 | 12,923 | 95.1% | 89.0% |
| PLANT_ReD_n10 | 27,412 | 16,531 | 90.4% | 68.4% |
| ACTINO_ReD_n4 | 8210 | 6933 | 92.7% | 82.1% |
1Fraction of merged groups that are identical to at least one orthologous group predicted from one gene of interest
2Fraction of genes of interest that belonged to the merged groups in C