| Literature DB >> 25137074 |
Marcus Lechner1, Maribel Hernandez-Rosales2, Daniel Doerr3, Nicolas Wieseke4, Annelyse Thévenin3, Jens Stoye3, Roland K Hartmann1, Sonja J Prohaska5, Peter F Stadler6.
Abstract
The elucidation of orthology relationships is an important step both in gene function prediction as well as towards understanding patterns of sequence evolution. Orthology assignments are usually derived directly from sequence similarities for large data because more exact approaches exhibit too high computational costs. Here we present PoFF, an extension for the standalone tool Proteinortho, which enhances orthology detection by combining clustering, sequence similarity, and synteny. In the course of this work, FFAdj-MCS, a heuristic that assesses pairwise gene order using adjacencies (a similarity measure related to the breakpoint distance) was adapted to support multiple linear chromosomes and extended to detect duplicated regions. PoFF largely reduces the number of false positives and enables more fine-grained predictions than purely similarity-based approaches. The extension maintains the low memory requirements and the efficient concurrency options of its basis Proteinortho, making the software applicable to very large datasets.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25137074 PMCID: PMC4138177 DOI: 10.1371/journal.pone.0105015
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Synteny-enhanced orthology prediction.
Four genes (A1, A2, B1, B2) in two species (A and B). a) The gene tree with a duplication (filled double circle) and a speciation event (empty circle). b) Gene order in the genomic context of both genes. Genes A'x and B'x are orthologous to each other. Lines depict suggested partners based on sequence similarity of which the dashed were neglected by the gene order algorithm.
Figure 2Workflow of PoFF.
Similar gene sequences are determined by an all-against-all blast search. Top reciprocal matches are ordered by their positions in the respective genomes. The FFAdj-MCS algorithm is applied to determine the maximum matching with respect to sequence similarity and gene order. As a result the orthology graph only contains the remaining edges from pairwise comparisons. Finally, orthologous groups are extracted by clustering.
Composition of simulated datasets.
| Dataset | Families | Proteins | ø Family size | ø Breakpoint distance |
|
| 50 | 8,363 | 167 proteins | 13 |
|
| 80 | 15,296 | 191 proteins | 19 |
|
| 100 | 27,258 | 273 proteins | 14 |
The simulated datasets differ by the number of gene families present in the species as well as by the size of these families. The larger the families the more diversity among the set of species can be considered. Set F80d additionally comprises whole genome duplications.
Figure 3A reconciled tree for gene families.
The gene tree is embedded in the species tree. Internal nodes represent either gene duplication (filled double circle) or speciation events (empty circles). Gene loss is depicted by ×.
Comparison using simulated data.
| Dataset | Method | Precision | Recall | Accuracy |
| Runtime |
| F50 |
| 3.06% | 7.26% | 86.18% | 89.71% | 7 h, 22 min |
|
| 38.64% | 9.62% | 95.49% | 99.32% | 1 day, 14 h | |
|
| 98.01% | 5.02% | 95.94% | 99.99% | 2 days, 2 h | |
|
| 80.63% | 23.11% | 97.62% | 99.83% | 0 h, 36 min | |
|
| 96.15% | 24.18% | 97.53% | 99.96% | 0 h, 36 min | |
| F80d |
| 0.92% | 0.88% | 87.44% | 93.43% | 15 h, 46 min |
|
| 43.97% | 5.25% | 93.51% | 99.54% | 3 days, 23 h | |
|
| 97.67% | 0.89% | 93.65% | 99.99% | 8 days, 23 h | |
|
| 79.36% | 16.64% | 97.68% | 99.88% | 1 h, 29 min | |
|
| 93.98% | 15.52% | 97.30% | 99.96% | 1 h, 30 min | |
| F100 |
| - | - | - | - | >31 days |
|
| 23.99% | 20.48% | 99.37% | 99.71% | 6 h, 39 min | |
|
| 90.16% | 18.17% | 99.62% | 99.99% | 6 h, 44 min |
Comparison of computational results with orthology relations derived from simulated datasets with different gene family sizes. Statistical values are explained in Materials and Methods. tn rate refers to true negative rate. Running time was measured on a quad core CPU (Intel core i7 at 2.9 GHz) with eight threads.
Comparison using real data.
| Dataset | Method | Precision | Recall | Accuracy |
|
|
|
| 99.50% | 23.80% | 29.12% | 98.45% |
|
| 99.52% | 22.50% | 27.93% | 98.47% | |
|
|
| 61.36% | 42.82% | 99.64% | 99.89% |
|
| 59.10% | 38.35% | 99.62% | 99.89% | |
|
| 59.07% | 36.97% | 99.62% | 99.89% | |
|
|
| 100% | 17.68% | 24.71% | 100% |
|
| 100% | 9.72% | 17.44% | 90.27% |
Comparison of tools on the basis of estimated orthology relations from real data sets. Statistical values are explained in Materials and Methods. tn rate refers to true negative rate.
Figure 4The false negative issue.
The genes {A1, B1, B2, C1, C2} form an orthologous group. {B1, B2} as well as {C1, C2} are not orthologous to each other but co-orthologous with respect to A1 (A1 and {B1, B2, C1, C2} are separated by a speciation event). Pairwise true orthology relationships are marked by black arcs, false ones are grey. Proteinortho is more inclusive, it would report all five genes as one group, yielding six true and two false positives (grey). Assuming that the gene copies 1 and 2 exhibit distinct genomic neighborhoods in all three species A to C, PoFF would report two separate groups, namely {A1, B1, C1} and {B2, C2}. This more fine-grained method avoids false positive orthology assignments. However, it introduces false negative assignments. Two in this example, depicted by dashed arcs.