| Literature DB >> 28209129 |
Malte Petersen1, Karen Meusemann2,3,4, Alexander Donath2, Daniel Dowling2,5, Shanlin Liu6, Ralph S Peters7, Lars Podsiadlowski8, Alexandros Vasilikopoulos2, Xin Zhou9,10, Bernhard Misof2, Oliver Niehuis11,12.
Abstract
BACKGROUND: Orthology characterizes genes of different organisms that arose from a single ancestral gene via speciation, in contrast to paralogy, which is assigned to genes that arose via gene duplication. An accurate orthology assignment is a crucial step for comparative genomic studies. Orthologous genes in two organisms can be identified by applying a so-called reciprocal search strategy, given that complete information of the organisms' gene repertoire is available. In many investigations, however, only a fraction of the gene content of the organisms under study is examined (e.g., RNA sequencing). Here, identification of orthologous nucleotide or amino acid sequences can be achieved using a graph-based approach that maps nucleotide sequences to genes of known orthology. Existing implementations of this approach, however, suffer from algorithmic issues that may cause problems in downstream analyses.Entities:
Keywords: Crabronidae; Orthology; Paralogy; Sphecidae; Splice variants; Transcriptome
Mesh:
Year: 2017 PMID: 28209129 PMCID: PMC5312442 DOI: 10.1186/s12859-017-1529-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Orthograph workflow. From a set of reference proteins (1), the proteins are clustered to form orthologous groups (OGs) (2). These OGs are aligned to construct profile hidden Markov models (pHMMs) (3). The pHMMs are used to search for candidate orthologs in the target library (4). Each of the obtained hit amino acid sequences (5) is used as a query for a BLAST search in a database comprising all reference proteins (including the ones forming OGs) (6). Search results from both forward and reverse searches (7) are collated and sorted by bit score, with the reverse search result order being subordinated to the forward result order (8). This list is evaluated in descending order: if the reverse search hit a protein that is part of the OG used for the forward search, the candidate ortholog is mapped to the OG (9)
Results from the tests that compare Orthograph performance to HaMStR [3]
| Software | Test | Genes | Species | OGS | Found | TP | FP | FN | Sens. | Acc. |
|---|---|---|---|---|---|---|---|---|---|---|
| Orthograph | Single-copy | 4625 |
| 15,314 | 4582 | 4577 | 5 | 48 | 0.990 | 0.996 |
| Orthograph | Single-copy | 4625 |
| 18,564 | 4590 | 4587 | 3 | 38 | 0.992 | 0.997 |
| HaMStR | Single-copy | 4625 |
| 15,314 | 4589 | 4588 | 3 | 39 | 0.992 | 0.997 |
| HaMStR | Single-copy | 4625 |
| 18,564 | 4573 | 4571 | 2 | 54 | 0.988 | 0.996 |
| Orthograph | Isoforms | 8 |
| 17,064 | 7 | 7 | 0 | 1 | 0.875 | 0.999 |
| HaMStR | Isoforms | 8 |
| 17,064 | 7 | 7 | 0 | 1 | 0.875 | 0.999 |
| Orthograph | Inparalogs | 647 |
| 18,093 | 583 | 583 | 0 | 6 | 0.901 | 0.996 |
Sensitivity is defined as the ratio of true positives (TP) to TP plus false negatives (FN). Accuracy is defined as the ratio of TP plus true negatives (TN) to the total number of genes in the official gene set (OGS). FP, false positives. Note that the results are meant to demonstrate equality in performance despite algorithmic differences