| Literature DB >> 18445630 |
Meng-Ru Ho1, Wen-Jung Jang, Chun-houh Chen, Lan-Yang Ch'ang, Wen-chang Lin.
Abstract
Orthology is a widely used concept in comparative and evolutionary genomics. In addition to prokaryotic orthology, delineating eukaryotic orthology has provided insight into the evolution of higher organisms. Indeed, many eukaryotic ortholog databases have been established for this purpose. However, unlike prokaryotes, alternative splicing (AS) has hampered eukaryotic orthology assignments. Therefore, existing databases likely contain ambiguous eukaryotic ortholog relationships and possibly misclassify alternatively spliced protein isoforms as in-paralogs, which are duplicated genes that arise following speciation. Here, we propose a new approach for designating eukaryotic orthology using processed transcription units, and we present an orthology database prototype using the human and mouse genomes. Currently existing programs cover less than 69% of the human reference sequences when assigning human/mouse orthologs. In contrast, our method encompasses up to 80% of the human reference sequences. Moreover, the ortholog database presented herein is more than 92% consistent with the existing databases. In addition to managing AS, this approach is capable of identifying orthologs of embedded genes and fusion genes using syntenic evidence. In summary, this new approach is sensitive, specific and can generate a more comprehensive and accurate compilation of eukaryotic orthologs.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18445630 PMCID: PMC2425467 DOI: 10.1093/nar/gkn227
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Criteria involved in generating the GOOD database prototype of human and mouse orthologs. (A) Flow chart of the overall database construction process. (B) Detail representing the merging of different AS products into a processed transcription unit. The upper panel (a) illustrates how the GOOD regions are obtained from all reference sequences. The lower panel (b) illustrates that the processed transcription unit is derived from isoforms from same genomic region. (C) Detail representing the inclusion of syntenic information. Potential ortholog pairs (D/d, shown in blue) are analyzed based on the three syntenic possibilities shown in panels a, b and c. Boxes with black lines represent anchor pairs; boxes with solid lines represent anchor pairs with syntenic structure and boxes with dotted lines represent pairs that lack syntenic structure.
Figure 2.Distribution of the percent identity between aligned orthologous protein and transcript pairs from the human and mouse genomes. The x axis indicates the percent identity of paired orthologous sequences, and the y axis indicates number of orthologous pairs normalized to the total number of input pairs. The aligned identities of protein sequences were obtained using UniGene and are shown in green. The aligned identities of the processed transcription units were obtained using GOOD and are shown in pink.
Comparison of 17 214 GOOD human/mouse orthologous pairs with the four existing ortholog databases
| GOOD | HomoloGene | UCSC known genes | Ensembl compara | Inparanoid | |
|---|---|---|---|---|---|
| # Reference Sequence | N/A | 16 325 HID | 14 692 kgID | 22 047 | 15 549 |
| # Region-based Orthologous Pairs | 17 214 | 14 843 | 12 111 | 12 362 | 9023 |
| # Region-based Orthologous Pairs also identified by GOOD | N/A | 14 327 | 11 889 | 11 332 | 8825 |
| Human reference gene coverage rate | ∼80% | ∼69% | ∼56% | ∼57% | ∼42% |
aHID: HomoloGene group id.
bkgID: Human/Mouse reciprocal conserved UCSC Known Genes ID pair.
N/A: not applicable.
HomoloGene: build 56
UCSC Known Genes: hg18/mm8
Ensembl Compara: release 46
Inparanoid: version 5.1
There are 21 544 human regions from the RefSeq (NCBI build 36). There might be some loss when transforming ids among different databases.
Compared to current ortholog databases, GOOD has higher consistency and also provides the highest coverage rate of the human genome.
Figure 3.Pie chart representation of GOOD ortholog data. The whole pie reflects the total number of human reference genes (21 544 human regions) from the RefSeq (NCBI build 36). The green shading represents the percentage of human reference genes that the program was capable of considering for GOOD. The grey shading represents the percentage of human reference genes that are not represented in GOOD [73% of these reference genes are located in non-NM_regions (regions not supported by experimental evidence), where ortholog designation is difficult]. The blue shading represents those ortholog pairs identified by both GOOD and HomoloGene. GOOD identified 97% of the ortholog pairs in the HomoloGene database. The orange shading represents those ortholog pairs identified by GOOD that were not represented in HomoloGene; 34% of these ortholog pairs (light orange) were represented in one of the other three existing ortholog databases.