| Literature DB >> 16945953 |
Alison Yao1, Rosane Charlab, Peter Li.
Abstract
The identification of pseudogenes is an integral and significant part of the genome annotation because of their abundance and their impact on the experimental analysis of functional genes. Most of the computational annotation systems are not optimized for systematic pseudogene recognition, often annotating pseudogenes as functional genes, and users then propagate these errors to subsequent analyses and interpretations. In order to validate gene annotations and to identify pseudogenes that are potentially mis-annotated, we developed a novel approach based on whole genome profiling of existing transcript and protein sequences. This method has two important features: (i) equally detects both processed and non-processed pseudogenes and (ii) can identify transcribed pseudogenes. Applying this method to the human Ensembl gene predictions, we discovered that 2011 (9% of total) Ensembl genes in the categories of known and novel might be pseudogenes based on expression evidence. Of these, 1200 genes are found to have no existing evidence of transcription, and 811 genes are found with transcription evidence but contain significant translation disruption. Approximately 40% of the 2011 identified pseudogenes presented a multi-exon structure, representing non-processed pseudogenes. We have demonstrated the power of whole genome profiling of expression sequences to improve the accuracy of gene annotations.Entities:
Mesh:
Year: 2006 PMID: 16945953 PMCID: PMC1636364 DOI: 10.1093/nar/gkl591
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1The conceptual relationships between parental gene, pseudogene and expression evidence. More mismatches (X) would be seen in the non-best hits of evidence aligned to the assembled genome.
Figure 2Composite mapping of protein sequence. Mapping of protein sequence to genome by linking segments of TBLASTN results prior to Genewise alignment.
Summary of gene and evidence mapping
| Gene and evidence | Total sequences | Criteria | Mapped sequences | Percentage | Alignments per sequence |
|---|---|---|---|---|---|
| Ensembl functional genes | 22 216 | Location verification | 22 131 | 99.62% | NA |
| Ensembl pseudogenes | 1978 | Location verification | 1976 | 99.90% | NA |
| Vega chromosome 9 and 10 pseudogenes | 1031 | Location verification | 667 | 64.69% | NA |
| Refseq | 22 887 | 70% identity, 50% length | 22 871 | 99.93% | 1.61 |
| GenBank mRNA | 195 073 | 70% identity, 50% length | 185 385 | 95.03% | 19.87 |
| EST | 6 020 341 | 95% identity, 50% length | 5 089 981 | 84.55% | 1.38 |
| Swiss-Prot | 11 777 | TBLASTN <1 × 10−10 | 11 192 | 95.03% | 5.73 |
Summary of EST profiling of Ensembl genes and pseudogenes
| # of EST mapped | Overlapping Ensembl genes | Overlapping Ensembl pseudogenes | Non-transcribed Ensembl genes | Non-transcribed Ensembl pseudogenes | Total non-transcribed pseudogenes |
|---|---|---|---|---|---|
| 1 million | 18 757 | 739 | 2942 | 912 | 3854 |
| 2 million | 19 780 | 911 | 1874 | 802 | 2676 |
| 3 million | 20 161 | 1007 | 1388 | 725 | 2113 |
| 4 million | 20 387 | 1090 | 1104 | 662 | 1766 |
| 5 million | 20 524 | 1158 | 918 | 613 | 1531 |
| Total | 20 536 | 1159 |
Figure 3Potential pseudogenes as a function of EST evidence used. Total potential pseudogenes without best hits or lack of any EST hits plotted against the amount of EST evidence used.
Summary statistics for the validation of Vega and Ensembl pseudogenes
| Categories | Vega | Ensembl | ||||
|---|---|---|---|---|---|---|
| # | % of respective category | % of total | # | % of respective category | % of total | |
| Original | 1031 | 1978 | ||||
| Aligned with consistent locations | 667 | 1974 | ||||
| cDNA support | 469 | 1611 | ||||
| No cDNA best hits (non-transcribed pseudogene) | 266 | 100.00 | 45.16 | 977 | 100.00 | 52.95 |
| With cDNA best hits | 203 | 642 | ||||
| Swiss-Prot protein support | 150 | 380 | ||||
| No best hits & presence of frameshifts (transcribed pseudogene) | 92 | 61.33 | 15.62 | 311 | 81.84 | 16.86 |
| No cDNA support | 198 | 634 | ||||
| Swiss-Prot protein support | 120 | 234 | ||||
| No best hits and presence of frameshifts (non-transcribed pseudogene) | 93 | 77.50 | 15.79 | 212 | 90.60 | 11.49 |
| No support from cDNA & Swiss-Prot | 78 | 129 | ||||
| Non-transcribed pseudogenes | 78 | NA | NA | 129 | NA | NA |
| Total genes supported by evidence | 589 | 1845 | ||||
| Total pseudogenes validated | 451 | NA | 76.57 | 1500 | NA | 81.30 |
Comparison of exon structures between Ensembl pseudogenes and their parental genes from the 288 identified pairs
| Categories | Pairs of genes | ||
|---|---|---|---|
| Total | Relationship of pseudo to parent | ||
| One-to-one | One-to-many | ||
| Group A: pseudogene single-exon, parent multi-exon | 109 (37.85%) | 102 | 7 |
| Group B: both single-exon | 42 (14.58%) | 27 | 15 |
| Group C: both multi-exon | 112 (38.89%) | 96 | 16 |
| Group D: pseudogene 2-exon, parent single-exon | 25 (8.68%) | 14 | 11 |
| Total | 288 | 239 | 49 |
Functional classification of pseudogenes
| Name | Number |
|---|---|
| Olfactory receptor | 53 |
| Keratin type I/II | 24 |
| Immunoglobulin | 21 |
| Peptidyl-prolyl | 17 |
| Heterogeneous nuclear ribonucleoprotein | 16 |
| HMGI/II (high mobility group) | 14 |
| Dynein heavy chain | 12 |
| Nucleophosmin | 12 |
| 40S ribosomal S2 | 11 |
| Elongation factor 1 alpha | 10 |