| Literature DB >> 16925835 |
Deyou Zheng1, Mark B Gerstein.
Abstract
BACKGROUND: Pseudogenes are inheritable genetic elements showing sequence similarity to functional genes but with deleterious mutations. We describe a computational pipeline for identifying them, which in contrast to previous work explicitly uses intron-exon structure in parent genes to classify pseudogenes. We require alignments between duplicated pseudogenes and their parents to span intron-exon junctions, and this can be used to distinguish between true duplicated and processed pseudogenes (with insertions).Entities:
Mesh:
Year: 2006 PMID: 16925835 PMCID: PMC1810550 DOI: 10.1186/gb-2006-7-s1-s13
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1A flow chart of our computational pipeline for identifying pseudogenes. It contains two parallel procedures, one on the left (routine P) is mainly for processed pseudogenes and the other on the right (routine D) is for duplicated pseudogenes. The steps common to both are shown at the top and in the bottom. Both procedures searched the ENCODE regions for DNA sequences similar to human genes as annotated by the ENSEMBL. The two routines differ in how to perform the search and how to process the search results. The key differences are highlighted with blue in P and orange in D. At the end, an alignment between a known gene and a pseudogene candidate was constructed either by TFASTY or GeneWise. Information in this alignment and the computational path taken by a pseudogene were used together to separate pseudogenes into three classes: duplicated, processed and fragment.
Separation of 184 pseudogenes in ENCODE regions identified in this study
| Final pseudogene type* | Detected only by routine P | Detected by both routines | Detected only by routine D |
| Processed | 60 | 30 | 3 |
| Non-processed | |||
| Duplicated | 3 | 13† | 3 |
| Fragment | 60 | 1 | 11 |
*The types are the final classification after information from routines P and D was combined. They could be different from a pseudogene's initial type labeled in either routine P or D. †In routine P, two were annotated as processed and two as fragments and another four were identified partially.
Figure 2Distribution of 184 pseudogenes in ENCODE regions. Pseudogenes were first grouped into processed and non-processed (duplicated and fragments). Their numbers in the 44 ENCODE regions are plotted. The inserted panel shows that the number of pseudogenes is approximately correlated to that of genes within individual regions.
Overlapping of our 184 pseudogenes with GENCODE annotations
| GENCODE annotation | ||||
| Annotation in this study | Processed | Non- processed | Not annotated | Exons |
| Processed | 70 | 7 | 13 | 3 |
| Non-Processed | 15 | 44 | 17 | 17 |
| Not Annotated | 15 | 18 | - | - |
Figure 3Two pseudogenes inconsistent with GENCODE gene annotation. (a) A pseudogene in ENr122: 359245-366200 (+) and its alignment with an ENSEMBL protein ENSP00000331368 (Serpin B8). This pseudogene overlaps a GENCODE gene whose transcript (ID: 'AC009802.2-001') contained the three pseudo-exons and one additional 5' exon. (b) A pseudogene at ENm005:200473-211501 (-) and its alignment with an ENSEMBL protein ENSP00000283507 (TCP-10 homolog). The first four pseudo-exons were included in a five-exon GENCODE transcript (ID: 'AP000274.7-001'). The frameshift mutations ('!' in the alignment) in both pseudogenes are highlighted.
Examples of known pseudogenes in ENCODE regions
| Name | Region | Pseudogene location | Our annotation | GENCODE annotation |
| β-globin | ENm009 | 488570-490726 (-) | 489920-490351 (-) | 488931-490348 (-) |
| α-globin | ENm008 | 156150-156704 (+) | NA | NA |
| α-globin | ENm008 | 158635-159503 (+) | 158920-159084 (+) | 158678-159333 (+) |
| ζ-globin | ENm008 | 152711-155400 (+) | NA | 153121-155155 (+) |
| OR51H2P | ENm009 | 123369-124314 (+) | 123353-124305 (+) | 123368-124273 (+) |
| OR51B8P | ENm009 | 577399-578156 (-) | 577369-578174 (-) | 577403-578171 (-) |
NA, not annotated.