| Literature DB >> 15860774 |
Paul M Harrison1, Deyou Zheng, Zhaolei Zhang, Nicholas Carriero, Mark Gerstein.
Abstract
Pseudogenes, in the case of protein-coding genes, are gene copies that have lost the ability to code for a protein; they are typically identified through annotation of disabled, decayed or incomplete protein-coding sequences. Processed pseudogenes (PPsigs) are made through mRNA retrotransposition. There is overwhelming genomic evidence for thousands of human PPsigs and also dozens of human processed genes that comprise complete retrotransposed copies of other genes. Here, we survey for an intermediate entity, the transcribed processed pseudogene (TPPsig), which is disabled but nonetheless transcribed. TPPsigs may affect expression of paralogous genes, as observed in the case of the mouse makorin1-p1 TPPsig. To elucidate their role, we identified human TPPsigs by mapping expressed sequences onto PPsigs and, reciprocally, extracting TPPsigs from known mRNAs. We consider only those PPsigs that are homologous to either non-mammalian eukaryotic proteins or protein domains of known structure, and require detection of identical coding-sequence disablements in both the expressed and genomic sequences. Oligonucleotide microarray data provide further expression verification. Overall, we find 166-233 TPPsigs ( approximately 4-6% of PPsigs). Proteins/transcripts with the highest numbers of homologous TPPsigs generally have many homologous PPsigs and are abundantly expressed. TPPsigs are significantly over-represented near both the 5' and 3' ends of genes; this suggests that TPPsigs can be formed through gene-promoter co-option, or intrusion into untranslated regions. However, roughly half of the TPPsigs are located away from genes in the intergenic DNA and thus may be co-opting cryptic promoters of undesignated origin. Furthermore, TPPsigs are unlike other PPsigs and processed genes in the following ways: (i) they do not show a significant tendency to either deposit on or originate from the X chromosome; (ii) only 5% of human TPPsigs have potential orthologs in mouse. This latter finding indicates that the vast majority of TPPsigs is lineage specific. This is likely linked to well-documented extensive lineage-specific SINE/LINE activity. The list of TPPsigs is available at: http://www.biology.mcgill.ca/faculty/harrison/tppg/bppg.tov (or) http:pseudogene.org.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15860774 PMCID: PMC1087782 DOI: 10.1093/nar/gki531
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Position of TPΨgs, other PΨgs and processed genes relative to annotated genes
| Categories of sequence grouped by position relative to genes | Type of sequence | |||||
|---|---|---|---|---|---|---|
| Other | Processed genes | |||||
| Observed number | Expected number | Observed number | Expected number | Observed number | Expected number | |
| Sequences that overlap gene annotations | 18 (8%) | — | — | — | — | — |
| Sequences mapped to introns of annotated genes | 67 (28%) | 79.7 | 693 (22%) | 1100.0¶¶ | 3 (5%) | 21.2¶¶ |
| Sequences <3000 nt 5′ of start codon of annotated genes | 20 (9%) | 6.8** | 78 (0.7%) | 93.6 | 5 (8%) | 1.9 |
| Sequences <10 000 nt 5′ of start codon of annotated genes | 36 (15%) | 22.3* | 278 (9%) | 307.8 | 7 (11%) | 5.9 |
| Sequences <3000 nt 3′ of translation stop of annotated genes | 22 (9%) | 6.7** | 55 (1.7%) | 92.3¶¶ | 0 (0%) | 1.8 |
| Sequences <10 000 nt 3′ of translation stop of annotated genes | 42 (18%) | 22.2** | 241 (7%) | 306.4¶¶ | 9 (14%) | 5.8 |
| Sequences that are in intergenic DNA | 109 (47%) | 129.8 | 2109 | 1371.7** | 43 | 31.4 |
aThese categories are not additive, as they are not mutually exclusive, i.e. some TPΨg may be within 10 000 nt of the 5′ end of one gene, and be in the intron of another gene or within 10 000 of the 3′ end of a third gene.
bExpected values are calculated assuming random insertion in the whole genome (without the genomic DNA for annotated genes). For significant over-representation, ** indicates P < 0.001, and * indicates P < 0.01 for a chi-squared test (1 degree of freedom) using Yates correction (similarly, ¶¶ is used for significant under-representation for P < 0.01).
cIntergenic DNA is defined as all of the genomic DNA that does not comprise exons, introns or the regions of genes within 10 000 nt of the translation stop and start of gene coding sequences.
Summary of numbers of TPΨgs
| Set or subset of | Total number | Total number (without those mapped to introns) |
|---|---|---|
| Mappings to existing pseudogene annotations | 218 | 154 |
| Pseudogene extraction from Refseq mRNAs | 15 | 12 |
| Total | 233 | 166 |
| Expressed sequence support | ||
| | 18 (8%) | 16 (10%) |
| | 74 (32%) | 50 (30%) |
| | 167 (72%) | 111 (67%) |
| | 38 (16%) | 25 (16%) |
| | 75 (32%) | 53 (32%) |
| Further evidence of decay | ||
| | 106 (45%) | 70 (42%) |
| | 127 (54%) | 88 (53%) |
| C set ( | 177 (76%) | 123 (74%) |
Human proteins with four or more homologous TPΨgs
| Number | Name of human protein |
|---|---|
| 5 [4] | Peptidyl-prolyl |
| 4 [3] | Prohibitin [P35232] |
| 4 [3] | 40S ribosomal protein S12 [P25398] |
| 4 [3] | Actin, cytoplasmic 2 (Gamma-actin) [P63261] |
| Glyceraldehyde-3-phosphate dehydrogenase [P04406, P00354] |
aThe totals in square brackets are for when those mapping to introns are removed.
bThe Swissprot accession numbers are given in square brackets.
Figure 1Origination and deposition of TPΨgs for different chromosomes. (A) Origination of TPΨgs: this plot shows the number of parent genes of TPΨgs in a chromosome versus the chromosome size (in Mb). (B) Deposition of TPΨgs: this shows the number of TPΨgs per chromosome versus chromosome size (in Mb). Only retrotranspositions from one chromosome to another are considered in each plot. The X chromosome is ringed. Note that for each plot we have corrected for the probability of X and Y chromosome inclusion in gametes [i.e. the size of X is multiplied by 0.75 and Y by 0.25; for comparison see figure 1 in (13)].
Figure 2Examples of TPΨgs. (A) This is a TPΨg derived from the human prohibitin gene. The prohibitin gene contains both a protein-coding region and an RNA in its 3′-UTR (45), but only the segment of the TPΨg corresponding to the protein-coding sequence is shown. In the center is an alignment of the TPΨg (in red) with prohibitin protein (in green). The graphic above it shows the position of the TPΨg (red segment) in the 3′-UTR of an mRNA that codes for a Zn-finger-containing protein (blue segment). (B) An example of a TPΨg that maps to a known globular protein domain. The TPΨg derives from the mRNA for the precursor sequence of mitochondrial 2-amino-3-ketobutyrate coenzyme A. The domain is from the closest-matching protein structure (from E.coli, PDB code 1fc4a). In the Molscript (54) picture, the protein chain trace color changes at the position of each disablement. The alignment of the E.coli domain sequence and the human TPΨg sequence is shown. The part of the sequence that maps to an EST (gi|6138420) is boxed and italicized.