Literature DB >> 15860774

Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability.

Paul M Harrison¹, Deyou Zheng, Zhaolei Zhang, Nicholas Carriero, Mark Gerstein.

Abstract

Pseudogenes, in the case of protein-coding genes, are gene copies that have lost the ability to code for a protein; they are typically identified through annotation of disabled, decayed or incomplete protein-coding sequences. Processed pseudogenes (PPsigs) are made through mRNA retrotransposition. There is overwhelming genomic evidence for thousands of human PPsigs and also dozens of human processed genes that comprise complete retrotransposed copies of other genes. Here, we survey for an intermediate entity, the transcribed processed pseudogene (TPPsig), which is disabled but nonetheless transcribed. TPPsigs may affect expression of paralogous genes, as observed in the case of the mouse makorin1-p1 TPPsig. To elucidate their role, we identified human TPPsigs by mapping expressed sequences onto PPsigs and, reciprocally, extracting TPPsigs from known mRNAs. We consider only those PPsigs that are homologous to either non-mammalian eukaryotic proteins or protein domains of known structure, and require detection of identical coding-sequence disablements in both the expressed and genomic sequences. Oligonucleotide microarray data provide further expression verification. Overall, we find 166-233 TPPsigs ( approximately 4-6% of PPsigs). Proteins/transcripts with the highest numbers of homologous TPPsigs generally have many homologous PPsigs and are abundantly expressed. TPPsigs are significantly over-represented near both the 5' and 3' ends of genes; this suggests that TPPsigs can be formed through gene-promoter co-option, or intrusion into untranslated regions. However, roughly half of the TPPsigs are located away from genes in the intergenic DNA and thus may be co-opting cryptic promoters of undesignated origin. Furthermore, TPPsigs are unlike other PPsigs and processed genes in the following ways: (i) they do not show a significant tendency to either deposit on or originate from the X chromosome; (ii) only 5% of human TPPsigs have potential orthologs in mouse. This latter finding indicates that the vast majority of TPPsigs is lineage specific. This is likely linked to well-documented extensive lineage-specific SINE/LINE activity. The list of TPPsigs is available at: http://www.biology.mcgill.ca/faculty/harrison/tppg/bppg.tov (or) http:pseudogene.org.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2005 PMID： 15860774 PMCID： PMC1087782 DOI： 10.1093/nar/gki531

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The search for novel functional elements in the human genome is imperative and ongoing (1–3). Pseudogenes (gene copies that have lost their protein-coding ability) are a form of sequence of potential functional utility (4). Substantial progress has been made in the annotation of pseudogenes (5–11). There may be twice as many pseudogenes (derived from protein-coding genes) in the human genome as protein-coding genes (6–10). Pseudogenes (derived from protein-coding genes) are typically ‘diagnosed’ through searching for the ‘symptoms’ of a lack of protein-coding ability. These symptoms include: frame disablement (from premature stop codons and frameshifts), coding sequence decay (typically detectable through examination of non-synonymous and synonymous substitution rates) or incompleteness (either from sequence truncation or from the loss of essential signals for transcription, splicing and translation) (6–10). Processed pseudogenes (PΨgs) are made through retrotransposition of mRNAs. There is ubiquitous genomic evidence for thousands of PΨgs in mammals (5–10). Similarly, dozens of processed genes (i.e. genes made by retrotransposition of the complete sequence of other genes) have arisen in both the mouse and human genomes (12,13). This mass gene retrotransposition may arise, at least in part, as a by-product of long interspersed element (LINE) retrotransposition (14). Retrotransposition is clearly an active process in mammalian gene evolution (15). Here, we search for an intermediate type of retrotransposed gene sequence: the transcribed processed pseudogene (shortened as TPΨg), which is a PΨg that is disabled but nonetheless transcribed. Historically, there have been several isolated reports of transcribed pseudogenes, of either the duplicated or the processed form (16–21). Two recent studies have demonstrated that such transcribed pseudogenes can regulate transcription of homologous protein-coding genes. Transcription of a pseudogene in Lymnea stagnalis, that is homologous to the nitric oxide synthase gene, decreases the expression levels for the gene through formation of a RNA duplex; this is thought to arise via a reverse-complement sequence found at the 5′ end of the pseudogene transcript (20). In a second example, transcription of the makorin1-p1 TPΨg in mouse was required for the stability of the mRNA from a homologous gene makorin1 (21). This regulation was deduced to arise from an element in the 5′ areas of both the gene and the pseudogene (21). In addition to helping to elucidate such regulatory roles, annotation of TPΨgs will further add to our understanding of the dynamics of gene evolution through retrotransposition (15). Also, it is crucial to annotate TPΨgs correctly as a part of the ongoing process of correct cDNA/expressed sequence tag (EST) mapping during genome annotation, and for more accurate interpretation of microarray expression data (22,23). Here, we have performed a data-mining expedition for human TPΨgs using a rigorous method that applies stringent filters to avoid data pollution. TPΨgs have a markedly distinct distribution in the genome when compared with other PΨgs and processed genes. A key result is that TPΨgs are significantly likely to insert near the 5′ and 3′ ends of genes, implying that TPΨgs can be generated by co-option of promoter elements or by intrusion into untranslated regions (UTRs) as ‘molecular passengers’. Also, we find that the vast majority of TPΨgs are human-lineage specific compared with mouse.

Definitions and terms

An mRNA can be reverse transcribed and re-integrated into the genomic DNA, possibly as a by-product of LINE-1 retrotransposition (14). The parent gene of the mRNA need not be on the same chromosome as the retrotransposed copy. Such a retrotransposed mRNA has three possible fates in the present-day genome: (i) formation of a non-transcribed PΨg, (ii) formation of a TPΨg or (iii) formation of a processed gene (or part of a gene). A PΨg can be defined as any disrupted, decayed or incomplete copy of a gene that has arisen through such retrotransposition. In the process of evolution, PΨgs accumulate disablements (frameshifts and premature stop codons) in their apparent coding sequences. Procedures to annotate PΨgs using disablement detection have been described previously (4,5,7), and serve as the basis for the present analysis. Operationally, a TPΨg is defined as a PΨg for which an expressed sequence is mappable across any of its coding-sequence disablements, i.e. the disablement occurs in both the expressed sequence and the genomic sequence (see Methods for details). A processed gene is any undisrupted retrotransposed copy of a gene that also has low Ka/Ks values indicative of selection pressure on coding ability (see Methods for details). Each of our TPΨgs has ≥1 disablement verified by alignment of the expressed sequences to genomic DNA, in a region of the TPΨg that maps to a known structural protein domain, or to a protein sequence that is conserved in non-mammalian eukaryotes. This three-level verification procedure (genome:transcript:protein) is termed triple alignment. Each verified disablement has an estimated probability of being the result of a sequencing error of ≤10−6, since the error rate for the genomic sequence build is ≤10−4 (24) and the error rate for cDNAs/ESTs is ≤10−2 (25,26). We made a subset of TPΨgs, termed the C set, which has further evidence of lack of coding ability. These have: (i) no continuous segment of sequence that can code for a protein domain (as defined in Methods); (ii) high Ka/Ks values (≥0.5). As it is possible that a fraction of the TPΨgs that map to introns arise from intron retention in cDNAs or ESTs in the source expressed sequence data, we analyzed all of the data both including and excluding the 67 TPΨgs that map to introns (see Table 3 and below). Our results are unaffected by such potential contamination, as explained below.

Table 3

Position of TPΨgs, other PΨgs and processed genes relative to annotated genes

Categories of sequence grouped by position relative to genes	Type of sequence
	TPΨgs		Other PΨgs		Processed genes
	Observed numbera	Expected numberb	Observed numbera	Expected numberb	Observed numbera	Expected numberb
Sequences that overlap gene annotations	18 (8%)	—	—	—	—	—
Sequences mapped to introns of annotated genes	67 (28%)	79.7	693 (22%)	1100.0^¶¶	3 (5%)	21.2^¶¶
Sequences <3000 nt 5′ of start codon of annotated genes	20 (9%)	6.8^**	78 (0.7%)	93.6	5 (8%)	1.9
Sequences <10 000 nt 5′ of start codon of annotated genes	36 (15%)	22.3^*	278 (9%)	307.8	7 (11%)	5.9
Sequences <3000 nt 3′ of translation stop of annotated genes	22 (9%)	6.7^**	55 (1.7%)	92.3^¶¶	0 (0%)	1.8
Sequences <10 000 nt 3′ of translation stop of annotated genes	42 (18%)	22.2^**	241 (7%)	306.4^¶¶	9 (14%)	5.8
Sequences that are in intergenic DNAc	109 (47%)	129.8	2109	1371.7^**	43	31.4

aThese categories are not additive, as they are not mutually exclusive, i.e. some TPΨg may be within 10 000 nt of the 5′ end of one gene, and be in the intron of another gene or within 10 000 of the 3′ end of a third gene.

bExpected values are calculated assuming random insertion in the whole genome (without the genomic DNA for annotated genes). For significant over-representation, ** indicates P < 0.001, and * indicates P < 0.01 for a chi-squared test (1 degree of freedom) using Yates correction (similarly, ¶¶ is used for significant under-representation for P < 0.01).

cIntergenic DNA is defined as all of the genomic DNA that does not comprise exons, introns or the regions of genes within 10 000 nt of the translation stop and start of gene coding sequences.

METHODS

Detection of TPΨgs

(i) Mapping expressed sequence data onto existing PΨgs annotations

PΨgs were annotated previously using a method based on the detection of disabled protein homology in genomic DNA (4,5,7). We mapped >6200 of these onto human genome build 34 (from ), through detection of 100% nucleotide sequence matches, removing overlap with coding exons. For each PΨg, the genomic sequence was extracted, both with and without a 6000 nt extension added on to either end to allow for homology matching to ‘pseudo-UTR’ regions. (These sets of genomic DNA are named genPΨg and genPΨg.) Three sources of expressed sequences (Refseq mRNAs, Unigene consensuses, and ESTs from dbEST) were downloaded from . They were mapped onto genPΨg and genPΨg, using BLASTN with low-complexity masking (E-value ≤ 10−10, minimum match length 100 nt) (27,28). From the resulting significant matches, those that align with ≥95% identity were used to generate a second BLASTN search against genPΨg and genPΨg, but this time without low-complexity masking, to insure correct sequence identity. Matches to both genPΨg and genPΨg with ≥99% identity over >0.998 of the length of the expressed sequence were then extracted. These expressed sequence matches were filtered to insure that they match more significantly to the PΨg than to any homologous gene. The matching expressed sequences were then re-aligned to the PΨg sequence using FASTY (29), to check that ≥1 disablement (frameshift or premature stop codon) in the PΨg occurs in both the genomic sequence and expressed sequence. Each disablement verified in this way has an estimated probability of being the result of a sequencing error of ≤1 × 10−6; this is because the genomic sequence error rate is ≤1 × 10−4, and the cDNA/EST sequencing error rate is ≤1 × 10−2.

(ii) Extraction of PΨgs that are in Refseq mRNAs

All human Refseq entries corresponding to known mRNAs (total = 20 741) were compiled from data downloaded from the NCBI website (). These were compared with all known, non-fragmentary human proteins in the SWISSPROT database (30), using a modification of the disabled protein homology-based procedure developed previously for PΨg annotation (4,5,7,31–33). To insure that all of the candidate TPΨgs in Refseq mRNAs map to a single continuous piece of genomic DNA, we extracted the appropriate mRNA subsequences and mapped them to the human genome using BLASTN. Those segments that matched over their complete length exactly were retained. The resulting TPΨg data were then filtered along with those generated in (i), as detailed in (iii) below.

(iii) Filtering the (transcribed) PΨg data

We applied a set of filters to insure that we were compiling a bona fide list of TPΨgs. All TPΨg data sets were filtered as follows: Removal of homologies to purely hypothetical proteins or fragmentary proteins: TPΨgs based only on homology to predicted reading frames or reading-frame fragments were removed through BLASTP comparisons (E-value ≤ 10−4) against a library of hypothetical or fragmentary proteins from SWISSPROT (30). These are removed because their disablements may be erroneous (which is inappropriate for the method employed here). Also, they may be inaccurately dated (values for Ks, Ka, etc., may be incorrect). Verification that the disablements are in conserved parts of a known protein sequence or domain: We verified that the disablements examined are in known conserved parts of sequences, as detailed below. This list of filters has an ‘if-else-if-else-if’ structure: First, we assigned protein structural domains to the TPΨgs, by comparing them with the ASTRALSCOP 95% identity set of protein domains (34), using BLASTP (27) (E-value ≤ 10−4). The total assigned TPΨg subsequence was determined (from the most N-terminal residue that was assigned to a domain, to the most C-terminal). This assigned subsequence was considered disabled, if a frameshift or stop codon occurred >10 residues in from either terminus. This accounts for ∼54% of TPΨgs. Otherwise, secondly, TPΨgs not meeting criterion (1) were checked manually for occurrence of disablements in conserved domains using the InterPro () and CDD () domain annotation tools. Otherwise, thirdly, TPΨgs not meeting criterion (2) (<10% of the sequences) were checked for disablement within a part of the sequence that is conserved in other mammals, and in ≥1 non-mammalian species, using BLASTP (E-value ≤ 10−4). Removal of candidates with small introns: Putative PΨgs, TPΨgs, and the expressed sequences that match them, were filtered for small intron sequences. A library of introns of <1000 nt was made from genes on human genome build 34. TBLASTN (27) was used to annotate any significant matches to these introns of >0.80 of their length (E-value ≤ 10−4). Any such matches could either be from introns in an aberrant cDNA or be a previously disregarded small intron in the genome. Some additional examples of TPΨgs that map to introns may arise from intron retention in cDNAs or ESTs; however, the main points of the analysis reported in this paper are unaffected by such potential contamination, as explained below. Removal of possible duplications of single-exon genes and large exons: We wished to insure that there were no single-exon gene duplications in our PΨg and TPΨg sets. To do this, all PΨgs and TPΨgs were compared using BLASTP (E-value ≤ 10−4), to the set of proteins for build 34 (removing those whose genes overlap putative PΨgs) (27). All PΨgs and TPΨgs that had closest-matching homologies to single-exon proteins were removed. Furthermore, we insured that they aligned to their closest-matching homologous human proteins around at least one ‘exon seam’, i.e. a position in a protein sequence that corresponds to an intron–exon boundary. This exon seam filter insures that the pseudogenes considered are processed, and is particularly useful for removing homologies to genes with large exons (e.g. some zinc-finger-containing proteins). Filtering for processed genes: All TPΨgs were filtered for overlap with annotated processed genes (resulting in the removal of only one putative TPΨg). After applying these rigorous filters, we had 3418 PΨgs (both transcribed and non-transcribed), and 233 TPΨgs, 218 from mapping expressed sequences to PΨgs and 15 from PΨg extraction from Refseq mRNAs. Almost half (97/233, 42%) of the TPΨg set represent 100% exact matches of expressed sequences to PΨgs. Restricting analysis to just these matches does not affect any of the main trends and results reported here.

Making an obviously decayed C set of TPΨg sequences

We derived a ‘core set’ of TPΨgs that have further evidence of coding-sequence decay. These are dubbed the C set (totaling 177/233, 76%). This set is the union of the following two subsets: (i) TPΨgs without continuous segment of sequence that can code for a protein domain (106/233 TPΨgs, 45%) or (ii) TPΨgs with high Ka/Ks values (>0.50) indicative of lack of coding ability (127/233 TPΨgs, 54%).

(i) Lack of protein domain coding ability

We parsed each TPΨg into subsequences according to the positions of its disablements. If all subsequences could be labeled as ‘unlikely to code for a protein domain’, then the TPΨg was included in the C set. This resulted in inclusion of 106 TPΨgs in the C set. We labeled a subsequence as ‘unlikely to code for a protein domain’ if: Its length was ≤32 residues. The vast majority (95%) of non-cysteine-rich protein domains in the ASTRALSCOP 40% identity set have sequence lengths >32 residues (34). Cysteine-rich domains (which are likely disulfide-bridged or metal-chelating) are defined as having cysteine concentration <0.077/residue, a value suggested by a bimodality in cysteine concentration, in surveys of cysteine and cystine occurrence in proteins (35,36). Condition (a) was not applied to any fragments that were adjudged cysteine-rich. It contained a disrupted SCOP domain, as defined in part (iii)(b)(1) above. Such fragments are likely not to constitute a large enough fragment; the reasoning behind this criterion is that evolution has defined and refined the integrity of a body of recurrent folding units (protein domains) (34), and we can therefore use their disruption to evaluate whether a piece of sequence is no longer protein-coding.

(ii) Ka/Ks analysis

We calculated the Ka/Ks values for whole TPΨgs, using the Yang and Nielsen method in PAML (37), using the present-day gene sequence to compare against the pseudogene, as described previously (7). Also, similarly, we calculated Ka/Ks values for subsequences of TPΨgs (≥50 residues) derived by parsing at disablement positions. This parsing allows for the possibility that some of the pseudogene subsequences have coding ability, while others do not, i.e. we can test for a coding ability ‘imbalance’. From these Ka/Ks calculations, we found that only ∼4% of both PΨgs and TPΨgs have two adjacent regions where one is <0.25 (potentially coding) and the other >0.5 (potentially non-coding), indicating that such imbalance is rare. From consulting independent analysis of populations of human genes and PΨgs (6), we ascertained that for a threshold value of Ka/Ks ≥ 0.5, >95% of sequences are predicted to be PΨgs and not genes. We use this as the expectation for the distribution of PΨgs in general. Calculation of Ka/Ks values for gene/pseudogene pairs errs on the side of under-estimation of coding-sequence decay (7).

Conservation of TPΨg in mouse

For each human TPΨg, we searched against potentially orthologous mouse TPΨgs. These ‘moTPΨgs’ were derived by mapping expressed sequences (Refseq mRNAs, Unigene consensus sequences and ESTs) for mouse onto a previously derived set of mouse PΨgs (8), in a similar manner to the human mappings (see above). These were pooled with any existing moTPΨg annotations, and a small number of mouse genes that might be potentially misannotated moTPΨgs. A potentially orthologous moTPΨg was required to match ≥0.5 of the length of the human TPΨg (for BLASTP matches, E-value ≤ 10−4), and to share the same closest-matching human protein with any potential human TPΨg homologs. We did not require that the retrotranspositions be in syntenic positions, since orthologous gene retrotranspositions are not necessarily syntenic (38).

Processed genes

We mapped an independently derived list of processed genes (13) to human genome build 34. In addition to the criteria in (13), we required Ka/Ks values <0.25, and coverage of ≥0.95 of the parent gene's length. Any examples that overlap the TPΨg data set of annotations were removed; vice versa, any TPΨgs that have Ka/Ks < 0.25 and cover ≥0.95 of their parent gene were deleted from the TPΨgs list. Our definitions give two distinct sets of processed genes and TPΨgs; naturally, we miss some sequences that cannot be classified as either a TPΨg or a processed gene.

RESULTS AND DISCUSSION

Number of TPΨgs

In total, we found 233 human TPΨgs (Table 1). These TPΨgs form a subset of 3418 previous PΨg annotations that were mapped to build 34 of the human genome (7). These PΨgs were filtered in the same way as the TPΨgs (from a starting total of ∼6200), to remove predicted reading frames, retained introns and potential duplications of single-exon genes or large exons. Using these data, we can estimate that ∼6% (218/3418) of PΨgs are TPΨgs. An additional 15 TPΨgs were derived from a reciprocal process of searching for PΨgs in known Refseq mRNAs, followed by subsequent mapping to the genome.

Table 1

Summary of numbers of TPΨgs

Set or subset of TPΨgs	Total number	Total number (without those mapped to introns)
Mappings to existing pseudogene annotations	218	154
Pseudogene extraction from Refseq mRNAs	15	12
Total TPΨgs	233	166
Expressed sequence support
TPΨgs that are supported by Refseq mRNAs	18 (8%)	16 (10%)
TPΨgs that are supported by Unigene consensus sequences	74 (32%)	50 (30%)
TPΨgs that are supported by dbEST expressed sequence tags	167 (72%)	111 (67%)
TPΨgs that are supported by dbEST expressed sequence tags and by either a Refseq mRNA or a Unigene consensus	38 (16%)	25 (16%)
TPΨgs that are additionally supported by oligonucleotide microarray data	75 (32%)	53 (32%)
Further evidence of decay
TPΨgs that have no continuous segment likely to code for a protein domain	106 (45%)	70 (42%)
TPΨgs that have K_a/K_s ≥ 0.5	127 (54%)	88 (53%)
C set (TPΨgs that have no continuous segment likely to code for a protein domain or K_a/K_s ≥ 0.5)	177 (76%)	123 (74%)

A small fraction of the TPΨgs (8%) corresponds to known Refseq mRNAs (Table 1). About a third are supported by Unigene consensus sequences, with a large fraction (71%) matching individual ESTs [of this last group, about a quarter (∼23%) are supported by a Refseq mRNA or a Unigene consensus; Table 1]. We sought additional expression verification from a series of high-density oligonucleotide microarrays, composed of ∼52 million 36mers (23). These microarrays were applied to probe the transcriptionally active regions of the human genome, in a strand-sensitive way. Using the same data and statistical method (i.e. a sign test) for scoring the genes' transcriptional activity (22,23), we found that 75/233 (32%) TPΨgs were transcriptionally active in liver (P < 0.05) (Table 1). In comparison, 64% of genes from RefSeq mRNAs, 57% of Ensembl annotated genes (39), and 35% of genes predicted with the program GENSCAN (40), were found to be transcribed in liver. The C set of more obviously decayed TPΨgs comprises 76% of the total population; 45% of TPΨgs having no continuous segment likely to code for a protein domain, and 54% of TPΨgs having Ka/Ks ≥ 0.5 (Table 1). Obvious degradation of the coding sequences is demonstrated for this set from analysis of protein-domain mapping and Ka/Ks (see Methods). Additionally, other factors (not examined in the present analysis) are expected to cause lack of coding ability in TPΨgs or arise as further consequences. It is likely that TPΨgs will not have appropriate start codon context (41), therefore leading to little or no efficient translation initiation. Also, those TPΨgs that are inserted into 3′-UTRs of mRNAs will be unlikely to become protein-coding through being downstream of a clearly defined coding sequence (although it is conceivable that they may be translatable in the 5′-UTR). Furthermore, a consequence of any frameshift in a sequence is the likelihood of an additional 20 residues or so of non-coding DNA, added onto the end of the sequence truncation (on average, in randomly picked, conceptually translated intergenic DNA, a stop codon will appear ∼20 residues downstream of any starting point); such additional sequence may lead to aggregation or misfolding in the cell. The proportions of TPΨgs break down in a similar fashion to that just described above for the total data set, when the 67 examples that map to introns are removed (Table 1).

Closest matching human proteins for TPΨgs

TPΨgs were grouped according to their closest-matching human protein (Table 2). Each table entry represents a single ‘parent gene’. The total counts are also shown for the TPΨgs that do not map to introns (in square brackets, Table 2). There are 4 human proteins that have ≥4 homologous TPΨgs. The highest number of TPΨgs (5) occur for cyclophilin A, which is required for cis-peptide isomerization (42). All of these proteins arise from highly expressed mRNAs. They also occur in the top 20 proteins when apportioning all PΨgs, in the same way (7).

Table 2

Human proteins with four or more homologous TPΨgs

Numbera	Name of human proteinb
5 [4]	Peptidyl-prolyl cis-trans isomerase A (Cyclophilin A) [P62937]
4 [3]	Prohibitin [P35232]
4 [3]	40S ribosomal protein S12 [P25398]
4 [3]	Actin, cytoplasmic 2 (Gamma-actin) [P63261]
	Glyceraldehyde-3-phosphate dehydrogenase [P04406, P00354]

aThe totals in square brackets are for when those mapping to introns are removed.

bThe Swissprot accession numbers are given in square brackets.

TPΨg position relative to genes and the implications for their expression mechanisms

A number of mechanisms for TPΨg expression are plausible. First, TPΨgs may co-opt nearby promoter elements of protein-coding genes. Secondly, they may intrude into the UTRs of another mRNA, as a sort of ‘molecular passenger’. Thirdly, they may make use of cryptic promoter elements in the intergenic DNA; such promoter elements may have originated from transposable elements, or from genomic duplication of genic promoter regions, or sporadically (de novo). Such mechanisms for TPΨg expression may have a bearing on their overall positional distribution in the genome relative to genes. To investigate this, we classified the TPΨgs into those that: (i) overlap existing coding-sequence exons; (ii) appear inserted in introns; (iii) are inserted in a 3000 or 10 000 nt region 5′ to annotated genes; (iv) are inserted in a 3000 or 10 000 nt region 3′ to annotated genes. Table 3 summarizes these data. A minor proportion (8%) of TPΨgs entail gene coding-sequence annotations, i.e. they are erroneously annotated reading frames. There are 67 TPΨgs that map to introns (Table 3); it is unclear how many of these may arise from intron retention in cDNAs or ESTs. Expectations based on random insertion in the genome were calculated for classes (ii) to (iv). We focus on (iii) and (iv) in particular. TPΨgs are significantly more likely than random (P < 0.01, chi-squared tests) to be inserted in the regions 5′ and 3′ of annotated genes; this effect is most obvious in the 3000 nt regions 5′ and 3′ to genes, but is still significant up to 10 000 nt in either direction (Table 3). Similar results are observed for the C set of more obviously decayed TPΨgs. The enrichment of TPΨgs observed in the 5′ and 3′ areas of genes can be seen as a simple logical consequence of randomly inserted PΨgs having an increased probability of being transcribed, and is clear support for either co-option of genic promoter elements, or insertion into UTRs as molecular passengers, leading to TPΨg expression. This result is also unaffected by possible contamination from intron retention in cDNAs/ESTs, as, in general, PΨgs are significantly under-represented in introns (Table 3); if one assumed that, in the extreme, all of the TPΨg mappings near the 5′ and 3′ ends of genes were actually mappings to introns, then this would make their over-representation even more significant. The general dearth of PΨgs in introns may be a reflection of an overall genomic tendency for a lack of retroelement insertion in introns (43). Roughly half of the TPΨgs are located away from genes (>10 000 nt 5′ and 3′ to genes, and overlapping neither an exon nor an intron; Table 3). These thus may be co-opting cryptic promoters of unknown origin in the intergenic DNA, such as those derivable from transposable elements. In summary, the distribution of TPΨgs in the vicinity of genes is significantly different from that observable for other non-transcribed PΨgs (that have no transcription evidence), and for processed genes in the following ways (Table 3): TPΨgs are significantly over-represented in the 10 000 nt 5′ and 3′ to genes, whereas other PΨgs and processed genes are not; Other PΨgs are significantly over-represented in intergenic DNA and significantly under-represented in introns, and processed genes are significantly under-represented in introns; TPΨgs now show such trends for introns or intergenic DNA. In addition, there is a dearth of PΨgs 3′ to genes (Table 3). The reasons for this are unclear; there may be a compositional effect, similar to the relationship between genomic G+C content and ribosomal-protein PΨgs insertion, observed previously (44). We examined the distribution of 20 TPΨgs that are directly mappable onto known Refseq mRNAs. Thirteen of these overlap an erroneously predicted open reading frame, and two are already annotated as transcribed pseudogenes. None of the five remaining TPΨgs are inserted in the 5′-UTR of a messenger RNA. One explanation for this absence in 5′-UTRs is that a TPΨg would introduce upstream ORFs that interfere with translation initiation (41). The five TPΨgs inserted in the 3′-UTRs of mRNAs are all in the forward direction (i.e. they are all on the same DNA strand as the annotated coding sequence). An example of this is discussed below. In addition, we checked the list of TPΨgs 3′ to annotated genes and within 3000 nt of the end of the coding sequence (Table 3), for additional examples of this ‘passenger’ phenomenon, through manual examination of cDNAs or ESTs for the 5′ genes, but could find no further examples of cDNAs with polyadenylation signals to define the end of the mRNA. Such analysis is complicated by the fact that, in some cases, it may not be possible to distinguish between the original polyadenylation signal of the gene, and an inserted polyadenylation signal arising from the TPΨg.

Distribution on chromosomes

Analysis of the distribution of processed genes in the human and mouse genome has indicated that the X chromosome is a marked outlier, both for processed gene deposition onto the X chromosome and origination from X (13). A similar outlier preference was observed for PΨg deposition onto the X chromosome (but not origination from X) (13). These phenomena may be due to selection pressures to compensate for X-chromosome inactivation during spermatogenesis, in combination with some unaccounted-for mutational biases (13,38). To compare with this previous analysis, we examined the distribution of TPΨg ‘parent genes’ on each chromosome, and also the distribution of the number of TPΨgs per chromosome (Figure 1A and B). Figure 1A indicates the data for origination of TPΨgs, and Figure 1B shows the trend for deposition of TPΨgs onto each chromosome. In each case (origination and deposition), the X chromosome is not an outlier. This may indicate that, in general, TPΨg formation is deleterious, unlike processed gene and non-transcribed PΨg formation, which are arguably, by comparison, beneficial and selectively neutral, respectively. Interestingly, there is some outlier behavior for TPΨg origination from chromosome 12. The same result is obtained, if the 67 TPΨgs that map to introns are removed.

Figure 1

Origination and deposition of TPΨgs for different chromosomes. (A) Origination of TPΨgs: this plot shows the number of parent genes of TPΨgs in a chromosome versus the chromosome size (in Mb). (B) Deposition of TPΨgs: this shows the number of TPΨgs per chromosome versus chromosome size (in Mb). Only retrotranspositions from one chromosome to another are considered in each plot. The X chromosome is ringed. Note that for each plot we have corrected for the probability of X and Y chromosome inclusion in gametes [i.e. the size of X is multiplied by 0.75 and Y by 0.25; for comparison see figure 1 in (13)].

Search for potential orthologs in mouse

We investigated mouse/human cross-species conservation of TPΨgs, as an indicator of human-lineage specificity. The 233 human TPΨgs were compared against a set of 215 putative mouse TPΨgs (moTPΨgs) (see Methods for details). We found that 5% (11/233) have potential orthologous TPΨgs. Four of these are for the metabolic enzyme, glyceraldehyde-3-phosphate dehydrogenase, which is ubiquitously and highly expressed, giving this sequence the status of a notable ‘parent gene’ for TPΨgs (see also Table 2). If the human and mouse TPΨgs are not restricted to having the same closest-matching human gene homolog, 28/237 (12%) have potential orthologs. These results suggest that a minor fraction of TPΨgs could be used in conserved functional roles in mammals. However, given that ∼40% of human PΨgs are conserved in the mouse genome (8), these results imply that TPΨgs are significantly under-conserved between human and mouse (P < 0.001 using binomial statistics) compared with PΨgs in general, and also compared with processed genes (13), which are at most ∼20% lineage-specific. The vast majority of TPΨgs are thus human lineage-specific compared with mouse; indeed, both Alus (which are primate-specific) and PΨgs can be made as by-products of LINE retrotransposition (14), and have similar overall age profiles in the genome (8). These results are also evidence for a general evolutionary selection pressure to delete TPΨgs. This may be because they form a source of transcriptional interference for adjacent genes or homologous genes. However, one must stress that, in the future, increased cDNA coverage for both the mouse and human genomes may modify these statistics somewhat. Such a lack of saturation in current databases of expressed sequences can be demonstrated using some simple sampling analysis. Sampling of TPΨg-matching expressed sequences from random fractional subsets of the total expressed sequence database used in the present analysis (i.e. ESTs + Unigene consensuses + Refseq mRNAs), indicates that we are not near finding all of the TPΨgs in the human genome (or, at least, those discoverable through mapping of expressed sequences). (This sampling analysis is presented in Supplementary Figure 1.)

Examples of TPΨgs

A TPΨg derived from the prohibitin gene is shown in Figure 2A. A prohibitin TPΨg is inserted into the 3′-UTR of a Zn-finger-containing protein. Prohibitin is highly and ubiquitously expressed, and is involved in inhibition of DNA synthesis; its mRNA contains a putative functional RNA element in its own 3′ UTR (45). It is beyond the scope of this present study to ascertain whether this RNA element, in this TPΨg, is intact, as it has not yet been characterized extensively by mutational and biophysical analysis. This TPΨg is one of four that derived from the prohibitin gene (Table 2).

Figure 2

Examples of TPΨgs. (A) This is a TPΨg derived from the human prohibitin gene. The prohibitin gene contains both a protein-coding region and an RNA in its 3′-UTR (45), but only the segment of the TPΨg corresponding to the protein-coding sequence is shown. In the center is an alignment of the TPΨg (in red) with prohibitin protein (in green). The graphic above it shows the position of the TPΨg (red segment) in the 3′-UTR of an mRNA that codes for a Zn-finger-containing protein (blue segment). (B) An example of a TPΨg that maps to a known globular protein domain. The TPΨg derives from the mRNA for the precursor sequence of mitochondrial 2-amino-3-ketobutyrate coenzyme A. The domain is from the closest-matching protein structure (from E.coli, PDB code 1fc4a). In the Molscript (54) picture, the protein chain trace color changes at the position of each disablement. The alignment of the E.coli domain sequence and the human TPΨg sequence is shown. The part of the sequence that maps to an EST (gi|6138420) is boxed and italicized.

The second example is derived from the precursor sequence of mitochondrial 2-amino-3-ketobutyrate coenzyme A. The crystal structure of the Escherichia coli homolog of this enzyme is known (PDB code 1fc4a). We have indicated how the ‘triple alignment’ of genomic sequence, EST and known protein-domain sequence overlap (Figure 2B). The protein chain is divided into colored segments, with each disablement defining a segment boundary. One can clearly see that the triple alignment covers two disablements in the TPΨg.

CONCLUSIONS

Diverse efforts to map novel elements of potential functional utility in our genome are ongoing (1–3). In the spirit of such endeavors, we have derived a rigorous procedure for annotating a specific novel type of element of potential functional utility, the TPΨg. Applying this method to the human genome, we discovered 166–233 TPΨgs, which represent ∼4–6% of all PΨgs (the lower total arises from setting aside any examples that map to introns). One should point out that we might have missed some TPΨgs; e.g. those without extensive homology to a coding sequence (i.e. those consisting largely of UTR homologies), or TPΨgs formed from single-exon and large-exon genes, or TPΨgs that are transcribed in a low-level beyond detectability through EST/cDNA sequencing. TPΨgs are significantly more likely in regions close to the 5′ and 3′ ends of genes, compared with both a random insertion model for them throughout the genome, and compared with the distribution observed in general for PΨgs. Furthermore, if one assumes that these 5′ and 3′ regions are actually introns, the significance of the increased 5′ and 3′ density of TPΨgs improves. (This indicates that the increased 5′ and 3′ density is not an artifact of intron retention in cDNA/EST libraries.) This increased density provides evidence that TPΨgs may be expressed through co-option of genic promoter elements or through insertion into UTRs as ‘molecular passengers’. Specific detailed evidence was found for molecular passengers in the 3′-UTRs of known mRNAs; an example of this derived from the prohibitin gene was illustrated (Figure 2A). TPΨgs could thus also have a role as intermediates in protein-coding sequence evolution. A reasonable hypothesis that can be further investigated is that, TPΨgs may represent a source of evolutionary protein novelty, either as ‘molecular passengers’, or as part of alternative splicings (46), through being temporarily released from coding-sequence selection pressures (31,47–50). Use of additional sequence segments may underlie the influence of the [PSI+] prion on phenotypic variability in budding yeast (31,51); analogs of this phenomenon are possible in mammals. Two examples of regulation by transcribed pseudogenes of homologous genic transcripts have been observed (20,21). Transcriptional analysis showed that the stability of the makorin1 mRNA in mouse relies upon the expression of its homologous makorin1-p1 TPΨg, through the action of an element at the 5′ end of the makorin1-p1 sequence. However, makorin1-p1 only seems to be conserved in one line of Mus, and has not been found in the rat genome (52). In a second example, transcription of a pseudogene in Lymnea stagnalis, that is homologous to the nitric oxide synthase gene, decreases expression levels for the gene; this is thought to arise via a reverse-complement sequence found at the 5′ end of the pseudogene transcript (20). Alternatively, TPΨgs near genes or in UTRs may also exert a controlling/interfering influence on the genes' transcription and translation, through upstream ORF formation, or the action of other undiscovered elements. Such TPΨgs could exert such effects through co-option as alternative splicings, as has been observed for Alus (53). Also, it is possible that some TPΨgs produce a short peptide that does not misfold or aggregate in the cell, but is still targeted and serves an alternative function as a truncated peptide. Certainly, TPΨgs represent a source of transcriptional ‘noise’, which may have implications for selection pressures on transcription levels, and the degree of variation on which such pressures can act. Our survey provides evidence for the existence in the human genome of a small population of TPΨgs, which are an intermediate class of retrosequence derived from genes, since they have expression evidence (like genes), but also have evidence of lack of coding ability (like other pseudogenes). The distribution of TPΨgs near the 5′ and 3′ ends of genes indicates that TPΨgs can co-opt genic promoters or intrude into UTRs; furthermore, this is a robust observation that verifies our expression-data mappings. One must also point out, however, that about half of the TPΨgs are located away from genes in intergenic DNA (Table 3), and thus may be co-opting cryptic promoters of undesignated origin. Also, TPΨgs differ from other PΨgs (without transcription evidence) and from processed genes in terms of their distribution per chromosome, and their projected conservation in mouse. Our analysis indicates that, unlike processed genes and other PΨgs, the vast majority (∼95%) of TPΨgs are human lineage-specific. In combination, the chromosomal distribution and mouse conservation for TPΨgs suggests that there is some general evolutionary pressure to delete TPΨgs from the genome. One should point out that the cDNA coverage of both genomes is far from complete (as illustrated here, with some simple sampling analysis), so that the analysis of conservation in mouse should be regarded as tentative. This TPΨg analysis has important implications for genome annotation. It is still common practice to assume that an mRNA contains one undisrupted open reading frame; however, it is clear that one should routinely check for TPΨgs in the manner described here. Also, this TPΨg annotation is useful for improved interpretation of microarray expression data (22,23). The list of TPΨgs is available at: (or) .

SUPPLEMENTARY MATERIAL

Supplementary Material is available at NAR Online.

52 in total

1. Flexible sequence similarity searching with the FASTA3 program package.

Authors: W R Pearson
Journal: Methods Mol Biol Date: 2000

2. Initial sequencing and analysis of the human genome.

Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal: Nature Date: 2001-02-15 Impact factor: 49.962

3. Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22.

Authors: Paul M Harrison; Hedi Hegyi; Suganthi Balasubramanian; Nicholas M Luscombe; Paul Bertone; Nathaniel Echols; Ted Johnson; Mark Gerstein
Journal: Genome Res Date: 2002-02 Impact factor: 9.043

4. Common exon duplication in animals and its role in alternative splicing.

Authors: Ivica Letunic; Richard R Copley; Peer Bork
Journal: Hum Mol Genet Date: 2002-06-15 Impact factor: 6.150

5. Initial sequencing and comparative analysis of the mouse genome.

Authors: Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander
Journal: Nature Date: 2002-12-05 Impact factor: 49.962

Review 6. Studying genomes through the aeons: protein families, pseudogenes and proteome evolution.

Authors: Paul M Harrison; Mark Gerstein
Journal: J Mol Biol Date: 2002-05-17 Impact factor: 5.469

7. Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome.

Authors: Zhaolei Zhang; Paul Harrison; Mark Gerstein
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

8. A small reservoir of disabled ORFs in the yeast genome and its implications for the dynamics of proteome evolution.

Authors: Paul Harrison; Anuj Kumar; Ning Lan; Nathaniel Echols; Michael Snyder; Mark Gerstein
Journal: J Mol Biol Date: 2002-02-22 Impact factor: 5.469

9. Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22.

Authors: Chingfer Chen; Andrew J Gentles; Jerzy Jurka; Samuel Karlin
Journal: Proc Natl Acad Sci U S A Date: 2002-02-26 Impact factor: 11.205

10. The sequence of the human genome.

Authors: J C Venter; M D Adams; E W Myers; P W Li; R J Mural; G G Sutton; H O Smith; M Yandell; C A Evans; R A Holt; J D Gocayne; P Amanatides; R M Ballew; D H Huson; J R Wortman; Q Zhang; C D Kodira; X H Zheng; L Chen; M Skupski; G Subramanian; P D Thomas; J Zhang; G L Gabor Miklos; C Nelson; S Broder; A G Clark; J Nadeau; V A McKusick; N Zinder; A J Levine; R J Roberts; M Simon; C Slayman; M Hunkapiller; R Bolanos; A Delcher; I Dew; D Fasulo; M Flanigan; L Florea; A Halpern; S Hannenhalli; S Kravitz; S Levy; C Mobarry; K Reinert; K Remington; J Abu-Threideh; E Beasley; K Biddick; V Bonazzi; R Brandon; M Cargill; I Chandramouliswaran; R Charlab; K Chaturvedi; Z Deng; V Di Francesco; P Dunn; K Eilbeck; C Evangelista; A E Gabrielian; W Gan; W Ge; F Gong; Z Gu; P Guan; T J Heiman; M E Higgins; R R Ji; Z Ke; K A Ketchum; Z Lai; Y Lei; Z Li; J Li; Y Liang; X Lin; F Lu; G V Merkulov; N Milshina; H M Moore; A K Naik; V A Narayan; B Neelam; D Nusskern; D B Rusch; S Salzberg; W Shao; B Shue; J Sun; Z Wang; A Wang; X Wang; J Wang; M Wei; R Wides; C Xiao; C Yan; A Yao; J Ye; M Zhan; W Zhang; H Zhang; Q Zhao; L Zheng; F Zhong; W Zhong; S Zhu; S Zhao; D Gilbert; S Baumhueter; G Spier; C Carter; A Cravchik; T Woodage; F Ali; H An; A Awe; D Baldwin; H Baden; M Barnstead; I Barrow; K Beeson; D Busam; A Carver; A Center; M L Cheng; L Curry; S Danaher; L Davenport; R Desilets; S Dietz; K Dodson; L Doup; S Ferriera; N Garg; A Gluecksmann; B Hart; J Haynes; C Haynes; C Heiner; S Hladun; D Hostin; J Houck; T Howland; C Ibegwam; J Johnson; F Kalush; L Kline; S Koduru; A Love; F Mann; D May; S McCawley; T McIntosh; I McMullen; M Moy; L Moy; B Murphy; K Nelson; C Pfannkoch; E Pratts; V Puri; H Qureshi; M Reardon; R Rodriguez; Y H Rogers; D Romblad; B Ruhfel; R Scott; C Sitter; M Smallwood; E Stewart; R Strong; E Suh; R Thomas; N N Tint; S Tse; C Vech; G Wang; J Wetter; S Williams; M Williams; S Windsor; E Winn-Deen; K Wolfe; J Zaveri; K Zaveri; J F Abril; R Guigó; M J Campbell; K V Sjolander; B Karlak; A Kejariwal; H Mi; B Lazareva; T Hatton; A Narechania; K Diemer; A Muruganujan; N Guo; S Sato; V Bafna; S Istrail; R Lippert; R Schwartz; B Walenz; S Yooseph; D Allen; A Basu; J Baxendale; L Blick; M Caminha; J Carnes-Stine; P Caulk; Y H Chiang; M Coyne; C Dahlke; A Deslattes Mays; M Dombroski; M Donnelly; D Ely; S Esparham; C Fosler; H Gire; S Glanowski; K Glasser; A Glodek; M Gorokhov; K Graham; B Gropman; M Harris; J Heil; S Henderson; J Hoover; D Jennings; C Jordan; J Jordan; J Kasha; L Kagan; C Kraft; A Levitsky; M Lewis; X Liu; J Lopez; D Ma; W Majoros; J McDaniel; S Murphy; M Newman; T Nguyen; N Nguyen; M Nodell; S Pan; J Peck; M Peterson; W Rowe; R Sanders; J Scott; M Simpson; T Smith; A Sprague; T Stockwell; R Turner; E Venter; M Wang; M Wen; D Wu; M Wu; A Xia; A Zandieh; X Zhu
Journal: Science Date: 2001-02-16 Impact factor: 47.728

85 in total

1. Pseudogene: lessons from PCR bias, identification and resurrection.

Authors: Shan-Min Chen; Ka-Yan Ma; Jin Zeng
Journal: Mol Biol Rep Date: 2010-11-30 Impact factor: 2.316

2. Competing endogenous RNA: A novel posttranscriptional regulatory dimension associated with the progression of cancer.

Authors: Qingsong Dai; Jixia Li; Keyuan Zhou; Tong Liang
Journal: Oncol Lett Date: 2015-09-14 Impact factor: 2.967

3. Evolutionary fate of retroposed gene copies in the human genome.

Authors: Nicolas Vinckenbosch; Isabelle Dupanloup; Henrik Kaessmann
Journal: Proc Natl Acad Sci U S A Date: 2006-02-21 Impact factor: 11.205

4. A novel testis ubiquitin-binding protein gene arose by exon shuffling in hominoids.

Authors: Daria V Babushok; Kazuhiko Ohshima; Eric M Ostertag; Xinsheng Chen; Yanfeng Wang; Prabhat K Mandal; Norihiro Okada; Charles S Abrams; Haig H Kazazian
Journal: Genome Res Date: 2007-07-10 Impact factor: 9.043

5. Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution.

Authors: Deyou Zheng; Adam Frankish; Robert Baertsch; Philipp Kapranov; Alexandre Reymond; Siew Woh Choo; Yontao Lu; France Denoeud; Stylianos E Antonarakis; Michael Snyder; Yijun Ruan; Chia-Lin Wei; Thomas R Gingeras; Roderic Guigó; Jennifer Harrow; Mark B Gerstein
Journal: Genome Res Date: 2007-06 Impact factor: 9.043

6. The unique expression and function of miR-424 in human placental trophoblasts.

Authors: Jean-Francois Mouillet; Rogier B Donker; Takuya Mishima; Tina Cronqvist; Tianjiao Chu; Yoel Sadovsky
Journal: Biol Reprod Date: 2013-08-01 Impact factor: 4.285

7. Origin and evolution of processed pseudogenes that stabilize functional Makorin1 mRNAs in mice, primates and other mammals.

Authors: Satoko Kaneko; Ikuko Aki; Kaoru Tsuda; Kazuyuki Mekada; Kazuo Moriwaki; Naoyuki Takahata; Yoko Satta
Journal: Genetics Date: 2006-01-16 Impact factor: 4.562

8. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

9. A nuclear ribosomal DNA pseudogene in triatomines opens a new research field of fundamental and applied implications in Chagas disease.

Authors: María Angeles Zuriaga; Santiago Mas-Coma; María Dolores Bargues
Journal: Mem Inst Oswaldo Cruz Date: 2015-03-06 Impact factor: 2.743

10. Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes.

Authors: Suganthi Balasubramanian; Deyou Zheng; Yuen-Jong Liu; Gang Fang; Adam Frankish; Nicholas Carriero; Rebecca Robilotto; Philip Cayting; Mark Gerstein
Journal: Genome Biol Date: 2009-01-05 Impact factor: 13.583