| Literature DB >> 19607679 |
Françoise Thibaud-Nissen1, Shu Ouyang, C Robin Buell.
Abstract
BACKGROUND: The Osa1 Genome Annotation of rice (Oryza sativa L. ssp. japonica cv. Nipponbare) is the product of a semi-automated pipeline that does not explicitly predict pseudogenes. As such, it is likely to mis-annotate pseudogenes as functional genes. A total of 22,033 gene models within the Osa1 Release 5 were investigated as potential pseudogenes as these genes exhibit at least one feature potentially indicative of pseudogenes: lack of transcript support, short coding region, long untranslated region, or, for genes residing within a segmentally duplicated region, lack of a paralog or significantly shorter corresponding paralog.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19607679 PMCID: PMC2724416 DOI: 10.1186/1471-2164-10-317
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Genes with pseudogene features (GPFs) and pseudogenes
| Category | No. of GPFs | Pseudogenes (%)1 | Transcribed pseudogenes |
| Unsupported2 | 17792 | 1191 (7%) | 101 (8.5%) |
| Long UTR3 | 831 | 104 (12%) | 35 (34%) |
| Short CDS4 | 734 | 5(4%) | 0 (0%) |
| Poly-A tail5 | 475 | 30(6%) | 1 (3%) |
| Segmentally duplicated6 | 248 | 40(16%) | 14 (35%) |
| Single-exon singletons7 | 4833 | 202(4%) | 31 (15%) |
| Total (non redundant) | 22033 | 1439(6.5%) | 170 (13%) |
1 Pseudogenes (with parent gene and at least one frameshift or premature stop codon)
2 GPFs not supported by cDNA or EST evidence
3 The UTRs of the GPFs are longer than mean + 2 standard deviations
4 The CDS of the GPFs are shorter than 50 amino acids
5 The GPFs contain a stretch of 18 adenines in a 20-base window, within -200 to 400 bases from the end of the annotated UTR, or within 600 bases of the stop codon if no UTR is annotated
6 The CDS of the GPFs are significantly shorter than their respective paralog or, the GPFs have a significantly smaller number of exons
7 The GPFs contain a single exon and are within a segmentally duplicated region but have no paralog in the duplicated region
Origin of the pseudogenes
| Known | Unknown | |||
| Category* | Duplicated | Retrotransposed | Single-exon pseudogene | Multi-exon pseudogne |
| Unsupported | 507 | 162 | 453 | 69 |
| Long UTR | 62 | 11 | 25 | 6 |
| Short CDS | 1 | 0 | 4 | 0 |
| Poly-A tail | 9 | 2 | 16 | 3 |
| Segmentally duplicated | 36 | 2 | 1 | 1 |
| Single-exon singletons | 39 | 34 | 115 | 14 |
| Total (non redundant) | 627 | 189 | 539 | 84 |
*See Table 1 for a description of the different categories
Characteristics of the pseudogenes
| Length (aa) | Nucleotide identity (%) | Protein similarity (%) | Coverage (%) | Disablements/pseudogene | Disablements/1000 bases | |
| Duplicated | 492 | 70.3 | 73.3 | 89.9 | 1.85 | 5.03 |
| Retrotransposed | 398 | 59.2 | 63.8 | 85.8 | 1.52 | 5.69 |
| Unknown single-exon | ||||||
| pseudogene | 257 | 71.6 | 70.8 | 91 | 1.83 | 10.32 |
| Unknown, multi-exon | ||||||
| pseudogene | 455 | 63.0 | 63.7 | 89.7 | 1.96 | 6.91 |
Figure 1Number of disablements per pseudogene. The number of disablements is represented on the x-axis and the log normal of the number of pseudogenes on the y-axis.
Figure 2log(Ka/Ks) ratios distribution of the pseudogenes (full line) and of a control set of functional paralogous genes (dotted line).
Figure 3Fate of the community-annotated pseudogenes in our annotation process. Number of candidates passing each step in our pseudogene identification method.
Twenty most significantly over-represented GO terms in pseudogenes
| GO term | Number of pseudogenes | Percent of pseudogenes | Percent of Osa1 Gene Complement | p-value | GO term description |
| GO:0019748 | 250 | 36.4 | 11.7 | 2.5E-63 | Secondary metabolic process |
| GO:0009058 | 277 | 40.3 | 16.9 | 1.1E-47 | Biosynthetic process |
| GO:0008150 | 186 | 27.1 | 9.6 | 3.8E-39 | Biological process |
| GO:0006519 | 162 | 23.6 | 9.0 | 3.0E-30 | Amino acid and derivative metabolic process |
| GO:0007165 | 341 | 49.6 | 29.0 | 3.5E-30 | Signal transduction |
| GO:0016301 | 418 | 60.8 | 41.7 | 2.3E-24 | Kinase activity |
| GO:0005739 | 248 | 36.1 | 19.6 | 4.0E-24 | Mitochondrion |
| GO:0030246 | 101 | 14.7 | 5.2 | 7.0E-21 | Carbohydrate binding |
| GO:0009987 | 254 | 37.0 | 21.5 | 7.1E-21 | Cellular process |
| GO:0016740 | 216 | 31.4 | 17.7 | 7.9E-19 | Transferase activity |
| GO:0007582 | 228 | 33.2 | 19.5 | 1.2E-17 | Physiological process |
| GO:0009719 | 300 | 43.7 | 29.3 | 5.8E-16 | Response to endogenous stimulus |
| GO:0016020 | 280 | 40.8 | 26.8 | 8.2E-16 | Membrane |
| GO:0006629 | 103 | 15.0 | 6.5 | 3.2E-15 | Lipid metabolic process |
| GO:0005515 | 189 | 27.5 | 16.3 | 6.1E-14 | Protein binding |
| GO:0005618 | 165 | 24.0 | 13.9 | 5.6E-13 | Cell wall |
| GO:0004872 | 51 | 7.4 | 2.5 | 8.7E-12 | Receptor activity |
| GO:0030154 | 57 | 8.3 | 3.2 | 5.2E-11 | Cell differentiation |
| GO:0005886 | 73 | 10.6 | 4.7 | 1.3E-10 | Plasma membrane |
| GO:0006464 | 138 | 20.1 | 11.8 | 1.8E-10 | Protein modification process |
Figure 4Number of pseudogenes per paralogous family. Pseudogenes were associated with paralogous families, based on their parents. Families discussed in the text are labeled with their number and the name of the associated Pfam domain, if characterized. BTB-MATH: Bric-a-Brac/Tramtrack/Broad Complex and Meprin and TRAF homology domain. DUF: domain of unknown function. The straight line represents the linear regression of the number of pseudogenes per family over the number functional genes per family. In the inserted plot, the y-axis has a greater range to represent family 3724.