| Literature DB >> 17718914 |
Zhan Yu1, David Morais, Mahine Ivanga, Paul M Harrison.
Abstract
BACKGROUND: The dynamics of gene evolution are influenced by several genomic processes. One such process is retrotransposition, where an mRNA transcript is reverse-transcribed and reintegrated into the genomic DNA.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17718914 PMCID: PMC2048973 DOI: 10.1186/1471-2105-8-308
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Pipeline summarizing the annotation of PRs and retropseudogenes. The pipeline for PR annotation is summarized. There is an inset at the bottom, that summarizes the tests for local gene order and chromosomal milieu.
Figure 2Rapid annotation of retropseudogenes. (1) TBLASTN matches (e-value ≤ 10-4) of the annotated proteome against the genomic DNA are sorted by coordinates and collated for each protein to form a set of matches {M}. (2) The sets {M} are filtered using length-based heuristics. (3) Each protein is realigned to the genomic DNA using FASTY, and the best-matching proteins at each point have disablements and that matches >70% of the length of the parent sequence are picked as retropseudogene annotations.
Overview of gene retrotransposition analysis for eight vertebrates
| 22219 | 631 (3%) | 78 (12%) | 36 (6%) | 504 | 2493 | 545 | 145/631 (23%) | |
| 20980 | 476 (2%) | 17 (4%) | 5 (1%) | 339 | 1889 | ---- | ||
| 18199 | 409 (2%) | 18 (4%) | 25 (6%) | 363 | 3505 | ---- | ||
| 23147 | 790 (3%) | 46 (6%) | 104 (13%) | 479 | 1996 | ---- | ||
| 25021 | 663 (3%) | 31 (5%) | 75 (11%) | 518 | 2969 | 533 | 58/663 (9%) | |
| 22157 | 567 (3%) | 21 (4%) | 62 (11%) | 492 | 4520 | ---- | ||
| 17707 | 321 (2%) | 15 (5%) | 26 (8%) | 267 | 720 | ---- | ||
| 28005 | 227 (1%) | 4 (2%) | 10 (4%) | 203 | 644 | ---- |
* The number of gene annotations (both those labelled 'known' and 'novel') for the genome version downloaded from Ensembl [[15]; see Methods for details].
** TE = transposable element ; see Methods for details.
*** FLE = fraction of largest exon ; see Methods for details.
**** Number of PRs with complete Refseq mRNAs or complete Unigene consensus sequences (percentage of these in brackets); see Methods for details. PRs have significantly much less mapping of this transcription information than the whole annotated proteome for these organisms.
Figure 4Lineage-specific lists of PRs: The number of species-specific PRs relative to other species. PRs specific relative to other species were obtained by comparison of Ks between the PR and its parent and the Ks between the parent (Ks) and the ortholog of the parent in the other species (Ks). PRs with Ks
Figure 3Ks distributions: (A) Ks distribution for human PRs meeting the local gene order test with threshold of N= 0, from comparison to their parent sequences. Labelled are the median values for the 'Human-specific' set, and those PRs formed between divergence from dog and from chimp [see panel (C)]. A similar distribution is observed with an Nthreshold of ≤ 1 for the local gene order test. (B) Ks distribution for mouse PRs meeting the local gene order test with threshold of N= 0, from comparison to their parent sequences. Labelled are the median values for the 'Mouse-specific' set, and those PRs formed between divergence from dog and from chimp [see panel (C)]. A similar distribution is observed with an Nthreshold of ≤ 1 for the local gene order test.
Figure 5Distributions of percentage protein sequence identity between PRs and parents. (A) Distribution of % protein sequence identity for all human PRs that pass the local gene order test (N≤ 1). These are broken down into 'transcribed' and 'not transcribed'. (B) The fraction that are transcribed in each bin of the histogram in panel A. (C) Distribution of % protein sequence identity for all mouse PRs that pass the local gene order test (N≤ 1). These are broken down into 'transcribed' and 'not transcribed'. (D) The fraction that are transcribed in each bin of the histogram in panel C.
Figure 6Ka/Ks distributions for PRs and for retropseudogenes (. (A) Distribution of Ka/Ks for human PRs (n = 262) meeting the local gene order test (N≤ 1), compared to Ka/Ks for the RψGs (n = 183). All sequences were required to have protein sequence identity ≥ 60.0% with their parent sequences. (B) As in (A), but for mouse PRs (n = 318) and RψGs (n = 220).
Results of analysis for reading-frame conservation (RFC)
| Human-specific relative to chimp | 10/40 (25%) | |
| Others in human, that were formed since divergence from dog | 49/171 (29%) | |
| Other older PRs | 162/378 (43%) | |
| TOTAL | 221/589 (38%) | |
| Mouse-specific relative to rat | 35/240 (15%) | |
| Others in mouse, that were formed since divergence from dog | 17/58 (29%) | |
| Other older PRs | 123/233 (53%) | |
| TOTAL | 175/531 (33%) |
† These are the sets illustrated in Figure 4. All of the PRs that pass the local gene order test, allowing for N≤ 1 were analysed for reading-frame conservation.
†† For PRs formed since divergence from dog, RFC simulations were performed using ancestral sequences calculated using PAML [20]. For PRs formed before divergence from dog, RFC simulations were conservatively performed to half of the nucleotide-level divergence between the PR and its assigned parent sequence.
Most common Gene Ontology (GO) functional terms for different sets of sequences *
| GO:0005515, protein binding (2360) | GO:0008270, zinc ion binding (49) | ||
| GO:0008270, zinc-ion binding (2069) | GO:0008270, zinc ion binding (189) | GO:0006355, regulation of transcription, DNA-dependent (35) | GO:0003677, DNA binding (10) |
| GO:0006355, regulation of transcription, DNA-dependent (2029) | GO:0006355, regulation of transcription, DNA-dependent (166) | GO:0005509, calcium ion binding (25) | GO:0006355, regulation of transcription, DNA-dependent (9) |
| GO:0005524, ATP-binding (1687) | GO:0005525, GTP binding (21) | GO:0005525, GTP binding (5) | |
| GO:0003677, DNA binding (1339) | GO:0005515, protein binding (21) | GO:0003823, antigen binding (5) | |
| GO:0007165, signal transduction (1264) | GO:0005515, protein binding (114) | GO:0004842, ubiquitin-protein ligase activity (21) | GO:0003676, nucleic acid binding (5) |
| GO:0016740, transferase activity (1263) | GO:0003677, DNA binding (110) | GO:0003677, DNA binding (20) | GO:0030145, manganese ion binding (4) |
| GO:0004872, receptor activity (1242) | GO:0005524, ATP binding (93) | GO:0003676, nucleic acid binding (20) | GO:0020037, heme binding (4) |
| GO:0016787, hydrolase activity (1171) | GO:0046872, metal ion binding (63) | GO:0016757, transferase activity, transferring glycosyl groups (4) | |
| GO:0003700, transcription factor activity (1052) | GO:0000166, nucleotide binding (57) | GO:0003723, RNA binding (13) | GO:0005509, calcium ion binding (4) |
| GO:0005515, protein binding (2502) | GO:0005515, protein-binding (17) | ||
| GO:0004872, receptor activity (1923) | GO:0005524, ATP binding (8) | ||
| GO:0006355, regulation of transcription, DNA-dependent (1571) | GO:0008270, zinc ion binding (12) | GO:0005515, protein-binding (7) | |
| GO:0008270, zinc ion binding (1481) | GO:0006355, regulation of transcription, DNA-dependent (12) | GO:0016740, transferase activity (6) | |
| GO:0005524, ATP binding (1252) | GO:0005524, ATP binding (12) | GO:0016491, oxidoreductase activity (6) | |
| GO:0016740, transferase activity (1036) | GO:0005509, calcium ion binding (12) | GO:0006355, regulation of transcription, DNA-dependent (6) | |
| GO:0003677, DNA binding (1017) | GO:0005515, protein-binding (101) | GO:0016740, transferase activity (10) | GO:0016853, isomerase activity (5) |
| GO:0016787, hydrolase activity (911) | GO:0016787, hydrolase activity (9) | GO:0016787, hydrolase activity (5) | |
| GO:0000166, nucleotide binding (873) | GO:0005524, ATP binding (78) | GO:0003677, DNA binding (9) | GO:0003677, DNA binding (5) |
| GO:0003676, nucleic acid binding (872) | GO:0003676, nucleic acid binding (9) | GO:0016874, ligase activity (3) | |
* The most abundant Gene Ontology 'molecular function' terms are listed for each set of sequences, in decreasing order of abundance. The GO term number and a brief description are followed by the number of occurrences (in brackets). Significant overrepresentation of GO terms was calculated as described previously using binomial statistics, using a Bonferroni correction for multiple hypothesis testing (P' < 0.05) [37]. 'Structural constituent of the ribosome' (in italics) is the only term that is significantly overrepresented in all of the three putatively retrotransposed sequences.
** 'Structural constituent of the ribosome' remains the most abundant GO category in this column when PRs from parents with large exons (FLE>0.67) are removed, or a more stringent Nhomologs threshold of = 0 is used.