| Literature DB >> 24615986 |
Sandra R Richardson1, Carmen Salvador-Palomeque, Geoffrey J Faulkner.
Abstract
Gene retrocopies are generated by reverse transcription and genomic integration of mRNA. As such, retrocopies present an important exception to the central dogma of molecular biology, and have substantially impacted the functional landscape of the metazoan genome. While an estimated 8,000-17,000 retrocopies exist in the human genome reference sequence, the extent of variation between individuals in terms of retrocopy content has remained largely unexplored. Three recent studies by Abyzov et al., Ewing et al. and Schrider et al. have exploited 1,000 Genomes Project Consortium data, as well as other sources of whole-genome sequencing data, to uncover novel gene retrocopies. Here, we compare the methods and results of these three studies, highlight the impact of retrocopies in human diversity and genome evolution, and speculate on the potential for somatic gene retrocopies to impact cancer etiology and genetic diversity among individual neurons in the mammalian brain.Entities:
Keywords: pseudogene; retrocopy; retrogene; retrotransposition
Mesh:
Substances:
Year: 2014 PMID: 24615986 PMCID: PMC4314676 DOI: 10.1002/bies.201300181
Source DB: PubMed Journal: Bioessays ISSN: 0265-9247 Impact factor: 4.345
Figure 1Illustration of retrocopy mechanism and detection strategy. A: Generation of L1 insertions and gene retrocopies by the L1 machinery. The typical L1 retrotransposition pathway is indicated by black arrows; gray arrows denote the less-frequent mobilization of cellular mRNAs. Retrotransposition begins with the transcription of a full-length L1 in the genome. The L1 mRNA (wavy black line) is exported from the nucleus and translated, giving rise to the L1 encoded proteins ORF1p (green circles) and ORF2p (blue oval). ORF1p and ORF2p exhibit a strong cis-preference for binding their encoding mRNA, resulting in formation of the L1 ribonucleoprotein particle (RNP). Occasionally, the L1-encoded proteins mobilise cellular mRNAs (dashed red line) in trans. Regardless of the RNA template, insertions generated by the L1-encoded enzymatic machinery undergo target-site primed reverse transcription, resulting in a new L1 copy (blue rectangle), or a gene retrocopy (multi-coloured rectangle), at a distinct genomic location. L1 insertions and retrocopy insertions bear the hallmarks of target-primed reverse transcription, including poly-A tails and target-site duplications (purple arrows). B: Characteristics and detection of gene retrocopies. A typical parent gene (above; coloured rectangles denote exons) contains introns (grey lines) and resides at a particular genomic location (heavy black line). Paired-end sequencing reads (dashed lines) wherein one end maps to a gene (red and green rectangles) and the other to its known genomic location (black rectangles) are termed concordant. Conversely, paired-end sequencing reads wherein one end maps to a gene (red and green rectangles), but the other end maps to a distal genomic location (blue rectangles), are termed discordant (denoted by red X's). Discordant paired-end reads are indicative of a gene retrocopy (below), and allow mapping of the retrocopy to its genomic location. Gene retrocopies are distinguished from parent genes by their discrete genomic location (heavy blue line), the presence of retrotransposition hallmarks (target site duplications (TSDs), purple arrows; and a poly-A tail, An) and a lack of introns. Sequencing reads which span exon-exon junctions (bi-coloured rectangles) are also indicative of gene retrocopies.
Comparison of three recent studies exploiting data from the 1,000 Genomes Project Consortium to uncover novel gene retrocopies
| Human genome sequencing data used | Requirements for calling novel retrocopies | Validation criteria | Novel retrocopies discovered | |
|---|---|---|---|---|
| Abyzov et al. [ | 1,000 Genomes: two deep-sequenced trios, analysed per individual. 968 shallow-sequenced individuals, analysed as pools based on population | Require exon-exon junctions. Calling parameters optimised for each individual or population using a null model based on shifted GENCODE annotations | Read-depth support insertion site found within HuRef genome assembly. PCR validation; DNA sequencing genotyped in additional samples by finding supportive reads | 149 Retrocopies absent from human genome reference; 38 with known insertion site. 27 retrocopies present in human genome reference but absent from sequenced genomes |
| Ewing et al. [ | 1,000 Genomes: 939 shallow-sequenced individuals. Analysed as one pool. The Cancer Genome Atlas (TCGA): 85 paired tumour/non-tumour genomes | Require insertion site. Require ≥8 read pairs spanning retrocopy and insertion location; ≥2 read pairs spanning each end of the retrocopy | Precise break points Hallmarks of L1-mediated retrotransposition exon-exon junctions | 48 Retrocopies absent from human genome reference (39 present in 1,000 Genomes data, 9 exclusive to TCGA data); 48 with known insertion site. 10 retrocopies present in the human genome reference but absent from sequenced genomes |
| Schrider et al. [ | 1,000 Genomes: 164 total individuals, including two deep-sequenced trios. Two additional genomes sequenced with SOLiD3 technology | Require insertion site or exon-exon junction. For insertion point, require ≥5 paired-end reads spanning retrocopy and insertion location. For exon-exon junctions, require ≥1 junction-spanning read with ≥10 bp crossing the junction, or ≥2 distinct reads with ≥5 bp crossing the junction | PCR validation; DNA sequencing genotyped in additional samples by finding supportive reads | 73 Retrocopies absent from human genome reference; 21 with known insertion site. 18 retrocopies present in the human genome reference but absent from sequenced genomes |
Figure 2Overlap in retrogene cohorts discovered from 1,000 Genomes Project Consortium data by the three studies. Only retrogenes absent from the human genome reference, but present in 1,000 Genomes Project Consortium data, are represented. Nine retrocopies present only in TCGA data discovered by Ewing et al. are excluded from this comparison. Within each segment, black numbers indicate the total number of novel retrogenes; below, red numbers indicate the number of retrogenes for which an insertion site was mapped by at least one study. For retrocopies with known insertion sites, overlap was confirmed by comparing insertion site coordinates. For those without known insertion sites, only gene names were used. Segments are not drawn to scale.