| Literature DB >> 27047535 |
Michael J Milligan1, Erin Harvey2, Albert Yu2, Ashleigh L Morgan2, Daniela L Smith2, Eden Zhang2, Jonathan Berengut2, Jothini Sivananthan2, Radhini Subramaniam2, Aleksandra Skoric2, Scott Collins2, Caio Damski2, Kevin V Morris2, Leonard Lipovich1.
Abstract
Pseudogenes are abundant in the human genome and had long been thought of purely as nonfunctional gene fossils. Recent observations point to a role for pseudogenes in regulating genes transcriptionally and post-transcriptionally in human cells. To computationally interrogate the network space of integrated pseudogene and long non-coding RNA regulation in the human transcriptome, we developed and implemented an algorithm to identify all long non-coding RNA (lncRNA) transcripts that overlap the genomic spans, and specifically the exons, of any human pseudogenes in either sense or antisense orientation. As inputs to our algorithm, we imported three public repositories of pseudogenes: GENCODE v17 (processed and unprocessed, Ensembl 72); Retroposed Pseudogenes V5 (processed only), and Yale Pseudo60 (processed and unprocessed, Ensembl 60); two public lncRNA catalogs: Broad Institute, GENCODE v17; NCBI annotated piRNAs; and NHGRI clinical variants. The data sets were retrieved from the UCSC Genome Database using the UCSC Table Browser. We identified 2277 loci containing exon-to-exon overlaps between pseudogenes, both processed and unprocessed, and long non-coding RNA genes. Of these loci we identified 1167 with Genbank EST and full-length cDNA support providing direct evidence of transcription on one or both strands with exon-to-exon overlaps. The analysis converged on 313 pseudogene-lncRNA exon-to-exon overlaps that were bidirectionally supported by both full-length cDNAs and ESTs. In the process of identifying transcribed pseudogenes, we generated a comprehensive, positionally non-redundant encyclopedia of human pseudogenes, drawing upon multiple, and formerly disparate public pseudogene repositories. Collectively, these observations suggest that pseudogenes are pervasively transcribed on both strands and are common drivers of gene regulation.Entities:
Keywords: ESTs (expressed sequence tags); SNPs (single nucleotide polymorphisms); gene expression; human disease; lncRNA (long non-coding RNA); piRNA; pseudogenes; transcriptome
Year: 2016 PMID: 27047535 PMCID: PMC4805607 DOI: 10.3389/fgene.2016.00026
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Scoring formula for genomic-span overlaps. P1, proportion of gene 1′s span overlapping with gene 2′s span; P2, proportion of gene 2′s span overlapping with gene 1′s span; SS, Span Span overlap score.
Figure 2Scoring formula for exon-to-exon overlaps. E1, Proportion of Exon within Gene 1 overlapping with Exon within Gene 2. E2, Proportion of Exon within Gene 2 overlapping with Exon within Gene 1; TE, Total number of unique exons from both genes; EE, Exon Exon overlap score.
Computationally derived intersections of all pseudogene, lncRNA, cDNA, and EST databases.
| GENCODE 17 Pseudogenes | GENCODE 17 lncRNA | 163 | 87 | 56 | 20 |
| GENCODE 17 Pseudogenes vs. GENCODE 17 lncRNA | cDNA | 64 | 33 | 4 | 27 |
| GENCODE 17 Pseudogenes vs. GENCODE 17 lncRNA | EST | 68 | 4 | 2 | 62 |
| GENCODE 17 Pseudogenes vs. GENCODE 17 lncRNA | cDNA and EST | 48 | 23 | 3 | 22 |
| GENCODE 17 Pseudogenes | Human lincRNA | 870 | 725 | 45 | 100 |
| GENCODE 17 Pseudogenes vs. Human lincRNA | cDNA | 371 | 216 | 15 | 140 |
| GENCODE 17 Pseudogenes vs. Human lincRNA | EST | 646 | 53 | 18 | 575 |
| GENCODE 17 Pseudogenes vs. Human lincRNA | cDNA and EST | 325 | 186 | 12 | 127 |
| Retro Ali5 | GENCODE 17 lncRNA | 211 | 79 | 120 | 12 |
| Retro Ali5 vs. GENCODE 17 lncRNA | cDNA | 78 | 35 | 17 | 26 |
| Retro Ali5 vs. GENCODE 17 lncRNA | EST | 162 | 30 | 16 | 116 |
| Retro Ali5 vs. GENCODE 17 lncRNA | cDNA and EST | 69 | 32 | 15 | 22 |
| Retro Ali5 | Human lincRNA | 557 | 405 | 108 | 44 |
| Retro Ali5 vs. Human lincRNA | cDNA | 129 | 61 | 26 | 42 |
| Retro Ali5 vs. Human lincRNA | EST | 381 | 64 | 34 | 283 |
| Retro Ali5 vs. Human lincRNA | cDNA and EST | 113 | 53 | 23 | 37 |
| Yale 60 | GENCODE 17 lncRNA | 105 | 66 | 25 | 14 |
| Yale 60 vs. GENCODE 17 lncRNA | cDNA | 40 | 25 | 2 | 13 |
| Yale 60 vs. GENCODE 17 lncRNA | EST | 46 | 2 | 1 | 43 |
| Yale 60 vs. GENCODE 17 lncRNA | cDNA and EST | 31 | 18 | 1 | 12 |
| Yale 60 | Human lincRNA | 547 | 468 | 24 | 55 |
| Yale 60 vs. Human lincRNA | cDNA | 192 | 123 | 6 | 63 |
| Yale 60 vs. Human lincRNA | EST | 372 | 38 | 12 | 322 |
| Yale 60 vs. Human lincRNA | cDNA and EST | 157 | 96 | 5 | 56 |
Directionality is relative to pseudogene locus, and complex loci refer to intersects which contain both sense and antisense overlaps.
Figure 3Positional redundancy elimination. Redundant loci were identified by overlapping genomic spans and were eliminated so that only one locus could occupy the same genomic span.
Directionality of lncRNA-supported pseudogene loci relative to the direction of pseudogene transcription in cDNA and EST data.
| cDNA | 374 | 164 | 629 |
| EST | 836 | 88 | 243 |
| EST and cDNA | 313 | 53 | 196 |
Figure 4Gene Ontology Comparison of all human protein-coding genes Ensembl 79, the PsiCube pseudogene parental gene database (from pseudogene.org), and the parental genes of the 313 loci computationally identified as bidirectionally-transcribed lncRNA-overlapping pseudogenes. Enrichments which are common to the 313 loci vs. both the whole genome and the parental gene database are in green. Enrichments found in all three comparisons are in red, and those which are unique to either pseudogene parental gene database when compared with the whole genome are in blue.
Figure 5Venn diagram of the genomic positionally-nonredundant intersection of three major public pseudogene databases. The resulting non-redundant dataset (Supplementary Dataset 3) renders a more inclusive and comprehensive pseudogene database of 20945 pseudogene loci and alleviates problems due to accession number synonymity within and between the three databases. (Accession number synonyms point to the same pseudogene along the human genome, but in the absence of positionally-nonredundant collapsing, they may be misrepresented by downstream programs as representing multiple pseudogenes).