| Literature DB >> 26177073 |
Jia-Yu Chen1, Qing Sunny Shen1, Wei-Zhen Zhou2, Jiguang Peng1, Bin Z He3, Yumei Li1, Chu-Jun Liu1, Xuke Luan4, Wanqiu Ding1, Shuxian Li1, Chunyan Chen5, Bertrand Chin-Ming Tan6, Yong E Zhang5, Aibin He4, Chuan-Yun Li1.
Abstract
While some human-specific protein-coding genes have been proposed to originate from ancestral lncRNAs, the transition process remains poorly understood. Here we identified 64 hominoid-specific de novo genes and report a mechanism for the origination of functional de novo proteins from ancestral lncRNAs with precise splicing structures and specific tissue expression profiles. Whole-genome sequencing of dozens of rhesus macaque animals revealed that these lncRNAs are generally not more selectively constrained than other lncRNA loci. The existence of these newly-originated de novo proteins is also not beyond anticipation under neutral expectation, as they generally have longer theoretical lifespan than their current age, due to their GC-rich sequence property enabling stable ORFs with lower chance of non-sense mutations. Interestingly, although the emergence and retention of these de novo genes are likely driven by neutral forces, population genetics study in 67 human individuals and 82 macaque animals revealed signatures of purifying selection on these genes specifically in human population, indicating a proportion of these newly-originated proteins are already functional in human. We thus propose a mechanism for creation of functional de novo proteins from ancestral lncRNAs during the primate evolution, which may contribute to human-specific genetic novelties by taking advantage of existed genomic contexts.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26177073 PMCID: PMC4503675 DOI: 10.1371/journal.pgen.1005391
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Fig 3Emergence of human de novo proteins from GC-rich lncRNA precursors.
(A) GC contents for randomly-selected intergenic regions, all lncRNAs and lncRNA precursors in rhesus macaque are summarized in boxplots. (B) GC contents of different genomic regions are shown for de novo genes in human, as well as the orthologous non-coding regions in chimpanzee and rhesus macaque. For lncRNA precursors, the pseudo-CDS and pseudo-UTR regions were defined according to the orthologous relationship with the corresponding CDS and UTR regions of human de novo proteins. (C) GC contents for CDS regions of RefSeq proteins and de novo proteins in human are summarized in boxplots. A: all de novo genes, Y: younger de novo genes, O: older de novo genes. (D) Boxplot showing the distribution of fragile codon composition of de novo genes and RefSeq proteins in human. (E) Boxplot showing the distribution of half-life time of de novo genes and RefSeq proteins in human. (F) Dot plot showing the survival probability of the de novo ORFs. The probability of 0.05 was marked by red dashed line.
Fig 4Profiling of polymorphisms in human and rhesus macaque.
(A) Comparison of human polymorphism sites profiled in this study with those in the 1000 Genomes Project. (B) The sequencing coverages of whole genome sequencing from one macaque animal and for the targeted re-sequencing of 82 macaque animals are summarized in green barplot and heatmaps inside the Circos map, respectively. The depths of the sequencing coverage are proportional to the color depth. Black rectangles outside the colored chromosome block represent the genomic locations of macaque orthologous regions of human de novo genes. The bottom panel illustrates the sequencing details of one region of interest. (C) Cumulative frequency of mean sequencing coverage on different genic regions of de novo genes is shown. Intergenic regions: 1-kb regions upstream and downstream of the gene. (D, E) Venn diagrams showing the distributions of macaque polymorphism sites identified by whole-genome sequencing and targeted re-sequencing, in terms of polymorphism sites (D) and genotypes (E).
Basic information of 64 de novo genes in hominoid lineage.
| Gene ID | Age | Length | Expression | Peptides | Source |
|---|---|---|---|---|---|
| ENSG00000178803 | H | 159 | Kidney, 4 | 7 [ | [ |
| ENSG00000204626 | H | 163 | Cerebellum, 7 | 8 [ | [ |
| ENSG00000145063 | H | 174 | Brain, 1 | 12 [ | [ |
| ENSG00000172927 | H | 313 | Breast, 3 | 14 [ | [ |
| ENSG00000177822 | H-C | 148 | Adipose, 4 | 7 [ | [ |
| ENSG00000179522 | H-C | 230 | Prostate, 5 | 9 [ | [ |
| ENSG00000215071 | H-C | 121 | Testes, 14 | 2 [ | [ |
| ENSG00000182457 | H-C | 135 | Ovary, 17 | 3 [ | [ |
| ENSG00000174407 | H-C-O | 99 | Heart, 5 | 6 [ | [ |
| ENSG00000203930 | H | 103 | Cerebellum, 2 | 2 [ | [ |
| ENSG00000204091 | H-C-O | 100 | Testes, 3 | 3 [ | [ |
| ENSG00000204666 | H-C | 122 | Brain, 2 | 8 [ | [ |
| ENSG00000204674 | H-C | 123 | Cerebellum, 13 | 10 [ | [ |
| ENSG00000212736 | H-C-O | 115 | Adrenal, 16 | 8 [ | [ |
| ENSG00000167747 | H-C-O | 117 | Testes, 17 | 11 [ | [ |
| ENSG00000214112 | H-C-O | 72 | Heart, 2 | 3 [ | [ |
| ENSG00000214130 | H-C | 149 | Heart, 2 | 5 [ | [ |
| ENSG00000118267 | H | 423 | Colon, 17 | 33 [ | [ |
| ENSG00000215458 | H | 302 | Blood, 5 | 18 [ | [ |
| ENSG00000215494 | H | 152 | Breast, 1 | 12 [ | [ |
| ENSG00000215848 | H | 161 | Brain, 3 | 6 [ | [ |
| ENSG00000221953 | H | 237 | Brain, 1 | 15 [ | [ |
| ENSG00000221891 | H-C-O | 157 | Testes, 4 | 13 [ | [ |
| ENSG00000221899 | H | 166 | Lymph_node, 10 | 17 [ | [ |
| ENSG00000205056 | H | 121 | Blood, 1 | 2 [ | [ |
| ENSG00000198547 | H | 194 | Brain, 2 | 13 [ | [ |
| ENSG00000136242 | H | 128 | Testes, 16 | 4 [ | [ |
| ENSG00000162968 | H | 151 | Brain, 3 | 4 [ | [ |
| ENSG00000175913 | H | 147 | Cerebellum, 2 | 9 [ | [ |
| ENSG00000176833 | H | 126 | Testes, 1 | 7 [ | [ |
| ENSG00000176911 | H | 134 | Breast, 1 | 5 [ | [ |
| ENSG00000180838 | H | 131 | Prostate, 3 | 4 [ | [ |
| ENSG00000187488 | H | 221 | Testes, 17 | 2 [ | [ |
| ENSG00000196273 | H | 105 | Testes, 1 | 10 [ | [ |
| ENSG00000197916 | H | 129 | Adipose, 16 | 3 [ | [ |
| ENSG00000204079 | H | 141 | Adrenal, 1 | 6 [ | [ |
| ENSG00000204292 | H | 150 | Testes, 1 | 2 [ | [ |
| ENSG00000204380 | H | 155 | Cerebellum, 10 | 6 [ | [ |
| ENSG00000205373 | H | 219 | Testes, 13 | 4 [ | [ |
| ENSG00000205557 | H | 149 | Cerebellum, 3 | 2 [ | [ |
| ENSG00000205965 | H | 175 | Kidney, 2 | 4 [ | [ |
| ENSG00000206028 | H | 164 | Brain, 5 | 8 [ | [ |
| ENSG00000206096 | H | 127 | Testes, 4 | 5 [ | [ |
| ENSG00000206110 | H | 129 | Brain, 1 | 8 [ | [ |
| ENSG00000206113 | H | 213 | Testes, 1 | 13 [ | [ |
| ENSG00000212693 | H | 131 | Thyroid, 6 | 3 [ | [ |
| ENSG00000214780 | H | 195 | Brain, 1 | 3 [ | [ |
| ENSG00000218478 | H | 158 | Kidney, 16 | 2 [ | [ |
| ENSG00000223857 | H | 131 | Brain, 12 | 7 [ | [ |
| ENSG00000224013 | H | 164 | Cerebellum, 6 | 8 [ | [ |
| ENSG00000225021 | H | 144 | Liver, 7 | 4 [ | [ |
| ENSG00000225860 | H | 175 | Brain, 1 | 8 [ | [ |
| ENSG00000225917 | H | 269 | Brain, 15 | 15 [ | [ |
| ENSG00000230294 | H | 119 | Testes, 1 | 8 [ | [ |
| ENSG00000235766 | H | 142 | Lung, 12 | 4 [ | [ |
| ENSG00000236314 | H | 156 | Testes, 17 | 10 [ | [ |
| ENSG00000260456 | H-C-G | 158 | Testes, 15 | 6 [ | [ |
| ENSG00000149443 | H-C | 151 | Testes, 1 | 11 [ | [ |
| ENSG00000167159 | H-C-O | 157 | Cerebellum, 7 | 1 [ | [ |
| ENSG00000008517 | H-C | 188 | Kidney, 17 | 3 [ | [ |
| ENSG00000183250 | H-C-O | 204 | Brain, 7 | 7 [ | [ |
| ENSG00000244291 | H-C | 216 | Testes, 17 | 9 [ | [ |
| ENSG00000205913 | H-C | 107 | Brain, 3 | 4 [ | [ |
| ENSG00000221990 | H-C-G-O | 119 | Testes, 13 | 15 [ | [ |
Gene IDs from the original studies
H: human, C: chimpanzee, G: gorilla and O: Orangutan
Stop codons are excluded
Tissue in which the de novo gene is most highly-expressed and the number of tissues (up to 17 tissues) in which the de novo gene is expressed (RPKM>0.5)
*This study.