| Literature DB >> 26658305 |
Meili Chen1, Yibo Hu2, Jingxing Liu1, Qi Wu2, Chenglin Zhang3, Jun Yu1, Jingfa Xiao1, Fuwen Wei2, Jiayan Wu1.
Abstract
High-quality and complete gene models are the basis of whole genome analyses. The giant panda (Ailuropoda melanoleuca) genome was the first genome sequenced on the basis of solely short reads, but the genome annotation had lacked the support of transcriptomic evidence. In this study, we applied RNA-seq to globally improve the genome assembly completeness and to detect novel expressed transcripts in 12 tissues from giant pandas, by using a transcriptome reconstruction strategy that combined reference-based and de novo methods. Several aspects of genome assembly completeness in the transcribed regions were effectively improved by the de novo assembled transcripts, including genome scaffolding, the detection of small-size assembly errors, the extension of scaffold/contig boundaries, and gap closure. Through expression and homology validation, we detected three groups of novel full-length protein-coding genes. A total of 12.62% of the novel protein-coding genes were validated by proteomic data. GO annotation analysis showed that some of the novel protein-coding genes were involved in pigmentation, anatomical structure formation and reproduction, which might be related to the development and evolution of the black-white pelage, pseudo-thumb and delayed embryonic implantation of giant pandas. The updated genome annotation will help further giant panda studies from both structural and functional perspectives.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26658305 PMCID: PMC4676012 DOI: 10.1038/srep18019
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Diagram of the improvement in genome assembly completeness.
(A) Scaffolding improvement; (B) Scaffolding inconsistencies; (C) Nest assembly errors; (D) Boundary extensions; (E) Gap closure.
Evaluation of the scaffolding using the Trinity-assembled transcripts.
| OK_join | 499 | 2,195 | 2,083 | 4,777 | 1,205 | 741 |
| OK_merge | 56 | 47 | 95 | 198 | 152 | 79 |
| PB_merge | 605 | 589 | 1,269 | 2,463 | 987 | 1,503 |
| Total | 1,160 | 2,831 | 3,447 | 7,438 | 2,106 | 2,317 |
1The mapping strands of two ordinal, aligned segments resulted in transcripts located across multiple scaffolds, which were oriented ‘+/+’, ‘−/−’, or ‘+/−’.
2OK_join: the Trinity-assembled transcript alignment results suggested that these scaffold sequences were adjacent, which was used to improve scaffolding.
3OK_merge: the Trinity-assembled transcript alignment results suggested that the most likely situation was that one genomic sequence filled the gap in another.
4PB_merge: the Trinity-assembled transcript alignment results suggested that mis-assembly existed within these scaffolds/contigs.
Figure 2Comparison of the Cufflinks- and Trinity-assembled transcripts of the giant panda with known gene models.
“Genome” represents known gene models from the Ensembl automated annotation system. In total, 43,838 Trinity-assembled transcripts unaligned back to the giant panda draft genome. In total, 102,742 Trinity-assembled transcripts were located to scaffolds that did not cover any known gene models.
Figure 3Expression pattern analysis of the giant panda novel protein-coding genes.
(A) Expression breadth of the Trinity-defined novel protein-coding genes; (B) Expression distribution of the Trinity-defined novel protein-coding genes; (C) A comparison of the GC content distributions between all the novel protein-coding genes and the known gene models (step size was 5%) of the giant panda; (D) A comparison of the CDS length distribution between all novel protein-coding genes and the known gene models (step size was 0.1 KB) of the giant panda.
Figure 4GO functional annotation analysis of the giant panda novel protein-coding genes.
Proteomic results of five selected tissues (pallium, pituitary gland, tongue, testis and ovary) for the validation of Trinity-defined novel protein-coding genes.
| Homology-based genes | 56 | 342 | 16.37% |
| Unknown genes | 780 | 5,407 | 14.43% |
| Hypothetical genes | 547 | 5,213 | 10.49% |
| Total | 1,383 | 10,962 | 12.62% |