| Literature DB >> 29183907 |
Mickael Orgeur1,2,3, Marvin Martens3, Stefan T Börno2, Bernd Timmermann2, Delphine Duprez4, Sigmar Stricker5,2.
Abstract
The sequence of the chicken genome, like several other draft genome sequences, is presently not fully covered. Gaps, contigs assigned with low confidence and uncharacterized chromosomes result in gene fragmentation and imprecise gene annotation. Transcript abundance estimation from RNA sequencing (RNA-seq) data relies on read quality, library complexity and expression normalization. In addition, the quality of the genome sequence used to map sequencing reads, and the gene annotation that defines gene features, must also be taken into account. A partially covered genome sequence causes the loss of sequencing reads from the mapping step, while an inaccurate definition of gene features induces imprecise read counts from the assignment step. Both steps can significantly bias interpretation of RNA-seq data. Here, we describe a dual transcript-discovery approach combining a genome-guided gene prediction and a de novo transcriptome assembly. This dual approach enabled us to increase the assignment rate of RNA-seq data by nearly 20% as compared to when using only the chicken reference annotation, contributing therefore to a more accurate estimation of transcript abundance. More generally, this strategy could be applied to any organism with partial genome sequence and/or lacking a manually-curated reference annotation in order to improve the accuracy of gene expression studies.Entities:
Keywords: Chicken genome annotation; Gallus gallus; Gene prediction; Genome-guided transcript discovery; RNA sequencing; Transcriptome reconstruction
Year: 2018 PMID: 29183907 PMCID: PMC5827264 DOI: 10.1242/bio.028498
Source DB: PubMed Journal: Biol Open ISSN: 2046-6390 Impact factor: 2.422
RNA-seq read pair assignment
Fig. 1.Dual transcript-discovery approach. (A) Region surrounding the genes RABEP1 and HSD3B7 on chromosome 19. RNA-seq signal on strand plus (green), which does not overlap any gene from UCSC and Ensembl reference annotations, corresponds to the gene COL26A1. (B) RNA-seq signal (orange) on strand minus of an uncharacterized contig delimitating three exons of the gene FLNA. (C) Region of the gene WNT11 on chromosome 1. As visible from the RNA-seq signal on strand plus (green), both UCSC and Ensembl reference annotations lack an exon of the 5′-UTR and display a shorter 3′-UTR. (D) The dual transcript-discovery approach combined a genome-guided gene prediction with a de novo transcriptome reconstruction. This dual approach enabled us to correct for gene fragmentation (orange), to identify missing gene candidates (red) and to adjust or validate existing annotated genes (green, blue) thus improving the assignment rate of RNA-seq read pairs. (E) Workflow to design the comprehensive gene annotation model.
Fig. 2.Characteristics of the new gene annotation model. (A) The dual transcript-discovery approach combining genome-guided gene prediction (light green) and de novo transcriptome reconstruction (dark green) raised the read-pair assignment rate by 19.3% as compared to when using the UCSC and Ensembl reference annotations (red). The proportion of read pairs coming from the RCAS-BP(A) replication competent retroviruses is depicted in black. (B) Proportion of gene locations on chromosomes and contigs of the chicken reference genome galGal4. Of the identified gene candidates, 9.2% are fragmented due to their location on multiple chromosomes and contigs. (C) Proportion of annotated gene biotypes. Most of the annotated gene candidates potentially encode proteins (78.3%). Putative proteins correspond to gene candidates for which at least one protein domain could be detected (3.1%). Uncharacterized proteins are gene candidates with an ORF of ≥100 amino acids without protein domain identified (6.6%). Gene candidates with no sufficient predicted ORF (<100 amino acids) are classified as non-coding RNAs (20.7%). Gene candidates encoding spliceosome complex members and ribosomal RNAs, as well as pseudogenes, are classified as miscellaneous genes (1.0%).
RNA-seq read pair assignment against galGal5
Length coverage of gene candidates as compared to galGal5 reference genes
Comparison of galGal4 gene candidates to galGal5 reference genes