| Literature DB >> 26358618 |
Nina F Ockendon1,2, Lauren A O'Connell3, Stephen J Bush1,2, Jimena Monzón-Sandoval1,2, Holly Barnes1,2, Tamás Székely1,2, Hans A Hofmann4, Steve Dorus5, Araxi O Urrutia1,2.
Abstract
Next-generation sequencing methods, such as RNA-seq, have permitted the exploration of gene expression in a range of organisms which have been studied in ecological contexts but lack a sequenced genome. However, the efficacy and accuracy of RNA-seq annotation methods using reference genomes from related species have yet to be robustly characterized. Here we conduct a comprehensive power analysis employing RNA-seq data from Drosophila melanogaster in conjunction with 11 additional genomes from related Drosophila species to compare annotation methods and quantify the impact of evolutionary divergence between transcriptome and the reference genome. Our analyses demonstrate that, regardless of the level of sequence divergence, direct genome mapping (DGM), where transcript short reads are aligned directly to the reference genome, significantly outperforms the widely used de novo and guided assembly-based methods in both the quantity and accuracy of gene detection. Our analysis also reveals that DGM recovers a more representative profile of Gene Ontology functional categories, which are often used to interpret emergent patterns in genomewide expression analyses. Lastly, analysis of available primate RNA-seq data demonstrates the applicability of our observations across diverse taxa. Our quantification of annotation accuracy and reduced gene detection associated with sequence divergence thus provides empirically derived guidelines for the design of future gene expression studies in species without sequenced genomes.Entities:
Keywords: Drosophila; RNA-seq; gene ontology; nonmodel species; primate; transcriptome assembly
Mesh:
Year: 2015 PMID: 26358618 PMCID: PMC4982090 DOI: 10.1111/1755-0998.12465
Source DB: PubMed Journal: Mol Ecol Resour ISSN: 1755-098X Impact factor: 7.090
Figure 1Flow chart outlining pipelines for transcriptome annotation. De novo and reference sequence‐guided transcriptome assembly strategies are shown alongside a simpler direct read‐to‐genome mapping approach where quality‐controlled short transcriptome reads are aligned directly against the closest available annotated reference sequence. *Reference sequences used to guide transcriptome assembly or to map reads directly onto may or may not be annotated. If they are not annotated, further information is required providing the coordinates of genomic features of interest. Boxes with squared corners indicate processes; boxes with rounded corners indicate data sets.
Reference genome species and their respective sequence divergence (total substitutions per site) from the transcriptome species
| Reference genome species | Divergence from |
|---|---|
| A: | |
|
| 0.0972 |
|
| 0.0952 |
|
| 0.2265 |
|
| 0.2149 |
|
| 1.0991 |
|
| 1.1619 |
|
| 1.1705 |
|
| 1.1895 |
|
| 1.2243 |
|
| 1.2315 |
|
| 1.2268 |
Figure 2Direct genome mapping (DGM) detects more genes than alternative assembly methods. The efficacy of each transcriptome annotation strategy at recovering genes was assessed using both the same reference species and reference sequences at increasing levels of sequence divergence. (A) Total numbers of genes detected by each strategy (complete stacks) when Drosophila melanogaster RNA‐seq sequences are annotated using its own genome. DGM: direct genome mapping; DNT: de novo assembly using Trinity; DNV: de novo assembly using Velvet Oases; GGV: guided assembly using Velvet Columbus. Genes detected by single‐match sequences are indicated by wide striped sections. (B) The proportion of orthologous genes that are detected (of the total orthologous genes in Drosophila melanogaster) at increasing levels of sequence divergence by DGM (stars), guided assemblies (diamonds), and de novo assembly using Velvet Oases (inverted triangles) or Trinity (filled circles).
Figure 3Direct genome mapping (DGM) results in lower gene detection error than alternative assembly methods. Single‐match sequences display significantly lower gene detection error compared to multimatch sequences. The proportion of orthologous genes incorrectly detected by (A) single‐match sequences (unassembled reads or assembled contigs) and (B) multimatch sequences is the lowest for DGM, compared to the assembly methods. Results for DGM (stars), guided assemblies (diamonds), de novo assembly using Velvet Oases (inverted triangles) and de novo assembly using Trinity (filled circles) are indicated.
Figure 4Gene Ontology (GO) annotations for genes detected by different annotation methods. (A) Heatmap of the proportion of genes detected in Drosophila melanogaster by different annotation methods (DGM: direct genome mapping, DNT: de novo assembly using Trinity, DNV: de novo assembly using Velvet Oases, GGV: guided assembly using Velvet Columbus) relative to all protein coding genes in that species. The total number of genes detected with each method in D. melanogaster is highlighted between brackets. (B) Heatmap of the proportion of genes detected using each alternative Drosophila genome relative to the genes detected in D. melanogaster by each annotation method. Colours tending towards black indicate a similar number of genes detected per GO slim term relative to the respective D. melanogaster background population, while colours tending towards white indicate a relative decrease in genes assigned per GO slim term (greater depletion). Drosophila phylogeny is shown at the top left, as available from Flybase (www.flybase.org).
Figure 5Direct genome mapping (DGM) characteristics in primates. (A) Proportion of human orthologues detected (of the total orthologous genes in human) by single‐match reads when human RNA‐seq reads were mapped to the alternative nonhuman primate genomes. (B) Proportion of orthologous genes incorrectly annotated using the human genome annotation of reads as the benchmark.
Figure 6Impact of sequence divergence on gene detection and misidentification rates using direct genome mapping (DGM): a reference guide for future studies. Trend lines display the proportion of Drosophila reference species genes detected by single‐match reads (open triangles) and associated errors rates (filled triangles) as a function of sequence divergence. Detection and error rates are estimated to be comparable at approximately 4.0 substitutions per site. The relationship between the efficacy of detection and inherent misidentification of genes can be of use to investigators in future experimental design and data analysis if they have an accurate estimation of nucleotide divergence between their study species and the reference genome they are utilizing.