| Literature DB >> 26328666 |
André E Minoche1,2,3, Juliane C Dohm1,2,3,4, Jessica Schneider5, Daniela Holtgräwe5, Prisca Viehöver5, Magda Montfort2,3, Thomas Rosleff Sörensen5, Bernd Weisshaar6, Heinz Himmelbauer7,8,9,10.
Abstract
We develop a method to predict and validate gene models using PacBio single-molecule, real-time (SMRT) cDNA reads. Ninety-eight percent of full-insert SMRT reads span complete open reading frames. Gene model validation using SMRT reads is developed as automated process. Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision. Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea). The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26328666 PMCID: PMC4556409 DOI: 10.1186/s13059-015-0729-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Identification of full-insert cDNA sequences in SMRT sequencing data. Colors refer to the different types of sequences that can be encountered within the read data, that is 5’ and 3’ cDNA synthesis primers, PacBio SMRT library preparation adapter, and cDNA sequences consisting of 5’ UTR, open reading frame (ORF), 3’ UTR, and poly(A) tail. Initially, reads were subclassified into two groups: SMRT reads consisting of several subreads (left) or individual subreads (right). Reads from both groups were error-corrected and used to identify full-length cDNA sequences
Proportion of error-corrected SMRT reads containing cDNA primer, poly(A) tail, and canonical polyadenylation signal (AAUAAA). Low levels of the latter are expected, since RNA processing in plants generally shows a decreased dependence on the AAUAAA signal [36]
| Dataset | All | With primer (%) | With poly(A) (%) | With primer and poly(A) (%) | With poly(A) signal (%) |
|---|---|---|---|---|---|
| CCS 1-2 kb | 36,143 | 92.2 | 64.3 | 60.9 | 21.2 |
| CCS 2-3 kb | 20,795 | 86.4 | 86.4 | 79.3 | 27.3 |
| CCS 3 kb | 22,027 | 86.0 | 89.1 | 81.9 | 29.9 |
| Unmerged subreads 1-2 kb | 181,522 | 13.4 | 28.5 | 28.5 | 9.7 |
| Unmerged subreads 2-3 kb | 223,925 | 10.2 | 33.5 | 33.5 | 11.6 |
| Unmerged subreads 3 kb | 221,424 | 10.6 | 36.2 | 36.2 | 12.6 |
SMRT reads covering full ORFs
| Dataset | SMRT reads overlapping with full-ORF sugar beet genesa | SMRT reads fully covering ORFs (%) | SMRT reads fully covering ORFs and at least 10 UTR bases (%) |
|---|---|---|---|
| CCS 1-2 kb | 17,717 | 94.5 | 94.2 |
| CCS 2-3 kb | 14,846 | 91.6 | 91.2 |
| CCS 3 kb | 17,706 | 92.8 | 92.5 |
| Unmerged subreads 1-2 kb | 45,678 | 41.5 | 40.7 |
| Unmerged subreads 2-3 kb | 70,792 | 33.6 | 32.9 |
| Unmerged subreads 3 kb | 77,803 | 34.6 | 33.9 |
aInterspecific comparison with four other eudicot plants resulted in 7,286 sugar beet genes with bona fide complete ORFs
Fig. 2Transcript length distribution. a Length distribution of 29,831 transcript models supported by evidence previously annotated in the RefBeet-1.1 assembly [13]. b Length distribution of SMRT CCS representing full-length transcripts. c Length distribution of transcripts annotated in RefBeet-1.1 that were matched by CCS representing full-length transcripts
Accuracy of SMRT transcript sequences before and after error correction using the proovread software
| Dataset | Number of sequences | Sequence accuracy | |
|---|---|---|---|
| Before correction | After correction | ||
| CCS | 78,965 | 97.2 % | 99.0 % |
| Full-insert CCS | 56,546 | 97.4 % | 98.1 % |
| Unmerged subreads | 626,871 | 85.2 % | 95.9 % |
| Unmerged full-insert subreads | 53,374 | 86.7 % | 94.9 % |
Fig. 3Alignment of full-insert SMRT sequences to identify reliable gene structures. Multiple independent SMRT reads derived from the same gene were used to (a) confirm genes previously predicted using AUGUSTUS default parameters and to (b) identify new gene models without prior annotation. Gene predictions were considered as validated if all aligning SMRT sequences indicated the same intron boundaries. For new gene models the most abundant isoform per locus supported by at least two reads was reported. c Prediction artefact through intronic transposable elements and corrected prediction in BeetSet-2. Numbers next to gene names indicate the percentage of predicted gene features supported by expression evidence
Fig. 4Gene model validation. An initial gene set was calculated based on the sugar beet reference genome (RefBeet-1.2) and publicly available gene expression data [5, 13] using AUGUSTUS default parameters. Genes from the initial gene set were validated using PacBio SMRT sequences and by manual curation. Additional gene models were determined solely from SMRT full-insert sequences. The latter were included to train the parameter set for the final BeetSet-2 gene prediction
Parameter training and results on ab initio performance
| Training settings | Parameter evaluation on 542 test genes | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Training genes | Setting | Exon level | Transcript level | Sum | Rank | UTR bases | |||
| Sensitivity | Precision | Sensitivity | Precision | Sensitivity | Precision | ||||
| 288 | Manual only | 0.759 | 0.486 | 0.356 | 0.201 | 1.802 | 9 | 0.484 | 0.359 |
| 800 | - | 0.800 | 0.509 | 0.424 | 0.233 | 1.966 | 7 | 0.509 | 0.350 |
| 1,200 | - | 0.808 | 0.511 | 0.445 | 0.244 | 2.008 | 3 | 0.502 | 0.358 |
| 1,200 | New species | 0.812 | 0.491 | 0.380 | 0.213 | 1.900 | 8 | 0.480 | 0.357 |
| 1,200 | SMRT only | 0.820 | 0.517 | 0.448 | 0.245 | 2.030 | 2 | 0.515 | 0.374 |
| 1,200 | No UTR | 0.808 | 0.511 | 0.445 | 0.244 | 2.008 | 3 | 0.502 | 0.358 |
| 1,200 | 3 opt. rounds | 0.808 | 0.511 | 0.445 | 0.244 | 2.008 | 3 | 0.502 | 0.358 |
| 1,200 | 6 opt. rounds | 0.808 | 0.511 | 0.445 | 0.244 | 2.008 | 3 | 0.502 | 0.358 |
| 2,000 | - | 0.830 | 0.515 | 0.458 | 0.248 | 2.051 | 1 | 0.508 | 0.363 |
|
| - | 0.810 | 0.391 | 0.384 | 0.151 | 1.736 | 10 | 0.623 | 0.287 |
Manual only: refers to 400 manually curated genes of which 288 were used for training and the remainder as test set. A. thaliana parameters: refers to the default A. thaliana parameters as provided by Stanke et al. Rank: calculated from the sum of the sensitivity and the precision for exons, transcripts, and UTR bases. New species: refers to calculating B. vulgaris parameters from scratch using ‘new_species.pl’ (part of the AUGUSTUS pipeline). Opt. round: refers to the number of optimization rounds when running optimize_augustus.pl; the default was nine rounds. SMRT only: refers to training including only SMRT-validated genes. No UTR: refers to not setting the ‘--UTR = on’ parameter when using optimize_augustus.pl
Sensitivity and precision of predicted genes after applying different settings
| Setting | Sensitivity in %a | Precision in %b |
|---|---|---|
| Default | 71.7 | 42.4 |
|
| 82.3 | 58.9 |
| + Hint masking | 82.9 | 62.3 |
| + Hint masking enforcement | 83.6 | 71.8 |
| + Additional mRNA-seq hints | 76.4 | 44.6 |
| + mRNA-seq noise reduction | 84.7 | 73.5 |
| + Higher weighting of introns | 85.0 | 73.9 |
| + SMRT reads as anchorsd | 91.1 | 77.9 |
Settings marked by ‘+’ were added to the setting of the previous line
aPercent of correctly predicted transcripts in the set of SMRT derived test genes not overlapping the training gene set
bPercent of wrongly predicted genes of all correctly and wrongly predicted gene models in genomic regions of SMRT derived test genes
cTraining based on SMRT and manually validated genes
d‘SMRT reads as anchors’ only affected genes covered by SMRT sequences
Expression support of sugar beet genes
| Source | Input sequences | Supported genes |
|---|---|---|
| ESTs | 35,523 | 10,222 |
| Roche/454 sequences | 282,169 | 12,681 |
| SMRT full-insert | 109,793 | 3,874 |
| KWS2320 mRNA-seq (all reads) | 923.8 million | 26,369 |
| KWS2320 salt stress | 86.2 million | 21,974 |
| KWS2320 heat stress | 91.6 million | 22,166 |
| KWS2320 light stress | 130.0 million | 23,041 |
| Sum | 924.2 million | 26,409 |
Supported genes: genes partially or completely supported by expression evidence
Fig. 5mRNA-seq coverage of sugar beet genes. Each dot represents one sugar beet gene. x-axis: mRNA-seq coverage as in the annotation based on the RefBeet-1.1 assembly; y-axis: mRNA-seq coverage for BeetSet-2 genes. The mRNA-seq data used in the RefBeet-1.1 annotation consisted chiefly of Illumina reads from genotype KWS2320, plus reads from other accessions (total amount: 616.3 million reads). The mRNA-seq data used to generate BeetSet-2 included KWS2320 reads plus isogenic reads from plants grown under stress conditions and their controls (total amount: 923.8 million reads). The overall mRNA-seq coverage increased in BeetSet-2, which improved the prediction of lowly expressed genes
Final parameter training using 2,794 training genes and results on ab initio performance
| Parameter evaluation on 349 test genes | ||||||||
|---|---|---|---|---|---|---|---|---|
| Training parameter | Exon level | Transcript level | Sum | Rank | UTR bases | |||
| Sensitivity | Precision | Sensitivity | Precision | Sensitivity | Precision | |||
|
| 0.810 | 0.527 | 0.368 | 0.192 | 1.899 | 2 | 0.678 | 0.281 |
|
| 0.842 | 0.664 | 0.461 | 0.341 | 2.308 | 1 | 0.547 | 0.508 |
Number of predicted genes in sugar beet and spinach
| Gene sets sugar beet | Gene set spinach | ||
|---|---|---|---|
| Annotation in RefBeet-1.1 | BeetSet-2 | SpiSet-1 | |
| 100 % evidence | 16,508 | 17,434 | 12,664 |
| 1-99 % evidence | 15,556 | 8,975 | 7,868 |
| 0 % evidence | 80,439 | 13,181 | 19,777 |
| 0 % evidence and 1:1 orthologya | n.a. | 514 | 1,171 |
| Final gene number | 27,421b | 26,923c | 21,703c |
aOne-to-one orthology identified between BeetSet-2 and SpiSet-1 predictions
bGenes annotated in RefBeet-1.1 with 1-100 % expression evidence excluding those with transposable element homology
cSum of genes with 1-100 % expression support and one-to-one orthologs
Fig. 6Workflow of our analyses to improve eukaryotic gene predictions, including the scripts that are part of this publication (highlighted orange). Input and output data are highlighted in bold lettering