| Literature DB >> 25887563 |
Alison C Testa1,2, James K Hane3, Simon R Ellwood4, Richard P Oliver5.
Abstract
BACKGROUND: The impact of gene annotation quality on functional and comparative genomics makes gene prediction an important process, particularly in non-model species, including many fungi. Sets of homologous protein sequences are rarely complete with respect to the fungal species of interest and are often small or unreliable, especially when closely related species have not been sequenced or annotated in detail. In these cases, protein homology-based evidence fails to correctly annotate many genes, or significantly improve ab initio predictions. Generalised hidden Markov models (GHMM) have proven to be invaluable tools in gene annotation and, recently, RNA-seq has emerged as a cost-effective means to significantly improve the quality of automated gene annotation. As these methods do not require sets of homologous proteins, improving gene prediction from these resources is of benefit to fungal researchers. While many pipelines now incorporate RNA-seq data in training GHMMs, there has been relatively little investigation into additionally combining RNA-seq data at the point of prediction, and room for improvement in this area motivates this study.Entities:
Mesh:
Year: 2015 PMID: 25887563 PMCID: PMC4363200 DOI: 10.1186/s12864-015-1344-4
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1CodingQuarry flow diagram. Examples are shown of correct annotations of coding sequences, (A) and a typical CodingQuarry input; assembled transcripts aligned to the genome (B). The stages used within CodingQuarry to predict coding sequences are shown (C-G). Firstly, coding sequences are predicted from transcript sequences (introns are removed) using a GHMM (C). Possible prediction errors after this step are coloured red, and notes show how these are identified (D). These error prone predicted genes are discarded (E), and regions are selected for prediction from genome sequence (F). The resulting prediction is output by CodingQuarry (G), which merges the retained predictions from transcript sequences (E) with the predictions from selected areas of the genome sequence (F). Sections of the example genome sequence and annotations have been labelled i-x in each part of the diagram (A-G), and marked with vertical dotted lines. These sections are labelled to facilitate in-text references to the diagram in the Implementation section of this manuscript. Labels i-x correspond to the same genome sections through A-G.
Comparisons between predictions and high-confidence gene sets for and
|
|
|
|
| |||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
|
| ||||||||
| CodingQuarry |
|
|
|
| 94.5 | 96.7 |
|
|
| AUGUSTUS | 99.2 | 99.1 | 92.0 | 91.4 |
| 92.6 | 86.9 | 88.9 |
| TransDecoder | 95.4 | 99.3 | 84.5 | 86.3 | 88.5 |
| 80.2 | 73.5 |
|
| ||||||||
| CodingQuarry |
|
|
| 90.0 |
| 67.6 |
| 91.1 |
| AUGUSTUS | 97.5 | 99.7 | 84.7 |
| 74.4 |
| 85.0 |
|
| TransDecoder | 92.2 | 99.5 | 79.9 | 74.8 | 73.9 | 67.4 | 80.1 | 68.0 |
Sensitivity (Sn) is the proportion of a given feature (nucleotides/exons/introns/genes) in the high-confidence set that are correctly predicted. Specificity (Sp) is the proportion of features in the predicted set that are correct. Sensitivity and specificity calculations for nucleotides are made on nucleotides within coding regions. Further descriptions of these measures are given in the Implementation subsection titled “Quantifying prediction results”. The highest scores in each column for Sc. pombe and S. cerevisiae are shown in boldface.
Whole-genome comparisons between predictions and current and
|
|
|
|
| |||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
|
| ||||||||
| CodingQuarry |
| 98.9 |
| 89.4 | 92.6 | 95.2 |
| 83.0 |
| AUGUSTUS | 98.0 |
| 89.0 |
|
| 92.7 | 83.1 |
|
| TransDecoder | 93.4 | 99.2 | 80.8 | 85.4 | 85.3 |
| 76.3 | 72.5 |
|
| ||||||||
| CodingQuarry |
| 99.5 |
| 87.2 |
| 65.8 |
| 88.3 |
| AUGUSTUS | 95.4 | 99.6 | 71.1 |
| 60.5 |
| 71.5 |
|
| TransDecoder | 87.8 |
| 67.1 | 75.0 | 60.5 | 70.1 | 67.8 | 68.0 |
Sensitivity (Sn) is the proportion of a given feature (nucleotides/exons/introns/genes) in the annotation that are correctly predicted. Specificity (Sp) is the proportion of features in the predicted set that are correct. Sensitivity and specificity calculations for nucleotides are made on nucleotides within coding regions. Further descriptions of these measures are given in the Implementation subsection titled “Quantifying prediction results” The highest scores in each column for Sc. pombe and S. cerevisiae are shown in boldface.
Figure 2Changes in CodingQuarry prediction accuracy at various stages of prediction of genes. The gene-level sensitivity and specificity is shown at various stages (See Figure 1 and Methods) within a CodingQuarry run. Results show comparisons with Sc. pombe where A) (left-hand panel) RNA-seq data strand information was used and B) (right-hand panel) strand information was ignored. Longest ORF is the initial training set, found by taking the longest open reading frame in each transcript to be a gene, stage 1 predictions are made from transcript sequences, stage 2 adds to and replaces some of stage 1 predictions by predicting from genome sequence. Filtering of likely false-positive genes (see Implementation section) takes place before a set of predicted genes is output as the “final output”. This output is the annotation generated by CodingQuarry.