| Literature DB >> 26204119 |
Lex Overmars1, Roland J Siezen1, Christof Francke2.
Abstract
The identification of translation initiation sites (TISs) constitutes an important aspect of sequence-based genome analysis. An erroneous TIS annotation can impair the identification of regulatory elements and N-terminal signal peptides, and also may flaw the determination of descent, for any particular gene. We have formulated a reference-free method to score the TIS annotation quality. The method is based on a comparison of the observed and expected distribution of all TISs in a particular genome given prior gene-calling. We have assessed the TIS annotations for all available NCBI RefSeq microbial genomes and found that approximately 87% is of appropriate quality, whereas 13% needs substantial improvement. We have analyzed a number of factors that could affect TIS annotation quality such as GC-content, taxonomy, the fraction of genes with a Shine-Dalgarno sequence and the year of publication. The analysis showed that only the first factor has a clear effect. We have then formulated a straightforward Principle Component Analysis-based TIS identification strategy to self-organize and score potential TISs. The strategy is independent of reference data and a priori calculations. A representative set of 277 genomes was subjected to the analysis and we found a clear increase in TIS annotation quality for the genomes with a low quality score. The PCA-based annotation was also compared with annotation with the current tool of reference, Prodigal. The comparison for the model genome of Escherichia coli K12 showed that both methods supplement each other and that prediction agreement can be used as an indicator of a correct TIS annotation. Importantly, the data suggest that the addition of a PCA-based strategy to a Prodigal prediction can be used to 'flag' TIS annotations for re-evaluation and in addition can be used to evaluate a given annotation in case a Prodigal annotation is lacking.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26204119 PMCID: PMC4512697 DOI: 10.1371/journal.pone.0133691
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Three typical distributions of alternative start codons found for genomes in the NCBI RefSeq database.
(A) The distribution of alternative starts in Escherichia coli K12 MG1655; (B) Bacillus thuringiensis str. Al Hakam; and (C) Acinetobacter baumannii ATCC 17978. For all ORFs that included an annotated gene and TIS, the total number of alternative start codons for each codon position relative to the annotated translation start were counted. The green line represents the expected distribution as determined using formula 1 In genomes that adhere to Fig 1A the observed and expected distribution are alike, whereas for genomes that adhere to B or C the observed distribution of alternative start codons given the annotation is clearly deviating from the expected distribution (green line). A comparison of the observed and expected distribution provides an inherent quality measure for genome-wide gene-prediction accuracy.
Fig 2Correlation coefficients between observed alternative start frequencies and expected alternative start frequencies for microbial genomes.
(A) Spearman’s rho coefficients for all bacterial RefSeq genomes with > 500 ORFs. (B) Spearman’s rho coefficients for all Archaeal RefSeq genomes with > 500 ORFs.
Fig 3Effects of year of sequencing, GC-content and taxonomy on TIS-prediction accuracy.
The boxplots show the distribution of the calculated correlation values (between the observed and expected distribution of alternative TISs) (Y axis) for: (A) all bacterial and archaeal RefSeq genomes grouped by year of sequencing (NCBI Bioproject data; [38]); (B) The RefSeq genomes grouped into 6 bins according to their GC%; (C) The RefSeq genomes grouped according to phylum; and (D) 277 selected bacterial and archaeal genomes with varying SD-index (proportion of Shine-Dalgarno sequence-preceded genes) [4].
TIS annotation for E. coli K12 MG1655.
The NCBI RefSeq file contained 4141 annotated genes. The position of the TISs was compared between the PCA-based prediction, the Prodigal-based prediction and the RefSeq annotation. Recently, the EcoGene annotation has been updated and 13 TISs have been adjusted (b0259, b0552, b0656, b1994, b2030, b2192, b3218, b3505, b4543, b2803, b1331, b2982 and b3093). The adaptations were compared to the PCA-based and Prodigal-based predictions.
| Annotation consistency | Total | Verified set | Ecogene Adjusted | Ecogene adjustment |
|---|---|---|---|---|
| RefSeq = PCA = Prodigal | 83% (3418) | 88.4% (811) | 1 | 12 nt upstream (b4543) |
| (RefSeq = Prodigal) ≠ PCA | 9.8% (406) | 7.8% (71) | 0 | |
| (Refseq = PCA) ≠ Prodigal | 4% (173) | 2.2% (20) | 0 | |
| RefSeq ≠ (PCA = Prodigal) | 2% (88) | 1.4% (13) | 12 | All in agreement with PCA = Prodigal |
| Refseq ≠ PCA ≠ Prodigal | 1% (54) | 0.2% (2) | 0 |
(a) The majority of TISs that are different in the PCA-based and Prodigal-based annotation are located close to the RefSeq TIS. For the PCA-based predictions: 548 were not in agreement with RefSeq, 199 of these where within 30 nt distance and 56 at 3nt distance; For the Prodigal predictions: 241 (6%) were not in agreement with RefSeq (and 74 (2%) were missed): 96 of these were within 30 nt distance and 30 at 3nt distance.
Fig 4(A) The relative position of PCA-based TIS annotations that deviate from the RefSeq annotation for E. coli MG1655.
(B) The effect of sequence vector length on the number of matching PCA-based and RefSeq TIS annotations in E. coli K12 MG1655 and B. subtilis 168. The following vector lengths were compared (denoted as: length upstream in nt. and length downstream in nt.): i) 60 & 60, ii) 36 & 36, iii) 30 & 24, iv) 30 & 18, v) 24 & 30, vi) 24 & 24, vii) 24 & 18, viii) 18 & 30 ix) 18 & 24 and x) 18 & 18.
Fig 5A comparison of TIS prediction accuracy between RefSeq, PCA-based and Prodigal annotation.
Scatterplot of the correlation between observed alternative start codon frequencies and expected alternative start codon frequencies (i.e., the TIS annotation quality measure) for both the original TIS annotation as found in the RefSeq database (Y axis) and the adjusted annotations (X axis) based on (A) our iterative PCA pipeline and (B) Prodigal. (C) Scatterplot for PCA-based annotation versus Prodigal. The color scale represents the GC% of the corresponding genome (blue: high, green: average, red: low)