| Literature DB >> 24621776 |
Bert Ely1, LaTia Etheredge Scott1.
Abstract
Bacterial genome annotations are accumulating rapidly in the GenBank database and the use of automated annotation technologies to create these annotations has become the norm. However, these automated methods commonly result in a small, but significant percentage of genome annotation errors. To improve accuracy and reliability, we analyzed the Caulobacter crescentus NA1000 genome utilizing computer programs Artemis and MICheck to manually examine the third codon position GC content, alignment to a third codon position GC frame plot peak, and matches in the GenBank database. We identified 11 new genes, modified the start site of 113 genes, and changed the reading frame of 38 genes that had been incorrectly annotated. Furthermore, our manual method of identifying protein-coding genes allowed us to remove 112 non-coding regions that had been designated as coding regions. The improved NA1000 genome annotation resulted in a reduction in the use of rare codons since noncoding regions with atypical codon usage were removed from the annotation and 49 new coding regions were added to the annotation. Thus, a more accurate codon usage table was generated as well. These results demonstrate that a comparison of the location of peaks third codon position GC content to the location of protein coding regions could be used to verify the annotation of any genome that has a GC content that is greater than 60%.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24621776 PMCID: PMC3951458 DOI: 10.1371/journal.pone.0091668
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Screen shot of Artemis showing GC Frame Plot and a wrong reading frame.
The GC frame plot shows a sliding window of the third codon position GC content for the three forward reading frames. The red line in the GC frame plot corresponds to the +1 reading frame, the green line corresponds to the +2 reading frame and blue line to the +3 reading frame. The three reverse reading frames show the same pattern with the blue, red, green lines corresponding to the −1, −2 and −3 reading frames, respectively. Gene CCNA_1867 (blue bar) is in the wrong reading frame in the 23-DEC-2012 NA1000 annotation. The −3 open reading frame highlighted in pink is the corrected reading frame for gene CCNA_1867.
C. crescentus NA1000 genes with a changed reading frame.
| Gene | Genome Coordinates | Gene product | Source of matching genes |
| CCNA_00513 | 527386..527931 | Conserved hypothetical protein | non caulobacter |
| CCNA_00581 | 609290…611077c | Conserved hypothetical protein | non caulobacter |
| CCNA_00599 | 635020…635670 | Conserved hypothetical protein | Caulobacter |
| CCNA_00702 | 761965…762321 | Conserved hypothetical protein | Caulobacter |
| CCNA_00713 | 770295…770444 | Hypothetical protein | Caulobacter |
| CCNA_00786 | 850119…850442c | Hypothetical protein | Caulobacter |
| CCNA_00868 | 946717…947265 | Conserved hypothetical protein | Caulobacter |
| CCNA_01127 | 1231609…1232985 | Conserved hypothetical protein | Caulobacter |
| CCNA_01150 | 1254292…1254639c | Conserved hypothetical protein | Caulobacter |
| CCNA_01265 | 1395129..1395539c | Transposase | non caulobacter |
| CCNA_01293 | 1418452..1418862 | Transposase | non caulobacter |
| CCNA_01411 | 1529256…1529711 | Conserved hypothetical protein | Caulobacter |
| CCNA_01435 | 1549757…1550008c | Transglycosylase associated protein | Caulobacter |
| CCNA_01518 | 1627542…1628183c | Metal dependent phosphohydrolase | Caulobacter |
| CCNA_01720 | 1848134…1848523 | Conserved hypothetical protein | Caulobacter |
| CCNA_01867 | 2003818…2005050 | Conserved hypothetical protein | Caulobacter |
| CCNA_01871 | 2009957…2010232c | Hypothetical protein | non significant |
| CCNA_02079 | 2230242…2230706 | Conserved hypothetical protein | Caulobacter |
| CCNA_02114 | 2263326..2263493c | Hypothetical protein | non significant |
| CCNA_02168 | 2321658…2321954c | Conserved hypothetical protein | Caulobacter |
| CCNA_02323 | 2465238..2465669 | No database match | non significant |
| CCNA_02393 | 2536532…2536966 | Limonene-1,2-epoxide hydrolase | Caulobacter |
| CCNA_02524 | 267256…2673444c | Conserved hypothetical protein | non caulobacter |
| CCNA_02536 | 2684042…2684413 | Conserved hypothetical protein | Caulobacter |
| CCNA_02585 | 2733698…2734024 | Conserved hypothetical protein | non caulobacter |
| CCNA_02871 | 3021200..3021598c | Gene transfer agent (GTA)-like protein | Caulobacter |
| CCNA_02880 | 3027016..3028377c | Phage DNA packaging protein | Caulobacter |
| CCNA_02968 | 3125241…3125573c | Conserved hypothetical protein | non caulobacter |
| CCNA_02990 | 3144126…3144536 | Transposase | non caulobacter |
| CCNA_02998 | 3153124…3153330 | Conserved hypothetical protein | Caulobacter |
| CCNA-03251 | 3422211..3423167c | MarR family transcriptional regulator | Caulobacter |
| CCNA_03361 | 3539803…3540480c | Conserved hypothetical protein | non caulobacter |
| CCNA_03411 | 3578916…3579266 | Conserved hypothetical protein | Caulobacter |
| CCNA_03427 | 3592615…3592957c | Conserved hypothetical protein | Caulobacter |
| CCNA_03470 | 3636805…3637563c | Conserved hypothetical protein | Caulobacter |
| CCNA_03566 | 3720403..3720825c | Conserved hypothetical protein | Caulobacter |
| CCNA_03654 | 3815982…3816086 | Conserved hypothetical protein | Caulobacter |
| CCNA_03785 | 3952640..3953299 | Conserved hypothetical protein | Caulobacter |
*A lower case “c” indicates that the coding sequence is on the complementary strand of the DNA.
New predicted genes.
| Temporary Gene ID | Gene Position | Predicted Gene Function |
| CCNA_01158B | 1264353..1264793 | Sugar Translocase |
| CCNA_01340B | 1452580..1453071 | Activator of Hsp90 ATPase 1-like protein |
| CCNA_01547B | 1658853..1659293c | Conserved hypothetical |
| CCNA_02123B | 2274650..2275060 | Transposase |
| CCNA_02393B | 2537009..2537347 | Metallo-bactalactamase |
| CCNA_02648B | 2801810..2802232 | Hypothetical protein |
| CCNA_02871B | 3021598..3021903 | Phage packaging-like protein |
| CCNA_02880B | 3028202..3028606c | Conserved hypothetical |
| CCNA_02968B | 3125573..3125986c | Conserved hypothetical |
| CCNA_03112B | 3263371..3263808 | Conserved hypothetical |
| CCNA_03080B | 3228849..3229025c | Oligosaccharyl transferase subunit (alpha) |
*A lower case “c” indicates that the coding sequence is on the complementary strand of the DNA.
Deletion of previously annotated genes.
| Gene | Gene Position | Number of deleted codons |
| CCNA_00242 | 256209..256364c | 51 |
| CCNA_00258 | 271669..271875 | 68 |
| CCNA_00289 | 300549..300845c | 98 |
| CCNA_00325 | 338371..338499 | 42 |
| CCNA_00347 | 361909..362088 | 59 |
| CCNA_00409 | 424307..424450 | 47 |
| CCNA_00418 | 430572..430736 | 54 |
| CCNA_00577 | 601417..601599c | 60 |
| CCNA_00584 | 612811..613236 | 141 |
| CCNA_00606 | 642000..642233 | 77 |
| CCNA_00739 | 79853..798380c | 43 |
| CCNA_00771 | 826360..826482 | 40 |
| CCNA_00797 | 861100..861297c | 65 |
| CCNA_00816 | 879721..879912c | 63 |
| CCNA_00819 | 882074..882181c | 35 |
| CCNA_00829 | 892673..892915c | 80 |
| CCNA_00848 | 921330..921674c | 114 |
| CCNA_00877 | 954339..954947c | 202 |
| CCNA_00896 | 975549..975920c | 123 |
| CCNA_00949 | 1026741..1026965 | 74 |
| CCNA_00955 | 1032152..1032286c | 44 |
| CCNA_00960 | 1038046..1038144 | 32 |
*A lower case “c” indicates that the coding sequence is on the complementary strand of the DNA.
Genes with modified start sites.
| Gene | Gene Position | Modified Gene Position | Gene Function |
| CCNA_00156 | 164586..164951 | 164685..164951 | ArsR family transcriptional regulator |
| CCNA_00176 | 191399.191956 | 191468..191956 | Type II secretion pathway protein H |
| CCNA_00177 | 191925..192308 | 191937..192308 | General secretion pathway protein I |
| CCNA_00230 | 245765..247030c | 245765..247003c | Ribosomal large subunit pseudouridine synthase B |
| CCNA_00304 | 318031..319308c | 318031..319263c | 3-deoxy-D-manno-octulosonic-acid transferase |
| CCNA_00318 | 333101..334138 | 333179..334138 | Hypothetical protein |
| CCNA_00338 | 348245..350719 | 348479..350719 | TonB-dependent receptor |
| CCNA_00438 | 444889..445263c | 444889..445200c | Hypothetical protein |
| CCNA_00465 | 477921..479033 | 477936..479033 | UDP-galactopyranose mutase |
| CCNA_00481 | 497307..497597 | 497313..497597 | HipB transcriptional regulator |
| CCNA_00582 | 611119..611757 | 611257..611757 | Hypothetical protein |
| CCNA_00613 | 654176..655519c | 654176..655399c | Cyanophycinase |
| CCNA_00641 | 692376..692672c | 692376..692645c | Hypothetical protein |
| CCNA_00656 | 710639..712531 | 710696..712531 | Type I restriction-modification system, M subunit |
| CCNA_00661 | 718799..719233c | 718799..719176c | Transposase |
| CCNA_00690 | 747704..748261c | 747704..748207c | CarD-like transcriptional regulator |
| CCNA_00756 | 813842..814120c | 813842..814018c | Hypothetical protein |
| CCNA_00772 | 827021..827428c | 827021..827239c | Hypothetical protein |
| CCNA_00860 | 938619..938855c | 938622..938825c | Hypothetical protein |
| CCNA_00884 | 963806..964192 | 963806..964180c | Hypothetical protein |
*A lower case “c” indicates that the coding sequence is on the complementary strand of the DNA.
Figure 2Screen shot of Artemis showing the correction of two overlapping gene annotations.
The CCNA_00338 gene in the +2 reading frame has been shortened relative to the open reading frame white box that corresponds to the original CCNA_00338 reading frame. Note that the original start site was upstream of the CCNA_00337 stop codon.
Figure 3Comparison of the frequency of rare codons and common codons in the original (O) and edited (E) NA1000 genome annotations.
The blue bar represents the number of rare codons, and the red bar represents the number of common codons that have a codon usage frequency in the edited genome annotation (E) that is equal to, greater than, or less than the frequency in the original genome (O) annotation. Nonsense codons were excluded from this analysis.