| Literature DB >> 22233127 |
Jonathan L Klassen1, Cameron R Currie.
Abstract
UNLABELLED: Ongoing technological advances in genome sequencing are allowing bacterial genomes to be sequenced at ever-lower cost. However, nearly all of these new techniques concomitantly decrease genome quality, primarily due to the inability of their relatively short read lengths to bridge certain genomic regions, e.g., those containing repeats. Fragmentation of predicted open reading frames (ORFs) is one possible consequence of this decreased quality. In this study we quantify ORF fragmentation in draft microbial genomes and its effect on annotation efficacy, and we propose a solution to ameliorate this problem.Entities:
Mesh:
Year: 2012 PMID: 22233127 PMCID: PMC3322347 DOI: 10.1186/1471-2164-13-14
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Quality of draft genomes available in the NCBI database as of May 10, 2011. The number of ORF fragments predicted by Prodigal and the percent of the total number of predicted ORFs (including ORF fragments) composed of ORF fragments, respectively, ordered by increasing number of ORF fragments (a and b) or plotted versus N50, the size of the contig for which 50% of the genome is contained in contigs of greater than or equal size (c and d). The number of ORF fragments, the percent of the total number of predicted ORFs composed of ORF fragments and N50 were logarithmically transformed to de-emphasize extreme values.
Effect of genome quality on annotation efficacy
| Uncorrected | ||||
|---|---|---|---|---|
| Pfam | COG | KEGG | ||
| Including PP-C42 | % Partial ORFs fragments vs. % all ORFs annotated | |||
| Mean ORF length vs. % all ORFs annotated | ||||
| Excluding PP-C42 | % Partial ORF fragments vs. % all ORFs annotated | |||
| Mean ORF length vs. % all ORFs annotated | ||||
| Corrected using matched partial ORF sets | ||||
| Pfam | COG | KEGG | ||
| Including PP-C42 | % Partial ORFs fragments vs. % all ORFs annotated | |||
| Mean ORF length vs. % all ORFs annotated | ||||
| Excluding PP-C42 | % Partial ORF fragments vs. % all ORFs annotated | |||
| Mean ORF length vs. % all ORFs annotated | ||||
Pearson correlations between annotation frequency and genome quality, as represented by the percent of the predicted ORFs composed of partial sequences and mean ORF length. Complete genomes are excluded in all cases; including them has essentially no effect.
Figure 2Mean partial and complete ORFs for each COG superfamily. The values shown are expressed as a percentage of the total number of partial or complete ORFs in the draft Streptomyces genomes examined. COG superfamily single-letter abbreviations are bracketed.
Efficacy of partial ORF linkage
| Strain | True matching fragments linked (%) | |
|---|---|---|
| 127 (22.1) | 2 (0.4) | |
| 58 (31.2) | 0 (0) | |
| 40 (9.3) | 0 (0) | |
| 12 (20.7) | 2 (3.5) | |
| 182 (28.04 | 0 (0) | |
| 326 (27.5) | 2 (0.2) | |
| 220 (20.2) | 1 (0.1) | |
| 117 (27.9) | 2 (0.5) | |
| 207 (20.7) | 0 (0) | |
| 182 (46.2) | 7 (1.8) | |
| 192 (41.6) | 8 (1.7) | |
| 0 (0) | 0 (0) | |
| 146 (17.9) | 9 (1.1) | |
| 158 (17.7) | 2 (0.2) | |
| 46 (12.0) | 0 (0) | |
| 887 (10.6) | No scaffolds | |
| 145 (15.8) | 12 (1.3) | |
| 109 (12.8) | 6 (0.7) | |
| 200 (28.1) | 7 (1.0) | |
| 102 (31.1) | 8 (2.4) |
All Streptomyces genomes except Streptomyces sp. PP-C42 and each query genome were used to link partial ORFs in that query genome. The parameters used were: ≥ 20% identity to and ≥ 60% coverage between the query and reference sequence and ≥ 50% similarity between the identities of each partial fragment to the reference sequence.
False positive linkages were identified from their incongruencies with the scaffold information
Figure 3The effect of genome relatedness on partial ORF linkage. Cumulative and genome-specific linkage of ORF fragments are plotted according to increasing AAI divergence between S. roseosporus NRRL15998 and the other genomes in the Streptomyces test dataset. Reference genomes were ordered for analysis by their decreasing AAI to S. roseosporus NRRL15998.
Figure 4Phylogenetic tree of the . This neighbor-joining phylogenetic tree is based on the average amino acid identity, calculated using each non-fragmented bidirectional best BLAST pair having ≥ 30% identities over ≥ 70% of both protein lengths. The tree was rooted based on a 16S rRNA gene tree constructed for all actinobacterial type strains which was consistent in all major respects to the tree presented here.
Characteristics of the Streptomyces genomes used in this study
| Strain | Contigs/ | Genome Size (bp) | N50 (bp) | Ref.; NCBI project accession number | |
|---|---|---|---|---|---|
| 501/2 | 6,619,469 | 23,675 | 5442/575 (9.6) | The Broad Institute, unpublished; PRJNA37045 | |
| Complete | 9,119,895 | - | 7676/0 (0) | [ | |
| Complete | 11,936,683 | - | 10035/0 (0) | [ | |
| 279/2 | 8,528,397 | 50,310 | 6893/186 (2.6) | [ | |
| 597/158 | 6,729,086 | 16,887 | 5692/430 (7.0) | The Broad Institute, unpublished; PRJNA28551 | |
| 89/5 | 9,134,976 | 203,387 | 7567/58 (0.8) | [ | |
| Complete | 9,054,847 | - | 8153/0 (0) | [ | |
| Complete | 7,656,104 | - | 6763/0 (0) | The Joint Genome Institute, unpublished; PRJNA37207 | |
| 616/3 | 8,223,278 | 24,170 | 7166/649 (8.3) | The Broad Institute, unpublished; PRJNA37041 | |
| 927/1 | 7,364,052 | 14,024 | 6076/1187 (16.3) | The Broad Institute, unpublished; PRJNA37185 | |
| Complete | 8,545,929 | - | 7136/0 (0) | [ | |
| 4/2 | 8,731,583 | 603,0272 | 7374/5 (0.1) | [ | |
| 783/2 | 10,466,286 | 27,794 | 8647/1092 (11.2) | The Broad Institute, unpublished; PRJNA37181 | |
| 333/1 | 8,190,887 | 49,556 | 7183/419 (5.5) | The Broad Institute, unpublished; PRJNA37179 | |
| 844/1 | 7,633,609 | 17,558 | 6283/1000 (13.7) | The Broad Institute, unpublished; PRJNA36845 | |
| 280/2 | 7,763,119 | 57,471 | 6737/394 (5.5) | The Broad Institute, unpublished; PRJNA65085 | |
| 371/1 | 7,560,086 | 38,584 | 6493/462 (6.6) | The Broad Institute, unpublished; PRJNA55545 | |
| Complete | 10,148,695 | - | 8746/0 (0) | Wellcome Trust Sanger Institute, unpublished; PRJNA35395 | |
| 652/4 | 7,916,041 | 25,608 | 6906/815 (10.6) | The Broad Institute, unpublished; PRJNA37177 | |
| 716/13 | 7,146,196 | 21,378 | 5780/893 (13.4) | The Broad Institute, unpublished; PRJNA38087 | |
| 466/127 | 7,105,723 | 25,079 | 6344/384 (5.7) | The Broad Institute, unpublished; PRJNA36841 | |
| 7074/- | 6,467,850 | 1,414 | 2503/8392 (77.0) | [ | |
| 845/2 | 6,505,970 | 13,132 | 5401/917 (14.5) | The Broad Institute, unpublished; PRJNA36843 | |
| 694/4 | 6,897,976 | 20,678 | 5723/853 (13.0) | The Broad Institute, unpublished; PRJNA37173 | |
| 552/1 | 9,055,790 | 33,523 | 7921/711 (8.2) | The Broad Institute, unpublished; PRJNA36847 | |
| 226/1 | 10,988,130 | 109,490 | 7473/328 (4.2) | The Broad Institute, unpublished; PRJNA37183 |
Predicted proteins for complete genomes were downloaded directly from the NCBI database; all others were predicted using Prodigal.
Figure 5Flow diagram of the partial ORF linkage approach used in this work.