| Literature DB >> 18445633 |
Fangqing Zhao1, Fanggeng Zhao, Tao Li, Donald A Bryant.
Abstract
Gap closing is considered one of the most challenging and time-consuming tasks in bacterial genome sequencing projects, especially with the emergence of new sequencing technologies, such as pyrosequencing, which may result in large amounts of data without the benefit of large insert libraries for contig scaffolding. We propose a novel algorithm to align contigs with more than one reference genome at a time. This approach can successfully overcome the limitations of low degrees of conserved gene order for the reference and target genomes. A pheromone trail-based genetic algorithm (PGA) was used to search globally for the optimal placement for each contig. Extensive testing on simulated and real data sets shows that PGA significantly outperforms previous methods, especially when assembling genomes that are only moderately related. An extended version of PGA can predict additional candidate connections for each contig and can thus increase the likelihood of identifying the correct arrangement of each contig. The software and test data sets can be accessed at http://sourceforge.net/projects/pga4genomics/.Entities:
Mesh:
Year: 2008 PMID: 18445633 PMCID: PMC2425481 DOI: 10.1093/nar/gkn168
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The flow chart of the pheromone trail-based genetic algorithm developed for genome assembly of contigs into scaffolds by comparison to one or more reference genome(s).
Figure 2.Relationships between PGA performance and various parameters. The X-axis shows the number of iterations, while the Y-axis shows the fitness score. (A) Comparison of the performance of genetic algorithms with different crossover operators. (B–D) Parameter settings for the relative importance of pheromone trail and the visibility (β), the pheromone trail persistence (ρ) and the probability for pseudo-random-proportional selection (q0).
Figure 3.Performance comparison between the BLAST-end method (plotted points) and the PGA method (vertical bars) using different simulated data sets. (A) The assembly accuracies on data sets containing different numbers of simulated contigs of random fragments derived from Synechococcus sp. WH8102. The genomes of Synechococcus sp. CC9605 (S9605), Synechococcus sp. CC9902 (S9902) and P. marinus MIT 9313 (P9313) were used as the reference genomes, respectively. (B) The assembly accuracies on data sets containing different numbers of simulated contigs of random fragments derived from Shewanella sp. ANA-3. The genomes of S. oneidiensis (Sone), S. amazonensis (Sama), and S. frigidimarina (Sfri) were used as references, respectively. (C) The overall success rate for gap closure attained if one uses the four best predictions from the relaxed PGA method (PGA-extended, see text for detail) with the same data sets used in Figure 3A. (D) The overall success rate for gap closure attained using the four best predictions from the relaxed PGA method (PGA-extended, see text for detail) with the same data sets used in Figure 3B.
Comparisons of contig assembly accuracy for authentic data sets from the assembly of contigs from green sulfur bacterial genomes using PGA, PGA-extended, BLAST-end, Projector2 and OSLay
| Genomes | Contigs | Method | Reference Ctep | Reference Cpha | Reference Plut | 2 or 3 Refs | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Best | Average | Best | Average | Best | Average | Best | Average | |||
| Clim | 37 | PGA | 0.324 | 0.276 ± 0.040 | 0.405 | 0.373 ± 0.032 | 0.378 | 0.346 ± 0.026 | 0.514 | 0.443 ± 0.040 |
| PGA-ext. | 0.514 | NA | 0.541 | NA | 0.568 | NA | NA | Na | ||
| BLAST-end | 0.108 | NA | 0.135 | NA | 0.135 | NA | NA | NA | ||
| Projector2 | 0.189 | NA | 0.189 | NA | 0.162 | NA | NA | NA | ||
| OSLay | 0.135 | NA | 0.162 | NA | 0.108 | NA | NA | NA | ||
| Cvib | 26 | PGA | 0.385 | 0.331 ± 0.031 | 0.462 | 0.454 ± 0.015 | 0.769 | 0.738 ± 0.015 | 0.731 | 0.731 ± 0.000 |
| PGA-ext. | 0.577 | NA | 0.731 | NA | 0.885 | NA | NA | NA | ||
| BLAST-end | 0.115 | NA | 0.385 | NA | 0.538 | NA | NA | NA | ||
| Projector2 | 0.231 | NA | 0.308 | NA | 0.577 | NA | NA | NA | ||
| OSLay | 0.000 | NA | 0.154 | NA | 0.423 | NA | NA | NA | ||
| Cpar | 58 | PGA | 0.690 | 0.679 ± 0.014 | 0.431 | 0.400 ± 0.020 | 0.586 | 0.559 ± 0.018 | 0.741 | 0.738 ± 0.007 |
| PGA-ext. | 0.914 | NA | 0.621 | NA | 0.724 | NA | NA | NA | ||
| BLAST-end | 0.534 | NA | 0.190 | NA | 0.172 | NA | NA | NA | ||
| Projector2 | 0.224 | NA | 0.121 | NA | 0.155 | NA | NA | NA | ||
| OSLay | 0.534 | NA | 0.052 | NA | 0.103 | NA | NA | NA | ||
aAssembly of contigs of C. limicola (Clim), C. vibrioforme (Cvib) and C. parvum (Cpar) contigs.
bChlorobium tepidum (Ctep), C. phaeobacteriodes (Cpha) and P. luteolum (Plut) were used as the reference genomes.
cChlorobium phaeobacteriodes (Cpha) and P. luteolum (Plut) were used as the reference genomes.
dChlorobium tepidum (Ctep) and P. luteolum (Plut) were used as the reference genomes.
eThe corresponding value indicates the overall success rate for gap closure attained using the four best predictions from the PGA-extended method.
NA, not applicable.
Figure 4.An example to illustrate the performance of the PGA. (A) Linear mapping between given target contigs (red bars) and a reference genome (blue bar). The purple lines represent the connections of the interior BLASTN matches and the light blue lines represent the connections of the terminal contig BLASTN matches. The name (number) of each contig is indicated below the red bars. (B, inset) The fitness matrix for these contigs derived from the reference genome. The smaller the matrix value, the shorter the distance between a pair of contigs. Red boxes indicate scores that provide correct information to PGA; the blue boxes indicate scores that provide false information, which may predict an incorrect assembly.