| Literature DB >> 18680607 |
Morgan G I Langille1, William W L Hsiao, Fiona S L Brinkman.
Abstract
BACKGROUND: Genomic islands (GIs) are clusters of genes in prokaryotic genomes of probable horizontal origin. GIs are disproportionately associated with microbial adaptations of medical or environmental interest. Recently, multiple programs for automated detection of GIs have been developed that utilize sequence composition characteristics, such as G+C ratio and dinucleotide bias. To robustly evaluate the accuracy of such methods, we propose that a dataset of GIs be constructed using criteria that are independent of sequence composition-based analysis approaches.Entities:
Mesh:
Year: 2008 PMID: 18680607 PMCID: PMC2518932 DOI: 10.1186/1471-2105-9-329
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Pipeline of how negative and positive datasets of GIs are derived given a single query genome as input. A pre-computed genome distance matrix using CVTree is required as input as well as the query genome A). If there is enough suitable reference genomes selected for comparison with the query genome then the query genome and reference genomes are used in a Mauve multiple genome alignment and all conserved regions are extracted into a negative dataset of GIs B). The positive dataset is constructed by taking each query genome and aligning it pair-wise with each reference genome. Then all unaligned overlapping regions found in the query genome from the pair-wise alignments are filtered using the NCBI BLAST to ensure that they are truly unique to the query genome C).
Average number of GI predictions and accuracy measurements of several GI prediction tools.
| 232.7 | 92.3 | 33.0 | 86.3 | |
| 170.7 | 85.8 | 35.6 | 86.2 | |
| 163.2 | 68.0 | 32.2 | 83.7 | |
| 171.3 | 61.3 | 27.6 | 82.4 | |
| 444.4 | 54.8 | 53.3 | 82.2 | |
| 1264.8 | 38.0 | 77.0 | 70.8 | |
| 639.4 | 100 | 87.0 | 96.3 | |
*Results are averaged from 118 chromosomes in 117 different strains [see Additional file 6] except for the "Literature" GIs, which were averaged over 5 strains; Escherichia coli O157:H7, E. coli O157:H7 EDL933, Salmonella enterica Typhi str. CT18, S. enterica typhimurium LT2, and Streptococcus pyogenes MGAS315 [21-25].
Figure 2Effect of cutoffs on a sample genome tree. First, for each query genome any genomes that are too distant to the query genome or too closely related to each other are removed (dotted lines) A). Second, we ensure that at least one genome (bold line) is close enough so that the GIs we identify were inserted from similar time frames and are not biased by the genomes that are currently available B). Also, we require that at least one genome is distant enough from the query genome (bold line) to ensure that the backbone sequences we identify were not inserted recently C). Finally, we require that there be a minimum of 3 reference genomes that have met all other criteria D). The reference genomes that have passed all the cutoffs are used for comparison with the query genome.