| Literature DB >> 30631340 |
Antonio Camilo da Silva Filho1, Roberto Tadeu Raittz1, Dieval Guizelini1, Camilla Reginatto De Pierri2, Diônata Willian Augusto1, Izabella Castilhos Ribeiro Dos Santos-Weiss3, Jeroniza Nunes Marchaukoski1.
Abstract
Tools for genomic island prediction use strategies for genomic comparison analysis and sequence composition analysis. The goal of comparative analysis is to identify unique regions in the genomes of related organisms, whereas sequence composition analysis evaluates and relates the composition of specific regions with other regions in the genome. The goal of this study was to qualitatively and quantitatively evaluate extant genomic island predictors. We chose tools reported to produce significant results using sequence composition prediction, comparative genomics, and hybrid genomics methods. To maintain diversity, the tools were applied to eight complete genomes of organisms with distinct characteristics and belonging to different families. Escherichia coli CFT073 was used as a control and considered as the gold standard because its islands were previously curated in vitro. The results of predictions with the gold standard were manually curated, and the content and characteristics of each predicted island were analyzed. For other organisms, we created GenBank (GBK) files using Artemis software for each predicted island. We copied only the amino acid sequences from the coding sequence and constructed a multi-FASTA file for each predictor. We used BLASTp to compare all results and generate hits to evaluate similarities and differences among the predictions. Comparison of the results with the gold standard revealed that GIPSy produced the best results, covering ~91% of the composition and regions of the islands, followed by Alien Hunter (81%), IslandViewer (47.8%), Predict Bias (31%), GI Hunter (17%), and Zisland Explorer (16%). The tools with the best results in the analyzes of the set of organisms were the same ones that presented better performance in the tests with the gold standard.Entities:
Keywords: genomic islands; genomic signature; horizontal gene transfer; mobility genes; pathogenic islands; virulence factors
Year: 2018 PMID: 30631340 PMCID: PMC6315130 DOI: 10.3389/fgene.2018.00619
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Main characteristics of genomic islands and possible functions.
Comparative characteristics of the tools.
| Alien Hunter | Seq. Comp. | IVOM, k-mers | No | 2006 |
| GI Hunter | Seq. Comp. | IVOM, k-mers | No | 2014 |
| Predict Bias | Seq. Comp. | G+C%, codon usage, dinucleotides | Yes | 2008 |
| Zisland Explorer | Seq. Comp. | G+C%, codon usage | No | 2016 |
| GIPSy | Hybrid | G+C%, codon usage | Yes | 2016 |
| IslandViewer4 | Hybrid | Codon usage, dinucleotides | R. G. only | 2009–2017 |
(Sequence composition bias),
(Genomic Islands Function),
(Year of publication),
(Sequence composition),
(Only Pathogenicity Island),
(Classifies GIs with functions of pathogenicity, resistance, metabolism, and symbiosis),
(Related genes only).
Description of selected organisms.
| Corynebacteriaceae | 53.50 | + | |
| Streptococcaceae | 38.50 | + | |
| Staphylococcaceae | 32.90 | + | |
| Enterobacteriaceae | 50.80 | ||
| Enterobacteriaceae | 50.50 | ||
| Aeromonadaceae | 61.50 | ||
| Pseudomonadaceae | 66.60 | ||
| Vibrionaceae | 47.70 | ||
| Vibrionaceae | 46.90 |
(Organisms chosen to evaluate the processing performance time of the predictors, gram-positive and gram-negative, they represent the largest and smallest base pair contents of the entire group).
(Organism chosen as a gold standard set).
Data of GIs, PAIS, and regions with DNA of bacteriophages curated in vitro from the reference organism Escherichia coli cft073.
| 1 | GI-CFT073-leuX | c5386–c5371 | 15 | leuX | 48.15 |
| 2 | PAI-CFT073-pheU | c5216–c5143 | 61 | pheU | 47.57 |
| 3 | GI-CFT073-selC | c4581–c4491 | 70 | selC | 47.04 |
| 4 | PAI-CFT073-pheV | c3698–c3556 | 124 | pheV | 47.08 |
| 5 | PAI-CFT073-metV | c3410–c3385 | 25 | metV | 53.37 |
| 6 | ϕ-CFT073-smpB | c3206–c3143 | 49 | 49.32 | |
| 7 | GI-CFT073-cobU | c2528–c2482 | 37 | 49.68 | |
| 8 | GI-CFT073-asnW | c2475–c2449 | 26 | asnW | 53.12 |
| 9 | PAI-CFT073-asnT | c2436–c2418 | 15 | asnT | 58.27 |
| 10 | PAI-CFT073-serU | c2416–c2392 | 19 | serU | 37.65 |
| 11 | PAI-CFT073-icdA | c1601–c1518 | 74 | 50.23 | |
| 12 | ϕ-CFT073-ycfD | c1507–c1481 | 14 | 49.78 | |
| 13 | ϕ-CFT073-potB | c1475–c1400 | 51 | 50.97 | |
| 14 | PAI-CFT073-serX | c1293–c1165 | 102 | serX | 48.76 |
| 15 | ϕ-CFT073-b0847 | c0979–c0932 | 42 | 50.45 | |
| 16 | PAI-CFT073-aspV | c0368–c0253 | 83 | aspV | 47.43 |
(Genomic islands),
(Identifiers applied to each gene),
(Number of coding sequences),
(Transfer RNA),
(Percentage of guanine and cytosine content in the region). GI, Genomic islands; PAI, Pathogenic islands; ϕ, Islands containing predominantly bacteriophage DNA.
Figure 2Steps in dataset creation.
Main features of each tool.
| Alien Hunter | S.O | .FASTA | .TXT | Yes | No | No |
| GI Hunter | S.O Linux (Console) | .FNA | .TXT/ PLOT | No | No | No |
| GIPSy | S.O Linux/Win (GUI) | .GBK | .TXT | No | Yes | No |
| IslandViewer4 | Web | .GBK/.EMBL | .GBK/.FASTA/.PLOT | No | No | Yes |
| Predict Bias | Web | .GBK | .TXT/.PLOT/HTML | No | No | No |
| Zisland Explorer | S.O Linux/Win/Web/Mac(GUI) | .FNA/.PTT | .TXT/.PLOT | No | No | No |
(Genome without annotation),
(Reference genome),
(Incomplete genomes drafts/contigs),
(Operating system),
(Base sequences),
(Text file),
(Molecular Biology Laboratory),
(Score of prediction results),
(Nucleotide sequences),
(Location and attributes of proteins),
(Location and attributes of transport ribonucleic acids),
(graphical user interface),
(graphical user interface),
(GenBank genomic sequence format),
(Hypertext Markup Language).
Additional information of each tool.
| Alien Hunter | No | No |
| GI Hunter | IVOM | No |
| GIPSy | SIGI-HMM | PFAM |
| IslandViewer4 | SIGI-HMM/MAUVE | VFDB |
| Predict Bias | No | VFDB |
| Zisland Explorer | GC-Profile | No |
(Interpolated variable order motifs),
(Genomic data statistical analysis tool),
(Sequence lookup tool),
(Tool to compare sequence information in amino acids),
(Protein family database),
(Microbial database of protein, toxins, virulence factors, and antibiotic resistance genes for bio-defense applications),
(Antibiotic resistance genes database),
(Comprehensive antibiotic resistance database),
(Clusters of orthologous groups of proteins),
(database for genes and mutants involved in symbiosis),
(Transfer RNA database),
(Genome sequence alignment tool),
(Virulence factors database),
(Bacterial bioinformatics database and analysis resource),
(Pathogen-host interaction data integration and analysis system database).
Figure 3Organism with the larger genome, Pseudomonas aeruginosa PAO1 (6,264,404 base pairs) was compared to that with a smaller genome, Streptococcus pyogenes M1 GAS (1,852,433 base pairs).
Figure 4Circular genome was plotted from the Artemis tool using DNA Plotter, along with the positions of each predicted island highlighted in red, GC% content in yellow (above) and purple (below), and GC% content Skew in green (below) and blue (above). The description of each GC% content of the islands predicted together with the results of each predictor was examined. The symbol ϕ represents islands containing predominantly bacteriophage DNA.
Features of gold standard gi 16 vs. predictors.
| IslandViewer4 | Present | 12 | 47.43 | 5 | 38 | 5 | 100 |
| GIPSy | Present | 12 | 47.38 | 5 | 38 | 5 | 99 |
| Alien Hunter | Absent | 8 | 46.97 | 4 | 33 | 5 | 84 |
| Zisland Explorer | Absent | 8 | 46.35 | 4 | 35 | 4 | 81 |
| GI Hunter | Absent | 8 | 46.34 | 3 | 32 | 4 | 79 |
| Predict Bias | Absent | 7 | 46.30 | 3 | 31 | 4 | 77 |
In bold, the 16th island of the gold standard Escherichia coli CFT073.
(Gold Standard),
(Transfer RNA),
(Percentage of guanine and cytosine content in the region),
(Virulence genes),
(Hypothetical proteins),
(Uncharacterized proteins),
(Coding sequences).
Products of the 16 gold standard gis of Escherichia coli cft073 based on reference articles.
| 1–GI | leuX | 0 | 1 | 48.15 | 6 | 0 | 0 | 15 |
| 2–PAI | pheU | 11 | 3 | 47.57 | 16 | 4 | 1 | 61 |
| 3–GI | selC | 10 | 2 | 47.04 | 23 | 3 | 1 | 70 |
| 4–PAI | pheV | 20 | 3 | 47.08 | 41 | 8 | 9 | 124 |
| 5–PAI | metV | 0 | 0 | 53.37 | 8 | 1 | 3 | 25 |
| 6–ϕ: | Absent | 1 | 1 | 49.32 | 20 | 5 | 0 | 49 |
| 7–GI | Absent | 7 | 1 | 49.68 | 19 | 1 | 2 | 37 |
| 8–GI | asnW | 2 | 1 | 53.12 | 4 | 0 | 1 | 26 |
| 9–PAI | asnT | 1 | 1 | 58.27 | 1 | 0 | 1 | 15 |
| 10–PAI | serU | 0 | 1 | 37.65 | 12 | 5 | 1 | 19 |
| 11–PAI | absent | 4 | 2 | 50.23 | 23 | 4 | 1 | 74 |
| 12–ϕ: | Absent | 0 | 1 | 49.78 | 9 | 4 | 0 | 14 |
| 13–ϕ: | Absent | 1 | 1 | 50.97 | 28 | 5 | 0 | 51 |
| 14–PAI | serX | 12 | 3 | 48.76 | 46 | 5 | 4 | 102 |
| 15–ϕ: | Absent | 0 | 1 | 50.45 | 15 | 1 | 0 | 42 |
| 16–PAI | aspV | 12 | 0 | 47.43 | 38 | 5 | 5 | 83 |
(Gold standard GIs),
(Transfer RNA),
(Percentage of guanine and cytosine content in the region),
(Hypothetical proteins),
(Uncharacterized proteins),
(Virulence genes),
(Total encoding sequences),
ϕ: (Island containing predominant bacteriophage DNA).
PAI, Pathogenicity island.
Total of relevant cds present in the 16 islands described of the gold standard compared to the results of the predictors.
| GIPSy | 11 | 16 | 80 | 276 | 40 | 738 | 91 |
| Alien Hunter | 5 | 13 | 70 | 241 | 37 | 655 | 81 |
| IslandViewer4 | 3 | 12 | 75 | 239 | 37 | 636 | 78 |
| Predict Bias | 1 | 4 | 26 | 101 | 17 | 182 | 22 |
| GI Hunter | 0 | 0 | 15 | 56 | 11 | 143 | 17 |
| Zisland Explorer | 0 | 1 | 13 | 47 | 9 | 137 | 16 |
In bold, total products present in the 16 islands of the gold standard.
(Gold standard),
(Transfer RNA),
(Hypothetical proteins),
(Uncharacterized proteins),
(Number of coding sequences).
Precision, recall, and f-score of 16 gold standard Gls.
| Precision | 16% | 17% | 34% | 17% | 27% | 8% |
| Recall | 81% | 19% | 81% | 81% | 19% | 38% |
| F-Score | 0.263 | 0.176 | 0.481 | 0.277 | 0.222 | 0.130 |
Predictors followed by the total number of correct predictions. (P1-13), (P2-3), (P3-13), (P4-13), (P5-3), and (P6-6).
Figure 5BLASTp hits of the tool intersection. The blue bars gradually display the tool intersections. The black circles show the intersections of the tools between them and the black bar graphs show how many times these intersections happened.
Hits intersections between predictors in all organisms and total number of islands predicted.
| Total of hits | 684 | 533 | 289 | 249 | 209 | 70 |
| Total of predicted GIs | 333 | 320 | 161 | 89 | 81 | 33 |
| Common islands | 108 (32%) | 223 (70%) | 140 (87%) | 66 (74%) | 40 (49%) | 21 (64%) |
| Unique islands | 225 (67%) | 97 (30%) | 21 (13%) | 23 (26%) | 41 (51%) | 12 (36%) |
Total of hits are the result intersections between predictors; total of predicted Gls is the total GIs predicted by the tools, common islands are the total GIs predicted by two or more tools, unique islands are the total of GIs predicted by only one tool.