Literature DB >> 22355228

EGID: an ensemble algorithm for improved genomic island detection in genomic sequences.

Dongsheng Che, Mohammad Shabbir Hasan, Han Wang, John Fazekas, Jinling Huang, Qi Liu.   

Abstract

Genomic islands (GIs) are genomic regions that are originally transferred from other organisms. The detection of genomic islands in genomes can lead to many applications in industrial, medical and environmental contexts. Existing computational tools for GI detection suffer either low recall or low precision, thus leaving the room for improvement. In this paper, we report the development of our Ensemble algorithm for Genomic Island Detection (EGID). EGID utilizes the prediction results of existing computational tools, filters and generates consensus prediction results. Performance comparisons between our ensemble algorithm and existing programs have shown that our ensemble algorithm is better than any other program. EGID was implemented in Java, and was compiled and executed on Linux operating systems. EGID is freely available at http://www5.esu.edu/cpsc/bioinfo/software/EGID.

Entities:  

Keywords:  Bacterial genomes; Ensemble algorithm; Genomic islands

Year:  2011        PMID: 22355228      PMCID: PMC3280502          DOI: 10.6026/007/97320630007311

Source DB:  PubMed          Journal:  Bioinformation        ISSN: 0973-2063


Background:

Genomic islands are chromosomal regions that have the evidence of horizontal gene transfer. The studies of genomic islands are extremely important to biomedical research, due to the fact that such knowledge can be used to explain why some strains of bacteria within the same species are pathogenic while others are not, or the phenomena that some strains of bacteria can adapt to extreme environments while others cannot. Current approaches of detecting genomic islands include comparative genomic analyses and sequence composition analyses. The comparative genome analysis consists of collecting the genome sequences of phylogenetically closely related species, aligning these genome sequences, and then considering those genome segments present in a query genome but not in others to be GIs [1]. Since this type of approach does not apply to the genomes that do not have enough number of phylogenetically closely related genomes for reference, it cannot be applied to all genomes. The second kind of approach, sequence composition-based approach, does not require reference genomes and can be applied to any genome. It is generally believed that each genome has a unique genomic sequence signature, and thus genomic islands can be detected by analyzing sequence composition. Existing sequence composition based tools include AlienHunter [2], Centroid [3], COLOMBO SIGI-HMM[4], IslandPath [5], INDeGenIUS [6], and PAI-IDA [7]. The assessment of these computational tools in recent studies has shown that none of these tools can predict genomic islands accurately in all genomes [1]. Langille [8] further suggested that a computational framework that combines multiple prediction results of existing programs should be developed for more accurate genomic island prediction. In this paper, we present our ensemble program for improved genomic island prediction, based on predicted results of five existing GI programs. The framework of our approach includes: [1] collecting prediction results from existing programs; (b) analyzing and filtering on predicted results; and c) generating final consensus GI results (Figure 1). Experimental tests on benchmark datasets have shown that our ensemble program could improve prediction accuracy, and thus it may be used for the future GI prediction.
Figure 1

The flowchart of our computational framework for GI prediction

Methodology:

Data sets:

Genomic sequences used for GI prediction were collected from the National Center for Biotechnology Information (NCBI) FTP server ( ftp://ftp.ncbi.nih.gov/genomes/Bacteria). The genomic islands used for performance evaluation of GI tools were obtained by IslandPick [1].

Prediction of GIs with existing tools:

In our framework, we used the predicted GI results from five GI tools, AlienHunter, COLOMBO SIGI-HMM, INDeGenIUS, IslandPath, and PAI-IDA. All five programs use genome sequences as program inputs, with some individual programs requiring additional inputs such as gene annotations. The prediction results from these programs were used in our ensemble method.

Ensemble method:

Since GIs could range in size from several kilo base pairs (kb) to several hundred kb, it is very unlikely that two different GI prediction tools predict exactly same genomic islands. Thus, the predicted GIs by different tools often overlap, making it difficult to vote predicted results simply based on their predicted GIs. To handle this problem, we considered the genes within the predicted GI regions to be GI genes, and non-GI genes otherwise. We collected GI and non-GI gene information based on the prediction results by multiple GI tools. A simple voting scheme could be applied by selecting a threshold value, and considering the region, where all contained genes meeting the threshold requirement, to be a GI region, as shown in Figure 2. This approach may work fine for those candidate GI regions that are far away (Figure 2B), but not for those which are close each other (Figure 2A), which are supposed to be a big GI region [8].
Figure 2

Illustrative examples of candidate GI regions, where two candidate GI regions are close in (A) and distant in (B). Each vertical bar represents the vote of a GI gene by multiple GI tools. The candidate GI regions meeting the threshold value are underlined.

To resolve this problem, we proposed a measure for GI or nonGI based on the overall score of all genes in the region, rather than individual gene scores. To do so, we first form candidate GI regions, G1, G2, G3, …, Gm, where two neighboring GI regions, Gi and Gi+1, are separated by a non-GI region (i.e., none of programs predicted the region to be a GI). We then merge any two neighboring GI regions, Gi and Gi+1, if the average score of all genes (including the genes in Gi and Gi+1, and between the two regions) meets a predefined threshold value T1. By applying this measurement, we should merge two close GI regions (as shown in Figure 2A), but not for distant GI regions (Figure 2B). If two GI regions are merged into a newly formed GI region Gi, i+1, then Gi, i+1 and Gi+2 will be picked for the next merging test. Otherwise, Gi+1 and Gi+2 will be selected for merging test. The merging process will be repeated until it reaches to the last GI region, and we can obtain a set of GI regions, G'1, G'2, G'3, …, and G'n We further filter out GIs from the previous step if (a) the GI is short (i.e., containing < eight genes in the GI); and (b) the percentage of high GI gene scores (i.e., >1) does not meet a threshold value T2, so that we can guarantee that predicted GIs are supported by multiple programs. The determination of threshold values, T1, and T2 was described in Supplementary Material.

Performance evaluation:

To evaluate the performance of our model, we compared the predicted GIs with the benchmark dataset [1]. The benchmark dataset contains picked GIs from 118 genomes, and we predicted GIs using our EGID algorithm on these 118 genomes. True positives (TP) are the nucleotides in the positive benchmark dataset predicted to be genomic islands. True negatives (TN) are the nucleotides in the negative benchmark dataset predicted to be non-genomic islands. False positives (FP) are the nucleotides in the negative benchmark dataset predicted to be genomic islands. False negatives (FN) are the nucleotides within the positive benchmark dataset not predicted to be genomic islands. We focus on four validation measures, recall = TP/(TP+FN), precision = TP/(TP+FP), performance coefficient (PC) = TP/(TP+FP+FN) and F-Measure = 2*recall*precision/(recall + precision).

Discussion:

We collected 118 prokaryotic genomes from the National Center for Biotechnology Information (NCBI) FTP server, ran our EGID program, and generated GI locations ( http://www5.esu.edu/cpsc/bioinfo/software/EGID) for each genome. We used genomic islands obtained by IslandPick [1] as benchmark, to evaluate the predicted GIs by EGID. We also collected predicted GI results of five component programs in EGID, and summarized all performance results in (Table 1, see supplementary). As we can see from Table 1 (see supplementary), both COLOMBO SIGI-HMM and IslandPath have relative high precision rate, but with low recall rate. On the other hand, AlienHunter has relative high recall rate, but with low precision rate. EGID makes the balance between recall and precision, and it reaches relative high recall (0.630) and precision rate (0.630). Since PC and F-measure capture both recall and precision in a single accuracy measurement, their values reflect overall performance more accurately. EGID improves 12.14% over the best existing program AlienHunter in PC, and 7.88% in F-measure, suggesting the performance improvement of our ensemble method. In order to view the predicted GIs, we displayed the GI locations through one of the popular visualization tools, circus [9]. As we can see from Figure 3, EGID always picks GIs predicted by multiple programs, thus guaranteeing the reliability of GIs selected. The circular representations of other 117 genomes can also be found in our website.
Figure 3

Circular representations of the Escherichia coli O157:H7 str. Sakai (NC_002695) showing predicted GIs, with each circle predicted by each program. The predicted GIs from the outer to the inner circle are EGID, AlienHunter, COLOMBO SIGI-HMM, INDeGenIUS, Island-Path, and PAI-IDA. The shaded parts show the predicted GIs by EGID, and evidenced GIs by other programs.

Conclusion:

In this paper, we have reported the development of an ensemble algorithm EGID for more accurate GI detection. We hope our improved GI prediction program could aid in molecular evolution and horizontal gene transfer studies.
  9 in total

1.  IslandPath: aiding detection of genomic islands in prokaryotes.

Authors:  William Hsiao; Ivan Wan; Steven J Jones; Fiona S L Brinkman
Journal:  Bioinformatics       Date:  2003-02-12       Impact factor: 6.937

2.  Detecting pathogenicity islands and anomalous gene clusters by iterative discriminant analysis.

Authors:  Qiang Tu; Dafu Ding
Journal:  FEMS Microbiol Lett       Date:  2003-04-25       Impact factor: 2.742

3.  INDeGenIUS, a new method for high-throughput identification of specialized functional islands in completely sequenced organisms.

Authors:  Sakshi Shrivastava; Ch V Siva Kumar Reddy; Sharmila S Mande
Journal:  J Biosci       Date:  2010-09       Impact factor: 1.826

Review 4.  Detecting genomic islands using bioinformatics approaches.

Authors:  Morgan G I Langille; William W L Hsiao; Fiona S L Brinkman
Journal:  Nat Rev Microbiol       Date:  2010-05       Impact factor: 60.633

5.  Score-based prediction of genomic islands in prokaryotic genomes using hidden Markov models.

Authors:  Stephan Waack; Oliver Keller; Roman Asper; Thomas Brodag; Carsten Damm; Wolfgang Florian Fricke; Katharina Surovcik; Peter Meinicke; Rainer Merkl
Journal:  BMC Bioinformatics       Date:  2006-03-16       Impact factor: 3.169

6.  Identification of compositionally distinct regions in genomes using the centroid method.

Authors:  Issaac Rajan; Sarang Aravamuthan; Sharmila S Mande
Journal:  Bioinformatics       Date:  2007-08-27       Impact factor: 6.937

7.  Circos: an information aesthetic for comparative genomics.

Authors:  Martin Krzywinski; Jacqueline Schein; Inanç Birol; Joseph Connors; Randy Gascoyne; Doug Horsman; Steven J Jones; Marco A Marra
Journal:  Genome Res       Date:  2009-06-18       Impact factor: 9.043

8.  Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands.

Authors:  Georgios S Vernikos; Julian Parkhill
Journal:  Bioinformatics       Date:  2006-07-12       Impact factor: 6.937

9.  Evaluation of genomic island predictors using a comparative genomics approach.

Authors:  Morgan G I Langille; William W L Hsiao; Fiona S L Brinkman
Journal:  BMC Bioinformatics       Date:  2008-08-05       Impact factor: 3.169

  9 in total
  11 in total

1.  IslandViewer update: Improved genomic island discovery and visualization.

Authors:  Bhavjinder K Dhillon; Terry A Chiu; Matthew R Laird; Morgan G I Langille; Fiona S L Brinkman
Journal:  Nucleic Acids Res       Date:  2013-05-15       Impact factor: 16.971

2.  GIST: Genomic island suite of tools for predicting genomic islands in genomic sequences.

Authors:  Mohammad Shabbir Hasan; Qi Liu; Han Wang; John Fazekas; Bernard Chen; Dongsheng Che
Journal:  Bioinformation       Date:  2012-02-28

3.  HGTector: an automated method facilitating genome-wide discovery of putative horizontal gene transfers.

Authors:  Qiyun Zhu; Michael Kosoy; Katharina Dittmar
Journal:  BMC Genomics       Date:  2014-08-26       Impact factor: 3.969

Review 4.  Identifying pathogenicity islands in bacterial pathogenomics using computational approaches.

Authors:  Dongsheng Che; Mohammad Shabbir Hasan; Bernard Chen
Journal:  Pathogens       Date:  2014-01-13

5.  A Computational Framework for Tracing the Origins of Genomic Islands in Prokaryotes.

Authors:  Peng Wan; Dongsheng Che
Journal:  Int Sch Res Notices       Date:  2014-10-28

6.  IslandViewer 4: expanded prediction of genomic islands for larger-scale datasets.

Authors:  Claire Bertelli; Matthew R Laird; Kelly P Williams; Britney Y Lau; Gemma Hoad; Geoffrey L Winsor; Fiona S L Brinkman
Journal:  Nucleic Acids Res       Date:  2017-07-03       Impact factor: 16.971

Review 7.  Computational methods for predicting genomic islands in microbial genomes.

Authors:  Bingxin Lu; Hon Wai Leong
Journal:  Comput Struct Biotechnol J       Date:  2016-05-07       Impact factor: 7.271

8.  A Novel Method to Predict Genomic Islands Based on Mean Shift Clustering Algorithm.

Authors:  Daniel M de Brito; Vinicius Maracaja-Coutinho; Savio T de Farias; Leonardo V Batista; Thaís G do Rêgo
Journal:  PLoS One       Date:  2016-01-05       Impact factor: 3.240

9.  Genomic repeats, misassembly and reannotation: a case study with long-read resequencing of Porphyromonas gingivalis reference strains.

Authors:  Luis Acuña-Amador; Aline Primot; Edouard Cadieu; Alain Roulet; Frédérique Barloy-Hubler
Journal:  BMC Genomics       Date:  2018-01-16       Impact factor: 3.969

Review 10.  Microbial genomic island discovery, visualization and analysis.

Authors:  Claire Bertelli; Keith E Tilley; Fiona S L Brinkman
Journal:  Brief Bioinform       Date:  2019-09-27       Impact factor: 11.622

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.