| Literature DB >> 32349677 |
Rui Kong1, Xinnan Xu1, Xiaoqing Liu2, Pingan He3, Michael Q Zhang4,5, Qi Dai6,7.
Abstract
BACKGROUND: Genomic islands are associated with microbial adaptations, carrying genomic signatures different from the host. Some methods perform an overall test to identify genomic islands based on their local features. However, regions of different scales will display different genomic features.Entities:
Keywords: Boundary detection; Genomic island detection; Genomic signature; Large scale test; Small scale test
Mesh:
Year: 2020 PMID: 32349677 PMCID: PMC7191778 DOI: 10.1186/s12859-020-3501-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparison of the window-based methods Centroid, INDeGenIUS, SigHunt, and the proposed 2SigFinder on classification of GIs/non-GI datasets. The precision, recall and overall accuracy of each method are calculated based on the number of overlapping nucleotides in both published GIs and predicted GIs
| Method | Predicted GI | At Nucleotide Level (%) | ||||
|---|---|---|---|---|---|---|
| Total Length | Total Number | Average Length | Accuracy | Precision | Recall | |
| Centroid | 5,573,339 | 320 | 17,417 | 82.37 | 61.35 | 27.63 |
| INDeGenIUS | 3,641,371 | 277 | 13,146 | 82.43 | 67.94 | 19.99 |
| SigHunt | 5,813,441 | 758 | 4670 | 80.54 | 51.00 | 23.95 |
Fig. 1Performance of the proposed 2SigFinder (2SF), SIGI-HMM (SH), Al-ien_Hunter (AH), Centroid (CE), IslandPath-DIMOB (IPA), INDeGenIUS (IN), SigHunt (SI) and IslandPick (IPI) on the detection of genomic islands in P. aerugino-sa LESB58. a Predicted GIs found by all of the methods, and the known genomic islands are shown as vertical grey bars. b Overall length of the predicted genomic islands, true positives and false positives of all of the evaluated methods at the nucleo-tide level. c Precision, false positive rate (FPR) and F1-score of all of the evaluated methods at the island level, in which the precision, false positive rate and F1-score are calculated based on the number of known GIs that are more than 50% covered by the results of the prediction methods
Total length, average length and number of genomic islands predicted by 2SigFinder, SIGI-HMM, Alien_Hunter, Centroid, IslandPath-DIMOB, INDeGenIUS, SigHunt and IslandPick on detection of genomic islands in P. aeruginosa LESB58, and total number of the overlapping nucleotides in both known GIs and predicted GIs Data as well as the number of the known GI with at least 50% covered by results of prediction methods
| Method | Predicted GI | Nucleotides in both RGIs and PGIsa | RGIs/PGIsb (> 50%) | ||
|---|---|---|---|---|---|
| Length | Number | Average length | |||
| IslandPick | 275,178 | 16 | 17,199 | 209,001 | 5 |
| IslandPath-DIMOB | 95,919 | 10 | 9592 | 59,146 | 3 |
| Sigi-HMM | 110,465 | 21 | 5260 | 83,573 | 0 |
| Alien Hunter | 822,570 | 71 | 11,585 | 292,823 | 6 |
| Centroid | 308,000 | 14 | 22,000 | 121,503 | 4 |
| INDeGenIUS | 160,000 | 10 | 16,000 | 88,473 | 3 |
| SigHunt | 292,029 | 29 | 10,070 | 78,836 | 2 |
aTotal number of the overlapping nucleotides in both known GIs and predicted GIs Data
bNumber of the known GI with greater than 50% covered by results of prediction methods
Precision, false positive rate (FPR) and F1-score of the proposed method 2SigFinder, SIGI-HMM, Alien_Hunter, Centroid, IslandPath-DIMOB, INDeGenIUS, SigHunt and IslandPick on detection of genomic islands in P. aeruginosa LESB58, and the precision, false positive rate and F1-score are calculated based on the number of the known GIs with greater than 50% covered by results of prediction methods
| Method | Method | Precision | FDR | F1-score | |
|---|---|---|---|---|---|
| comparative genomics | IslandPick | 31.25 | 68.75 | 37.04 | |
Sequence composition | HMM-based methods | IslandPath-DIMOB | 30 | 70 | 28.57 |
| Sigi-HMM | 0 | 100 | 0 | ||
| Alien Hunter | 8.45 | 91.55 | 14.63 | ||
Window-based methods | Centroid | 28.57 | 71.43 | 32 | |
| INDeGenIUS | 30 | 70 | 28.57 | ||
| SigHunt | 6.90 | 93.10 | 10 | ||
Summary of functional features predicted by 2SigFinder, SIGI-HMM, Alien_Hunter, Centroid, IslandPath-DIMOB, INDeGenIUS, SigHunt and IslandPick on detection of genomic islands in S. enterica Typhi CT18, and the functional features were based on the number of the related genes in the real genomic islands which are covered by more than 50% of the results of the prediction method
| Pathogenicity | Integrase | Phage | tRNA | HEG | Transposase | Virulence | Repeats | IS | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Genuine GI | 27 | 5 | 130 | 1 | 36 | 9 | 2 | 2 | 3 | |
Predicted GIS | IslandPick | 0 | 1 | 10 | 0 | 8 | 0 | 0 | 0 | 0 |
| IslandPath-DIMOB | 0 | 2 | 58 | 0 | 10 | 1 | 0 | 0 | 0 | |
| Sigi-HMM | 16 | 0 | 5 | 0 | 6 | 3 | 0 | 0 | 0 | |
| Alien_Hunter | 65 | 1 | 23 | 6 | 2 | 2 | 3 | |||
| Centroid | 4 | 0 | 3 | 1 | 3 | 0 | 2 | 1 | 0 | |
| INDeGenIUS | 5 | 2 | 1 | 1 | 7 | 0 | 2 | 0 | 0 | |
| SigHunt | 0 | 1 | 15 | 0 | 1 | 2 | 0 | 2 | 0 | |
Ten pathogenicity islands reported to be located in S. enterica Typhi CT18, and name, star position, end position, size and function of these PAIs have been summarized from the pathogenicity island database (PAIDB)
| Name | Pathogenicity islands | Function | ||
|---|---|---|---|---|
| Star | End | Size(bp) | ||
| SPI-1 | 2,858,736 | 2,900,586 | 41,851 | Type III secretion system, invasion into epithelial cells, apoptosis (InvA, OrgA, SptP, SipA, SipB, SipC, SipD, SopE, prgH) |
| SPI-2 | 1,624,920 | 1,666,524 | 41,605 | Type III secretion system, required for systemic infection and intracellular pathogenesis by facilitating replication of intracellular bacteria within membrane-bound Salmonella-containing vacuoles |
| SPI-3 | 3,883,613 | 3,900,553 | 16,941 | Invasion, survival in monocytes, Mg2+ uptake (MgtC, B, MarT, MisL) |
| SPI-4 | 4,322,993 | 4,346,383 | 23,391 | Type I secretion system, putative toxin secretion, apoptosis, required for intracellular survival in macrophages, genes weakly similar to RTX-like toxins |
| SPI-5 | 1,085,068 | 1,092,563 | 7496 | Effector proteins for SPI-1 and SPI-2 (SopB, SigD, PipB) |
| SPI-6 | 302,092 | 360,757 | 58,666 | safA-D and tcsA-R chaperone-usher fimbrialoperons6 |
| SPI-7 | 4,409,511 | 4,543,148 | 133,638 | Vi exopolysaccharide, SopE prophage and a type IVB pilus operon |
| SPI-8 | 3,132,530 | 3,139,414 | 6885 | Two bacteriocin pseudogenes, genes conferring immunity to the bacteriocins |
| SPI-9 | 2,743,495 | 2,759,190 | 15,696 | Type I secretory apparatus, large RTX-like protein |
| SPI-10 | 4,683,605 | 4,716,538 | 32,934 | Phage 46 and the sefA-R chaperone-usher fimbrial operon |
Fig. 2Overlap percentages between the reported PAI and the predicted genomic islands from Precision, recall and overall accuracy of SigHunt and INDeGenIUS, in which 0.05–0.2 significance levels are used as cut-off values to evaluate their performances. All evaluation indexes are calculated at the nucleotide level
Overall length of the predicted genomic islands, true positives and false positives of all of the evaluated methods at the nucleotide level in S. enterica Typhi CT18
| Method | True positives | False positives | Overall length of PGIs |
|---|---|---|---|
| IslandPick | 106,587 | 206 | 106,793 |
| IslandPath | 233,096 | 58,168 | 291,264 |
| SIGI-HMM | 137,308 | 103,846 | 241,154 |
| Alien_Hunter | 449,085 | 531,001 | 980,086 |
| Centroid | 68,483 | 105,517 | 174,000 |
| INDeGenIUS | 61,214 | 58,786 | 120,000 |
| SigHunt | 102,160 | 155,840 | 258,000 |
| 2SigFinder | 357,218 | 97,551 | 454,769 |
PGI denotes predicted genomic islands
Fig. 3Overview of the 2SigFinder algorithm. a The work-flow of the small-scale t-test with large-scale feature selection, in which signatures of the host are extracted using the confidence interval of window variances, and core signatures are selected based on ordered kurtosis. During an iteration, we score each window using the two-sample t-test and selecte the windows whose scores are large enough to be considered to be statistically significant. b The workflow of the large-scale statistical test using dynamic signals from small-scale feature selection. Starting from the higher moments of each tetranucleotide, we select signatures of the host using the confidence interval of window variances and select dynamic core signatures using large sliding windows. During an iteration, we score each sliding long window with an accumulative score and select the windows whose scores are large enough to be consid-ered to be statistically significant