| Literature DB >> 19101056 |
Gordana M Pavlović-Lazetić1, Nenad S Mitić, Milos V Beljanski.
Abstract
The paper presents a novel, n-gram-based method for analysis of bacterial genome segments known as genomic islands (GIs). Identification of GIs in bacterial genomes is an important task since many of them represent inserts that may contribute to bacterial evolution and pathogenesis. In order to characterize and distinguish GIs from rest of the genome, binary classification of islands based on n-gram frequency distribution have been performed. It consists of testing the agreement of islands n-gram frequency distributions with the complete genome and backbone sequence. In addition, a statistic based on the maximal order Markov model is used to identify significantly overrepresented and underrepresented n-grams in islands. The results may be used as a basis for Zipf-like analysis suggesting that some of the n-grams are overrepresented in a subset of islands and underrepresented in the backbone, or vice versa, thus complementing the binary classification. The method is applied to strain-specific regions in the Escherichia coli O157:H7 EDL933 genome (O-islands), resulting in two groups of O-islands with different n-gram characteristics. It refines a characterization based on other compositional features such as G+C content and codon usage, and may help in identification of GIs, and also in research and development of adequate drugs targeting virulence genes in them.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19101056 PMCID: PMC7185697 DOI: 10.1016/j.cmpb.2008.10.014
Source DB: PubMed Journal: Comput Methods Programs Biomed ISSN: 0169-2607 Impact factor: 5.428
Results of the Mann–Whitney two-sample U-test.
| (a) | |||||||
|---|---|---|---|---|---|---|---|
| OI# | Position | Asymp. Sig. | G + C content | CU bias | Related to | Coding for | |
| 7 | 241549..277463 | −1.434 | 0.152 | 51.96% | 0.163 | Rhs element | Put. Macrophage tox.; put. protease Rhs element associated |
| 8 | 300060..310645 | −2.685 | Phage or prophage related (CP-933H) | Phage related proteins | |||
| 28 | 579589..604753 | −8.804 | Put. RTX family proteins; put. membrane transport proteins | ||||
| 30 | 664172..675785 | −2.770 | 52.11% | 0.251 | Rhs element | Put. Rhs protein; put. receptor | |
| 35 | 843271..857236 | −1.450 | 0.147 | 47.50% | 0.156 | Put. regulator; put. enzyme; put. Transport | |
| 36 | 892772..931359 | −0.945 | 0.345 | 49.32% | 0.185 | Phage or prophage related (CP-933K) | Phage related proteins |
| 43 | 1058620..1146182 | −0.183 | 0.855 | 47.96% | 0.179 | Phage or prophage related (P4-phage) | Phage related proteins; insertion sequence associated proteins; putative phage inhibition; colicin resistance; tellurite resistance |
| 44 | 1250302..1295546 | −1.760 | 0.078 | 51.94% | 0.176 | Phage or prophage related (CP-933M) | Phage related proteins |
| 45 | 1330836..1392498 | −0.096 | 0.923 | 49.37% | 0.192 | Phage or prophage related (BP-933W) | Phage related proteins (shiga-like toxin, tRNA, etc.) |
| 47 | 1420969..1452695 | −0.941 | 0.347 | 46.24% | 0.099 | Transcriptional regulator MarT; hemagglutinin/hemolysin-related protein | |
| 48 | 1454242..1541789 | −0.179 | 0.858 | 47.90% | 0.178 | Phage or prophage related (P4 family) | Insertion sequence associated proteins; put. urease; put. phage inhibition; colicin resistance; tellurite resistance |
| 50 | 1626570..1673884 | −1.354 | 0.176 | 48.40% | 0.205 | Phage or prophage related (CP-933N) | Phage related proteins |
| 51 | 1678561..1694142 | −1.968 | 50.96% | 0.217 | Phage or prophage related (CP-933C) | Phage related proteins; sensor protein PhoQ | |
| 52 | 1701990..1756455 | −0.858 | 0.391 | 47.18% | 0.159 | Phage or prophage related (CP-933X) | Phage related proteins |
| 57 | 1849324..1929825 | −0.461 | 0.645 | 51.29% | 0.156 | Phage or prophage related (CP-933O) | Phage related proteins (put. intestinal colonization factor tRNA phage or prophage related) |
| 71 | 2271618..2329601 | −1.280 | 0.200 | 50.85% | 0.183 | Phage or prophage related (CP-933P) | Phage related proteins (put. RNA identical to DicF cryptic prophage encoded); tRNA genes |
| 76 | 2668112..2689231 | −1.120 | 0.263 | 50.08% | 0.157 | Phage or prophage related (CP-933T) | Phage related proteins |
| 84 | 2843565..2857752 | −6.392 | PAI related | Surface polysaccharides and antigens (O-antigen) | |||
| 93 | 2966157..3015072 | −1.041 | 0.298 | 51.66% | 0.200 | Phage or prophage related (CP-933V) | Phage related proteins (shiga-like toxin) |
| 102 | 3264256..3277062 | −1.207 | 0.227 | 44.74% | 0.291 | Phage or prophage related (P22, APSE-1, pM3) | Phage related proteins; enzymes; transport proteins |
| 108 | 3545770..3567450 | −3.210 | Phage or prophage related (CP-933Y) | Phage related proteins | |||
| 115 | 3786306..3803253 | −5.830 | PAI related | Type III secretion apparatus Proteins | |||
| 122 | 3919348..3942802 | −1.085 | 0.280 | 46.30% | 0.226 | PAI related | IS related proteins (put. pathogenicity island Integrase); PagC-like membrane protein |
| 138 | 4399382..4414774 | −3.783 | 54.68% | 0.156 | PAI related | enzymes | |
| 148 | 4649862..4693279 | −2.929 | Phage or prophage related (CP-933L) PAI (LEE pathogenicity Island) | Phage related proteins; LEE PAI integrase; transport and secretory proteins; put. intimin receptor protein | |||
| 172 | 5377088..5421521 | −0.262 | 0.793 | 47.35% | 0.158 | IS sequences related proteins; myosin heavy chain ( | |
The Mann–Whitney two-sample U-test of equality of 5-gram frequency distributions is applied to OIs of the E. coli EDL933 of length >10 kb as the first sample, and the complete genome as the second sample (a). Included are OI number (OI#), its position in the genome (Position), z-value and p-value (Asymptotic Significance) of the test, G + C content and CU bias of the sequence, as well as possible origin of a sequence it is related to (Related to) and the element(s) it codes for (Coding for)—according to [5], [64].
The same test is applied to sequences from the backbone, as the first sample, and the complete genome E. coli EDL933, as the second sample (b). Included are the sequence number (No.), its position in the genome (Start, End), z-value and Asymptotic Significance (p-value) of the test, and the G + C content of the sequence.
Asymptotic Significance in bold denotes low p-value (<0.05); G + C content data in bold denote high percentage (>58%), in italic—low percentage (<42%); CU bias data in bold denotes high CU bias (>0.30).
Fig. 1Distribution of 5-gram percentages in E. coli EDL933. Distribution of 5-gram percentages in E. coli EDL933 complete genome, the backbone sequence, OIs from the AC class (a) and OIs from the 5DC class (b). Labeled coordinate axes for the complete genome subfigure are presented only. The histogram's x-axis shows percentage intervals, y-axis shows number of different 5-grams with percent occurrence falling into specific intervals (e.g., in the complete genome there are 18 tetragrams with percentages in the interval 0.00–0.025%). x-Values range from 0 to the percent occurrence of the most frequent 5-gram (in the corresponding sequence, e.g., it is 0.28% for the complete E. coli EDL933 genome and corresponds to the most frequent 5-gram CCAGC). y-Values range from 0 to the number of different 5-grams with percent occurrence belonging to the modal interval (e.g., in the complete genome, there are 132 5-grams with percentages in the interval 0.08125–0.09375%); y-values sum up to 1024 (number of different 5-grams). Axes scales for all the subfigures are the same as the ones presented in the complete genome subfigure. Similarity in shape is noticeable among sequences from the AC class represented in (a), as well as their dissimilarity with sequences in the 5DC class represented in (b).
The most overrepresented and the most underrepresented tetragrams in the complete E. coli EDL933 genome, backbone and the 9 O-islands from the 5DC class.
Rank of tetragrams.
| Tetragram | Backbone | EDL933 | #30 | #8 | #84 | #138 | #51 | #115 | #108 | #28 | #148 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Rank | |||||||||||
| CGCC | 1 | 1 | 10 | 4 | 3 | 4 | 59 | 72 | 13 | 5 | 15 |
| GGCG | 2 | 2 | 81 | 63 | 37 | 2 | 4 | 140 | 19 | 16 | 10 |
| CCAG | 3 | 6 | 16 | 9 | 62 | 41 | 127 | 68 | 78 | 142 | 61 |
| CTTC | 4 | 5 | 9 | 23 | 172 | 129 | 238 | 4 | 54 | 139 | 33 |
| GAAG | 5 | 3 | 51 | 56 | 69 | 11 | 21 | 133 | 29 | 7 | 44 |
| CTGG | 6 | 4 | 74 | 200 | 85 | 1 | 1 | 77 | 18 | 15 | 40 |
| GATG | 7 | 7 | 8 | 12 | 65 | 10 | 23 | 150 | 60 | 9 | 2 |
| CATC | 8 | 8 | 2 | 8 | 2 | 76 | 18 | 6 | 2 | 22 | 11 |
| TTGC | 9 | 11 | 59 | 102 | 130 | 17 | 2 | 65 | 55 | 58 | 72 |
| GCAA | 10 | 13 | 41 | 5 | 54 | 7 | 6 | 126 | 124 | 8 | 45 |
| GGAC | 247 | 247 | 241 | 65 | 139 | 197 | 180 | 180 | 207 | 236 | 185 |
| TATA | 248 | 248 | 251 | 235 | 131 | 226 | 225 | 256 | 206 | 195 | 254 |
| GATC | 249 | 253 | 249 | 255 | 204 | 235 | 254 | 207 | 253 | 237 | 253 |
| CTAG | 250 | 249 | 111 | 222 | 179 | 240 | 168 | 255 | 248 | 225 | 246 |
| CGCG | 251 | 252 | 245 | 168 | 233 | 252 | 217 | 175 | 211 | 255 | 239 |
| CAAG | 252 | 250 | 193 | 108 | 207 | 225 | 230 | 82 | 236 | 252 | 223 |
| CTTG | 253 | 251 | 218 | 216 | 120 | 237 | 106 | 112 | 200 | 159 | 161 |
| TTGG | 254 | 255 | 252 | 239 | 112 | 254 | 255 | 213 | 254 | 200 | 244 |
| CCAA | 255 | 254 | 253 | 237 | 163 | 255 | 251 | 123 | 228 | 249 | 202 |
| GGCC | 256 | 256 | 248 | 226 | 225 | 256 | 253 | 216 | 255 | 256 | 251 |
(a) The 10 topmost overrepresented and the 10 topmost underrepresented (according to Markov Model of order 2) tetragrams in the complete E. coli EDL933 genome, backbone and the 9 OIs from the 5DC class. Tetragrams in first row of each group are overrepresented, while tetragrams in the second row are underrepresented. Tetragrams that represent palindromes themselves are shaded in orange, while pairs of tetragrams and their reverse complements (in the same OI, backbone or E. coli EDL933 genome) are shaded in red, blue, green or yellow. All palindromes belong to underrepresented group. Also, tetragrams and their reverse complements are either both underrepresented or both overrepresented.
(b) Rank of tetragrams in the complete genome E. coli EDL933 and in 9 OIs from the 5DC class, relative to the 10 topmost overrepresented and underrepresented tetragrams in the backbone. While the most overrepresented tetragrams in the backbone have low rank in some of the 5DC OIs (e.g., CTTC in OI# 84, #138, #51, #28, or CTGG in OI# 8), the most underrepresented tetragrams in the backbone have low rank in all the other sequences, too.
Over–underrepresented tetragrams according to z-value and their relation to restriction enzyme sites.
| No. | Tetragram sequence | OI# | Restriction enzymes | |||
|---|---|---|---|---|---|---|
| Other bacteria | ||||||
| Overrepresented in an OI and underrepresented in the backbone | ||||||
| 1. | CGGC | 8 | 2.250 | −6.776 | Eco52I, Eco56I | FseI, NaeI, NotI, Sse232I, XmaIII, EagI, EclXI |
| 2. | TGGA | 28 | 7.395 | −10.732 | Bpml | |
| 3. | CGTG | 84 | 2.654 | −23.930 | Eco72I | PmaCI |
| 4. | GGCT/AGCC | 108 | 3.820 | −13.573 | TaqII | |
| 5. | GTGT/ACAC | 115 | 1.888 | −10.799 | RLeAI | |
| 6. | GCAC | 148 | 2.760 | −15.092 | ApaLI | |
| 7. | GTAC | 30 | 3.042 | −11.083 | Eco255I | ScaI, Bsp1407I |
| 8. | ATGC | 51 | 4.085 | −6.797 | EcoT22I | SphI, AvaIII |
| 9. | GAGC | 138 | 2.169 | −17.301 | Eco53kI, EcoICRI | SacI, EcI136II |
| Underrepresented in an OI and overrepresented in the backbone | ||||||
| 10. | CTTC/GAAG | 51 | −2.418 | 52.641 | Eco57I | Eam1104I, EarI |
| 11. | GTAA | 30 | −2.351 | 8.637 | PI-SceI | |
| 12. | TTAT | 138 | −2.642 | 13.506 | PsiI | |
| 13. | ACTA | 8 | −2.912 | 14.456 | SpeI | |
| 14. | TCAT | 28 | −4.408 | 10.197 | BspHI | |
| 15. | TATC | 84 | −1.125 | 22.921 | EcoRV, Eco32I | BfuI |
| 16. | ATCC | 84 | −2.985 | 1.821 | Bam HI | |
| 17. | GGCA | 108 | −1.404 | 38.056 | PI-SceI | |
| 18. | AGAG/CTCT | 115 | −1.846 | 22.564 | Eco31I, EcoA4I, EcoO44I | Esp3I |
| 19. | GAGA | 148 | −2.599 | 20.247 | Eam1104I, EarI | |
| Uniquely overrepresented in one OI and underrepresented in the backbone | ||||||
| 20. | ACAC | 28 | 4.204 | −11.712 | TaqII | |
| 21. | CGGC | 8 | 2.250 | −6.776 | Eco52I, Eco56I | FseI, NaeI, NotI, Sse232I, XmaIII, EagI, EclXI |
| 22. | ATCT | 28 | 2.186 | −16.522 | BglII | |
| 23. | CAAA | 30 | 1.795 | −5.137 | (not found in DB) | |
| 24. | GCAC | 148 | 2.760 | −15.092 | ApaLI | |
| 25. | TTAG | 148 | 1.904 | −17.928 | (not found in DB) | |
| 26. | GTTT | 138 | 1.172 | −11.686 | PmeI | |
| Uniquely underrepresented in one OI and overrepresented in backbone | ||||||
| 27. | GTAA | 30 | −2.351 | 8.637 | PI-SceI | |
| 28. | CTTC/GAAG | 51 | −2.418 | 52.641 | Eco57I | Eam1104I, EarI |
| 29. | ACCA | 138 | −1.206 | 2.861 | (not found in DB) | |
| 30. | ACTA | 8 | −2.912 | 14.456 | SpeI | |
| 31. | CCGG/CCGG | 28 | −3.418 | 21.776 | Eco56I, Eco52I, EcoHK31I | EcIRI, SrfI, Sse232I, AgeI, BetI, BspMII, Cfr 10I, EaeI, EaeAI, FseI, HpaII, NaeI, SgrAI, SmaI, EciIXI |
| 32. | GGTA | 84 | −1.137 | 19.760 | KpnI | |
| 33. | GGCA | 108 | −1.404 | 38.056 | PI-SceI | |
| 34. | AGAG/CTCT | 115 | −1.846 | 22.564 | Eco31I, EcoA4I, EcoO44I | Esp3I |
| 35. | GGAG | 148 | −1.627 | 6.808 | BpmI, BseRI | |
First part of this table represents tetragrams which are among the most overrepresented in an OI (high z-value) while being among the most underrepresented in the backbone (low z-value); second part of this table represents tetragrams which are among the most underrepresented in an OI (low z-value) while being among the most overrepresented in the backbone (high z-value); third part of this table represents some of the tetragrams which are uniquely overrepresented in an OI and underreperesented in the backbone, while the fourth part represents some of the tetragrams which are uniquely underrepresented in an OI and overrepresented in the backbone. As expected, the first and the third part of the table have tetragrams in common, as well as the second and the fourth part of the table.
Restriction enzyme recognition sequences for some E. coli strains, as well as for other bacteria, are given when known. Restriction enzyme data are taken from Genscript.com [70].
A tetragram with its reverse complement is given in pair (xxxx/yyyy) when the tetragram (xxxx) corresponds to the “OI#” and “z-value in OI” data, while its complement (yyyy) corresponds to the “Restriction enzymes” data.
Fig. 2Comparative Zipf analysis. The 20 topmost overrepresented and underrepresented tetragrams in the complete E. coli EDL933 genome – subfigures (1), (4), OI #28 – subfigures (2), (5) and OI #51 – subfigures (3), (6), with the corresponding z-values for the biased (according to the maximal order Markov model) tetragram frequency. The red line marked with triangles (▴) represents z-values of tetragram frequencies in descending order for the chosen sequence (complete genome, OI), while the blue line marked with circles (●) represents z-values of the corresponding tetragram frequencies in the backbone, and all the other lines represent z-values of the corresponding tetragram frequencies in other sequences. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of the article.)
Fig. 3Comparative Zipf analysis. All the tetragrams (x-axis) are sorted according to descending z-value of the z statistic (y-axis) in the complete genome sequence (thick line), and z-values for the corresponding tetragrams in all the other sequences are presented. Although the most overrepresented tetragrams may deviate highly among the sequences (left part of the figure), the most underrepresented ones tend to coincide (right part of the figure).
O-islands signature (overrepresented tetragrams).
| Tetragam | O-island | Backbone sequence | |||
|---|---|---|---|---|---|
| OI# | Rank | Rank | |||
| ATAC | 8 | 1.361 | 34 | −25.463 | 228 |
| CCTA | 8 | 1.079 | 46 | −25.620 | 229 |
| CGGC | 8 | 2.250 | 11 | −6.776 | 152 |
| ACAC | 28 | 4.204 | 40 | −11.712 | 178 |
| ATCT | 28 | 2.186 | 71 | −16.522 | 197 |
| TGGA | 28 | 7.395 | 13 | −10.732 | 169 |
| TTAA | 28 | 3.246 | 52 | −10.080 | 168 |
| AAAC | 30 | 1.228 | 62 | −11.112 | 176 |
| CAAA | 30 | 1.795 | 40 | −5.137 | 147 |
| GCGG | 30 | 1.982 | 32 | −2.159 | 134 |
| ACTG | 51 | 1.473 | 45 | −12.218 | 180 |
| CGCA | 51 | 1.017 | 71 | −26.859 | 233 |
| TGTA | 84 | 2.157 | 14 | −15.204 | 193 |
| GGTT | 115 | 1.048 | 73 | −33.843 | 244 |
| GTGT | 115 | 1.888 | 30 | −10.799 | 171 |
| GTTT | 138 | 1.172 | 71 | −11.686 | 177 |
| GCAC | 148 | 2.760 | 31 | −15.092 | 192 |
| TTAG | 148 | 1.904 | 54 | −17.928 | 207 |
All the tetragrams that are uniquely overrepresented in one of the 5DC class OIs and underrepresented in the backbone, along with the z-value and rank of the tetragram in both sequences, are listed.
O-islands signature (underrepresented tetragrams).
| Tetragam | O-island | Backbone sequence | |||
|---|---|---|---|---|---|
| OI# | Rank | Rank | |||
| ACTA | 8 | −2.912 | 256 | 14.456 | 68 |
| AACA | 8 | −1.758 | 233 | 29.532 | 19 |
| CGGG | 8 | −1.997 | 246 | 10.116 | 83 |
| GTCA | 8 | −1.797 | 236 | 13.755 | 71 |
| TCTT | 8 | −1.150 | 213 | 22.012 | 40 |
| TGCC | 8 | −1.387 | 221 | 38.883 | 11 |
| AGCA | 28 | −1.521 | 167 | 25.525 | 31 |
| CCGG | 28 | −3.418 | 210 | 21.776 | 43 |
| GTGG | 28 | −1.611 | 172 | 7.747 | 95 |
| GTTA | 28 | −1.964 | 181 | 7.667 | 96 |
| TAGT | 28 | −2.350 | 193 | 14.291 | 69 |
| TCCT | 28 | −1.840 | 178 | 7.251 | 100 |
| TGGC | 28 | −2.732 | 198 | 18.462 | 56 |
| CTAC | 30 | −1.138 | 199 | 34.370 | 18 |
| GTAA | 30 | −2.351 | 232 | 8.637 | 89 |
| TAAC | 30 | −1.045 | 192 | 8.668 | 8 |
| TTAC | 30 | −1.801 | 219 | 10.524 | 80 |
| CCCT | 51 | −1.080 | 194 | 5.075 | 108 |
| CTTC | 51 | −2.418 | 238 | 52.641 | 4 |
| GGGA | 51 | −1.206 | 199 | 16.351 | 65 |
| CACT | 84 | −1.699 | 231 | 8.861 | 87 |
| GGTA | 84 | −1.137 | 210 | 19.760 | 49 |
| TATC | 84 | −1.125 | 209 | 22.921 | 35 |
| GGCA | 108 | −1.404 | 208 | 38.056 | 13 |
| TCAC | 108 | −1.229 | 201 | 17.896 | 58 |
| TCCC | 108 | −1.470 | 213 | 13.943 | 70 |
| AGAG | 115 | −1.846 | 233 | 22.564 | 36 |
| GGTG | 115 | −1.051 | 192 | 38.467 | 12 |
| ACCA | 138 | −1.206 | 187 | 2.861 | 118 |
| GGAG | 148 | −1.627 | 193 | 6.808 | 103 |
All the tetragrams that are uniquely underrepresented in one of the 5DC class OIs and overrepresented in the backbone, along with the z-value and rank of the tetragram in both sequences, are listed.
Fig. 4z values in 10 kb windows for tetragrams overrepresented in the backbone of the E. coli EDL933 and underrepresented in 5DC OIs. z-Plots of the most underrepresented tetragrams in each of the 5DC OIs, which are overrepresented in the backbone, are presented (ACTA is the most underrepresented in the OI #8 among the tetragrams overrepresented in the backbone, and similarly TCAT in the OI #28, GTAA in the OI #30, CTTC in the OI #51, ATCC in the OI #84, CGTT in the OI #108, ATGT in the OI #115, TTAT in the OI #138, GAGA in the OI #148). Tetragrams are ordered upward by the corresponding OIs ordering (first tetragram for the OI #8, followed by tetragrams for OI #28, #30, #51, etc.) Narrow vertical rectangles delimit each of the OIs and the color of each of them is the same as the z-plot of the tetragram the most underrepresented in it. Horizontal lines represent z values of −1 and +1 (boundary of under-overrepresentation).