| Literature DB >> 16225667 |
Oleg N Reva1, Burkhard Tümmler.
Abstract
BACKGROUND: Complete sequencing of bacterial genomes has become a common technique of present day microbiology. Thereafter, data mining in the complete sequence is an essential step. New in silico methods are needed that rapidly identify the major features of genome organization and facilitate the prediction of the functional class of ORFs. We tested the usefulness of local oligonucleotide usage (OU) patterns to recognize and differentiate types of atypical oligonucleotide composition in DNA sequences of bacterial genomes.Entities:
Mesh:
Substances:
Year: 2005 PMID: 16225667 PMCID: PMC1274298 DOI: 10.1186/1471-2105-6-251
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1OUV of different heptanucleotide usage patterns from n0_7 mer to n6_7 mer determined for complete bacterial genomes.
Figure 2Distances D between local n0_4 mer patterns and the global n0_4 mer patterns in the A). Local patterns were calculated for the sequence fragments of 8 kbp with sliding windows of 2 kbp. The 90% confidence interval of D values is depicted by horizontal lines. The loci with D-values exceeding the genomic confidence interval are considered as gene islands. The abscissa indicates the coordinates of the bacterial chromosomes as they were published in the NCBI database [27].
Figure 3Curves of D:n0_4 mer, PS:n0_4 mer and OUV:n1_4 mer in a locus of the . Local OU patterns were analyzed in 5 kbp sliding windows with steps of 1 kbp. Curves are specified by a color code: blue for D, green for PS and brown for OUV. Protein coding genes are shown by red bars and genes for ribosomal RNAs are shown in black. The abscissa indicates the coordinates of the locus in the chromosome. The upper horizontal line shows the upper boundary of the 95% confidence interval of intragenomic deviation of D values. The lower horizontal line separates genes by their direction of transcription.
Figure 4Dot-plot presentation of 8 kb genomic fragments of A). Fragments of 8 kbp were generated with a sliding window 2 kbp. Each dot represents the D:n0_4 mer, OUV:n1_4 mer and PS:n0_4 mer values of one fragment. The latter parameter is depicted by a color code represented by the bar in the right part of the figure. The grey lines indicate borders of the inner quartiles of values for the corresponding OU statistical parameters.
Genetic repertoire of loci characterized by atypical tetranucleotide usage patterns and extreme OUV (section III in Fig. 4) identified in bacterial chromosomes
| ΔD† | ΔOUV‡ | ||||
| putative hemagglutinin/hemolysin-related protein | 923,008 | 11,136 | 3.11 | 4.13 | |
| non-coding multiple repeats TTTAGAAA | 2,448,000 | 5.600 | 2.24 | 17.33 | |
| BB1186: putative hemolysin | 1,268,967 | 10,041 | 5.13 | 4.12 | |
| 3,592,327 | 17,058 | 3.17 | 4.65 | ||
| 3,930,196 | 10,326 | 6.23 | 5.02 | ||
| 4,106,955 | 12,387 | 4.39 | 4.95 | ||
| 6,017,600 | 12,633 | 5.04 | 6.16 | ||
| 962,711 | 8,919 | 2.85 | 3.85 | ||
| 2,541,750 | 9,069 | 2.88 | 5.42 | ||
| DR1461-1462: hypothetical proteins | 1,465,188 | 10,000 | 2.19 | 8.27 | |
| non-coding tandem repeats CCCGCCC | 519,833 | 8,415 | 7.06 | 8.42 | |
| Z0609, Z0615: RTX family exoproteins | 581,356 | 20,160 | 1.82 | 9.43 | |
| Rv0272c-Rv0279c hypothetical Gly-, Ala-rich proteins | 328.573 | 10,499 | 1.52 | 9.15 | |
| Rv0297-Rv0304c: hypothetical Gly-, Ala-, Asn-rich proteins | 361,332 | 11,431 | 8.79 | 7.91 | |
| Rv0355c: Asn-rich protein | 424,775 | 9,903 | 8.31 | 10.91 | |
| Rv0573c-Rv0578c: hypothetical Gly-rich proteins | 665,849 | 10,066 | 0.60 | 4.72 | |
| Rv0742-Rv0747: hypothetical Gly-rich proteins | 832,979 | 7,876 | 1.24 | 3.97 | |
| Rv1060-Rv1068c: hypothetical Gly-, Ala-rich proteins | 1,183,506 | 8,641 | 1.04 | 5.54 | |
| Rv1084-Rv1092c: hypothetical proteins | 1,207,634 | 11.395 | 2.19 | 6.44 | |
| multiple repeats CCGCCGCCA | 1,630,636 | 7,592 | 2.33 | 8.84 | |
| Rv2490c-Rv2494: hypothetical Gly-rich proteins | 2,801,252 | 7,482 | 2.60 | 5.50 | |
| PA1874: hypothetical protein | 2,036,441 | 7,407 | 2.61 | 5.61 | |
| PP0168: Thr-rich surface adhesion protein | 194,494 | 26,046 | 2.58 | 6.97 | |
| PP0806: surface adhesion protein | 926,690 | 18,930 | 1.17 | 4.39 | |
| PSPTO3229: filamentous hemagglutinin | 3,629,677 | 18,825 | 2.34 | 7.87 | |
| RB3077: putative cyclic nucleotide binding protein | 1,588,083 | 18,024 | 1.62 | 6.19 | |
| RB4375: large polymorphic membrane protein, probable extracellular nuclease; | 2,242,933 | 9,171 | 3.23 | 7.09 | |
| RB11769: probable aggregation factor core protein MAFp3 | 6,335,006 | 24,522 | 5.25 | 6.31 | |
| conserved hypothetical protein | 1,459,664 | 9,891 | 2.61 | 3.38 | |
| conserved hypothetical protein | 1,475,303 | 13,008 | 2.89 | 4.18 | |
| non-coding tandem repeats GAATTGAAAG | 1,228,221 | 12,238 | 1.94 | 15.25 | |
| 1,253,000 | 5,000 | 1.50 | 8.67 | ||
| 1,305,242 | 5,000 | 1.89 | 12.39 | ||
| 1,437,928 | 20,142 | 4.04 | 10.07 | ||
| SA2447: similar to streptococcal hemagglutinin | 2,755,253 | 6,816 | 3.03 | 9.29 | |
| SC8F4.01c: Ala/Glu-rich protein | 586,509 | 3.981 | 2.16 | 5.40 | |
| SC2H4.02: hypothetical protein | 6,836,057 | 6,552 | 2.86 | 4.80 | |
| 2,374,740 | 11,886 | 3.22 | 6.61 | ||
| non-coding sequence, multiple | 1,183,606 | 11,095 | 1.31 | 9.81 | |
| repeats (GGT)n | 1,447,312 | 11,139 | 1.37 | 10.91 | |
| 2,082,143 | 10,134 | 1.06 | 9.78 | ||
| 2,501,956 | 10,374 | 1.41 | 11.79 | ||
| 2,654,642 | 15,867 | 4.27 | 6.05 | ||
| 3,747,888 | 11,133 | 2.66 | 8.60 | ||
| y3579: putative filamentous hemagglutinin | 3,961,333 | 9,888 | 3.31 | 4.32 |
* left coordinate of the locus in the chromosomal sequence;
† deviation of the D:n0_4 mer value calculated for the locus from the mean genomic D:n0_4 mer in standard deviations;
‡ deviation of the OUV:n1_4 mer value calculated for the locus from the mean genomic OUV:n1_4 mer in standard deviations;
Figure 5Gene islands in the . Genome fragments of 8 kbp were generated with a sliding window in step of 2 kbp. Red bars in figure A indicate protein coding genes and black bars-hypothetical genes. The horizontal line in the part A separates genes by direction of transcription. The yellow-shaded 8 kbp long fragment in A corresponds to the red dot indicated by an arrow in B.
Figure 6Structural analysis of the complete sequence of the plasmid pKLC102 by local trinucleotide usage patterns. Local OU patterns were analyzed in 1.2 kbp sliding windows with steps of 0.2 kbp. The scale indicates the coordinates of the plasmid sequence and separates genes by their direction of transcription. Red bars depict protein coding genes and black bars hypothetical genes. Grey bars along the D and OUV axes depict the 3-sigma ranges of fluctuation of D:n0_3 mer and OUV:n1_3 mer in a randomly generated sequence of the same length and mononucleotide contents as pKLC102.
Correlation coefficients between D, PS and OUV of n0_4 mer local patterns with those of the corresponding n1, n2 and n3 normalized patterns
| plasmid pKLC102, window 5,000 bp, step 2,500 bp | |||
| D:n0_4 mer | 0.85* | 0.82 | 0.40 |
| PS:n0_4 mer | 0.40 | 0.60 | 0.10 |
| OUV:n0_4 mer | 0.89 | 0.83 | 0.39 |
| 1 Mbp-2 Mbp locus of | |||
| D:n0_4 mer | 0.94 | 0.84 | 0.63 |
| PS:n0_4 mer | 0.88 | 0.75 | 0.53 |
| OUV:n0_4 mer | 0.61 | 0.46 | 0.35 |
*Values in the cells of the table indicate the correlation coefficients between respective OU statistical parameters D, PS and OUV determined for n0 patterns and the normalized patterns n1, n2 and n3. For example, 0.85 is the correlation coefficient between series of values D:n0_4 mer and D:n1_4 mer determined for overlapping 5 kbp fragments of pKLC102.