| Literature DB >> 15640440 |
Hong-Yu Ou1, Rebecca Smith, Sacha Lucchini, Jay Hinton, Roy R Chaudhuri, Mark Pallen, Michael R Barer, Kumar Rajakumar.
Abstract
ArrayOme is a new program that calculates the size of genomes represented by microarray-based probes and facilitates recognition of key bacterial strains carrying large numbers of novel genes. Protein-coding sequences (CDS) that are contiguous on annotated reference templates and classified as 'Present' in the test strain by hybridization to microarrays are merged into ICs (ICs). These ICs are then extended to account for flanking intergenic sequences. Finally, the lengths of all extended ICs are summated to yield the 'microarray-visualized genome (MVG)' size. We tested and validated ArrayOme using both experimental and in silico-generated genomic hybridization data. MVG sizing of five sequenced Escherichia coli and Shigella strains resulted in an accuracy of 97-99%, as compared to true genome sizes, when the comprehensive ShE.coli meta-array gene sequences (6239 CDS) were used for in silico hybridization analysis. However, the E.coli CFT073 genome size was underestimated by 14% as this meta-array lacked probes for many CFT073 CDS. ArrayOme permits rapid recognition of discordances between PFGE-measured genome and MVG sizes, thereby enabling high-throughput identification of strains rich in novel genes. Gene discovery studies focused on these strains will greatly facilitate characterization of the global gene pool accessible to individual bacterial species.Entities:
Mesh:
Year: 2005 PMID: 15640440 PMCID: PMC546176 DOI: 10.1093/nar/gni005
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1Microarray-Assisted mobilome Prospecting (MAmP): a method for determining the discrepancy between the physical genome size and that accounted for by known genes represented on an expanded species-specific microarray. (a) A schematic representation of a microarray-based CGI output for an hypothetical region of the genome in a strain under investigation by the MAmP technique. Scanned raw data are normalized permitting classification of microarray-represented genes as ‘Present (+)’ or ‘Absent (−)’ in individual test strains. (b) The arrows represent the genetic organization of these CDS within E.coli K-12 MG1655, E.coli K-12 W3110, E.coli O157 EDL933, S.flexneri 2a Sf301 or other virulence-associated gene clusters included on the MG1655, W3110 or ShE.coli microarrays. (c) Contiguous CDS classified as ‘Present’ are merged into an IC with intergenic non-coding segments between the contiguous CDS included in the corresponding IC. Each IC was then extended in both directions by lengths equal to half the flanking 5′ and 3′ intergenic segments, as indicated by the double-headed arrows of lengths 0.5x and 0.5y in the examples shown. When the ShE.coli meta-array sequences were used, each probe was mapped onto a single source reference template to allow for the generation of specific ICs. (d) The size of the MVG was calculated as the sum of all IC lengths. Consequently, the size of the non-microarray-borne novel mobilome was estimated as equal to the discrepancy between the pulsed-field gel electrophoresis-determined physical genome size and the MVG size. Figure not drawn to scale.
Figure 2A flow diagram of the logic steps used for in silico CGI. CDS, P, H and S denote CDS mapped onto specific probes, individual microarray-borne amplicon probes, H-values and Bit scores, respectively. P and P denote distinct individual probes, with the subscripts i and j identify matching CDS, H-values and Bit scores. The steps involved in the primary similarity search are shown in (a), while the cross-hybridization algorithm that re-categorizes CDS from ‘Putatively Present’ to ‘Present’ or ‘Absent’ are shown in (b). A direct comparison of in silico generated data and true experimental CGI data for E.coli EDL933 obtained using the MG1655 microarray was used to validate the algorithm. The key threshold values, H0 = 0.40 and H = 0.96, were set to optimize the sensitivity and specificity of identifying genes as ‘Present’ when using the in silico as compared to experimental approach (see Figure 3).
Figure 3The distribution of H-values corresponding to 4264 amplicon probes spotted onto the MG1655 microarray obtained following a BLASTN similarity search against the EDL933 chromosomal sequence are shown. Interval-grouped H-values are plotted with the data stratified into the experimental CGI categories of ‘Present’ (3775), ‘Absent’ (400) and ‘Indeterminate’ (89). The selected threshold values for in silico CGI, H0 = 0.40 and H = 0.96, are as indicated. The numbers in the boxes at the top right corner correspond to the ‘Number of CDS’ associated with the two bars that extend beyond the limits of the graph. The inset table shows a direct comparison of experimental CGI data derived by Anjum et al. (18) and in silico CGI data for the CDS classified as ‘Present’ or ‘Absent’ only. The sensitivity (S) and specificity (S) of identifying genes as ‘Present’ when using the in silico as compared to experimental approach, are shown on the right-hand side.
The ArrayOme-predicted MVG sizes of E.coli and Shigella strains based on data derived by in silico hybridization of the complete genomes against the ShE.coli microarray amplicon probe sequences
| Strain | Size of the complete chromosome (kb) | No. of CDS classified as ‘Present’ | Size of the | Discrepancy between the chromosome length and size of the | ||||
|---|---|---|---|---|---|---|---|---|
| MG1655 CDS ( | EDL933-specific CDS ( | Sf301-specific CDS ( | Other chromosomal CDS ( | Plasmid-borne CDS ( | ||||
| 5528 | 3783 | 1097 | 162 | 50 | 5 | 5566 [4192] | +38 (+0.7) | |
| 4639 | 4288 | 58 | 59 | 17 | 8 | 4771 [4639] | +132 (+2.8) | |
| 5498 | 3780 | 1083 | 164 | 50 | 5 | 5556 [4191] | +58 (+1.1) | |
| 4607 | 3541 | 122 | 515 | 16 | 14 | 4519 [3882] | −88 (−1.9) | |
| 4599 | 3534 | 123 | 499 | 17 | 14 | 4507 [3882] | −92 (−2.0) | |
| 5231 | 3638 | 278 | 174 | 26 | 9 | 4498 [4013] | −733 (−14.0) | |
aThe lengths of the complete chromosomes, shown to the nearest kilobase (kb), are based on genome sequences lodged with GenBank.
bA total of 4264 E.coli K-12 MG1655 CDS, 1101 E.coli EDL933 CDS, 516 S.flexneri 2a Sf301 CDS, 132 other E.coli chromosomal virulence-associated CDS and 226 plasmid-borne E.coli virulence genes are classified as ‘Present’ or ‘Absent’ based on in silico CGI analysis. The 25 MG1655 CDS not represented on the microarray were assigned based on logic rules described in the text with any remaining ‘Indeterminate’ CDS classed as ‘Present’ for simplicity, leading to the classification of all 4289 MG1655 CDS.
cThe data corresponding to plasmid-borne genes was omitted when calculating MVG sizes as these CDS would normally be considered to be borne on episomal entities other than the main chromosome given their original identified location. The ArrayOme-predicted sizes of ShE.coli-MVGs (left) and MG1655-MVGs (right, square brackets) are shown to the nearest kilobase (kb).
dThe percentage errors between the reported lengths of the chromosomes and the sizes of the ShE.coli-MVGs are shown within parentheses.
The MVG sizes of ‘virtual genomes’ constructed by precise deletion of Islander-defined GIs
| Virtual strain designation | Virtual genome | No. of CDS classified as ‘Present’ | Size of the MVG (kb) | Discrepancy between the sizes of genome and MVG (kb) (% error) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| No. of GIs deleted | Total length of deletion (bp) | Size of the derivative virtual genome (kb) | MG1655 CDS ( | EDL933-specific CDS ( | Sf301-specific CDS ( | Other chromosomal CDS ( | Plasmid-borne CDS ( | |||
| EDL933-V9 | 9 (398) | 321 129 | 5207 | 3776 | 905 | 144 | 48 | 3 | 5351 | +144 (+2.8) |
| MG1655-V3 | 3 (81) | 65 825 | 4573 | 4219 | 53 | 55 | 17 | 5 | 4705 | +132 (+2.9) |
| Sakai-V8 | 8 (327) | 235 511 | 5263 | 3774 | 888 | 142 | 48 | 3 | 5333 | +70 (+1.3) |
| Sf301-V3 | 3 (70) | 75 968 | 4531 | 3538 | 112 | 469 | 16 | 11 | 4445 | −86 (−1.9) |
| 2457T-V4 | 4 (68) | 82 476 | 4517 | 3527 | 113 | 448 | 17 | 11 | 4424 | −93 (−2.1) |
| CFT073-V6 | 6 (435) | 395 611 | 4836 | 3606 | 214 | 138 | 20 | 6 | 4357 | −479 (−9.9) |
The ArrayOme-facilitated MVG size prediction was based on data derived by in silico CGI hybridization of the complete genomes against the ShE.coli microarray amplicon probe sequences.
aThe GIs that were deleted from the corresponding complete genomes were based on the Islander-derived data of Mantri and Williams (36). The details of these GIs along with the precise boundaries were downloaded from the database of Islander () (36). The total numbers of CDS contained within the deleted GIs are shown within parentheses.
bA total of 4264 E.coli K-12 MG1655 CDS, 1101 E.coli EDL933 CDS, 516 S.flexneri 2a Sf301 CDS, 132 other E.coli chromosomal virulence-associated CDS and 226 plasmid-borne E.coli virulence genes are classified as ‘Present’ or ‘Absent’ based on in silico CGI analysis. The 25 MG1655 CDS not represented on the microarray were assigned based on logic rules described in the text with any remaining ‘Indeterminate’ CDS classed as ‘Present’ for simplicity.
cThe data corresponding to plasmid-borne genes was omitted when calculating MVG sizes as these CDS would normally be considered to be borne on episomal entities other than the main chromosome given their original identified location. ArrayOme-predicted MVGs are shown to the nearest kilobase (kb).
dThe percentage errors between the lengths of virtual chromosomes and the sizes of the ShE.coli-MVGs are shown within parentheses.
ArrayOme-predicted MVG sizes of three sequenced E.coli and Shigella genomes based on in silico CGI data derived using the E.coli K-12 MG1655 microarray
| Strain | Sequenced genome | MVG | |||
|---|---|---|---|---|---|
| Size of chromosome (bp) | Total size of non-MG1655 sequences (bp) | No. of CDS classified as ‘Present’ | Size of the MVG (kb) | Size of the non-MG1655 mobilome (kb) (% error) | |
| 5 498 450 | 1 393 070 | 3822 | 4191 | 1307 (−6.2) | |
| 5 231 428 | 1 306 391 | 3662 | 4013 | 1218 (−6.7) | |
| 4 607 203 | ∼700 000 | 3553 | 3882 | ∼725 (∼+3.6) | |
aThe total lengths of strain-specific sequences with respect to E.coli K-12 MG1655 genome were reported by the authors of the published sequences. The extent of non-MG1655 sequences present in S.flexneri Sf301 was reported by Jin et al. (4) as ∼0.7 Mb.
bThe 4264 annotated E.coli K-12 MG1655 CDS represented on the microarray were classified as ‘Present’ or ‘Absent’ based on in silico CGI analysis described in the text. The 25 MG1655 genes not represented on the microarray were assigned based on logic rules described in the text with any remaining ‘Indeterminate’ CDS classed as ‘Present’ for simplicity.
cThe percentage errors between the reported lengths of the non-MG1655 sequences and the sizes of the non-MG1655 mobilomes as determined using ArrayOme are shown within parentheses.