| Literature DB >> 18230148 |
Ben Carter1, Guanghui Wu, Martin J Woodward, Muna F Anjum.
Abstract
BACKGROUND: Microarray based comparative genomic hybridisation (CGH) experiments have been used to study numerous biological problems including understanding genome plasticity in pathogenic bacteria. Typically such experiments produce large data sets that are difficult for biologists to handle. Although there are some programmes available for interpretation of bacterial transcriptomics data and CGH microarray data for looking at genetic stability in oncogenes, there are none specifically to understand the mosaic nature of bacterial genomes. Consequently a bottle neck still persists in accurate processing and mathematical analysis of these data. To address this shortfall we have produced a simple and robust CGH microarray data analysis process that may be automated in the future to understand bacterial genomic diversity.Entities:
Mesh:
Year: 2008 PMID: 18230148 PMCID: PMC2262894 DOI: 10.1186/1471-2164-9-53
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1The analysis process for CGH studies. The minimum number of stages that are required to carry out a CGH study from raw data to a validated output and have confidence in the robustness of the results, are outlined. Stages include data cleaning, normalisation and decision over the presence/divergence of each gene in the array. The validation provides a metric to compare the process by examining sequenced data.
Figure 2Distribution of the log. a) Distribution of log2(Cy3/Cy5) data for MG1655. b) Distribution of log2(Cy3/Cy5) data for EDL933. c) Distribution of log2(Cy3/Cy5) data for Sakai.
Comparison of the six algorithms, using the microarray hybridization data from the MG1655 sequenced strain. The result of using each algorithm were compared to BLASTN data and are shown below.
| Cut-off | TP | FP | TN | FN | Sensitivity | Specificity | M-Score | |
| 0.50 | 4231 | 3 | 1452 | 58 | 98.65 | 99.78 | 98.90 | |
| 0.33 | 4244 | 6 | 1449 | 45 | 98.95 | 99.59 | 99.11 | |
| Naïve | 0.25 | 4249 | 10 | 1445 | 40 | 99.07 | 99.31 | 99.12 |
| Cut-off | 0.20 | 4253 | 20 | 1435 | 36 | 99.16 | 98.63 | 99.02 |
| 0.10 | 4265 | 125 | 1330 | 24 | 99.44 | 91.40 | 97.41 | |
| GENCOM* | 4203 | 3 | 1452 | 86 | 97.99 | 99.79 | 98.45 | |
| GACK** | EPP = 50 | 4116 | 1 | 1554 | 173 | 95.97 | 99.93 | 96.97 |
| EPP= 0 | 4202 | 3 | 1452 | 87 | 97.97 | 99.79 | 98.43 | |
| Porwollik | 4219 | 2 | 1453 | 70 | 98.37 | 99.86 | 98.75 | |
| MKD*** | 4243 | 12 | 1443 | 46 | 98.93 | 99.18 | 98.99 | |
| Mixture | Bimodal | 4256 | 20 | 1435 | 33 | 99.23 | 98.63 | 99.07 |
| Trimodal | 4236 | 4 | 1451 | 53 | 98.76 | 99.73 | 99.01 | |
The number of genes estimated as correctly conserved (True positives, TP), genes identified as conserved but actually are variable (false positives, FP), genes identified as correctly variable (true negatives TN), and genes identified falsely as variable (false negative FN), are given. The sensitivity, specificity, and M-Score are also calculated, where the sensitivity = TP/(TP+FP), specificity = TN/(FP+TN), and M-Score = Sensitivity*prevalence + Specificity*(1-prevalence)
* Institute of Food Research method (GENCOM)
** Genotyping Analysis by Charlie Kim method (GACK)
*** Minimum Kernel Density method (MKD)
Comparison of the six algorithms, using the microarray hybridization data from the Sakai sequenced strain. The result of using each algorithm were compared to BLASTN data and are shown below.
| Cut-off | TP | FP | TN | FN | Sensitivity | Specificity | M-Score | |
| 0.50 | 5285 | 2 | 416 | 41 | 99.23 | 99.52 | 99.25 | |
| Naïve | 0.33 | 5297 | 6 | 412 | 29 | 99.45 | 98.56 | 99.39 |
| cut-off | 0.25 | 5302 | 8 | 410 | 24 | 99.55 | 98.09 | 99.44 |
| 0.20 | 5308 | 12 | 406 | 18 | 99.66 | 97.13 | 99.48 | |
| 0.10 | 5316 | 64 | 354 | 10 | 99.81 | 84.69 | 98.71 | |
| GENCOM* | 5238 | 2 | 416 | 88 | 98.34 | 99.52 | 98.43 | |
| GACK** | EPP = 50 | 5137 | 1 | 417 | 189 | 96.45 | 99.76 | 96.69 |
| EPP= 0 | 5261 | 1 | 417 | 65 | 98.78 | 99.76 | 98.85 | |
| Prowollik | 5277 | 1 | 417 | 49 | 99.07 | 98.76 | 99.13 | |
| MKD*** | 5297 | 6 | 412 | 29 | 99.45 | 98.56 | 99.39 | |
| Mixture | Bimodal | 5296 | 8 | 413 | 30 | 99.44 | 98.80 | 99.39 |
| Trimodal | 5244 | 1 | 417 | 82 | 98.46 | 99.76 | 98.55 | |
The number of genes estimated as correctly conserved (True positives, TP), genes identified as conserved but actually are variable (false positives, FP), genes identified as correctly variable (true negatives TN), and genes identified falsely as variable (false negative FN), are given. The sensitivity, specificity, and M-Score are also calculated, where the sensitivity = TP/(TP+FP), specificity = TN/(FP+TN), and M-Score = Sensitivity*prevalence + Specificity*(1-prevalence)
* Institute of Food Research method (GENCOM)
** Genotyping Analysis by Charlie Kim method (GACK)
*** Minimum Kernel Density method (MKD)
Summary of the performance of each algorithm using CGH microarray data for the MG1655 sequenced strain. The sensitivity, specificity, and M-score generated from each of the cut-off algorithms from the CGH data were summarized for comparison.
| Algorithm | M-Score | Sensitivity | Specificity |
| Naive Cut-off (0.25) | 99.12 | 99.07 | 99.31 |
| Mixture Model (Bimodal) | 99.07 | 99.23 | 98.63 |
| MKD | 98.99 | 98.93 | 99.18 |
| Porwollik | 98.75 | 98.37 | 99.86 |
| GENCOM | 98.45 | 97.99 | 99.79 |
| GACK EPP = 0 | 98.43 | 97.97 | 99.79 |
* Institute of Food Research method (GENCOM)
** Genotyping Analysis by Charlie Kim method (GACK)
*** Minimum Kernel Density method (MKD)
Figure 3Histogram of MG1655 hybridisation data. A histogram of the MG1655 microarray hybridisation data is shown (slide number 12842588). The data is displayed the raw scale (a) and on the log2 scale, with the scaled kernel density superimposed (b).
Comparison of the six algorithms, using the microarray hybridization data from the EDL933 sequenced strain. The result of using each algorithm were compared to BLASTN data and are shown below.
| Cut-off | TP | FP | TN | FN | Sensitivity | Specificity | M-Score | |
| 0.50 | 5202 | 24 | 475 | 43 | 99.18 | 95.19 | 98.83 | |
| Naïve | 0.33 | 5211 | 31 | 468 | 34 | 99.35 | 93.79 | 98.87 |
| cut-off | 0.25 | 5218 | 37 | 462 | 27 | 99.49 | 92.59 | 98.89 |
| 0.20 | 5222 | 43 | 456 | 23 | 99.56 | 91.38 | 98.58 | |
| 0.10 | 5235 | 127 | 372 | 10 | 99.81 | 74.55 | 97.61 | |
| GENCOM* | 5167 | 21 | 478 | 78 | 98.31 | 95.79 | 98.09 | |
| GACK** | EPP = 50 | 5083 | 22 | 477 | 162 | 96.91 | 95.59 | 96.80 |
| EPP= 0 | 5197 | 24 | 475 | 48 | 99.08 | 95.19 | 98.75 | |
| Prowollik | 5185 | 23 | 476 | 57 | 98.91 | 95.39 | 98.60 | |
| MKD*** | 5207 | 29 | 470 | 38 | 99.28 | 94.18 | 98.84 | |
| Mixture | Bimodal | 5215 | 37 | 462 | 30 | 99.42 | 92.59 | 98.83 |
| Trimodal | 5207 | 27 | 472 | 38 | 99.27 | 94.59 | 98.86 | |
The number of genes estimated as correctly conserved (True positives, TP), genes identified as conserved but actually are variable (false positives, FP), genes identified as correctly variable (true negatives TN), and genes identified falsely as variable (false negative FN), are given. The sensitivity, specificity, and M-Score are also calculated, where the sensitivity = TP/(TP+FP), specificity = TN/(FP+TN), and M-Score = Sensitivity*prevalence + Specificity*(1-prevalence)
* Institute of Food Research method (GENCOM)
** Genotyping Analysis by Charlie Kim method (GACK)
*** Minimum Kernel Density method (MKD)
Summary of the performance of each algorithm using CGH microarray data for the EDL933 sequenced strain. The sensitivity, specificity, and M-score generated from each of the cut-off algorithms from the CGH data were summarized for comparison.
| Algorithm | M-Score | Sensitivity | Specificity |
| Naive Cut-off (0.25) | 98.89 | 99.49 | 92.59 |
| Mixture Model (trimodal) | 98.86 | 99.27 | 94.59 |
| MKD*** | 98.84 | 99.28 | 94.18 |
| Porwollik | 98.60 | 98.91 | 95.39 |
| GACK** EPP = 0 | 98.75 | 99.08 | 95.19 |
| GENCOM* | 98.09 | 98.31 | 95.79 |
* Institute of Food Research method (GENCOM)
** Genotyping Analysis by Charlie Kim method (GACK)
*** Minimum Kernel Density method (MKD)
Summary of the performance of each algorithm using CGH microarray data for the Sakai sequenced strain. The sensitivity, specificity, and M-score generated from each of the cut-off algorithms from the CGH data were summarized for comparison
| Algorithm | M-Score | Sensitivity | Specificity |
| Naive Cut-off (0.20) | 99.48 | 99.66 | 97.13 |
| MKD*** | 99.39 | 99.45 | 98.80 |
| Mixture Model (trimodal) | 99.39 | 99.44 | 98.80 |
| Porwollik | 99.13 | 99.07 | 98.76 |
| GACK** EPP = 0 | 98.85 | 98.78 | 99.76 |
| GENCOM* | 98.43 | 98.34 | 99.52 |
* Institute of Food Research method (GENCOM)
** Genotyping Analysis by Charlie Kim method (GACK)
*** Minimum Kernel Density method (MKD)
Figure 4A scatter plot matrix of unknown test strains. A scatter plot matrix of the three reference and two test strains (X1, and X2 represent strains 0864/00 and 0330/01) were compared to identify test strains most similar to each sequenced strain. The lower left panes present the scatter plots with smoothing splines and the right hand panel displays the Pearson's correlation coefficient.
Analysis of microarray hybridisation data from 19 unsequenced test strains. The CGH microarray data was cleaned, normalised and cut-off assessed using the MKD method, then the Pearson's correlation co-efficient was calculated for comparison between each test and reference strain. The highest correlation is shown below. The strain ID, source and slide codes for each of the test strains is included
| Strain | Source | Slide replicate numbers | Reference strain most correlated with | |||
| Strain | Pearson Correlation | |||||
| 0023/99 | Bovine | 13248842 | 12842688 | EDL933 | 0.78 | |
| 0059/99 | Bovine | 12842681 | 13252965 | EDL933 | 0.83 | |
| 0144/99 | Ovine | 12842593 | 13248844 | EDL933 | 0.83 | |
| 0445/99 | Ovine | 12842605 | 12842576 | EDL933 | 0.85 | |
| 0796/00 | Bovine | 12842610 | 12842591 | EDL933 | 0.78 | |
| 1299/00 | Human | 12842608 | 13248843 | EDL933 | 0.83 | |
| 1463/00 | Human | 12842620 | 12842580 | EDL933 | 0.85 | |
| 1464/00 | Human | 12842621 | 12842581 | EDL933 | 0.79 | |
| 1471/00 | Human | 13252961 | 13248846 | 12842615 | EDL933 | 0.81 |
| 1472/00 | Human | 13252963 | 12842582 | 12842614 | EDL933 | 0.86 |
| 1484/00 | Bovine (Burger) | 12842669 | 13248838 | EDL933 | 0.66 | |
| 1489/00 | Bovine (Steak) | 12842668 | 12842577 | EDL933 | 0.75 | |
| 1812/00 | Bovine | 12842666 | 12842578 | EDL933 | 0.81 | |
| 1585/00 | Bovine | 12842603 | 12842586 | EDL933 | 0.88 | |
| 0945/00 | Bovine | 12842682 | 12842583 | EDL933 | 0.79 | |
| 0330/01 | Bovine | 12842680 | 13248839 | EDL933 | 0.84 | |
| 0864/00 | Bovine | 13248841 | 12842613 | EDL933 | 0.82 | |
| 1070/00 | Bovine | 12842678 | 13248840 | Sakai | 0.59 | |
| 1176/00 | Bovine | 12842601 | 12842585 | MG1655 | 0.85 | |
Figure 5The empirical density of the Sakai strain partitioned into single and multiple copy genes. (a) The theoretical location of the four modes, assuming a constant coefficient of hybridisation and labelling. (b) The empirical density of the Sakai data. The primary mode consists of 3,755 and 240 genes present in all three of the sequenced strains on the Cy5 channel highlighted as single and multiple gene copies (solid line and broken line, respectively). The secondary mode includes genes that are specific to only Sakai strain and consists of 948 and 586 genes in single and multiple copies (solid and broken lines, respectively).
Figure 6The Sakai log. The K12 chromosomal backbone and single and multiple contiguous gene deletions harboured in Sakai with respect to the MG1655 chromosome are shown.