| Literature DB >> 18267014 |
Sacha A F T van Hijum1, Richard J S Baerends, Aldert L Zomer, Harma A Karsens, Victoria Martin-Requena, Oswaldo Trelles, Jan Kok, Oscar P Kuipers.
Abstract
BACKGROUND: Array-based comparative genome hybridization (aCGH) is commonly used to determine the genomic content of bacterial strains. Since prokaryotes in general have less conserved genome sequences than eukaryotes, sequence divergences between the genes in the genomes used for an aCGH experiment obstruct determination of genome variations (e.g. deletions). Current normalization methods do not take into consideration sequence divergence between target and microarray features and therefore cannot distinguish a difference in signal due to systematic errors in the data or due to sequence divergence.Entities:
Mesh:
Year: 2008 PMID: 18267014 PMCID: PMC2275246 DOI: 10.1186/1471-2105-9-93
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flow diagram of the S-Lowess procedure. The S-Lowess procedure consists of two phases: 1) determine or upload likely conserved genes (LCG) and 2) Normalize a microarray dataset with the LCGs. In case that for phase 1 de novo prediction of LCGs is selected, the user has to upload microarray feature sequences and select (multiple) genomes (in this study 3 Streptococcus genomes). The optimal parameters for selection of LCGs from a sequence comparison using BLAT of array features versus multiple reporter genomes are difficult to predict. Therefore, selection of a LCG set is facilitated by cycling through a maximum of 2 parameters. These parameters are (a combination of two): (i) alignment length cutoff, (ii) E-value cutoff, (iii) percentage nucleotide identity cutoff, (iv) maximum number of hits within the same genome (to account for paralogous genes or duplicated genome fragments), (v) minimum number of hits across genomes (to account for gene conservation in multiple genome sequences). Those array feature sequences meeting the criteria (here in at least 2 out of three genomes a significant BLAT hit; one hit over at least 100 bp with at least 80% nucleotide identity) are marked as LCG and added to the conserved array feature list. In phase 2, the LCGs are used to normalize an uploaded aCGH microarray dataset. The result of phase 2 is a normalized dataset and diagnostic plots.
Figure 2MA-plots of aCGH data after applying different normalization methods. The log transformed ratios of slide 11 [17] are plotted against the log transformed sum of the green (negative M values; MG1363 signals) and red (positive M values; IL1403 signals) channels. A: non-normalized data. B: grid-based Lowess normalization. C: S-Lowess normalization based on the LCG set obtained from the comparison of L. lactis IL1403 amplicon sequences to the ORFs of three S. pneumoniae strains. D: S-Lowess normalization with a stringent LCG set (99% identity over 100 bp).
Figure 3Performance of the different normalization methods in the identification of deletions in . Blue: number of deletions correctly called (here a cutoff of 1.5 fold is used). Purple: number of deletions missed. S-L: S-Lowess. S-L Sp: S-Lowess normalization based on the LCG set obtained from the comparison of L. lactis IL1403 amplicon sequences to the ORFs of three S. pneumoniae strains. The total heights of the bars indicate the total number of amplicons for missing ORFs in L. lactis MG1363 with at least 5 aCGH measurements. They thus indicate the total number of missing ORFs that could be detected based on the aCGH data.
Figure 4Correlation plots of the normalized ratios (signals for labeled gDNA of . Sp: S-Lowess normalization based on the LCG set obtained from the comparison of L. lactis IL1403 amplicon sequences to the ORFs of three S. pneumoniae strains. The R2 values indicate the quality of the regression curve fit (where higher is better).