Literature DB >> 20236987

Evaluating oligonucleotide properties for DNA microarray probe design.

Xiao-Qin Xia¹, Zhenyu Jia, Steffen Porwollik, Fred Long, Claudia Hoemme, Kai Ye, Carsten Müller-Tidow, Michael McClelland, Yipeng Wang.

Abstract

Most current microarray oligonucleotide probe design strategies are based on probe design factors (PDFs), which include probe hybridization free energy (PHFE), probe minimum folding energy (PMFE), dimer score, hairpin score, homology score and complexity score. The impact of these PDFs on probe performance was evaluated using four sets of microarray comparative genome hybridization (aCGH) data, which included two array manufacturing methods and the genomes of two species. Since most of the hybridizing DNA is equimolar in CGH data, such data are ideal for testing the general hybridization properties of almost all candidate oligonucleotides. In all our data sets, PDFs related to probe secondary structure (PMFE, hairpin score and dimer score) are the most significant factors linearly correlated with probe hybridization intensities. PHFE, homology and complexity score are correlating significantly with probe specificities, but in a non-linear fashion. We developed a new PDF, pseudo probe binding energy (PPBE), by iteratively fitting dinucleotide positional weights and dinucleotide stacking energies until the average residue sum of squares for the model was minimized. PPBE showed a better correlation with probe sensitivity and a better specificity than all other PDFs, although training data are required to construct a PPBE model prior to designing new oligonucleotide probes. The physical properties that are measured by PPBE are as yet unknown but include a platform-dependent component. A practical way to use these PDFs for probe design is to set cutoff thresholds to filter out bad quality probes. Programs and correlation parameters from this study are freely available to facilitate the design of DNA microarray oligonucleotide probes.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Oligonucleotide Probes

Year: 2010 PMID： 20236987 PMCID： PMC2887943 DOI： 10.1093/nar/gkq039

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Microarray technology surveys many thousands of genes to investigate gene expression (1), transcription factor binding profiles (2–5), DNA methylation profiles (4–6), DNA copy numbers (5) and genomic sequences (7). Oligonucleotide probes provide higher hybridization specificity than longer PCR products (8–10). Falling costs of oligonucleotide synthesis, along with the development of new microarray manufacturing technologies, such as the NimbleGen maskless array synthesizer (11) and Agilent's ink-jet oligonucleotide synthesizer, make custom long (>50 bases) oligonucleotide arrays possible for many experimental applications. Optimal probe design algorithms are consequently desirable. Hybridization on an array is characterized by several interconnected processes, including the affinity of a target for a probe, formation of stem–loop structures of a probe, formation of secondary structures (loops and helices) of a target, and probe-to-probe dimerization (12–16). There are a variety of factors governing these processes, including probe hybridization free energy (PHFE) (17), probe minimum folding energy (PMFE) (18), probe dimer and hairpin scores (19), as well as homology and complexity scores (20). Most of the current oligonucleotide probe design software packages estimate these properties (20–28). To systematically and quantitatively study how these factors influence probe performance in microarrays, we collected a large amount of array CGH data and used these data to evaluate the utility of each PDF for probe selection. Using aCGH data, a novel PDF, pseudo probe binding energy (PPBE), was developed. PPBE is more accurate in predicting probe performance than all other factors and can thus be used for iterative improvement of the choice of oligonucleotides on the array. While the specific physical properties measured by PPBE remain unknown, they encompass platform-specific parameters.

METHODS

Microarray CGH data sets

Four comparative genome hybridization microarray data sets were used in the study (Table 1). Human genomic DNA (Data sets 1, 2 and 4) and Salmonella genomic DNA (Data set 3) samples were hybridized to their corresponding arrays. The array platforms include NimbleGen arrays (3′ end of the oligo is linked to the solid phase) and in-house spotted oligonucleotide arrays (5′ end of oligos is linked to the solid phase). The majority of probes on the arrays we used are 50 bases in length. However, there are also probes of different lengths, e.g. there are 9989 46-mer probes and 4721 55-mer probes on the array for data set 4. We found that the correlations of PDFs to probe sensitivities for these probes were very similar to those for the 50-mer probes (data not shown). In order to make data comparable across platforms, only data from 50-mer oligonucleotide probes were used. Hybridization intensity values were natural log transformed before fitting the linear models.

Table 1.

Array CGH data set used in this study

Data Set	Microarray platform	Sample	Manufacturer	Designer	Oligos	Bases	Role of data set in the analysis	Number of samples
1	NimbleGen HG18 whole genome CGH Array	Normal human male genomic DNA	NimbleGen Inc.	NimbleGen Inc.	137 280	50	Sensitivity	6
2	NimbleGen Human Promoter Array (custom design)	Human prostate cell line (PC3M, 267B1) genomic DNA	NimbleGen Inc.	authors	220 475	50	Sensitivity	4
3	NimbleGen Salmonella Whole Genome Array (custom design)	Salmonella LT2 genomic DNA	NimbleGen Inc.	authors	288 238	50	Sensitivity, specificity	4
4	In-house Spotted Human Promoter Array (custom design)	Normal human lung tissue genomic DNA	authors	authors	11 653	50	Sensitivity, reproducibility	205

Array CGH data set used in this study Data set 3 used pooled Salmonella genomic DNA XbaI restriction fragments, representing half of the genome in 3-fold excess, in one channel, and whole genomic DNA in the other. Data set 4 contain 205 replicates of human lung tissue genomic DNA hybridizations which were used as control channel in two-color hybridizations experiments.

PDF

The following DNA microarray PDFs were included in this study.

PHFE

PHFE was calculated based on the dinucleotide stacking energies. where n is the oligonucleotide length, is the k-th position dinucleotide stacking energy, and εhead and εtail are the terminal nucleotide stacking energies. The salt concentrations for the calculations were set to 1 M Na+, 0 M Mg++, and the temperature was set to 40, 50 or 60°C for the computation of PHFE. The dinucleotide stacking energies are computed according to SantaLucia (17) and shown in Supplementary Table 1.

PPBE

For a probe sequence with n bases, the PPBE model is parameterized by dinucleotide stacking energies ε and position-dependent weights ω, . The position-dependent weight ω is first estimated by fitting the linear model, employing dinucleotide stacking energies (as used in the PHFE model) as initial values. Then, with the same linear model fitting scheme, the pseudo dinucleotide stacking energies ε are approximated by treating previously estimated weights as known quantities. Such a process of ‘reciprocal’ estimation was iteratively carried out three times, at which point the average residue sum of squares (ARSS) for the PPBE model reached its minimum or near-minimum (see also the ‘Linear modeling’ section below, and Figure 1A).

Figure 1.

ARSS, positional weights and pseudo stacking energies of the PPBE model for data set 1. (A) Convergence of the PPBE model after three cycles of iterative fitting of both of positional weights and pseudo dinucleotide stacking energies (six cycles total); (B) Plot of positional weights; (C) Comparison of traditional dinucleotide stacking energies and pseudo dinucleotide stacking energies.

PMFE

PMFE is the minimum folding energy of a single strand DNA sequence and represents the stability of the secondary structure of a given sequence. PMFE were computed by using the MFOLD program (18). The program hybrid-ss-min was downloaded from http://www.bioinfo.rpi.edu/applications/hybrid/download.php and executed on GNU/Linux. The parameters were set as DNA–DNA hybridization, 1 M Na+, 0 M Mg++, and the temperature was set to 40, 50 or 60 for the calculation of PMFE.

Probe dimer score, hairpin score

The calculation of the probe dimer score and the hairpin score was described as part of the AutoDimer program based on a sliding algorithm (19). For screening probe dimers, two probe sequences are incrementally overlapped, and the presence or absence of base pairing is evaluated and tabulated. A dimer score value was then determined by combining the number of Watson–Crick base pairs (+1) with mismatches (−1). Hairpin secondary structures were screened by using the probe sequence to check for the presence of 4 and 5 base loops. A minimum of a 2-base stem was deemed to be necessary in a hairpin structure. Hairpin scores were sums of matched base pairs (+1) in hairpin stems where mismatches are not permitted.

Homology score

The homology score for each oligonucleotide estimates the degree of cross-hybridization, and is based on a BLAST search of the input sequence against a species-specific database. The calculation of the homology score was similar to the one used in the OligoWiz program (20). where L is the length of the oligonucleotide, m is the number of Blast hits considered in position i of the oligonucleotide and is the bit score in position i. Oligonucleotides with 100% identity to any considered BLAST hit along the full length received a score of 0. Percentages of identity <70% or shorter than 15 bp were removed, resulting in perfect homology scores of 1 for those oligos.

Complexity score

Complexity scores were calculated for estimating the degree of common sequence fragments in a given oligonucleotide, as described in the OligoWiz program (20). The information content can be calculated by the following equation: where is the number of occurrences of a pattern in the genome, represents the pattern length, and nt is the total number of patterns found in DNA sequences present in the target pool, for example, the whole genome in an array comparative genomic hybridization. The following equation was used to calculate the complexity score for each oligonucleotide probe: where L is the length of the oligonucleotide, w is the pattern in position i and norm is a function that normalizes the summed information to a value between 1 and 0 by dividing them by the maximum value. A complexity score of 0 indicates an oligonucleotide with very low complexity. Pattern lengths of 2, 5, 8 and 11 bases were tested in this study.

Oligonucleotide specificity and reproducibility

Data set 3, with known expected oligonucleotide signal ratios (3-fold changes) between the two channels, was used for estimating oligonucleotide probe specificity. The observed ratios were log2 base transformed for further analysis. Coefficient of variation (CV) was used for estimating probe reproducibility.

Linear modeling and model validation

R language (http://www.r-project.org) was used for linear modeling (29–31). In the four microarray data sets, simple linear models were used to evaluate each individual PDF, and multivariate models were used to estimate all PDFs together. The ARSS, which reflects the model fitness, was defined as , where g was the observed ln-transformed intensity for probe i, was the predicted ln-transformed intensity for probe i, and n was the number of probes. For model selection, the stepAIC function in the MASS package (http://www.r-project.org) was used to reduce the full model to the optimal one. This Akaike information criterion (AIC) is a measure of the quality of the fit of an estimated statistical model and balances the complexity of an estimated model with the accuracy with which the model fits the data (32). The models were validated in two ways: within one data set and across different data sets. In both cases, the leave-many-out cross-validation (33) was used. Within-dataset validation uses half of the data from one data set to train the models and the other half for testing of the models. Cross-dataset validation uses different data sets, which may vary in array platforms and sample species, for training and testing.

RESULTS

Array CGH data is a valuable source for studying microarray oligonucleotide probe performance because it can be assumed that most of the probes in these experiments hybridize to approximately equimolar target amounts, resulting in relatively uniform hybridization signals. Four large aCGH data sets on different array platforms, with a total of 657,646 oligos of 50 bases in length and 219 samples, were used in this study to evaluate PDFs and to develop new algorithms (Table 1).

Correlation of individual PDFs with probe hybridization intensities

The models examined are all presented in the ‘Methods’ section and will not be repeated here. All 10 PDFs, i.e. PHFE, PMFE, hairpin score, probe dimer score, homology score, complexity score (2 bases), complexity score (5 bases), complexity score (8 bases), complexity score (11 bases) and PPBE, showed highly significant correlation with probe hybridization intensities, as shown in Figure 2 (data set 1) and Supplementary Figure 1 (data sets 2, 3 and 4). The correlation coefficients (r), ARSS, intercepts and slopes for these linear regression models are listed in Table 2 and Supplementary Table 2.

Figure 2.

Table 2.

Simple linear model average residue square sum (ARSS) and correlation coefficients (r) for the correlation of individual PDFs with probe hybridization intensities

	Data Set 1		Data Set 2		Data Set 3		Data Set 4
	r	ARSS	r	ARSS	r	ARSS	r	ARSS
PHFE	0.11	0.168	0.03	0.504	0.03	0.460	0.13	1.668
PMFE	0.29	0.156	0.27	0.468	0.32	0.414	0.28	1.568
HairpinScore	0.21	0.162	0.22	0.479	0.20	0.442	0.21	1.621
DimerScore	0.19	0.164	0.23	0.478	0.17	0.448	0.15	1.660
ComplexityScore-2B	0.08	0.169	0.05	0.503	0.02	0.461	0.09	1.684
ComplexityScore-5B	0.04	0.170	0.11	0.498	0.01	0.461	0.02	1.698
ComplexityScore-8B	0.01	0.170	0.15	0.493	0.01	0.461	0.12	1.675
ComplexityScore-11B	0.01	0.170	0.10	0.498	0.02	0.461	0.10	1.683
BlastScore	0.02	0.170	0.11	0.498	0.01	0.461	0.18	1.641
PPBE	0.36	0.148	0.30	0.460	0.65	0.269	0.48	1.301

Box plots (black line) show the correlation of individual PDFs with observed oligonucleotide probe hybridization intensities for data set 1. The density curve (red line) is computed using kernel density estimates and shows the distribution of individual PDFs. The secondary Y-axis represent the density of different PDFs. Simple linear model average residue square sum (ARSS) and correlation coefficients (r) for the correlation of individual PDFs with probe hybridization intensities The ARSS values of linear models based on individual PDFs were compared, as shown in Figure 3. Among these factors, PPBE generated the lowest ARSS, suggesting that this factor is superior to the traditional factors in correlating with probe hybridization intensity. PPBE was modeled by iteratively fitting dinucleotide stacking energies and positional weights, with the conventional dinucleotide stacking energies as initial values. The ARSS values from the PPBE model tend to stabilize after three cycles of iterative fitting of both positional weights and pseudo dinucleotide stacking energies (Figure 1 and Supplementary Figure 2). The positional weights and pseudo dinucleotide stacking energies generated from the different data sets are entirely different, reflecting the empirical nature of the model. The positional weights and pseudo stacking energies for PPBE models from different data sets are listed in Supplementary Tables 3 and 4, and the positional weights illustrate the effect of the distance of the dinucleotide to the solid phase. The positional weights of data sets 2 and 4, for example, showed inverse correlation to the distance to the probe’s 5′ end, which may be due to the fact that these platforms differed in the ends of oligos that were linked to the solid phase (5′ versus 3′).

Figure 3.

Relative ARSS of different models for different data sets.

Relative ARSS of different models for different data sets. In most data sets, the best individual traditional factors were PMFE, dimer score and hairpin score. All these three PDFs showed that less stable probe secondary structure positively correlates with probe hybridization intensity, suggesting that the formation of secondary structure can severely hinder the probe hybridization capabilities. PHFE’s linear correlation with probe hybridization intensity was less significant, suggesting that hybridization behavior on microarrays might be different from that in solution. Moreover, quadratic rather than linear relationships were observed for data sets 1 and 3, and the mode (the peak points shown in Figure 2A and Supplementary Figure 1 and 2A) varies among these two data sets, suggesting that hybridization conditions were not the same for the two data sets. We tried to use quadratic equations to fit the data sets 1 and 3, but the ARSS values generated from these models were bigger than those obtained using simple linear models (data not shown). This is probably due to the fact that the majority of PHFE data points is clustered within a very narrow range, where the relationship between PHFE and intensities may be better described by a linear equation. In future studies, once there are sufficiently large data sets with a higher PHFE data spread across a wider range of values, more advanced models can be applied to scrutinize the relationship between PHFE and hybridization intensities in a non-linear fashion. Blast score and complexity scores (2, 5, 8 and 11 bases) correlated least significantly with the probe hybridization intensity among the PDFs tested. No obvious differences were observed among the scores obtained for 2, 5, 8 and 11 bases when correlating them with probe hybridization intensity (Table 2). Among all four data sets, PPBE, PMFE, dimer score and hairpin score showed positive correlations with probe hybridization intensity, and are therefore the more reliable indicators of probe sensitivity. The other PDFs displayed inconsistencies in correlation for different data sets. For example, PHFE is positively correlated with probe intensity in data sets 2 and 3, but is negatively correlated with probe intensity in data sets 1 and 4. More complex models might be developed for blast score and complexity scores (2, 5, 8 and 11 bases), but that is beyond the scope of this article. As shown in Supplementary Table 2, enormous variations were observed among individual data sets for the trend coefficients (e.g. intercept and slope), possibly due to differences in array manufacture, sample and array processing and other factors. The values of PHFE and PMFE are dependent on parameters such as hybridization temperature and concentrations of sodium, most of which were unavailable to us. However, we computed PHFE and PMFE using various potential parameters, and changes in parameters did not cause significant differences in correlation assessments; the average difference of ARSS value are 0.0058 (0.010 for PHFE and 0.001 for PMFE) among different temperature settings. 60°C was used for the computation of PHFE, and 40°C was used for computation of PMFE for all data sets, because they slightly outperformed other temperatures.

Multivariate linear modeling

For each data set, a multivariate linear model with PPBE (W. PPBE model) was built based on all PDFs for predicting probe hybridization intensity and comparing the significance of the individual PDFs. This multivariate model showed significant improvement over all individual models based on each individual PDF (Figure 3, Supplementary Figure 3). The W. PPBE model parameters are shown in Supplementary Table 5. Increasing the number of free parameters obviously improves the fit. On the other hand, overfitting is very likely to happen and reduces or destroys the ability of the model to generalize beyond the data it is built upon. The AIC is an operational way of trading off the complexity of an estimated model against how well the model fits the data (32). It not only rewards improvement of fit, but also includes a penalty that is an increasing function of the number of estimated parameters and thereby discourages overfitting. In this study, stepwise selection with AIC was used to search for the optimal model which only contains covariates (individual PDFs) related to the outcome (probe hybridization intensity). Stepwise model selection analysis showed that all PDFs contributed to the prediction of probe hybridization intensity in all data sets with only one exception in which the complexity score (2 bases) was not significant in data set 1 (Supplementary Figure 4). The most significant factor is PPBE, followed by PMFE in all data sets. The order of significance of other PDFs varied among different data sets.

Generality of linear models

Two multivariate models, the W. PPBE model (includes all factors) and the W/O PPBE model (including all factors except PPBE), were developed using a training data set, and tested on independent data sets to determine if the models can be reliably used as a probe design tool. Applying within-dataset validation, Figure 4 illustrates that the models developed from the training set can predict the performance of oligos in the test set almost as accurately as it can predict performance in the training set. The W. PPBE model outperformed the W/O PPBE in all cases, suggesting that PPBE is a reliable factor although it is generated by an empirical approach.

Figure 4.

Comparisons of ARSS for within-dataset validations using the multivariate models W/O PPBE or W. PPBE.

Comparisons of ARSS for within-dataset validations using the multivariate models W/O PPBE or W. PPBE. Cross-dataset validations (Supplementary Table 6) resulted in extremely high ARSS values in the test data sets when the W/O PPBE and W. PPBE models were applied, even when the array manufacture technique and sample species were identical between test and training set. The complex multivariate models developed from one data set can therefore not be directly and simply applied on other data sets. The adverse performance was not caused by PPBE, as there were no obvious differences between W/O PPBE and W. PPBE models. The substantial variations in correlation intercepts and slopes for each individual PDF, as observed in Supplementary Table 2, severely hinder the cross-dataset probe intensity predictions using multivariate linear models.

Probe specificity

Probe specificity is a measurement of the capability of a probe to discriminate between its specific target sequences in a complex set of non-specific sequences. In a two-channel hybridization experiment, if one channel includes the target sequence and the other does not, then the probe with specificity for the target can be expected to yield a high ratio of hybridization signal intensity between the two channels, which is a measure of probe specificity in the mixture. We estimated the oligonucleotide specificity using data set 3, where the targets in one channel included a 3-fold overrepresentation of approximately half of the Salmonella genome and 3-fold underrepresentation for the other half of the genome. Therefore, there are 3-fold differences in the target concentration between the two channels for all probes, and the expected hybridization ratio is 3 for specific hybridization. This was achieved by XbaI digestion of stationary phase Salmonella enterica sv Typhimurium LT2 genomic DNA, separation of the seven resulting fragments using pulsed field gel electrophoresis, capturing those fragments and pooling the six smaller fragments, while keeping the big fragment separate. Genomic DNA preparations from stationary phase LT2 were then supplemented either with the big fragment or with the pooled six smaller fragments, creating overrepresentations of the different halves of the genome. Probes with stronger hybridization intensities displayed better specificity (Figure 5A). When each individual PDF and the predicted probe hybridization intensities were compared with the observed ratios, significant correlation was detected between the ratios and all the factors (Supplementary Figure 6), most significantly for PHFE, PMFE, PPBE and Complexity Score (8 bases). The Pearson correlation coefficients are listed in Supplementary Table 7. Note that PHFE is significantly and positively correlated with probe specificity. Probes with low PHFE values displayed both low specificity and relatively low sensitivity (as shown in Supplementary Figure 1–2).

Figure 5.

Correlation of probe hybridization intensity with probe specificity and reproducibility. (A) Correlation of probe hybridization intensity with probe specificity (observed log2 base transformed ratio) for data set 3. Gray line indicates no change; (B). Correlation of oligonucleotide probe hybridization intensity with probe reproducibility for data set 4, represented as coefficient of variation (CV). As shown in Supplementary Figure 5, the relationships between log2 based ratios and some PDFs seem to be non-linear. For the sake of simplicity, only linear equations were considered in the current study.

Probe reproducibility

Data set 4, which includes 205 replicated hybridizations, was used to estimate probe reproducibility using CV. High probe reproducibility (corresponding to low CV values) is positively correlated with the observed probe hybridization intensities (Figure 5B). When examined individually, each PDF shows a significant but distinct level of association with CV (Supplementary Figure 6). PPBE and PHFE are the most significant factors. Correlation coefficients are listed in Supplementary Table 7.

Software

Programs for computing of PHFE, PMFE, probe dimer score and hairpin score, blast score and complexity score were written in Python. All programs, including parameters for computation, are freely available upon request.

DISCUSSION

Microarray probe hybridization signals are determined by the equilibrium of probe–target complex formation and probe–probe hybridization capability, and are also influenced by non-specific binding from the complex target. The PDFs we studied here covered these three aspects. While Affymetrix Chips are designed for one-sample-for-one-array, it is very common to apply multiple samples onto the same array on customized platforms, including in-house spotted arrays and Nimblegen arrays. The natural log transformed intensity values from multiple arrays were averaged for each probe to minimize variation caused by sample processing and hybridization. We used genomic DNA samples because these hybridizations allow a comparison of probe performance under similar target concentrations. Linear models were selected to model the relationships between individual PDFs and probe performance based on our observation that most scatter plots generated from multiple data sets consistently showed a linear relationship. The actual relationships may be far more complex. Nevertheless, from a practical point of view, linear models are easy to handle and generate more accurate predictions than more complex models based on model diagnosis with ARSS (34). The finding of these correlations is a useful first step in trying to understand the physical phenomena, which are clearly not subsumed in all the parameters currently in use. In future research, we plan to identify more advanced models (for example non-linear association models) that may reduce the ARSS we have achieved in the current study. PMFE, dimer score and hairpin score are factors that estimate probe–probe hybridization capability. Of all the traditional PDFs (all factors except PPBE), PMFE correlated most significantly with probe hybridization intensity in all four data sets, followed by dimer score and hairpin score in most data sets. Although these three PDFs contain redundant information for estimation of the probe–probe hybridization capabilities, they cannot be simply replaced by each other as shown in the stepAIC analysis, which optimizes the complexity of the model versus the fit (32). All three PDFs therefore deliver unique information that needs to be considered for probe design. Probe hybridization free energy (PHFE) is a long-established parameter for measuring probe–target hybridization capability in solution. In our study, PHFE was not as reliable in predicting probe hybridization intensity as other factors (PMFE, dimer score and hairpin), which may be largely due to the linkage of probes to a solid phase in microarray hybridizations. To compensate for the attachment of one end of the probe to the matrix, we introduced PPBE, which modifies the PHFE calculation by adding a positional weight parameter and iteratively fitting positional weights and dinucleotide stacking energies. PPBE showed much better capabilities of predicting probe hybridization than all other PDFs, and was a tremendous improvement over PHFE. The drawback of PPBE is that it is platform-dependent, and preliminary aCGH data are required for developing the PPBE model prior to application. The quality of the training data is critical for the construction of an accurate PPBE model. There are many factors that may result in bad quality arrays, e.g., bad sample quality. In order to solve these problems, we suggest that multiple CGH be performed using genomes without copy number variation to minimize the noise caused by sample processing. Both PMFE and PHFE are sodium dependent. Generally, changes in free energy are linearly correlated to log-transformed sodium concentration (17), which has been confirmed by us on the Mfold web server (18) for PMFE and PHFE. That means that all oligonucleotide PMFE/PHFE values will change proportionally if the sodium concentration changes. Subsequently, these changes will be canceled out through adjustment of related coefficients in linear models. Therefore, changes in sodium concentration had no influence on the significance of linear modeling. The PPBE model is empirical by nature, similar to the positional-dependent nearest neighbor model, which was designed for the Affymetrix array platform (34). Parameters of this model similarly need to be empirically estimated based on hybridization data, and significantly vary among different Affymetrix array platforms. At this stage, we do not understand the physical properties governing the parameters, but present a practical approach to optimize oligo design. The position dependence of the weighting factors is a conspicuous feature in such models. In previous work, the sensitivity profiles of base C and base A change in a parabola-like fashion in a 25-base probe sequence, while the same profiles for G and T change monotonically (35–38). The overall position weighting factors change like the shape of a parabola, with peak and width varying across different GeneChip platforms (14,34,39). Our data reveal weight distribution patterns different from this previous work. Our data were obtained on two types of platforms: Nimblegen in situ synthesized oligonucleotide arrays and a spotted oligonucleotide array. For three Nimblegen platforms, the weights change linearly for the first 35∼45 bases or so from the 3′ end and get weaker at the free end (Figure 1B, Supplementary Figure 2B and 2E). In contrast, a parabola-like curve is observed on the other platform (Supplementary Figure 2H). Although it is not the object of this article to explore a physical explanation for these differences, we point out some facts that may be important in further studies: We are using platforms of 50-mer probes, while the quoted previous work used 25-mer Affymetrix GeneChip platforms. Lengthening of the sequence on the platform inevitably reduces the importance of each single base or position, and weakens the position dependence. Unlike Affymetrix platforms and Nimblegen platforms, the probes of the spotted array in this study are linked to the array at the 5′ end, and there are no terminal oligonucleotide linkers between probes and the array surface. The impact of this difference is unknown, but it may reduce the freedom of a probe and even its effective length, leading to a pattern of position dependence similar to platforms of lower probe length, e.g. Affymetrix platforms. For the fitting of the PPBE model, it is not critical whether weights or energies were fitted first. Either way, the final converged models reach similar ARSS values (average difference is <0.005). The final weights and pseudo-stacking energies are similar as well. We began to fit the models with the conventional dinucleotide stacking energies simply because the modes reached convergence faster. The dinucleotide stacking energies may express a relevant part of the physical properties underlying the model. However, further evidence is required to confirm this speculation. Blast and complexity scores reflect occurrences of sequence segments similar to the probe, and are used for evaluating probe specificity. It would be simpler and easier to use cutoff thresholds for these PDFs to filter out bad quality probes. In this study, we applied four different patterns for the complexity score calculation, which are based on 2, 5, 8 and 11 base patterns. The complexity score (8 bases) showed better correlation with probe specificity than other complexity score patterns and blast score. Langmuir isotherm oriented models were not included in our studies. Although Langmuir models were initially developed for adsorption of gases on glass surfaces (40), its variations have been widely applied in research for hybridization of oligonucleotides on DNA microarrays (13–16,41). In these models, the hybridization signal intensities were in essence divided into two parts: the hybridization of the probe with its perfect-matching target and the background noise. Although such models fit hybridization intensity values well for spiked-in genes and corresponding targets with controlled concentrations, they are of less help in screening probes for microarray design because they are based on the equilibrium constant or the change of standard Gibbs free energy ΔG°, which is a PDF of less sensitivity and specificity in comparison to PMFE and PPBE in our study. In contrast, platform-dependent empirical models based on pseudo free energies and position weights can make predictions very close to the observed hybridization intensities (34,39). This fact encouraged us to explore pure empirical models in microarray design. In summary, we used aCGH as a model system to study the correlation between individual PDFs and probe performance during microarray hybridization. These individual correlations can be used as guidance for designing microarray probes for other complex experimental setups such as gene expression analysis. In gene expression microarray hybridization, non-specific binding, probe–targets complex formation and probe–probe binding capability will all be influenced by the varying concentrations of the targets. Systematic study of probe performance in such systems is beyond the scope of this study. Nevertheless, if preliminary aCGH data is available, a complex multivariate linear model including the empirical factor PPBE can be developed and used for refining arrays. The model can predict a probe hybridization intensity value which will be an indicator of probe quality. Higher predicted intensity values will be equivalent to higher sensitivity, improved specificity and reproducibility. In practice, this strategy can be used for improving an existing array platform by replacing bad probes or by expanding the array by selecting probes predicted to perform well. If aCGH data are unavailable for microarray platform design, we suggest using each individual PDF to filter or rank probes instead of using a complex model, because the coefficient parameters (intercept and slopes) vary significantly among different data sets/platforms. PMFE, hairpin score and probe dimer score can be used to rank probe qualities. PHFE, blast score and complexity score can be used to filter probes with low specificity. We have provided all correlation parameters generated from four data sets to be used as a guideline for filtering or ranking probes. All the programs for calculating individual PDFs are also available from the authors.

Supplementary Data

Supplementary Data are available at NAR Online.

FUNDING

The Prostate Cancer Foundation, the Mary Kay Ash Foundation, National Institute of Health grants (R01CA68822, U01CA114810, R01AI052237, R01AI073971, R01AI075093) to M.M. DOD Breast Cancer Research Program, (BC073899). Funding for open access charge: R01CA68822 and BC073899. Conflict of interest statement. None declared.

36 in total

1. Probe selection for high-density oligonucleotide arrays.

Authors: Rui Mei; Earl Hubbell; Stefan Bekiranov; Mike Mittmann; Fred C Christians; Mei-Mei Shen; Gang Lu; Joy Fang; Wei-Min Liu; Tom Ryder; Paul Kaplan; David Kulp; Teresa A Webster
Journal: Proc Natl Acad Sci U S A Date: 2003-09-19 Impact factor: 11.205

2. AutoDimer: a screening tool for primer-dimer and hairpin structures.

Authors: Peter M Vallone; John M Butler
Journal: Biotechniques Date: 2004-08 Impact factor: 1.993

3. Base pair interactions and hybridization isotherms of matched and mismatched oligonucleotide probes on microarrays.

Authors: Hans Binder; Stephan Preibisch; Toralf Kirsten
Journal: Langmuir Date: 2005-09-27 Impact factor: 3.882

4. Chromatin modifications induced by PML-RARalpha repress critical targets in leukemogenesis as analyzed by ChIP-Chip.

Authors: Claudia Hoemme; Abdul Peerzada; Gerhard Behre; Yipeng Wang; Michael McClelland; Kay Nieselt; Matthias Zschunke; Christine Disselhoff; Shuchi Agrawal; Fabienne Isken; Nicola Tidow; Wolfgang E Berdel; Hubert Serve; Carsten Müller-Tidow
Journal: Blood Date: 2007-11-16 Impact factor: 22.113

5. Survey of differentially methylated promoters in prostate cancer cell lines.

Authors: Yipeng Wang; Qiuju Yu; Ann H Cho; Gaelle Rondeau; John Welsh; Eileen Adamson; Dan Mercola; Michael McClelland
Journal: Neoplasia Date: 2005-08 Impact factor: 5.715

6. YODA: selecting signature oligonucleotides.

Authors: Eric K Nordberg
Journal: Bioinformatics Date: 2004-11-30 Impact factor: 6.937

7. Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays.

Authors: Felix Naef; Marcelo O Magnasco
Journal: Phys Rev E Stat Nonlin Soft Matter Phys Date: 2003-07-16

8. Oligodb--interactive design of oligo DNA for transcription profiling of human genes.

Authors: Ralf Mrowka; Johannes Schuchhardt; Christoph Gille
Journal: Bioinformatics Date: 2002-12 Impact factor: 6.937

9. Relationship between gene expression and observed intensities in DNA microarrays--a modeling study.

Authors: G A Held; G Grinstein; Y Tu
Journal: Nucleic Acids Res Date: 2006-05-24 Impact factor: 16.971

10. Bayesian mixture model analysis for detecting differentially expressed genes.

Authors: Zhenyu Jia; Shizhong Xu
Journal: Int J Plant Genomics Date: 2008

8 in total

1. Optimization of signal-to-noise ratio for efficient microarray probe design.

Authors: Olga V Matveeva; Yury D Nechipurenko; Evgeniy Riabenko; Chikako Ragan; Nafisa N Nazipova; Aleksey Y Ogurtsov; Svetlana A Shabalina
Journal: Bioinformatics Date: 2016-09-01 Impact factor: 6.937

2. Identify potential miRNA-mRNA regulatory networks contributing to high-risk neuroblastoma.

Authors: Feng-Ling Shao; Qing-Qing Liu; Shan Wang
Journal: Invest New Drugs Date: 2021-03-05 Impact factor: 3.850

3. Targeted analysis of nucleotide and copy number variation by exon capture in allotetraploid wheat genome.

Authors: Cyrille Saintenac; Dayou Jiang; Eduard D Akhunov
Journal: Genome Biol Date: 2011-09-14 Impact factor: 13.583

4. A Transcriptome-Targeting EcoChip for Assessing Functional Mycodiversity.

Authors: Derek Peršoh; Alfons R Weig; Gerhard Rambold
Journal: Microarrays (Basel) Date: 2011-10-31

5. Sequence characteristics define trade-offs between on-target and genome-wide off-target hybridization of oligoprobes.

Authors: Olga V Matveeva; Aleksey Y Ogurtsov; Nafisa N Nazipova; Svetlana A Shabalina
Journal: PLoS One Date: 2018-06-21 Impact factor: 3.240

6. Microbial community profiling of ammonia and nitrite oxidizing bacterial enrichments from brackishwater ecosystems for mitigating nitrogen species.

Authors: Viswanathan Baskaran; Prasanna K Patil; M Leo Antony; Satheesha Avunje; Vinay T Nagaraju; Sudeep D Ghate; Suganya Nathamuni; N Dineshkumar; Shankar V Alavandi; Kizhakedath K Vijayan
Journal: Sci Rep Date: 2020-03-23 Impact factor: 4.379

7. Analysis and identification of novel biomarkers involved in neuroblastoma via integrated bioinformatics.

Authors: Bo Chen; Peng Ding; Zhongyan Hua; Xiuni Qin; Zhijie Li
Journal: Invest New Drugs Date: 2020-08-08 Impact factor: 3.850

8. Natural Allelic Variations in Highly Polyploidy Saccharum Complex.

Authors: Jian Song; Xiping Yang; Marcio F R Resende; Leandro G Neves; James Todd; Jisen Zhang; Jack C Comstock; Jianping Wang
Journal: Front Plant Sci Date: 2016-06-08 Impact factor: 5.753

8 in total