Mina Mohammadi-Kambs1, Kathrin Hölz2, Mark M Somoza2, Albrecht Ott1. 1. Biological Experimental Physics, Saarland University, Campus B2.1, 66123 Saarbrücken, Germany. 2. Institute of Inorganic Chemistry, Faculty of Chemistry, University of Vienna, Althanstraße 14 (UZA II), 1090 Vienna, Austria.
Abstract
DNA microarrays constitute an in vitro example system of a highly crowded molecular recognition environment. Although they are widely applied in many biological applications, some of the basic mechanisms of the hybridization processes of DNA remain poorly understood. On a microarray, cross-hybridization arises from similarities of sequences that may introduce errors during the transmission of information. Experimentally, we determine an appropriate distance, called minimum Hamming distance, in which the sequences of a set differ. By applying an algorithm based on a graph-theoretical method, we find large orthogonal sets of sequences that are sufficiently different not to exhibit any cross-hybridization. To create such a set, we first derive an analytical solution for the number of sequences that include at least four guanines in a row for a given sequence length and eliminate them from the list of candidate sequences. We experimentally confirm the orthogonality of the largest possible set with a size of 23 for the length of 7. We anticipate our work to be a starting point toward the study of signal propagation in highly competitive environments, besides its obvious application in DNA high throughput experiments.
DNA microarrays constitute an in vitro example system of a highly crowded molecular recognition environment. Although they are widely applied in many biological applications, some of the basic mechanisms of the hybridization processes of DNA remain poorly understood. On a microarray, cross-hybridization arises from similarities of sequences that may introduce errors during the transmission of information. Experimentally, we determine an appropriate distance, called minimum Hamming distance, in which the sequences of a set differ. By applying an algorithm based on a graph-theoretical method, we find large orthogonal sets of sequences that are sufficiently different not to exhibit any cross-hybridization. To create such a set, we first derive an analytical solution for the number of sequences that include at least four guanines in a row for a given sequence length and eliminate them from the list of candidate sequences. We experimentally confirm the orthogonality of the largest possible set with a size of 23 for the length of 7. We anticipate our work to be a starting point toward the study of signal propagation in highly competitive environments, besides its obvious application in DNA high throughput experiments.
Molecular recognition
in the crowded environment of DNA microarrays
plays an important role in processing information. Recognition often
requires the discrimination of one specific molecule among many similar,
competing molecules. In 1894, Emil Fischer proposed the lock and key
model to describe the recognition of an enzyme and a substrate.[1] According to this model, the substrate possesses
the perfect size and shape to accommodate the active site of its complement.
However, in crowded environments, binding between noncomplementary
molecules may occur and result in introduction of errors. For DNA,
specific-binding of two single strands, that is the formation of a
stable double helix, occurs only if the bases A and T as well as C
and G pair along the sequence. DNA microarrays are a widely used platform
that, besides many applications in medicine and biology, enables the
study of the fundamentals of DNA hybridization.[2−10] These microarrays consist of single-stranded DNA oligonucleotides
immobilized on a surface (probes). If these probes are exposed to
a bulk mixture of fluorescently labeled target sequences, only complementary
targets are expected to hybridize. However, hybridization of a probe
to a noncomplementary target still occurs, albeit with a lower binding
affinity than the corresponding perfectly matching sequence. Therefore,
similarities among probes can lead to a significant amount of nonspecific
cross-hybridization. On a DNA microarray with complex target mixtures,
imperfect recognition introduces noise and makes results difficult
to interpret.The kinetics of hybridization in the presence
of competitors and
the importance of cross-hybridization for quantitative interpretation
of microarray data have been intensely studied,[11−13] especially
for the purpose of single nucleotide polymorphism detection and the
accurate assessment of gene expression levels.[14−17] One strategy to avoid cross-hybridization
is to construct sets of probes with minimized pairwise competition
so that they do not cross-hybridize. Such probes are often referred
to as orthogonal. Previous theoretical research[18−24] developed different strategies to find sets of orthogonal sequences.
The most intuitive approach to decide, which sequences cross-hybridize,
is based on the free energy difference between the perfectly matched
and mismatched hybridization.[25] However,
estimating free energies led to poor predictions of hybridization
intensities on microarrays.[26] In this work,
we apply a well-known local search algorithm and implement graph-theoretical
methods to find such sets. Following the concept of Hamming distance
from coding theory, we consider that two sequences do not cross-hybridize
if they differ by at least a certain number of bases. This threshold
is called minimum Hamming distance d.[27] We determine a suitable d experimentally.
One of the fundamental problems in coding theory is finding the maximum
size of a code, where a code is a set of codewords with the length L and minimum Hamming distance d.[28] In analogy, here, we experimentally and theoretically
find maximal sets of independent (i.e., orthogonal) sequences (MIS)
with a certain minimum Hamming distance that can coexist on a microarray
without exhibiting cross-hybridization.
Results and Discussion
Theoretical
Results
For a given strand with L bases,
according to all permutations of DNA bases (A,C,T,G),
there are 4 distinct sequences. However,
some of these sequences exhibit undesired structures that prevent
them from binding to their complement. An example is the sequences
with runs of at least four guanines that we call 4G sequences. These
sequences are capable of forming complex structures such as G-quadruplexes,
which restrict hybridization. Moreover, they have abnormal affinities
and tend to show increased cross-hybridization and reduced target-specific
hybridization, which makes the measurement of gene expression unreliable.[29−31] Therefore, in this work, we eliminate 4G sequences and their complement
sequences 4C. The number of sequences for a given length L that exhibit at least one run of 4G is given aswhere the sum
represents the number of sequences
that are not 4G. The quadrinomial coefficient equals the number of
permutations of L–k guanines
within a sequence length of L (for the derivation
of eq , see section S1).To verify eq , we numerically calculate N4G(L) by generating 4 sequences for a given L ≤
7 and discarding the ones that contain 4G sequences. Figure illustrates N4G(L) in comparison to the total number
of sequences 4, for different lengths.
As depicted in Figure , for L ≤ 7, this fraction stays below 1.5%,
whereas for longer lengths, it rises, so that for L ≥ 200, around 50% of all possible sequences contain 4G structures.
Figure 1
Fraction
of all possible 4 sequences
that contain 4G structures for different oligonucleotide lengths.
The inset shows this fraction stays below 1.5% for L ≤ 7. For longer lengths, it rises. Dotted lines are a guide
for the eye.
Fraction
of all possible 4 sequences
that contain 4G structures for different oligonucleotide lengths.
The inset shows this fraction stays below 1.5% for L ≤ 7. For longer lengths, it rises. Dotted lines are a guide
for the eye.The second category of
sequences that will not contribute to recognition
are self-complementary sequences. We neglect them as we work with
short sequences where self-complementarity only plays a minor role.[32−34] For longer lengths, however, this must be considered.Coding
theory is a branch of mathematics that studies codes and
their properties for different applications. A code is a set of codewords.
The length of a codeword L is the number of letters
that create the codeword, where the letters are often taken from an
alphabet. In our case, DNA sequences are taken as the codewords, where L is the number of bases (A,C,G,T) that make up the sequence.
The number of positions that two codewords of the same length differ
is the Hamming distance.[27] In case of DNA
sequences, we define this distance as the number of bases by which
they differ. We assume that for every sequence of a given length there
is a minimum Hamming distance d in such a way that
there is no cross-hybridization as long as the Hamming distance k is larger (or equal) than d. If two sequences
differ by less than d, they may cross-hybridize.
For a given sequence, N(d) is the
number of sequences from which one can choose a competitor with k ≥ d. N(d), decreases by increasing d (Figure ). Equation , for a given length L, gives the number of sequences P(k) with the Hamming distance k. Figure depicts N(d) obtained by summing P(k) over
all k ≥ d for L = 7 and a given minimum Hamming distance.
Figure 2
N(d) is the number of sequences
with k ≥ d for a given sequence
with L = 7. This number decreases for larger d. The inset depicts the number of sequences P(k) for each Hamming
distance k. If d = 5, the shaded
region represents the sum over all sequences with k ≥ 5 which do not cross-hybridize with the given sequence.
Dotted lines are a guide for the eye.
N(d) is the number of sequences
with k ≥ d for a given sequence
with L = 7. This number decreases for larger d. The inset depicts the number of sequences P(k) for each Hamming
distance k. If d = 5, the shaded
region represents the sum over all sequences with k ≥ 5 which do not cross-hybridize with the given sequence.
Dotted lines are a guide for the eye.Solving maximum independent set problems is believed to be
NP-hard.
There is no general exact solution, however, there are approximations.[35,36] Finding maximal independent set (MIS) in N(d) is a problem related to graph theory.[36,37] A graph consists of vertices represented by red circles in Figure a. Two vertices are
called adjacent if they are connected by an edge (blue line). We represent
the probes by vertices. If two sequences are such that they hybridize
to each other, we connect them by an edge (Figure b). An independent set is a subset with no
adjacent vertices. If adding any sequence to the set corrupts its
independency, the set is called MIS. The largest possible size of
a maximal set refers to the maximum independent set. Here, MIS corresponds
to the largest number of independent oligonucleotides that can be
found. For our approach, we create an adjacency matrix for a given L and d, where the number of rows and columns
correspond to the number of sequences; thus, it is a 4 × 4 square matrix
(Figure c). If the
Hamming distance between sequences i and j is less than d, they cross-hybridize,
that is, they are connected by an edge. In this case, A = 1, otherwise A = 0. Sequences are not self-adjacent, that
is, A = 0 ∀ i.
Figure 3
Concept of the graph-theoretical approach. (a) Vertices
of one
graph (red circles) along with the edges (blue lines) between adjacent
vertices. The vertices inside of the dashed area form the maximum
independent set. (b) Sequences that are connected by blue lines lead
to cross-hybridization. ATT, TTC, and TCT are generating the maximum
independent set. (c) Adjacency matrix for the corresponding set of
sequences.
Concept of the graph-theoretical approach. (a) Vertices
of one
graph (red circles) along with the edges (blue lines) between adjacent
vertices. The vertices inside of the dashed area form the maximum
independent set. (b) Sequences that are connected by blue lines lead
to cross-hybridization. ATT, TTC, and TCT are generating the maximum
independent set. (c) Adjacency matrix for the corresponding set of
sequences.We apply a constructive local
search algorithm[38,39] that iteratively adds orthogonal
sequences to an existing set until
the available sequences are depleted. To identify the orthogonal sequences
the algorithm employs the adjacency matrix constructed beforehand.
The algorithm is restricted as it does not try all combinations of
sequences. Therefore, it does not necessarily find the maximum independent
set but proposes many maximal independent sets instead. We consider
the largest set among them as an approximate solution to the exact
size of the maximum independent set. All obtained set sizes are within
the known Singleton and Gilbert–Varshamov[28,40] bounds and are summarized in Tables S2 and S3 along with a comparison to literature values. The size of the adjacency
matrix increases exponentially with the sequence length. This requires
a large memory. Therefore, we are limited to short sequences L ≤ 7.Figure illustrates
the possible sizes of different independent sets for L = 7 and d = 5 before discarding 4G and 4C sequences
and afterward. The MIS size M(L, d) in both cases is 23. Removing these sequences for L ≤ 7 does not change the size of MIS in most cases
(refer to Table S2). However, for longer
lengths, the fraction of 4G rises and we expect that discarding such
sequences reduces the size of a MIS (Figure ). This algorithm creates independent sets,
based on the pool of available sequences. Removing all sequences containing
4C and 4G changes this pool. Therefore, we obtain different independent
sets (blue columns) compared with the cases where we did not discard
these sequences (red columns). A significant trend toward smaller
or bigger set sizes by removing 4C and 4G sequences cannot be identified.
Figure 4
Size of
all independent sets for L = 7 with d = 5 before (red columns) and after removing 4G and 4C
sequences (blue columns). The height of each column gives the number
of possible sets for a given M(L, d).
Size of
all independent sets for L = 7 with d = 5 before (red columns) and after removing 4G and 4C
sequences (blue columns). The height of each column gives the number
of possible sets for a given M(L, d).
Experimental Results
A suitable minimum Hamming distance d must be determined experimentally. Because the longest
sequences studied with our algorithm are 7-mers, we design a microarray
consisting of oligonucleotides of length 7 (plus four additional terminal
bases, see Material and Methods). We immobilize,
complementary to a perfectly matching target (PM), an arbitrary sequence
and some of its related mismatched sequences. To study the dependency
of hybridization probability on the positions of defects, we locate
the mismatches at the ends, in the middle of the sequence, or uniformly
distribute them. Hybridizing the PM target on the microarray yields
the results shown in Figure . Each feature block, as depicted in Figure a–d, corresponds to a set of sequences
with one to four mismatches MM1–MM4, respectively. They are
all surrounded by a frame of PMs. Each sequence appears 8 times within
a feature block. To have better statistics, the hybridization intensities
from all sequences are averaged, and their standard deviations σ
are calculated. Then, all intensities are normalized relative to the
average PM intensity on the microarray. The PM and mismatched sequences
are all subject to the same constant synthesis error rate (see Material and Methods), which leads to an overall
loss of hybridization intensity. For the results presented in the
following, the relative intensity is of importance, which is not affected
by this loss. Fluorescence intensity variations are due to inhomogeneities
of the microarray surface, fluorescent stains in the feature blocks,
or illumination gradients during synthesis.[9] For all MM ≥ 4, we detect no other intensity than PM hybridization
(not shown).
Figure 5
Fluorescent intensity from a hybridized PM target on a
microarray.
Each feature block (a–d) corresponds to a set of sequences
with one to four mismatches, respectively. They are surrounded by
a frame of PMs. Each sequence appears eight times within a feature
block.
Fluorescent intensity from a hybridized PM target on a
microarray.
Each feature block (a–d) corresponds to a set of sequences
with one to four mismatches, respectively. They are surrounded by
a frame of PMs. Each sequence appears eight times within a feature
block.Figure presents
the normalized fluorescent intensities of hybridization for a sequence
with one mismatch as a function of defect positions. The intensity
for sequences with single mismatches in the middle is smaller because
the defects in the middle of the duplex increase the base pair opening
probability and destabilize the duplex. This result agrees well with
previously reported work.[10,41]
Figure 6
Normalized hybridization
intensity for the sequences with one mismatch
as a function of their mismatch position. The intensity for sequences
including a single mismatch in the middle is smaller than for a MM
located at the end.
Normalized hybridization
intensity for the sequences with one mismatch
as a function of their mismatch position. The intensity for sequences
including a single mismatch in the middle is smaller than for a MM
located at the end.We assume all eight fluorescence
intensities of one probe measured
at different positions on the microarray to be normally distributed
and described by a standard deviation σ. To discriminate the
PM binding intensity from all other nonspecific binding, the normal
distributions of their hybridization intensities must be well separated.
We show in Figure , the distributions of the fluorescence intensities of PM and the
sequences which exhibit the highest cross-hybridization intensities IMM,max for MM1–MM3. The normal distributions
are based on a statistical analysis of the microarrays shown in Figure . The peak centers
in Figure correspond
to the average value of the fluorescence intensities and their widths
to the standard deviations (shown in Table ). In DNA microarrays, the binding affinities
can largely vary, depending on the precise sequence and its concentration,[41] that is, fluorescence intensities of perfectly
matched sequences span a large range. To illustrate that we determine
the hybridization free energy of the sample sequence 3′-CTATATATATC-5′
binding to its PM using Nupack software[42] and the corresponding expected fluorescence intensity using the
Langmuir isotherm.[9] As this sequence does
not contain any G or C bases within the seven core bases, its fluorescence
intensity is amongst the lowest of all possible sequences. In fact,
we find that it has just 16.5% of the fluorescence intensity, obtained
by the same procedure, for the PM sequence 3′-CTACCGTACTC-5′
used on the microarray shown in Figure . Accordingly, it should be expected that some perfectly
matched but weakly binding sequences will have lower hybridization
intensities than the 27% signal of IMM,max for three mismatches. This clearly shows that a minimum Hamming
distance of d = 3 cannot be used for a reliable discrimination
between PM and MM hybridization. Therefore, we investigate sets with d ≥ 4 in subsequent experiments. Table shows the sequences and their
intensities as well as the corresponding standard deviations for each
mismatch.
Figure 7
Normal distribution of the PM and MM1–MM3 hybridization
intensities. Assuming a normal distribution with average intensity
(peak centers) and standard deviation σ as given in Table . Even the average
cross-hybridization intensity of IMM3,max = 27% is too high for accurate discrimination of PM-binding and
unwanted cross-hybridization (compare main text).
Table 1
Sequences with Different Numbers of
Mismatches, Which Yield the Highest Hybridization Intensities among
All Probes within Each Feature Block (Figure ) along with Their σ
number of mismatches
sequence
Imax ± σ
0
3′-CTACCGTACTC-5′
1 ± 0.066
1
3′-CTTCCGTACTC-5′
0.63 ± 0.1
2
3′-CTACCGTCTTC-5′
0.37 ± 0.074
3
3′-CTACCGACTTC-5′
0.27 ± 0.073
Normal distribution of the PM and MM1–MM3 hybridization
intensities. Assuming a normal distribution with average intensity
(peak centers) and standard deviation σ as given in Table . Even the average
cross-hybridization intensity of IMM3,max = 27% is too high for accurate discrimination of PM-binding and
unwanted cross-hybridization (compare main text).To test sets with d ≥ 4, we first design
a microarray consisting of 23 sequences (see Table S1) as predicted by our algorithm, corresponding to d = 5 (compare Figure ). To verify its independence, we record the hybridization
intensities of three arbitrarily chosen PM targets of this set simultaneously. Figure a shows the measured
normalized hybridization intensities Iseq in a barplot after background subtraction. It can be clearly seen
that the PM targets, which are present in solution, hybridize to their
corresponding complementary probes only (green bars). By using the
highest hybridization intensity as a reference, the other hybridized
PM sequences reach 24 and 31% of that level. On the other hand, the
measured hybridization intensities of all other probes (blue bars)
scatter with σ = 0.3% around their average value of zero, which
can be attributed to the background fluorescence noise. Negative values
correspond to the intensities below the average background level.
The intensities of the probes, whose PM targets are not present in
a solution, stay well below 2% within a large confident interval (5σ
environment). To cross-check that the sets with d ≤ 4 are not independent, we synthesize another microarray
including 83 sequences with d = 4. Hybridization
of one PM leads to cross-hybridization of 11 other probes that rise
above 2%, as can be seen for the red bars in Figure b. This underlines that d < 5 is not sufficient to achieve independency.
Figure 8
Two microarrays consisting
of sequences with two different minimum
Hamming distance, (a) independent set with d = 5
and (b) set with d = 4. In both cases, the green
bars present the probes whose PM targets are present in solution.
The blue color corresponds to the hybridization intensities of sequences
with Iseq ≤ 2%. Red bars represent
the cross-hybridized sequences with Iseq > 2%.
Two microarrays consisting
of sequences with two different minimum
Hamming distance, (a) independent set with d = 5
and (b) set with d = 4. In both cases, the green
bars present the probes whose PM targets are present in solution.
The blue color corresponds to the hybridization intensities of sequences
with Iseq ≤ 2%. Red bars represent
the cross-hybridized sequences with Iseq > 2%.
Conclusion
In
this work, we experimentally determined a minimum Hamming distance d between DNA oligonucleotides. Sequences with a distance
of d can make up an orthogonal set, which means they
do not cross-hybridize. By applying a local search algorithm, we found
orthogonal sets for different L and d. For the length of 7, we determined a MIS with the size of 23 and
experimentally confirmed its orthogonality with an appropriate minimum
distance of 5. The small set size of 23 compared with 47 possible sequences arises from the minimum Hamming distance of 5.
Technology of optically directed synthesis introduces errors into
sequences.[43−46] Single-nucleotide polymorphism detection in bulk has been achieved,
albeit with higher synthesis fidelity and optimized experimental conditions.[47] Moreover, d can be reduced
by increasing the temperature to reduce nonspecific bindings, which
can improve the discrimination among the sequences of a set.[47] For longer sequences lengths, higher temperatures
are particularly important to increase the number of complementary
bases that enable binding.[48] At a given
concentration, the discrimination increases near a melting temperature.In the course of our experiments, we found a minimum Hamming distance
of five for a sequence length of 11 (7 core bases and four terminal
extra ones) in a good agreement with the discrimination level of d ≈ L/2 that is reported.[18,19] Our set size, on the other hand, does not gain from the four additional
bases. By extending our algorithm to longer sequences, these extra
bases are redundant, and we expect d ≈ L/2 will remain applicable. With the same d, larger lengths lead to larger set sizes than we have determined
here.We also derived an analytical expression to calculate
the number
of 4G sequences. As we have shown, eliminating these sequences for
short lengths does not change the size of MIS in most cases. However,
we anticipate an impact for higher sequence lengths, as the fraction
of sequences containing 4G structures increases. Although we could
show how to avoid cross-hybridization in our synthesis microarray,
we cannot easily transfer it to the real world microarray application
as developed by Affymetrix. Following the protocol for expression
studies, Affymetrix targets are very long compared with their surface
bound probes. Such sequence lengths introduce a large variety of conformations.
Therefore, in expression studies one should consider additional effects
such as the brush effect[49] and surface
density of probes.[7,50]
Materials and Methods
DNA Microarray
Hybridization Experiment
The light-directed
in situ synthesis method and some of the analysis software were described
previously.[4,41,51] We use in-house synthesized DNA microarrays. Probes on a microarray
are tethered to the surface from their 3′-end. To increase
the hybridization probability at the given temperature, we extended
all sequences by adding four bases, CT at the 3′ and TC at
5′ end. The microarray synthesis used in our work has a stepwise
coupling efficiency of ≥99%. Considering the sequence length
7, this leads to an estimation of probes free of any synthesis defects
of 93%.[52] The remaining 7% have mostly
one defect. The targets are prepared in 25 nM concentration in a 5×
SSPE buffer solution. Their terminus is labeled by a Cy3 fluorescent
dye. Hybridization is performed in equilibrium with the buffer in
a chamber designed for that purpose. We use an UPlanApo 10× 0.40
NA objective for observation. Figure shows the image of a hybridized microarray as obtained
after 100 s exposure time. The particular probe sequence species are
restricted to small areas commonly called features. To determine the
amount of bound targets to a probe, we measure the fluorescence intensities
(hybridization intensity) by taking images from DNA microarray surfaces
with an electron multiplying EM-CCD camera (EM-CCD C9100-02, Hamamatsu). We correct for background fluorescence
originating from the unhybridized targets in the buffer by subtraction.
Microarray pictures shown in the Experimental Results are computationally reconstructed by using these intensities, for
example, Figure a
is produced from Figure . The hybridization temperature is 32 °C.
Figure 9
Image of a hybridized
microarray. The bright features correspond
to the fluorescent intensities of hybridized targets.
Image of a hybridized
microarray. The bright features correspond
to the fluorescent intensities of hybridized targets.
Authors: Alex Pozhitkov; Peter A Noble; Tomislav Domazet-Loso; Arne W Nolte; Rainer Sonnenberg; Peer Staehler; Markus Beier; Diethard Tautz Journal: Nucleic Acids Res Date: 2006-05-17 Impact factor: 16.971