Literature DB >> 28474009

Hamming Distance as a Concept in DNA Molecular Recognition.

Mina Mohammadi-Kambs¹, Kathrin Hölz², Mark M Somoza², Albrecht Ott¹.

Abstract

DNA microarrays constitute an in vitro example system of a highly crowded molecular recognition environment. Although they are widely applied in many biological applications, some of the basic mechanisms of the hybridization processes of DNA remain poorly understood. On a microarray, cross-hybridization arises from similarities of sequences that may introduce errors during the transmission of information. Experimentally, we determine an appropriate distance, called minimum Hamming distance, in which the sequences of a set differ. By applying an algorithm based on a graph-theoretical method, we find large orthogonal sets of sequences that are sufficiently different not to exhibit any cross-hybridization. To create such a set, we first derive an analytical solution for the number of sequences that include at least four guanines in a row for a given sequence length and eliminate them from the list of candidate sequences. We experimentally confirm the orthogonality of the largest possible set with a size of 23 for the length of 7. We anticipate our work to be a starting point toward the study of signal propagation in highly competitive environments, besides its obvious application in DNA high throughput experiments.

Entities: Chemical Disease Species

Year: 2017 PMID： 28474009 PMCID： PMC5410656 DOI： 10.1021/acsomega.7b00053

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Molecular recognition in the crowded environment of DNA microarrays plays an important role in processing information. Recognition often requires the discrimination of one specific molecule among many similar, competing molecules. In 1894, Emil Fischer proposed the lock and key model to describe the recognition of an enzyme and a substrate.[1] According to this model, the substrate possesses the perfect size and shape to accommodate the active site of its complement. However, in crowded environments, binding between noncomplementary molecules may occur and result in introduction of errors. For DNA, specific-binding of two single strands, that is the formation of a stable double helix, occurs only if the bases A and T as well as C and G pair along the sequence. DNA microarrays are a widely used platform that, besides many applications in medicine and biology, enables the study of the fundamentals of DNA hybridization.[2−10] These microarrays consist of single-stranded DNA oligonucleotides immobilized on a surface (probes). If these probes are exposed to a bulk mixture of fluorescently labeled target sequences, only complementary targets are expected to hybridize. However, hybridization of a probe to a noncomplementary target still occurs, albeit with a lower binding affinity than the corresponding perfectly matching sequence. Therefore, similarities among probes can lead to a significant amount of nonspecific cross-hybridization. On a DNA microarray with complex target mixtures, imperfect recognition introduces noise and makes results difficult to interpret. The kinetics of hybridization in the presence of competitors and the importance of cross-hybridization for quantitative interpretation of microarray data have been intensely studied,[11−13] especially for the purpose of single nucleotide polymorphism detection and the accurate assessment of gene expression levels.[14−17] One strategy to avoid cross-hybridization is to construct sets of probes with minimized pairwise competition so that they do not cross-hybridize. Such probes are often referred to as orthogonal. Previous theoretical research[18−24] developed different strategies to find sets of orthogonal sequences. The most intuitive approach to decide, which sequences cross-hybridize, is based on the free energy difference between the perfectly matched and mismatched hybridization.[25] However, estimating free energies led to poor predictions of hybridization intensities on microarrays.[26] In this work, we apply a well-known local search algorithm and implement graph-theoretical methods to find such sets. Following the concept of Hamming distance from coding theory, we consider that two sequences do not cross-hybridize if they differ by at least a certain number of bases. This threshold is called minimum Hamming distance d.[27] We determine a suitable d experimentally. One of the fundamental problems in coding theory is finding the maximum size of a code, where a code is a set of codewords with the length L and minimum Hamming distance d.[28] In analogy, here, we experimentally and theoretically find maximal sets of independent (i.e., orthogonal) sequences (MIS) with a certain minimum Hamming distance that can coexist on a microarray without exhibiting cross-hybridization.

Results and Discussion

Theoretical Results

For a given strand with L bases, according to all permutations of DNA bases (A,C,T,G), there are 4 distinct sequences. However, some of these sequences exhibit undesired structures that prevent them from binding to their complement. An example is the sequences with runs of at least four guanines that we call 4G sequences. These sequences are capable of forming complex structures such as G-quadruplexes, which restrict hybridization. Moreover, they have abnormal affinities and tend to show increased cross-hybridization and reduced target-specific hybridization, which makes the measurement of gene expression unreliable.[29−31] Therefore, in this work, we eliminate 4G sequences and their complement sequences 4C. The number of sequences for a given length L that exhibit at least one run of 4G is given aswhere the sum represents the number of sequences that are not 4G. The quadrinomial coefficient equals the number of permutations of L–k guanines within a sequence length of L (for the derivation of eq , see section S1). To verify eq , we numerically calculate N4G(L) by generating 4 sequences for a given L ≤ 7 and discarding the ones that contain 4G sequences. Figure illustrates N4G(L) in comparison to the total number of sequences 4, for different lengths. As depicted in Figure , for L ≤ 7, this fraction stays below 1.5%, whereas for longer lengths, it rises, so that for L ≥ 200, around 50% of all possible sequences contain 4G structures.

Figure 1

Fraction of all possible 4 sequences that contain 4G structures for different oligonucleotide lengths. The inset shows this fraction stays below 1.5% for L ≤ 7. For longer lengths, it rises. Dotted lines are a guide for the eye. The second category of sequences that will not contribute to recognition are self-complementary sequences. We neglect them as we work with short sequences where self-complementarity only plays a minor role.[32−34] For longer lengths, however, this must be considered. Coding theory is a branch of mathematics that studies codes and their properties for different applications. A code is a set of codewords. The length of a codeword L is the number of letters that create the codeword, where the letters are often taken from an alphabet. In our case, DNA sequences are taken as the codewords, where L is the number of bases (A,C,G,T) that make up the sequence. The number of positions that two codewords of the same length differ is the Hamming distance.[27] In case of DNA sequences, we define this distance as the number of bases by which they differ. We assume that for every sequence of a given length there is a minimum Hamming distance d in such a way that there is no cross-hybridization as long as the Hamming distance k is larger (or equal) than d. If two sequences differ by less than d, they may cross-hybridize. For a given sequence, N(d) is the number of sequences from which one can choose a competitor with k ≥ d. N(d), decreases by increasing d (Figure ). Equation , for a given length L, gives the number of sequences P(k) with the Hamming distance k. Figure depicts N(d) obtained by summing P(k) over all k ≥ d for L = 7 and a given minimum Hamming distance.

Figure 2

N(d) is the number of sequences with k ≥ d for a given sequence with L = 7. This number decreases for larger d. The inset depicts the number of sequences P(k) for each Hamming distance k. If d = 5, the shaded region represents the sum over all sequences with k ≥ 5 which do not cross-hybridize with the given sequence. Dotted lines are a guide for the eye. Solving maximum independent set problems is believed to be NP-hard. There is no general exact solution, however, there are approximations.[35,36] Finding maximal independent set (MIS) in N(d) is a problem related to graph theory.[36,37] A graph consists of vertices represented by red circles in Figure a. Two vertices are called adjacent if they are connected by an edge (blue line). We represent the probes by vertices. If two sequences are such that they hybridize to each other, we connect them by an edge (Figure b). An independent set is a subset with no adjacent vertices. If adding any sequence to the set corrupts its independency, the set is called MIS. The largest possible size of a maximal set refers to the maximum independent set. Here, MIS corresponds to the largest number of independent oligonucleotides that can be found. For our approach, we create an adjacency matrix for a given L and d, where the number of rows and columns correspond to the number of sequences; thus, it is a 4 × 4 square matrix (Figure c). If the Hamming distance between sequences i and j is less than d, they cross-hybridize, that is, they are connected by an edge. In this case, A = 1, otherwise A = 0. Sequences are not self-adjacent, that is, A = 0 ∀ i.

Figure 3

Concept of the graph-theoretical approach. (a) Vertices of one graph (red circles) along with the edges (blue lines) between adjacent vertices. The vertices inside of the dashed area form the maximum independent set. (b) Sequences that are connected by blue lines lead to cross-hybridization. ATT, TTC, and TCT are generating the maximum independent set. (c) Adjacency matrix for the corresponding set of sequences. We apply a constructive local search algorithm[38,39] that iteratively adds orthogonal sequences to an existing set until the available sequences are depleted. To identify the orthogonal sequences the algorithm employs the adjacency matrix constructed beforehand. The algorithm is restricted as it does not try all combinations of sequences. Therefore, it does not necessarily find the maximum independent set but proposes many maximal independent sets instead. We consider the largest set among them as an approximate solution to the exact size of the maximum independent set. All obtained set sizes are within the known Singleton and Gilbert–Varshamov[28,40] bounds and are summarized in Tables S2 and S3 along with a comparison to literature values. The size of the adjacency matrix increases exponentially with the sequence length. This requires a large memory. Therefore, we are limited to short sequences L ≤ 7. Figure illustrates the possible sizes of different independent sets for L = 7 and d = 5 before discarding 4G and 4C sequences and afterward. The MIS size M(L, d) in both cases is 23. Removing these sequences for L ≤ 7 does not change the size of MIS in most cases (refer to Table S2). However, for longer lengths, the fraction of 4G rises and we expect that discarding such sequences reduces the size of a MIS (Figure ). This algorithm creates independent sets, based on the pool of available sequences. Removing all sequences containing 4C and 4G changes this pool. Therefore, we obtain different independent sets (blue columns) compared with the cases where we did not discard these sequences (red columns). A significant trend toward smaller or bigger set sizes by removing 4C and 4G sequences cannot be identified.

Figure 4

Size of all independent sets for L = 7 with d = 5 before (red columns) and after removing 4G and 4C sequences (blue columns). The height of each column gives the number of possible sets for a given M(L, d).

Experimental Results

A suitable minimum Hamming distance d must be determined experimentally. Because the longest sequences studied with our algorithm are 7-mers, we design a microarray consisting of oligonucleotides of length 7 (plus four additional terminal bases, see Material and Methods). We immobilize, complementary to a perfectly matching target (PM), an arbitrary sequence and some of its related mismatched sequences. To study the dependency of hybridization probability on the positions of defects, we locate the mismatches at the ends, in the middle of the sequence, or uniformly distribute them. Hybridizing the PM target on the microarray yields the results shown in Figure . Each feature block, as depicted in Figure a–d, corresponds to a set of sequences with one to four mismatches MM1–MM4, respectively. They are all surrounded by a frame of PMs. Each sequence appears 8 times within a feature block. To have better statistics, the hybridization intensities from all sequences are averaged, and their standard deviations σ are calculated. Then, all intensities are normalized relative to the average PM intensity on the microarray. The PM and mismatched sequences are all subject to the same constant synthesis error rate (see Material and Methods), which leads to an overall loss of hybridization intensity. For the results presented in the following, the relative intensity is of importance, which is not affected by this loss. Fluorescence intensity variations are due to inhomogeneities of the microarray surface, fluorescent stains in the feature blocks, or illumination gradients during synthesis.[9] For all MM ≥ 4, we detect no other intensity than PM hybridization (not shown).

Figure 5

Fluorescent intensity from a hybridized PM target on a microarray. Each feature block (a–d) corresponds to a set of sequences with one to four mismatches, respectively. They are surrounded by a frame of PMs. Each sequence appears eight times within a feature block. Figure presents the normalized fluorescent intensities of hybridization for a sequence with one mismatch as a function of defect positions. The intensity for sequences with single mismatches in the middle is smaller because the defects in the middle of the duplex increase the base pair opening probability and destabilize the duplex. This result agrees well with previously reported work.[10,41]

Figure 6

Normalized hybridization intensity for the sequences with one mismatch as a function of their mismatch position. The intensity for sequences including a single mismatch in the middle is smaller than for a MM located at the end. We assume all eight fluorescence intensities of one probe measured at different positions on the microarray to be normally distributed and described by a standard deviation σ. To discriminate the PM binding intensity from all other nonspecific binding, the normal distributions of their hybridization intensities must be well separated. We show in Figure , the distributions of the fluorescence intensities of PM and the sequences which exhibit the highest cross-hybridization intensities IMM,max for MM1–MM3. The normal distributions are based on a statistical analysis of the microarrays shown in Figure . The peak centers in Figure correspond to the average value of the fluorescence intensities and their widths to the standard deviations (shown in Table ). In DNA microarrays, the binding affinities can largely vary, depending on the precise sequence and its concentration,[41] that is, fluorescence intensities of perfectly matched sequences span a large range. To illustrate that we determine the hybridization free energy of the sample sequence 3′-CTATATATATC-5′ binding to its PM using Nupack software[42] and the corresponding expected fluorescence intensity using the Langmuir isotherm.[9] As this sequence does not contain any G or C bases within the seven core bases, its fluorescence intensity is amongst the lowest of all possible sequences. In fact, we find that it has just 16.5% of the fluorescence intensity, obtained by the same procedure, for the PM sequence 3′-CTACCGTACTC-5′ used on the microarray shown in Figure . Accordingly, it should be expected that some perfectly matched but weakly binding sequences will have lower hybridization intensities than the 27% signal of IMM,max for three mismatches. This clearly shows that a minimum Hamming distance of d = 3 cannot be used for a reliable discrimination between PM and MM hybridization. Therefore, we investigate sets with d ≥ 4 in subsequent experiments. Table shows the sequences and their intensities as well as the corresponding standard deviations for each mismatch.

Figure 7

Table 1

Sequences with Different Numbers of Mismatches, Which Yield the Highest Hybridization Intensities among All Probes within Each Feature Block (Figure ) along with Their σ

number of mismatches	sequence	I_max ± σ
0	3′-CTACCGTACTC-5′	1 ± 0.066
1	3′-CTTCCGTACTC-5′	0.63 ± 0.1
2	3′-CTACCGTCTTC-5′	0.37 ± 0.074
3	3′-CTACCGACTTC-5′	0.27 ± 0.073

Normal distribution of the PM and MM1–MM3 hybridization intensities. Assuming a normal distribution with average intensity (peak centers) and standard deviation σ as given in Table . Even the average cross-hybridization intensity of IMM3,max = 27% is too high for accurate discrimination of PM-binding and unwanted cross-hybridization (compare main text). To test sets with d ≥ 4, we first design a microarray consisting of 23 sequences (see Table S1) as predicted by our algorithm, corresponding to d = 5 (compare Figure ). To verify its independence, we record the hybridization intensities of three arbitrarily chosen PM targets of this set simultaneously. Figure a shows the measured normalized hybridization intensities Iseq in a barplot after background subtraction. It can be clearly seen that the PM targets, which are present in solution, hybridize to their corresponding complementary probes only (green bars). By using the highest hybridization intensity as a reference, the other hybridized PM sequences reach 24 and 31% of that level. On the other hand, the measured hybridization intensities of all other probes (blue bars) scatter with σ = 0.3% around their average value of zero, which can be attributed to the background fluorescence noise. Negative values correspond to the intensities below the average background level. The intensities of the probes, whose PM targets are not present in a solution, stay well below 2% within a large confident interval (5σ environment). To cross-check that the sets with d ≤ 4 are not independent, we synthesize another microarray including 83 sequences with d = 4. Hybridization of one PM leads to cross-hybridization of 11 other probes that rise above 2%, as can be seen for the red bars in Figure b. This underlines that d < 5 is not sufficient to achieve independency.

Figure 8

Two microarrays consisting of sequences with two different minimum Hamming distance, (a) independent set with d = 5 and (b) set with d = 4. In both cases, the green bars present the probes whose PM targets are present in solution. The blue color corresponds to the hybridization intensities of sequences with Iseq ≤ 2%. Red bars represent the cross-hybridized sequences with Iseq > 2%.

Conclusion

In this work, we experimentally determined a minimum Hamming distance d between DNA oligonucleotides. Sequences with a distance of d can make up an orthogonal set, which means they do not cross-hybridize. By applying a local search algorithm, we found orthogonal sets for different L and d. For the length of 7, we determined a MIS with the size of 23 and experimentally confirmed its orthogonality with an appropriate minimum distance of 5. The small set size of 23 compared with 47 possible sequences arises from the minimum Hamming distance of 5. Technology of optically directed synthesis introduces errors into sequences.[43−46] Single-nucleotide polymorphism detection in bulk has been achieved, albeit with higher synthesis fidelity and optimized experimental conditions.[47] Moreover, d can be reduced by increasing the temperature to reduce nonspecific bindings, which can improve the discrimination among the sequences of a set.[47] For longer sequences lengths, higher temperatures are particularly important to increase the number of complementary bases that enable binding.[48] At a given concentration, the discrimination increases near a melting temperature. In the course of our experiments, we found a minimum Hamming distance of five for a sequence length of 11 (7 core bases and four terminal extra ones) in a good agreement with the discrimination level of d ≈ L/2 that is reported.[18,19] Our set size, on the other hand, does not gain from the four additional bases. By extending our algorithm to longer sequences, these extra bases are redundant, and we expect d ≈ L/2 will remain applicable. With the same d, larger lengths lead to larger set sizes than we have determined here. We also derived an analytical expression to calculate the number of 4G sequences. As we have shown, eliminating these sequences for short lengths does not change the size of MIS in most cases. However, we anticipate an impact for higher sequence lengths, as the fraction of sequences containing 4G structures increases. Although we could show how to avoid cross-hybridization in our synthesis microarray, we cannot easily transfer it to the real world microarray application as developed by Affymetrix. Following the protocol for expression studies, Affymetrix targets are very long compared with their surface bound probes. Such sequence lengths introduce a large variety of conformations. Therefore, in expression studies one should consider additional effects such as the brush effect[49] and surface density of probes.[7,50]

Materials and Methods

DNA Microarray Hybridization Experiment

The light-directed in situ synthesis method and some of the analysis software were described previously.[4,41,51] We use in-house synthesized DNA microarrays. Probes on a microarray are tethered to the surface from their 3′-end. To increase the hybridization probability at the given temperature, we extended all sequences by adding four bases, CT at the 3′ and TC at 5′ end. The microarray synthesis used in our work has a stepwise coupling efficiency of ≥99%. Considering the sequence length 7, this leads to an estimation of probes free of any synthesis defects of 93%.[52] The remaining 7% have mostly one defect. The targets are prepared in 25 nM concentration in a 5× SSPE buffer solution. Their terminus is labeled by a Cy3 fluorescent dye. Hybridization is performed in equilibrium with the buffer in a chamber designed for that purpose. We use an UPlanApo 10× 0.40 NA objective for observation. Figure shows the image of a hybridized microarray as obtained after 100 s exposure time. The particular probe sequence species are restricted to small areas commonly called features. To determine the amount of bound targets to a probe, we measure the fluorescence intensities (hybridization intensity) by taking images from DNA microarray surfaces with an electron multiplying EM-CCD camera (EM-CCD C9100-02, Hamamatsu). We correct for background fluorescence originating from the unhybridized targets in the buffer by subtraction. Microarray pictures shown in the Experimental Results are computationally reconstructed by using these intensities, for example, Figure a is produced from Figure . The hybridization temperature is 32 °C.

Figure 9

Image of a hybridized microarray. The bright features correspond to the fluorescent intensities of hybridized targets.

33 in total

1. DNA sequence design based on template strategy.

Authors: Wenbin Liu; Shudong Wang; Lin Gao; Fengyue Zhang; Jin Xu
Journal: J Chem Inf Comput Sci Date: 2003 Nov-Dec

2. Optical study of DNA surface hybridization reveals DNA surface density as a key parameter for microarray hybridization kinetics.

Authors: Wolfgang Michel; Timo Mai; Thomas Naiser; Albrecht Ott
Journal: Biophys J Date: 2006-11-03 Impact factor: 4.033

3. Competitive displacement of DNA during surface hybridization.

Authors: J Bishop; C Wilson; A M Chagovetz; S Blair
Journal: Biophys J Date: 2006-10-20 Impact factor: 4.033

4. Short oligonucleotide probes containing G-stacks display abnormal binding affinity on Affymetrix microarrays.

Authors: Chunlei Wu; Haitao Zhao; Keith Baggerly; Roberto Carta; Li Zhang
Journal: Bioinformatics Date: 2007-05-30 Impact factor: 6.937

5. Demonstration of a word design strategy for DNA computing on surfaces.

Authors: A G Frutos; Q Liu; A J Thiel; A M Sanner; A E Condon; L M Smith; R M Corn
Journal: Nucleic Acids Res Date: 1997-12-01 Impact factor: 16.971

6. Efficiency, error and yield in light-directed maskless synthesis of DNA microarrays.

Authors: Christy Agbavwe; Changhan Kim; DongGee Hong; Kurt Heinrich; Tao Wang; Mark M Somoza
Journal: J Nanobiotechnology Date: 2011-12-08 Impact factor: 10.435

7. Stability of double-stranded oligonucleotide DNA with a bulged loop: a microarray study.

Authors: Christian Trapp; Marc Schenkelberger; Albrecht Ott
Journal: BMC Biophys Date: 2011-12-13 Impact factor: 4.778

8. Position dependent mismatch discrimination on DNA microarrays - experiments and model.

Authors: Thomas Naiser; Jona Kayser; Timo Mai; Wolfgang Michel; Albrecht Ott
Journal: BMC Bioinformatics Date: 2008-12-01 Impact factor: 3.169

9. Tests of rRNA hybridization to microarrays suggest that hybridization characteristics of oligonucleotide probes for species discrimination cannot be predicted.

Authors: Alex Pozhitkov; Peter A Noble; Tomislav Domazet-Loso; Arne W Nolte; Rainer Sonnenberg; Peer Staehler; Markus Beier; Diethard Tautz
Journal: Nucleic Acids Res Date: 2006-05-17 Impact factor: 16.971

10. Design and analysis of mismatch probes for long oligonucleotide microarrays.

Authors: Ye Deng; Zhili He; Joy D Van Nostrand; Jizhong Zhou
Journal: BMC Genomics Date: 2008-10-17 Impact factor: 3.969