Literature DB >> 35328493

Needles in Haystacks: Understanding the Success of Selective Pairing of Nucleic Acids.

Carlos A Plata^1,2, Stefano Marni³, Samir Suweis^1,4, Tommaso Bellini³, Elvezia Maria Paraboschi^5,6.

Abstract

The effectiveness of several biological and biotechnological processes relies on the remarkably selective pairing of nucleic acids in contexts of molecular complexity. Relevant examples are the on-target binding of primers in genomic PCR and the regulatory efficacy of microRNA via binding on the transcriptome. Here, we propose a statistical framework that enables us to describe and understand such selectivity by means of a model that is extremely cheap from a computational point of view. By re-parametrizing the hybridization thermodynamics on three classes of base pairing errors, we find a convenient way to obtain the free energy of pairwise interactions between nucleic acids. We thus evaluate the hybridization statistics of a given oligonucleotide within a large number of competitive sites that we assume to be random, and we compute the probability of on-target binding. We apply our strategy to PCR amplification and microRNA-based gene regulation, shedding new light on their selectivity. In particular, we show the relevance of the defectless pairing of 3' terminals imposed by the polymerase in PCR selection. We also evaluate the selectivity afforded by the microRNA seed region, thus quantifying the extra contributions given by mechanisms beyond pairing statistics.

Entities: Chemical

Keywords: nucleic acid interactions; pairing statistics; stat-mech modeling

Mesh：

Substances：
MicroRNAs
Nucleic Acids

Year: 2022 PMID： 35328493 PMCID： PMC8954656 DOI： 10.3390/ijms23063072

Source DB: PubMed Journal: Int J Mol Sci ISSN： 1422-0067 Impact factor: 5.923

1. Introduction

The selective pairing of nucleic acids is the key molecular property enabling genetic coding, gene expression and regulation, and heredity transmission. The extent of such selectivity becomes evident in processes in which complementary strands have to selectively pair amid a plethora of other nucleic acid polymers and oligomers. Relevant examples of such a successful “needle in the haystack” search performed by nucleic acids can be found in both biological and technological contexts. For instance, in the biological context, microRNA (miRNA) play a key role in gene expression and regulation. miRNA are short RNA molecules (∼22 nt, where nt stands for nucleotides) typically targeting specific messenger RNAs (mRNA) among the molecular variety present in the cytoplasm, inducing mRNA degradation or translation halting. In the technological context, polymerase chain reaction (PCR) is the most used technique in molecular biology, allowing the exponential amplification of target DNA/cDNA regions thanks to the selective pairing between oligonucleotide primers and entire genomes/transcriptomes. In both cases, one short oligomer (of the order of 20 nt) has to search and find its complementary counterpart within much longer polymers (e.g., nt). Regarding the PCR technique, since Mullis’ first publication, the primer length considered effective in PCR was in the range of 20–27 nt [1]. A simple statistical consideration is to evaluate permutations in a strand of length L and compare it to the total length of the analyzed genome [2]. When , the possible permutation of nucleobases is around , much more than the length of the human genome (around ). However, this simple evaluation does not take into account the possibility of forming defected pairings, which is the most relevant form of potential failure in selective targeting. In its current use, primer design is optimized through the use of algorithms that allow us to control for GC content, secondary structure, or internal complementary regions [2]. On the other hand, for miRNA selectivity, the mechanism of action has different layers of complexity. First, miRNAs in cells function within a ribonucleoprotein complex called the RNA-induced silencing complex (RISC). The formation of the mature miRNA–RISC complex is not trivial, and requires the maturation of the miRNA molecule, the association with Argonaute (AGO) proteins, and the selection of the guide strand that takes the RISC to the target mRNAs, usually in its 3 untranslated region (3 UTR) [3]. Moreover, although the length of mature miRNAs is ∼22 nt, the “active” region, called the “seed”, is only 6–8 nt long [4]. Generally, the seed corresponds to nucleotides 2–7, and it is considered the minimal element to bind and repress mRNA translation potential. This length must have been optimized by nature as a compromise between selectivity on the one hand and fast diffusion and accessibility to the target on the other. Despite the seed being recognized as a critical element in the miRNA mechanism, growing evidence indicates that sequences in the miRNA 3-end play an important role in mRNA targeting [3]. Interestingly, structural studies have shown that, once the miRNA forms a complex with AGO proteins, only the seed is available to interact with the target site [5]. However, the binding of the miRNA–RISC complex to a target RNA induces a conformational change that unmasks the 3 end of the miRNA, allowing further pairing outside the seed region [3,5], which can impact the specificity of targeting, the regulatory mechanism, and the stability of the miRNA itself. Finally, the presence of a mRNA–seed (or extended) pairing is not the only determinant of miRNA successful activity. In fact, mRNAs in cells tend to form secondary structures, and to interact with RNA binding proteins, which can limit miRNA accessibility to the target. Site accessibility was demonstrated to be a key feature for miRNA-mediated translational repression: functional miRNA target sites are preferentially located in highly accessible regions, and this feature is conserved across genomes [6]. These notions need to be taken into account in the estimate of the total amount of sites on which miRNA may bind in competition to its targets. In spite of the differences and the complexity of the selectivity processes described above, they are both rooted in the selectivity of interactions between nucleic acids. A natural question thus arising from these remarkably successful examples of selectivity is how to model and understand these phenomena on the basis of the well-known thermodynamics of nucleic acid duplex formation [7]. Here, we tackle this problem by elaborating on a re-parametrization on three classes of base pairing error guided by the description of hybridization thermodynamics from the so-called “nearest-neighbor model” [8]. We then develop a mean field method to calculate the probability for the formation of perfect and defected duplexes in these two contexts. In particular, we focus on exploring the effect of the oligomer (primer and miRNA) length L in the efficiency of targeting their cognate sites within long random sequences, gaining new insights into the factors at play in both situations. The dependences on other relevant parameters such as the temperature are also analyzed.

2. Materials and Methods

Our strategy relies on the comparison between the Boltzmann statistical weights for on-target and off-target pairings in order to evaluate the success probability of the process. In the miRNA case, we study the pairing of the miRNA–RISC complex to the mRNA, where mainly the nucleotides within the “seed” region are available for Watson–Crick interactions; on the other hand, we consider the first annealing cycle of the PCR, being the most significant for the success of the technique. In the following subsections, we present the main ingredients for our physical statistical description of the PCR technique and miRNA gene expression regulation: firstly, we are able to obtain the average free energy of a certain quality of duplex, thanks to a parametrization of the pairing depending on the kinds of mismatches involved. Secondly, the same parametrization allows us to obtain the degeneracy of each kind of duplex, i.e., the total number of sequences with which the primer/miRNA can realize a duplex with the same combination of mismatched bases. For our purposes of general validity of the results, we neglect the sequence specificity of the genomic ssDNA or of the mRNA and we consider them as random sequences, where the 4 nitrogenous bases are equiprobable in each nucleotide of the off-target sites. Finally, we have combined the binding free energy and the degeneracy to compute the Boltzmann weight of the on-target and off-target pairings. Comparing these two terms, we obtain the on-target pairing probability.

2.1. Free Energy for Duplex Formation

Differently from other works on DNA hybridization focusing on the prediction of stable pairings as a function of the temperature, i.e., the study of “melting curves” [9,10,11], we would like to characterize here the probability of on-target binding of oligonucleotides in the presence of huge numbers of random competitive sites. To do so, we have to describe the binding free energy between any given pair of interacting oligomers, as well as the degeneracy of their potential pairing. The free energy difference between a nucleic acid duplex and its free constituent sequences can be split into an enthalpic and an entropic part, Nevertheless, providing an accurate description of such thermodynamic parameters characterizing the interaction between nucleic acids is not an easy task. In the highly cited review by SantaLucia and Hicks [12], detailed energetic data for several DNA motifs can be found, comprising canonical Watson–Crick pairing and a long catalog of errors, including internal mismatches, terminal mismatches, terminal dangling ends, hairpins, bulges, internal loops, and multibranched loops. The extraction of such thermodynamic parameters, however, necessarily requires the knowledge of the specific bases composing the two strings, and this is information that is not possible to access typically, or it is simply unfeasible to compute when dealing with a multitude of random possible competing pairs. Moreover, since our aim is to unveil some fundamental properties based on thermodynamic arguments with a coarse-grained modeling to explain the effectiveness of selective bindings in nucleic acids, we consider that such properties do not depend on fine details such as the specific bases composing the interacting oligomers. This hypothesis is checked for specific cases (see Appendix A, and Appendix A.5 in particular), proving the robustness and range of applicability of our description. Remarkably, our assumption of two states, i.e., on–off hybridization with no intermediate state between unbound and paired, is justified for short oligomers [13], as it is in the cases considered in this study. For these reasons, we develop here an effective energetic model that, by considering only three classes of base pairing errors and through a “mean field” approach where all possible combinations of interacting pairs are averaged, yields a simplified but yet quantitatively fair description of DNA (or RNA) hybridization. For the sake of concreteness, let us focus on the pairing between a generic primer (an oligomer with length L) and a long polymer with length . Specifically, is measuring the number of ways in which the first oligomer can couple to the latter (number of sites wherein it can attach). Once L is defined, in our description, the duplex is fully characterized through a three-component parameter vector . This vector carries the information of the number of external mismatches, and , and internal mismatches, . The definition of thus consists of re-parametrizing the hybridization thermodynamics on three classes of base pairing errors. To this aim, we split the total enthalpy and entropy into different contributions stemming from the different interactions involved in the duplex, Above, we have separated the contributions from the perfect match, the dangling ends, and internal mismatches. Note that entropy is additionally corrected due to salt concentration [14]. In the following subsections, we account for each contribution in detail.

2.1.1. Perfect Match: Initiation and Nearest-Neighbor Canonical Base Pairs

Our starting point is the contribution of an ideal matched duplex. The nearest-neighbor model has been proven to provide a very good description for the enthalpy and entropy of duplexes [12]. This model starts from initiation values and , which are complemented by additive contributions coming from each couple of neighboring base pairs. Such contributions depend on the specific bases considered. Nevertheless, in our coarse-grained description, we associate a single averaged contribution and to any couple of neighboring matched base pairs (see Appendix A along with Table A1 therein for further details on the averaging). Therefore, in our framework, the enthalpy and entropy of perfectly matched duplexes depend solely on the length L and simply read where we have taken into account that the number of couples of neighboring base pairs is .

Table A1

Thermodynamic parameters for canonical Watson–Crick base pairs for duplexes made by DNA/DNA (left panel) and RNA/RNA (right panel). Energy and entropy units are kcal mol−1 and cal mol−1 K−1, respectively.

Propagation			Propagation
Sequence	ΔH	ΔS	Sequence	ΔH	ΔS
AA/TT	−7.6	−21.3	AA/UU	−6.82	−19.0
AT/TA	−7.2	−20.4	AU/UA	−9.38	−26.7
TA/AT	−7.2	−21.3	UA/AU	−7.69	−20.5
CA/GT	−8.5	−22.7	CA/GU	−10.44	−26.9
GT/CA	−8.4	−22.4	GU/CA	−11.4	−29.5
CT/GA	−7.8	−21.0	CU/GA	−10.48	−27.1
GA/CT	−8.2	−22.2	GA/CU	−12.44	−32.5
CG/GC	−10.6	−27.2	CG/GC	−10.64	−26.7
GC/CG	−9.8	−24.4	GC/CG	−14.88	−36.9
GG/CC	−8.0	−19.9	GG/CC	−13.39	−32.7
EC(G)/G(C)E	0.1	−2.85	EC(G)/G(C)E	1.805	−0.75
EA(T)/T(A)E	2.3	4.05	EA(U)/U(A)E	5.525	9.75

2.1.2. Dangling Ends: External Mismatches

This contribution takes into account that the duplex may happen with a certain external mismatched base pair. Moreover, if the external base is well paired, there is a stacking contribution to the free energy, due to the base of the long polymer that is next to the pair. The number of external mismatches in each end is given by and , respectively. Note that, in order to obtain at least one matched base pair, we need to enforce (see Figure 1). Our work hypothesis, motivated by the values typically found [12], is that external mismatches can be thought of as two dangling bases at the same end. Therefore, due to external mismatches, (i) neighboring base pairs are canceled out with respect to the perfect match and () there is an extra contribution stemming from the first bases within a dangling end. Although, in reality, this contribution would depend on the identity of the bases, we consider an averaged contribution and to any dangling end (see Appendix A along with Table A2 therein for further insight on these values).

Figure 1

Sketch of a DNA primer interacting with a generic portion of a DNA single strand of a denaturated genome. Gray rectangles represent the nucleobases. Canonical Watson–Crick pairing is marked in yellow. Shaded boxes mark pairing defects: internal mismatches (green shades, counted by ), terminal mismatches at the 3 and 5’ ends (blue shades, and purple shades, respectively). In this sketch, , and .

Table A2

Thermodynamic parameters for dangling ends for DNA/DNA interaction. Energy and entropy units are kcal mol−1 and cal mol−1 K−1, respectively.

		X
Dangling	Propagation	A		T		C		G
End	Sequence	ΔH	ΔS	ΔH	ΔS	ΔH	ΔS	ΔH	ΔS
5′-dangling	XA/T	0.2	2.3	−6.9	−20.0	0.6	3.3	−1.1	−1.5
	XT/A	−2.9	−7.7	−0.2	−0.3	−4.1	−13.2	−4.2	−15.1
	XC/G	−6.3	−17.2	−4.0	−11.0	−4.4	−12.5	−5.1	−14.1
	XG/C	−3.7	−10.1	−4.9	−13.8	−4.0	−11.8	−3.9	−10.8
3′-dangling	AX/T	−0.5	−1.2	−3.8	−12.7	4.7	14.25	−4.1	−13.2
	TX/A	−0.7	−0.7	2.9	10.3	4.4	14.8	−1.6	−3.5
	CX/G	−5.9	−16.4	−5.2	−15.1	−2.6	−7.4	−3.2	−10.3
	GX/C	−2.1	−3.8	−4.4	−13.1	−0.2	0.1	−3.9	−11.2

Therefore, by summing up the previous discussion, the contribution of external mismatches can be parametrized as follows: where takes the possible values depending on the external mismatches corresponding, respectively, to dangling ends without external mismatches, dangling ends plus external mismatches in one end, and dangling and external mismatches in both ends. When writing the cases above, we have kept in mind the binding of a primer inside a specific region of a longer DNA as in Figure 1. Nevertheless, this has to be modified if one is interested in studying selection by miRNA. As described in the Introduction, the active region of miRNA is finite, as represented in Figure 2. Therefore, when considering miRNA, we always assume , regardless of the number of external mismatches.

Figure 2

Sketch of a miRNA seed interacting with a generic portion of a mRNA. The nucleobases involved in the interaction with the AGO protein (red shading) are not available for pairing. Colored boxes have the same color code as Figure 1. In this sketch, , and .

2.1.3. Internal Mismatches

Now, we consider the effect of internal mismatches in the duplex. The integer parameter gives the number of internal mismatches within the duplex. When , the set of possible defining a possible duplex, with one matched base pair at least, fulfills the condition (see Figure 1). Besides the corresponding couples of neighboring base pairs that are canceled out, the contribution penalty that stems from single internal mismatches has been thoroughly studied [12]. This depends on the particular bases. Again, following the philosophy of our coarse-grained approach, we give an averaged contribution and to those eventualities (see Appendix A along with Table A3 therein for further details on the averaging). We assume that such a contribution does not vary when more than one internal mismatch is considered. Moreover, in order to prevent further complexity, we completely neglect the internal structure of the internal mismatches (number, sizes, and separation of adjacent internal mismatches). Specifically, we consider that the effects of additional internal mismatches are equivalent to considering those mismatches to be non-consecutive. Therefore, the thermodynamic parameters associated with internal mismatches are

Table A3

Thermodynamic parameters for internal errors for DNA/DNA interaction. Energy and entropy units are kcal mol−1 and cal mol−1 K−1, respectively.

		Y
Propagation		A		T		C		G
Sequence	X	ΔH	ΔS	ΔH	ΔS	ΔH	ΔS	ΔH	ΔS
AX/TY	A	1.2	1.7	WC	WC	2.3	4.6	−0.6	−2.3
	T	WC	WC	−2.7	10.8	−1.2	6.2	1.0	0.9
	C	5.3	14.6	0.7	0.2	0.0	−4.4	WC	WC
	G	−0.7	−2.3	−2.5	−8.3	WC	WC	−3.1	−9.5
TX/AY	A	4.7	12.9	WC	WC	3.4	8.0	0.7	0.7
	T	WC	WC	0.2	−1.5	1.0	$0.7	−0.1	−1.7
	C	7.6	20.2	1.2	0.7	6.1	16.4	WC	WC
	G	3.0	7.4	−1.3	−5.3	WC	WC	1.6	3.6
CX/GY	A	−0.9	−4.2	WC	WC	1.9	3.7	−0.7	−2.3
	T	WC	WC	−5.0	−15.8	−1.5	−6.1	−4.1	−11.7
	C	0.6	−0.6	−0.8	−4.5	−1.5	−7.2	WC	WC
	G	−4.0	−13.2	−2.8	−8.0	WC	WC	−4.9	−15.3
GX/CY	A	−2.9	−9.8	WC	WC	5.2	14.2	−0.6	−1.0
	T	WC	WC	−2.2	−8.4	5.2	13.5	3.3	10.4
	C	−0.7	−3.8	2.3	5.4	3.6	8.9	WC	WC
	G	0.5	3.2	−4.4	−12.3	WC	WC	−6.0	−15.8

According to our modeling, each internal mismatch replaces two couples of next-neighbor canonical base pairs with two next-neighbor couples of mismatched base pairs. Note that we have carried out a strong approximation in the study of internal mismatches. Nevertheless, since the states with a significant statistical weight are those with low numbers of errors, we can argue that this approximation will not lead to significant errors. The results we present in this work are reasonable and physically sound, corroborating that our assumptions do not seem to misguide our analysis.

2.1.4. Salt Correction

Thermodynamic parameters are computed for a given referential salt concentration, usually 1 M of NaCl. Either excess or a defect of salt, or the presence of other ions, will imply a change in those parameters, affecting mainly the entropic contribution. Salt correction has been studied in detail in the literature [14]. In a nutshell, the most accepted proposals for this contribution assume that is a function of the salt concentration , usually through its logarithm. Again, this contribution has a dependence on the specific sequence that we neglect through averaging (see Appendix A for details).

2.1.5. CG Contribution

In order to complement our averaged description, we develop also a more detailed, yet simple, approach that takes into account also the effect of different sequences. Specifically, we assume that the energetic parameters will be a function on the fraction of bases C or G in the DNA sequence, which may change in a significant way the thermal stability of the duplex. This description will be primarily of interest for the PCR pairing statistics, since it will highlight how the choice of specific sequences can influence the success of the PCR. Herein, we follow the IUPAC-IUB notation, where the bases are classified as either strong bases S = {C, G} or weak bases W = {A, T}. Then, we define as the fraction of S bases in the DNA sequence of interest. Our hypothesis is that the contribution coming from the next-neighbor canonical couples of base pairs is a function of this fraction. Specifically, we consider a linear interpolation (see Appendix A for further details), i.e., When illustrating the effect of differences in the richness of strong bases, we will present the results in terms of the number of S bases .

2.2. Degeneracy of Equivalent Duplexes

Given a specific sequence, there is only one well-defined complementary sequence, which corresponds with . On the contrary, with the same specific referential sequence, we can find many duplexes with the same , i.e., duplexes with errors are degenerate. Since there are 4 different possible bases, if we focus on one of them, there is only 1 exact complementary and 3 possible mismatches. Therefore, the degeneracy of a duplex with errors made by a selective molecule (primer or miRNA) of length L within a specific site of a much longer nucleic acid characterized by is where the binomial coefficient takes into account all possible combinations of the mismatches in the internal region of the duplex. Note that the simplicity of this degeneracy is partially due to our disregarding of the internal structure of internal mismatches.

2.3. Quantifying Selectivity

In the annealing phase of PCR, a short primer of length L can pair to its complementary target or to an off-target site in the two genomic ssDNA. Similarly, this also occurs in the pairing of miRNA, which can pair to its specific target or to other available sites within mRNA different molecules. The specificity of this binding is key to guarantee the success of the selective process. Herein, we compute the probability of having such successful binding using our model. Let us consider a duplex comprising one selective molecule (primer/miRNA) and a longer nucleic acid. This duplex has, in principle, many ways to be formed. Obviously, we expect that there is a preferred binding, which corresponds with the selective molecule binding to the target region of the longer nucleic acid. For generalization purposes, let us assume that this target region appears times in the longer nucleic acids. The statistical weight of occurrence for a specific binding j is given by the Boltzmann factor where is the free energy difference corresponding to such binding, R is the gas constant, and T the temperature used in the experiment. Therefore, if we label as the desirable hybridization of the primer/miRNA with a specific target region, the probability of having a successful selection is where the sum is carried out over all possible pairings in the system. Note that is the conditional probability of having a successful binding, given that a binding occurs. In other words, we implicitly assume that in typical conditions, concentration and temperature grant a good degree of PCR primer (or miRNA) binding to the longer nucleic acids. should not be confused with a melting curve, e.g., means that, out of a total of bound primers/miRNA per long polymer, is on-target and are off-target. When computing , we have conjectured that the oligomers (primer/miRNA) in the system are mutually independent, i.e., they do not compete for the binding on each specific site. Therefore, we are requiring implicitly that the total number of actual bindings per long polymer measured by the melting curve should not be much larger than . This constraint means that, on average, the number of primers/miRNA on target computed from , i.e., , does not exceed the number of target spots on the genome. In Appendix B, we provide a numerical check of in typical genomic PCR conditions, based on the assumption of independence of primers and the computation of the melting curve, validating our hypothesis. Thus, we interpret as a good estimator of pairing selectivity, expressing the ratio between on-target and off-target bindings. In order to compute , we need to quantify the different . On the one hand, we can compute through using the formalism introduced in the previous section considering . On the other hand, using a mean field approximation, we assign the averaged Boltzmann factor to the rest of the possible bindings, where the sum over runs for all possible external and internal mismatches. The denominator in (16) comes from , where the sum includes duplexes without a single complementary base pair. These duplexes can be considered within our energetic framework as impossible bindings to which we associate . These off-target pairs have a weight proportional to the total sites of pairing available in the system, i.e., the number of bases of the long polymer. Finally, we can rewrite the probability in (15) for pairing to the targets that are found in number in the system as Note that we have used that , which is true for both PCR and miRNA. This statistical approach allows us to provide a simple theoretical result with no knowledge of the specific sequences involved, which is computationally cheap. Although we are aware of the quantitative limitations of such an approach, we show here that our framework leads to a better understanding of the physics involved in selective processes such as the PCR technique or miRNA.

3. Results

The theoretical framework introduced above enables the evaluation of the probability of successful binding in the two conditions we have identified as especially challenging for the selectivity. In this section, we compute, for both PCR and miRNA, the targeting efficiency as a function of the relevant parameters (i.e., the length of the oligonucleotides L, the temperature T, the number of competing sites , and the number of target sites in the system ), by varying one parameter at a time and holding the other values fixed, and chosen to mimic typical real conditions. The energetic parameters used in the calculations are obtained by averaging over the DNA and RNA thermodynamic dataset of the nearest-neighbor model, as detailed in Appendix A (Table A1, Table A2 and Table A3).

3.1. PCR

The application of our general framework to the selectivity of primers in PCR requires some specifications. First, primers are typically designed to pair to a single target position on the genome, i.e., . Second, in evaluating PCR efficiency, it is crucial to include the notion that the DNA polymerase needs a correct pairing between the target molecule and the 3 terminal of the primer in order to start the amplification reaction [15]. This can be included in the model by splitting the average of off-target pairings into the weighted combination of the two contributions stemming from and , where the denominator expresses the degeneracy of duplexes in the two cases. Accordingly, the total statistical weight of the pairing of the primer along the genome becomes where the coefficients and are the frequency with which correct and defected pairing occur in the 3 terminal nucleobase, respectively. Thus, the probabilities of the two classes of off-target pairings, with and without correct pairing at the 3 terminal, are Since off-target pairing with errors at the 3 terminal inhibits the amplification, the relevant quantity expressing the selectivity of PCR is the ratio of on-target pairing over all the defectless 3 primer–genome binding, i.e., the meaningful ratio is normalized with only defectless 3 primer–genome possible pairings, as, for the other cases, the PCR would not even start its amplification process. Figure 3 and Figure 4 show the primer length dependence for the PCR pairing statistics. Specifically, we display the curves for the on-target binding probability (blue dots in Figure 3), the probability of off-target binding with and without 3 pairing errors and (yellow and purple diamonds in Figure 3, respectively), and the renormalized on-target binding probability (Figure 4). The computation of the different curves is performed holding fixed , since the primer can bind to both strands of the genomic DNA double helix, and °C, a typical annealing temperature; the salt concentration is mM, representing the standard salt concentration, according to a typical DNA polymerase manufacturer’s instructions. Due to the logarithmic dependence on the salt concentration, it is necessary to notably change the salt concentration in order to observe significant changes (see Appendix C for details). Both and exhibit a rather sharp rise, indicating that the selectivity of the primer markedly changes upon lengthening or shortening the primer of a single nucleobase. The significant difference between and is due to the remarkable difference between the probability of off-target binding and its sub-ensemble of off-target with a defectless 3 terminal (red and purple diamonds, respectively, in Figure 3). This proves that such a defect is actually quite common in random binding, since terminal defects involve the smallest energy penalties [12] with a limited growth of degeneracy. In Figure 4, we also consider the effect of modifying the fraction of CG bases in the primer. Full dots are computed with a number of CG bases ; open dots correspond to .

Figure 3

Dependence on the primer length L of the pairing probability for PCR. Fixed values are considered for temperature °C, total sites , salt concentration mM, and for CG fraction . Successful target binding (, blue dots). Off-target binding (, red dots). Off-target binding can be split into 2 contributions: off-target binding with no terminal defects at the 3 end (, purple diamonds), off-target binding with terminal defects at the 3 end (, yellow diamonds). The vertical gray line stands for , a typical primer length in PCR.

Figure 4

Dependence on the primer length L of the on-target pairing probability conditioned on the well-paired 3 end for PCR. Fixed values are considered for temperature °C, salt concentration mM, and total sites , for different CG fractions in the primer. Full dots and solid line: CG fraction . Open dots and dashed line: CG fraction . Open dots and dotted line: CG fraction . Curves are computed using the average energetic description, detailed on the CG fraction (Equations (11) and (12)). The vertical gray line stands for , a typical primer length in PCR.

The temperature dependence for the PCR pairing statistics is analyzed in Figure 5. Therein, is plotted for either a balanced or unbalanced proportion of CG bases for and . When T increases, the fraction of on-target binding decreases, as expected since the energetic gain for Watson–Crick against defected pairing decreases with T. Again, we find a sharp transition between high and low , and the typical working temperature °C is indeed in the regime of high selectivity, but close to the transition to low selectivity.

Figure 5

Dependence on the temperature T of the on-target primer pairing probability conditioned on the well-paired 3 end for PCR. Fixed values are considered for primer length , salt concentration mM, and total sites , for different CG fractions in the primer. Solid line: CG fraction . Dashed line: CG fraction . Dotted line: CG fraction . The gray line marks °C, a typical annealing temperature in the PCR experiments.

Finally, we compute the on-target binding probability as a function of the number of competing binding sites . This is shown in Figure 6, where we repeat our study for different proportions of CG bases while fixing and °C. In this case, the transition is much smoother and relevant changes in the selectivity appear only when changing of order of magnitudes. When considering the of the human genome, the selectivity of the PCR primers is found to be very high, as expected.

Figure 6

Dependence on the genome length of the on-target primer pairing probability conditioned on the well-paired 3 for PCR. Fixed values are considered for temperature °C, salt concentration mM, and primer length , for different CG fractions in the primer. Solid line: CG fraction . Dashed line: CG fraction . Dotted line: CG fraction . The gray line marks twice the length of the human genome, .

3.2. miRNA

To apply our theoretical approach to the selective binding of miRNA, we need first to assess which are the most appropriate values for and . miRNAs preferentially target 3 UTRs, since the coding region is usually bound to other macromolecular complexes, e.g., exon junction complexes and ribosomal machinery, that would displace the RISC complex [16]. For this reason, we choose to include in our analysis only a portion of around 1000 nt, which corresponds to the median length of the 3 UTR [17]. Moreover, to evaluate the total number of possible binding sites for each miRNA, we have to consider that not all the genes encoded in the human genome are actively transcribed within a cell. Transcriptome data in fact show that approximately 11,000 genes are simultaneously detectable within a specific cell type [18]. Thus, in evaluating the seed selectivity, we consider a reduced transcriptome length given by the product of these two quantities, . As for the evaluation of , it is relevant to notice that, differently from the PCR situation in which the primer is designed to target a single position in the genome, a single miRNA regulates the expression of several genes simultaneously. In particular, evidence suggests that the “targetome” of a miRNA is not random, but it is generally constituted by transcripts sharing the same biological network. This fact suggests that miRNAs can regulate entire target pathways [19,20]. Thus, in order to provide a reasonable value for the seed length that ensures the required selectivity, we need to consider the mean number of genes targeted by each miRNA family (groups of miRNA sharing the same seed). Analyses of preferential conservation of the seed sequence in mammals against vertebrates have indicated that the average number of targets for each miRNA family is around 300 [16]. More recent studies based on the integration of miRNA target prediction and RNA sequencing data suggest an average of 90 targets for each miRNA, highlighting the high variability among individual miRNAs [21]. Therefore, in the application of our approach to miRNA selectivity, we consider to be in the range 100–300. The dependence of successful binding on the length of the miRNA seed region is shown in Figure 7, where the three conditions of , , and are considered for °C (temperature in human cell) and ; the salt concentration is mM (to mimic the physiological salt concentration). The results obtained for clearly indicate that, if miRNAs were meant to regulate only one specific gene, the seed length should have been 4–5 nucleobases longer in order to have the right selectivity.

Figure 7

Dependence on the miRNA length L of the target pairing probability for fixed temperature °C, total sites , salt concentration mM, and CG fraction . Curves correspond to different numbers of distinct miRNA targets . Yellow dots: . Red dots: . Blue dots: . The gray line marks , the typical length of the seed region of miRNA.

It is possible to recast the effect of and in Equation (17) in a single parameter , which is the ratio of the number of off-target over on-target binding sites, quantifying the required selectivity. With such a definition, the equation can be rewritten as Therefore, two different systems where and scale with the same factor, and thus with the same , are completely equivalent in our theoretical framework. Finally, we present the relation between selectivity and the number of competing binding sites in Figure 8, i.e., the analogous dependence shown in Figure 6 in the case of PCR. As already introduced above, since the role played by the length of the long polymer is always modulated by the number of targets, it suffices to study the dependence on the defined effective length . The inset shows that the dependence on T of is not so strong as observed in the PCR case, i.e., the miRNA selectivity is less sensible to the temperature in the range of interest for the human body.

Figure 8

Dependence of the miRNA pairing probability on the reduced transcriptome length computed with for fixed temperature °C, salt concentration mM, CG fraction , and miRNA length . Inset: T dependence of in the same conditions and . The gray lines mark the reference values °C (inset) and (main figure).

4. Discussion

The statistical framework developed in this work has allowed the analysis of the effectiveness of selectivity in both PCR and miRNA. In spite of its simplicity, the model has helped to better understand the relevance of the mechanisms behind the selective process, enabling non-trivial predictions that appear to have quantitative agreement with experimental observations. Among these remarkable features, we highlight the steep dependence of selectivity on L in Figure 3, Figure 4, and Figure 7 and the complex L dependence of various families of defect duplexes (Figure 3), which are discussed below separately for the two cases of interest.

4.1. PCR

In the context of PCR, our results convey various insights on the nature of primer selectivity. If the selective mechanism was entirely provided by on-target vs. off-target binding, i.e., expressed by , longer primers would be needed, e.g., selectivity of entails from Figure 3. Nevertheless, we find that the constraint of Watson–Crick pairing at the 3 terminal of the primers significantly changes the range of successful binding. This is because, out of the large number of expected off-target pairings (red dots in Figure 3), the fraction of primer that binds off-target with a well-formed 3 terminal is small and has a non-trivial dependence on L, with a drop for . When only defectless 3 terminal binding is considered, the successful binding of primers among the plethora of off-target positions offered by the human genome is approximately . While this figure is still far from 1, we argue that it is sufficient, since the PCR protocol makes use of a combination of two primers, designed to target the complementary strands of the region of interest. The double strands produced at the end of the first replication cycle are much shorter than the initial genome, reducing effectively the value of in our description. Thus, in the following replication cycles, the ratio between on-target and off-target position increases, leading to a progressive increment of . Another outcome of our approach is the quantification of the effect of unbalancing CG and AT bases in the primer: Figure 6 shows that the reduction or addition of two CG bases markedly affects . This is in line with the experimental procedures: when the CG content of a primer is low, its length is usually extended to compensate for the loss of selectivity. The dependence of on found by our model in Figure 6 is weak, i.e., for a moderate change of , the pairing probability does not change. This indicates that the PCR primer’s length granting selectivity depends weakly on the complexity of the molecular target, and thus it does not need to be significantly changed depending on the nucleic acid environment. We have found that the on-target pairing probability decreases as T increases (Figure 5). This behavior is well grounded from a thermodynamic point of view, since increasing the temperature makes the free energy penalty associated with mispairing decrease, and the population of more entropic (defected) states is favored. However, this dependence appears in contradiction with the typical experience of the molecular biologist. In fact, when PCR efficiency is not very high, the annealing temperature is usually raised (typically by 2–3 °C), especially during the first cycles to improve specificity. We argue that this experimental strategy is not rooted in an increment in the selectivity at equilibrium (which is the quantity we compute), but rather it is a strategy to overcome kinetic barriers, i.e., to avoid off-target defected bindings having lifetimes comparable with the annealing time. Indeed, the increase in T by even a few degrees strongly reduces the lifetime of off-target bindings, thus speeding up the dynamics towards equilibrium (see Appendix D).

4.2. miRNA

Now, the application of our description to the miRNA selective process is discussed. The results in Figure 7 demonstrate that, in order to obtain significant selectivity over the targets around 0.8, a seed region of length 9–10 would be required, depending on the number of targets. Differently from what was found in PCR, where we obtained a primer length transition consistent with the typical experimental setting, the estimated length of the miRNA seed is larger than the actual value. This difference is not surprising since miRNAs operate within a much more intricate biological network than in vitro PCR settings. Many factors contribute to the complexity of this system. miRNAs are part of a ribonuclear particle, where the interaction with the protein component plays an essential role not only in the mechanism of silencing that follows the binding, but also in the target recognition. Experiments exploiting AGO crosslinking and coimmunoprecipitation revealed extensive AGO-bound mRNAs in the absence of miRNA seed complementarity, thus suggesting that AGO proteins might have an RNA-binding property that allow thems to recognize mRNA targets [22]. Moreover, once the RISC complex is bound to the target mRNA, the molecular machinery undergoes a conformational change that exposes a part of the miRNA 3 region (nts 13–16), thus allowing a supplemental pairing with the target, and providing additional selectivity [3], which could be interpreted as an increment in L in our description. Furthermore, the interaction between a miRNA and its target is not solely dependent on the nucleic acid pairing, but also on the availability of the target in the cell. This implies that the target gene must be transcribed, and that its secondary structure must allow the landing of the RISC complex and the binding of the miRNA to the target region. These elements suggest that the actual seed length is a compromise between the selectivity provided by nucleic acid pairing and the complexity of the cellular environment that calls for a higher degree of flexibility of the system. Overall, the comparison between the estimated and actual miRNA seed length offers a quantification of the extra selectivity brought by the mechanisms at play beyond base pairing.

5. Conclusions

In this study, a theoretical framework has been developed to describe selective processes in complex nucleic acid environments and applied to the PCR technique and miRNA-based gene regulation. In both cases, the selective binding occurs in spite of a huge degeneracy of competing defected pairings. The theory is constructed around two main approximations: (i) the coarse-grained description of the duplex energetics, recast based on three classes of base pairing errors, and (ii) the statistics of competing binding sites, computed by assuming random sequences. Despite the complexity of the problem, our simple approach has led to a quite cheap model that has enabled quantitative estimates of the selectivity in the two processes, at the same time enlightening features that cannot be recognized in the absence of a quantitative framework: the sharpness of the transition in selectivity as a function of the length L of the oligomers at play; the relevance of the constraint of defectless 3 terminals in PCR primer targeting; and the quantitative estimate of the contribution to miRNA target selectivity provided by processes beyond base pairing.

Table A4

Thermodynamic parameters for canonical Watson–Crick base pairs for duplexes made by DNA/DNA averaged over categories of bases. Energy and entropy units are kcal mol−1 and cal mol−1 K−1, respectively.

Propagation
Sequence	ΔH	ΔS
WW	−7.4	−21.1
SS	−9.1	−22.8
SW(WS)	−8.2	−22.1

Table A5

Estimation of the relative decrease in the disassociation time , due to a temperature increment . The values of the table are computed using Equation (A5), with different values of the increment , primer length L, and external mismatch in , representing different levels of defectiveness of the duplex. °C, as a typical PCR temperature.

		τL,α→(T)/τL,α→(T+ΔT)
ΔT	L	αe1=0	αe1=1	αe1=2	αe1=5	αe1=10	αe1=15
ΔT=+2 °C	L=20	4.4	4.2	3.8	3.1	2.1	1.4
	L=21	4.7	4.5	4.2	3.3	2.3	1.5
	L=22	5.1	4.8	4.5	3.6	2.4	1.7
ΔT=+3 °C	L=20	9.1	8.4	7.5	5.3	3.0	1.7
	L=21	10.2	9.5	8.4	6.0	3.4	1.9
	L=22	11.5	10.6	9.5	6.7	3.8	2.1

26 in total

Review 1. The thermodynamics of DNA structural motifs.

Authors: John SantaLucia; Donald Hicks
Journal: Annu Rev Biophys Biomol Struct Date: 2004

2. Effects of sodium ions on DNA duplex oligomers: improved predictions of melting temperatures.

Authors: Richard Owczarzy; Yong You; Bernardo G Moreira; Jeffrey A Manthey; Lingyan Huang; Mark A Behlke; Joseph A Walder
Journal: Biochemistry Date: 2004-03-30 Impact factor: 3.162

3. The role of site accessibility in microRNA target recognition.

Authors: Michael Kertesz; Nicola Iovino; Ulrich Unnerstall; Ulrike Gaul; Eran Segal
Journal: Nat Genet Date: 2007-09-23 Impact factor: 38.330

4. Quantitative effects of position and type of single mismatch on single base primer extension.

Authors: Jer-Horng Wu; Pei-Ying Hong; Wen-Tso Liu
Journal: J Microbiol Methods Date: 2009-03-12 Impact factor: 2.363

5. Single-molecule kinetics reveal cation-promoted DNA duplex formation through ordering of single-stranded helices.

Authors: Nicholas F Dupuis; Erik D Holmstrom; David J Nesbitt
Journal: Biophys J Date: 2013-08-06 Impact factor: 4.033

6. Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs.

Authors: T Xia; J SantaLucia; M E Burkard; R Kierzek; S J Schroeder; X Jiao; C Cox; D H Turner
Journal: Biochemistry Date: 1998-10-20 Impact factor: 3.162