Literature DB >> 15990391

A computer program for the estimation of protein and nucleic acid sequence diversity in random point mutagenesis libraries.

Abstract

A computer program for the generation and analysis of in silico random point mutagenesis libraries is described. The program operates by mutagenizing an input nucleic acid sequence according to mutation parameters specified by the user for each sequence position and type of point mutation. The program can mimic almost any type of random mutagenesis library, including those produced via error-prone PCR (ep-PCR), mutator Escherichia coli strains, chemical mutagenesis, and doped or random oligonucleotide synthesis. The program analyzes the generated nucleic acid sequences and/or the associated protein library to produce several estimates of library diversity (number of unique sequences, point mutations, and single point mutants) and the rate of saturation of these diversities during experimental screening or selection of clones. This information allows one to select the optimal screen size for a given mutagenesis library, necessary to efficiently obtain a certain coverage of the sequence-space. The program also reports the abundance of each specific protein mutation at each sequence position, which is useful as a measure of the level and type of mutation bias in the library. Alternatively, one can use the program to evaluate the relative merits of preexisting libraries, or to examine various hypothetical mutation schemes to determine the optimal method for creating a library that serves the screen/selection of interest. Simulated libraries of at least 10(9) sequences are accessible by the numerical algorithm with currently available personal computers; an analytical algorithm is also available which can rapidly calculate a subset of the numerical statistics in libraries of arbitrarily large size. A multi-type double-strand stochastic model of ep-PCR is developed in an appendix to demonstrate the applicability of the algorithm to amplifying mutagenesis procedures. Estimators of DNA polymerase mutation-type-specific error rates are derived using the model. Analyses of an alpha-synuclein ep-PCR library and NNS synthetic oligonucleotide libraries are given as examples.

Entities: Chemical Disease Mutation Species

Mesh：

Substances：

Year: 2005 PMID： 15990391 PMCID： PMC1166583 DOI： 10.1093/nar/gki669

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Random point mutagenesis of a nucleic acid sequence is a useful technique for probing structure and function and for directed evolution of proteins, peptides and nucleic acids. Mutagenesis libraries can be created by several methods (1), including error-prone PCR (ep-PCR) (2–4), passage through mutator Escherichia coli strains (5), chemical mutagenesis (6) (e.g. sodium bisulfite, nitrous acid), and oligonucleotide synthesis (7) (NNN, NNS, or arbitrary doping; N = A, G, C, or T; S = C or G). A mutagenesis library is typically characterized by its size (number of independent clones) and by how heavily it is mutated. However, statistics such as library diversity (8) (e.g. number of unique sequences, point mutations, and single point mutants), amino acid mutation bias and the distribution of the number of mutations per sequence would in some cases be of more direct interest to the experimentalist, if they were available. The reverse issue of optimizing amino acid mutation bias in synthetic oligonucleotide encoded libraries has been explored previously (7,9–11). Prior to screening, estimates of these parameters could be used to evaluate the relative merits of preexisting libraries. They can also be used to evaluate a variety of potential mutagenesis schemes and levels, in order to determine the most appropriate type of library to construct, for the given objective. Subsequent to selecting or preparing the most appropriate library, one would ideally screen all of its available diversity. However, if the screening process is costly in terms of time or expense, knowledge of the amount of diversity covered as a function of the number of clones screened allows the investigator to determine the optimal screen size. This is the endpoint that allows sufficient and efficient exploration of a defined portion of the library diversity while avoiding inefficient oversampling. Finally, when the screen has been completed, one would like to know how much of sequence space was actually covered, and with what amino acid mutation bias. A number of library statistics are amenable to an analytical treatment. For example, amino acid mutation frequencies, the fraction of sequences that are wild-type, the length distribution of truncated proteins, and some simpler diversity statistics (the number of unique sequence-position-specific point mutations and single point mutants; Supplementary Appendix H) can be calculated analytically. In contrast, the distribution of the number of mutations per sequence and the overall sequence diversity (number of unique sequences) are in general not analytically tractable. Previous analytical work has shown that probability theory can be used to estimate the overall diversity of a nucleic acid library when the equations are simplified by requiring that all sequences are equiprobable, or that all mutations occur with a single frequency, independent of wild-type base identity, mutation (e.g. transition versus transversion), and position (12). However, this requirement is generally not upheld during random mutagenesis. In the case of ep-PCR (3,13) (Table 3) and chemical mutagenesis (6), mutation frequencies vary with wild-type base, mutation (there are 3 possible changes for each of the 4 bases, and therefore 12 total mutation types), and possibly with sequence position. While equiprobability of sequences can be experimentally specified during oligonucleotide synthesis (e.g. NNS), in general the composition of every position in the sequence is arbitrarily controlled. Furthermore, the diversity of the translated protein library is often what is of prime interest, but this prediction is even more difficult to handle mathematically (14): even if the underlying nucleic acid sequence variants are equiprobable, the degeneracy of the genetic code results in non-equiprobable amino acid sequence variants (7).

Table 3

Estimated polymerase paired incorporation frequencies for the α-synuclein ep-PCR DNA librarya

Specific event	Probability
p_aa, p_tt	0.00161
p_ag, p_tc	0.00037
p_ac, p_tg	0.00121
p_ga, p_ct	0.00026
p_gg, p_cc	0.00005
p_gt, p_ca	0.00043

aThis set of six paired mutagenic incorporation frequencies was derived using the estimator of Appendix B with the sequencing data of Table 1, and n = 9 and λ = 0.88. The minimum value of k in the estimator was taken as zero, because this library can potentially incorporate initial template sequences (see Appendix B). The values of n and λ were chosen as reasonable estimates that also coincide with the experimental amplification factor of ∼300. The accuracy of this set of probabilities, as measures of the paired incorporation frequencies of a real polymerase, depend strongly on the accuracy of the estimated values of n and λ. In contrast, the results of the simulation/analysis are relatively insensitive to these two parameters (Supplementary Appendix C.II). Statistical uncertainty in these values is reflected in the confidence intervals of Table 2.

Taking a numerical approach, two previously reported algorithms (14,15) use a Monte Carlo procedure and DNA translation, but allow only a single scalar value for all mutation frequencies at all positions, a small number of iterations, and do not track library diversity (these programs were written in the early 1990's when computer power was much more limiting). To our knowledge, estimates of protein/nucleic acid library diversity cannot be practically obtained with any currently available methods, analytical or numerical. We describe here a computer program that calculates statistics for, and diversity of, nucleic acid and protein random point mutagenesis libraries. The frequency of all possible nucleic acid mutations at every position in a sequence can be specified independently, enabling one to make predictions about library composition based on the mutation frequencies derived or expected from almost any type of random mutagenesis scheme.

MATERIALS AND METHODS

Construction of a library of randomly mutated α-synuclein cDNA molecules

The ep-PCR method of Cadwell and Joyce was used (3,16). Template for the ep-PCR was generated by standard PCR (non-error generating; Pfu polymerase; forward primer: cgagctctccatatggatgtattcatgaaaggac; reverse primer: cgagctctcaagcttggatggaacatctgtcagc) of the α-synuclein cDNA. The template extends from 12 bp upstream of the α-synuclein start codon through ∼60 bp downstream of the stop codon, and was purified using agarose gel-electrophoresis. Approximately 30 ng (100 fmol) of template was used in a 100 μl ep-PCR [10 mM Tris, pH 8.3, 50 mM KCl, 0.01% gelatin, 0.2 mM each dATP and dGTP, 1 mM each dCTP and dTTP, 7 mM MgCl2, 0.5 mM MnCl2, 0.3 μM forward and reverse primers (same sequences as above), 5 U Taq polymerase]. Thirty reaction cycles [94°C 1 min, 66°C 1 min, 72°C 75 s; this number of cycles is probably excessive, given the maximum amplification under these conditions of 300-fold, but see (3) and Supplementary Appendix D] were performed followed by product purification (Qiagen PCR purification kit). The insert was digested with NdeI and HindIII, the product was purified again (as above), and the mixture was ligated (Takara ligation kit) into a digested and phosphatased pT7-7 E.coli expression vector (17). The ligated DNA was purified into a low-salt buffer for electroporation (Qiagen PCR purification kit). Ligated DNA (10 μl) was added to 30 μl of electrocompetent E.coli [strain DH10B prepared by the Dower method (18)] on ice, and two 20 μl electroporations were carried out at 4°C. The electroporator and cuvette were constructed in the laboratory, and provide equivalent efficiency to commercially available devices. The electroporator individually charges each of a set of 12 serially connected capacitors to several hundred volts using an electrophoresis power supply; this provides a total end-to-end voltage in the kilovolt range. The cuvette is Plexiglas with stainless steel electrodes (1.4 mm gap). An initial electric field of ∼14 kV/cm with an exponential decay (time constant of 5 ms) was measured with an oscilloscope. Immediately following electroporation, the cells were diluted into 2 ml SOC medium and incubated with shaking for 1 h at 37°C. Aliquots were then plated on LB-ampicillin media to determine the number of transformants, and the remainder was grown in several hundred milliliters of LB-ampicillin media overnight for a midi-prep. Two additional libraries with similar properties were generated during optimization of this mutagenesis procedure (dNTP, Mn++ and Mg++ concentrations were never varied). Total overall mutation frequencies of these three individual libraries are similar (range of 0.006–0.009 mutations per base pair). The specific overall DNA mutation frequencies from each of these three libraries also appear to be similar [significant differences between the libraries were only detected in the mutation pair T→A, A→T (χ2 significance level <0.05)]. Therefore, sequence data from all three sub-libraries were combined.

The computational algorithm

Inputs

First, the initial DNA sequence is read in from a text file, as are a set of mutation parameters. In the case of ep-PCR, these parameters are 12 probabilities p which describe the frequency that the polymerase creates a mutation by misincorporating base y across from base x. Note the distinction between these incorporation probabilities and the actual mutation which results (e.g. p creates an A→T mutation; mutation of base x to base y is denoted by x→y). The estimation of these probabilities from DNA sequencing data is discussed in Appendix B. Also, the number of PCR cycles n and the PCR efficiency λ are specified (rough estimates of these parameters are sufficient, see Supplementary Appendix C.II). In non-amplifying mutagenesis methods such as oligonucleotide synthesis, the input mutagenic parameters are the direct mutation frequencies, for example, the frequency that wild-type base A is mutated to a G. These frequencies may be different at each position in the sequence and for each of the 12 mutation types (i.e. A→G, A→C, A→T, etc.). In both cases, the number of sequences to generate is also specified, and the option is given to proceed with a numerical (see sections below) or analytical algorithm (Supplementary Appendix H). The analytical algorithm has the advantage of speed and can work with an arbitrarily large library, but cannot calculate the total library diversity, the distribution of the number of mutations per sequence, or the distribution of the number of times sequences in the library were repeatedly generated.

Numerical production of a sequence

The program generates a random number with the Mersenne Twister algorithm (19). This number is used to decide whether to accept one of the three possible mutations or to leave the base as wild-type. The decision to accept a mutation occurs with a specified probability, discussed in the next section. The program then repeats this procedure, using a new random number and the applicable acceptance frequencies, for the second base, the third base, and so on, to the end of the sequence. This single-pass mutagenesis is in contrast with the multi-pass, amplifying process of PCR. One object of Appendix A is to define a method by which the result of the latter process is simulated using the former.

How the acceptance frequencies are determined from the inputs

The method of determining the acceptance frequencies differs for non-amplifying methods (e.g. oligonucleotide synthesis, chemical mutagenesis) and amplifying methods (ep-PCR, E.coli mutator strain). For the case of non-amplifying methods, the acceptance frequencies are simply the inputs directly specified by the user, as described above. In the case of ep-PCR, the acceptance frequencies are determined for each new sequence as follows: First, the generation number of the sequence and the strand (top or bottom) of its zeroth generation ancestor are chosen randomly, according to their probability distributions (these terms and their probability distributions are defined in Appendix A). Inclusion of the zeroth generation (initial templates) in the probability distribution used to choose generation is optional. In some experimental procedures, these molecules are not incorporated into the library, for example, if they are lacking restriction sites (for subsequent cloning) introduced using the PCR primers. Having decided these two values, the appropriate mutation acceptance rates are derived by a series of matrix transformations on the input polymerase incorporation frequencies (see Appendix A for details). These frequencies are then held constant throughout production of a single sequence, because all bases of a sequence share the same generation and ancestor strand.

Protein translation

If protein diversity is being examined, the mutated DNA is translated. The program uses the standard genetic code by default, but by altering a single line of the program code one can specify an alternative codon translation, for example, as necessary with amber stop codon suppression or mitochondrial protein synthesis.

Further iterations

This process for mutating a wild-type sequence is then repeated exactly as above, the number of times (number of sequences) specified by the user. Each iteration begins anew with a wild-type sequence and is independent of any previous iteration.

Library storage in memory

Every mutated sequence is stored in a binary search tree (20) as it is generated. Each mutation requires two bytes of memory, the low bits (5 for protein, 2 for DNA) of which store the mutation type, while the remaining high bits are used to store the sequence position. This scheme limits protein sequences to ∼2000 amino acids [2(16–5)], but this should be adequate for almost all cases. DNA alone can be examined to lengths of ∼16 kb [2(16–2)]. Sequences are unambiguously ordered in the tree using a scheme based on the number, position, and types of mutation they contain. The number of times each sequence has been generated is also recorded. Memory is allocated dynamically for all critical parts of the algorithm, so that the problem size accessible to the program is only limited by machine hardware. For simulations which have memory requirements beyond the physical memory of the computer, an efficient disk caching routine was written which significantly extends the upper size limit of practical simulations. Reliance on operating system disk paging for this purpose would have been unacceptably slow; the non-locality of reference of the binary tree would cause continuous seeking on the hard disk.

Library analysis

Statistics of the nucleic acid and/or protein libraries are collected during and following the sequence iterations. For diversity estimation, these are the number of unique sequences, mutations, and single point mutants as a function of the total number of sequences generated, the distribution of the number of mutations per sequence, the distribution of the number of times sequences in the library were repeatedly generated, and a listing of the most often generated sequences. Additional statistics include the specific mutation frequencies generated at each position, the total mutation frequencies summed over all positions, the number of extended sequences (protein stop to sense mutation), and the distribution and number of truncations. When truncated proteins are generated, the occurrence and sequence position is recorded, but the sequence is not used in other statistics. The user has the option of discarding or keeping sequences in which a stop codon has been mutated to a sense codon.

Library size limitations

With current personal computers, physical memory is the most likely factor to set the upper practical limit on numerical simulation size, which will typically be on the order of 109–1010 sequences for average problems. Although the scratch disk function extends the upper limit well beyond that which would be possible with physical memory alone, the nature of the disk cache routine and the slowness of disk access speed still set an upper boundary which is a function of available physical memory (the frequency with which the scratch function is used is inversely correlated with physical memory size). Modern supercomputers often provide tens of GBs of physical memory and a large amount of scratch space, which should extend the currently possible simulation size to at least 1011 sequences. Continual growth in available computing power may make ribosomal display size libraries (1012 or more sequences) accessible to the algorithm within several years. The computer program was written in C, compiled using Microsoft Visual C++, and is portable with only minor changes (for 64-bit integer support). The program is available on our laboratory website, .

RESULTS AND DISCUSSION

Example 1: ep-PCR of the α-synuclein gene

An α-synuclein ep-PCR random mutagenesis library was created, and will be analyzed below in order to demonstrate the type of results which can be expected from the algorithm and the types of conclusions that can be drawn from them. The library contains ∼3.8 × 106 clones with full length, potentially expressible, inserts. A number of these were sequenced (Table 1), and the data together with the estimator of Appendix B were used to derive polymerase incorporation frequencies (Table 3).

Table 1

Summary of α-synuclein DNA sequencing data from 89 sequencesa

	Number of times observed as:
Wild-type base:	A	G	C	T
A	11 233	64	19	73
G	20	11 979	2	12
C	8	2	6296	13
T	44	8	24	5975

aThese data were derived from the plus strand of all available sequences (89 sequences, 36 kb; see Methods section) including those selected for by expression of purifiable protein as well as unselected sequences. Significant differences between the observed mutation levels of selected and unselected sequences were not detected (all six χ2 values with Yates correction were of P-value >0.3). Sequencing data were taken from the final base of the forward primer through the stop codon of the synuclein cDNA. We observed almost no mutations in the first 22 bp downstream of the start codon, which are complementary to the forward primer. The reverse primer is complementary to the 3′-untranslated region and, therefore, does not influence mutation frequencies in the coding sequence. The observed overall mutation frequency in the library was 0.0081 mutations per base, or an average of 3.2 mutations per DNA sequence.

Using these incorporation frequencies, we estimated the characteristics of the protein library (Table 2) by generating and analyzing ∼3.8 × 106 sequences with the computer algorithm. A larger library containing 109 sequences was also generated with the program (not experimentally), for the purpose of comparison. The DNA diversity and statistics will not be discussed, since we are primarily interested in the protein translation of this library. The simulation of the library required 2–3 min and 100 MB of memory for ∼3.8 × 106 sequences (both protein and DNA; the size of our actual library) and 10 h with 1 GB of RAM and the disk cache function enabled for 109 sequences (protein only), using a 1.8 GHz Pentium IV computer.

Table 2

Estimated characteristics of the α-synuclein protein librarya

Property	Averaged	Standard deviationd	99% Confidence intervale
Truncations (%)	15	0.02	12.0–18.3
Stop codon mutated to a sense codonb (%)	3	0.01	2.4–3.5
Clones producing full length α-synuclein	3.1 × 10⁶	1.0 × 10³	3.0 × 10⁶–3.2 × 10⁶
Protein mutation frequency per amino acidc	0.016	0.00001	0.013–0.018
Average number of mutations per proteinc	2.1	0.001	1.7–2.4
Unmutated sequences (wild-type)c (%)	16	0.017	12.7–21.3
Number of unique proteinsc	1.3 × 10⁶	0.8 × 10³	1.0 × 10⁶–1.5 × 10⁶
Number of unique point mutationscf	1990	13	1766–2074
Number of unique single point mutantsc	1566	12	1438–1660

aAll data result from simulations of 3 770 580 sequences, which is the approximate number of independent, full-length, in-frame inserts in our library. The polymerase incorporation paired frequencies of Table 3 were used centrosymmetrically [E(N)cen, see Appendix B]. As with the experimental library, initial template sequences were allowed to be candidates for incorporation into the simulated library. The values of n and λ for the simulations were identical to those used in the estimator, 9 and 0.88, respectively.

bAs a percentage of the untruncated clones.

cCalculated after removing truncated proteins and proteins with the stop codon mutated to a sense codon. However, these occurrences are counted towards the total number of sequences generated (see Figure 1 legend). Data is derived from amino acid 8 through the stop codon; the forward PCR primer is complementary to the DNA corresponding to the first 7 amino acids.

dAverages and estimates of the SD are based on 10 independent simulations.

eCalculated by the method of Supplementary Appendix C.I with 2000 bootstrap replicates of 89 sequences each. The sampling distribution of the statistics showed that they were unbiased (Supplementary Appendices C and G). In rare cases (37 of 2000), the bootstrap sample contained no G→C or C→G mutations, resulting in very small negative values in the corresponding elements of E(N)cen. These were adjusted to zero as discussed in Appendix B.

fThe bootstrap sampling distribution of this property was significantly skew to the left. The other properties had sampling distributions which were quite normal.

The standard deviations of the average values are quite low (Table 2); with simulations of this size or greater, random fluctuations in the output of the algorithm can be considered negligible, relative to errors from other sources. The width of the 99% confidence intervals on the estimated properties leads us to conclude that a modest amount of sequencing (e.g. Table 1) is sufficient for the algorithm to produce very useful estimates. Figure 1A shows the number of unique sequences (black diamonds), unique point mutations (red squares), and unique single point mutants (green triangles) which were produced during the simulation, as a function of the number of sequences generated. The slopes of the lines in Figure 1A are shown with an identical color scheme in Figure 1B. As discussed in the introduction, this information is important for judging the appropriate screen size to use, and the efficiency of the screening procedure in examining sequence space. For example, if one were interested in looking at as many point mutations as possible, without regard for whether multiple point mutations existed per sequence, the red curves in Figure 1 would be appropriate. In the first several thousand clones examined, the efficiency ranges from >1.6 new mutations examined per clone, down to ∼0.05, and a total of ∼700 different mutations can be expected after 2500 clones. Screening for new point mutations beyond this first several thousand sequences is very inefficient; only infrequently will a new point mutation be looked at. In this system, there are 2527 possible single point mutations (133 mutatable amino acids times 19). Therefore, if the screening procedure is not trivial in terms of time and expense, and if one wished to look beyond the ∼25% of those mutations which can be efficiently examined in this library, a different experimental approach to creating the library might be necessary. However, almost all of the possible point mutations would be accessible in a full screen of a library with the mutation frequencies shown in Table 3, and which contained 109 independent clones. This is relevant if the screen/selection size is not limiting (e.g. phage display). If one wishes to consider only unique single point mutants, the situation is similar (Figure 1, green), but somewhat higher numbers of clones must be screened relative to the unrestricted case (Figure 1, red) in order to get equal coverage. A similar analysis of the number of unique sequence variants (Figure 1, black), shows that a library of 109 clones can be mined for new variants with comparative efficiency throughout its entirety (at least one in five clones examined are new sequences).

Figure 1

α-Synuclein ep-PCR protein library diversity (A) and screening efficiency (B) as a function of the library/screen size. (A) The black diamonds show the total number of unique sequences (non-wild-type) generated (left-hand y-axis, logarithmically spaced). The red squares indicate the number of unique point mutations generated; a unique point mutation is defined as a particular amino acid change at a particular position of the sequence, which has not occurred before. A single sequence may contain multiple unique point mutations. The green triangles refer to the number of unique single point mutants generated; a single point mutant is defined as a sequence with only a single amino-acid mutation, occurring at a specified sequence position. The general term ‘unique elements’ is used on the y-axis and encompasses all three of the above terms. The theoretical maximum number of unique point mutations and unique single point mutants is 19 times the sequence length; both are referred to the right-hand, linear, y-axis). Proteins with truncations or extensions (mutated stop codon) are not included on either y-axis, but are counted as generated sequences in the x-axis. This mimics an actual screen: significantly truncated or extended proteins are often not effective candidates, yet they must still be screened because they cannot be easily removed from the library. Points with abscissa values below 40 000 are averages of values from three independent simulations. Sampling without replacement is assumed (see Supplementary Appendix E). (B) The efficiency of sequence space coverage as a function of number of sequences screened. The efficiency is the expected number of unique elements [a new sequence variant (black curve), new point mutations (red curve), or a new single point mutant (green curve)] which will be covered by screening one additional clone. These curves are the derivatives of the curves in (A). Values may be greater than one for the efficiency of discovering new point mutations (red curve), because multiple new mutations can occur in a single sequence. Note that while our actual experimental library contains ∼3.8 × 106 sequences, the numerical simulations here were carried out for up to 109 sequences, for the purpose of example. The parameters used in the simulations are the same as those described in Table 2.

The relative frequencies of each type of protein mutation from each wild-type amino acid are shown in Figure 2. As expected, some mutations are much more probable than others. One reason for this is that many protein mutations require two or three DNA base changes in a single codon, which is an infrequent event in ep-PCR. Other contributing factors are that certain DNA mutations are much more frequent than others (Table 3) (3,13), and that amino acids have varying degrees of degeneracy with respect to the genetic code. This mutation bias underlies, in part, the limitations on screening efficiency shown in Figure 1. The bias is further underscored by the output of the program on the most frequently generated protein sequences, which are a pool of several hundred single point mutants. The most common, single point mutant F87L, exists more than 106 times in a library of size 109, and the top 100 sequences make up 7% of all 109 generated sequences. The computer program also reports the number of occurrences of each mutation type at each position of the sequence (data not shown). This allows for the examination of region or codon specific bias.

Figure 2

Mutation frequencies and bias as a function of wild-type amino acid in an α-synuclein ep-PCR library. A grey-scale is used to show the number of times (labeled every 10 000 units) a particular wild-type amino acid (x-axis) was observed to be changed to a particular mutated amino acid (y-axis) after numerical simulation of 3 770 580 sequences (see Table 2 legend for details), normalized (divided) by the number of times the wild-type amino acid appears in the sequence. Sense mutations are recorded only in untruncated, unextended sequences. Positions in the graph corresponding to wild-type (no mutation) have values of zero. Both the y- and x-axes are arranged in order of increasing hydropathy index (21). Mutation to a stop codon is represented by an asterisk. α-Synuclein does not contain cysteine, arginine, or tryptophan, therefore, these amino acids do not appear on the x-axis. The graph was created using HeatMap Builder ().

Another way to examine the data in Figure 2 is by grouping amino acids with similar properties. Specifically, the x- and y-axes in Figure 2 are arranged in order of increasing hydropathy index (21). Consider Figure 2 divided into four quadrants: hydrophobic→hydrophobic, hydrophobic→ hydrophilic, hydrophilic→hydrophobic, hydrophilic→hydrophilic. All four are reasonably well populated; by this measure, the bias is not nearly as great as in the above analysis. In certain cases, e.g. optimization of structural stability, this latter approach may be the most appropriate bias indicator. In others, such as optimization of a catalytic site, the former may be more relevant.

Example 2: synthetic NNS oligonucleotides

Simulations were performed of synthetic NNS oligonucleotides of various lengths, in multiples of three, and after translation the peptide libraries were examined. The wild-type DNA sequence was specified as the appropriately sized poly-A, with 25% mutation frequencies to each of the three other bases (for N), or a 50% chance each of mutating to G or C (for S). Protein diversity, in terms of number of unique sequences, as a function of library size, is shown in Figure 3A. We simulated the lesser of 10 times the theoretical number of unique sequences or 109 sequences, per peptide length; the latter typically required 8 h on the machine described above. These statistics are useful for estimating the number of peptides which need to be screened in order to approach a certain coverage of the available diversity. The slopes of the curves in Figure 3A are shown in Figure 3B, describing the efficiency of continued screening, as a function of number of clones screened. Note that because the NNS library is well-distributed, with relatively little mutation bias, the curves of Figure 3A are regular and evenly spaced (compare Figure 1A; this suggests that an analytic fit of this empirical data may be usefully extrapolated in order to solve larger, otherwise inaccessible problems). Still, in most cases the point at which efficiency drops off severely (Figure 3B) would not be entirely predictable without some sort of diversity sampling calculation. On the other hand, with a very large unbiased library, such as the NNS 10mer library, diversity predictions are superfluous; we know that essentially every additional clone examined will be unique.

Figure 3

Peptide diversity and screening efficiency of NNS (S = G or C) oligonucleotide libraries. (A) Number of unique peptide sequences generated (logarithmically spaced y-axis) as a function of the number of NNS DNA sequences produced (log-spaced x-axis). Sequences which contained a stop codon were discarded during the simulation. Results with varying numbers of NNS triplets are labeled with amino acid length. Data for all 1 through 10mer peptides are shown, but 8–10, which have a much higher diversity than the maximum number of simulated sequences, are not separately discernible. Data points below 1000 represent average values from three separate runs, to reduce the noise inherent in very small simulations. Sampling without replacement is assumed (see Supplementary Appendix E). (B) Efficiency (y-axis; see Figure 1 legend for description) as a function of number of sequences screened (log-spaced x-axis). Colors are used to distinguish some of the overlapping curves (6mer, light green; 7mer, dark green; 8mer, blue; 9mer, pink; 10mer, red). The colors in (A) follow an identical scheme. The abrupt decrease in noise level at 100 000 in (B) is due to an increase in the Δx used to calculate slope from the points in (A), and is not inherent in the data. Note the decrease in initial efficiency [visible in (B), especially in the less noisy, colored, curves], as the size of the peptide increases: longer sequences have a greater chance of being eliminated due to incorporation of a nonsense codon.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at NAR Online.

23 in total

1. Optimizing nucleotide mixtures to encode specific subsets of amino acids for semi-random mutagenesis.

Authors: A P Arkin; D C Youvan
Journal: Biotechnology (N Y) Date: 1992-03

2. Randomization of genes by PCR mutagenesis.

Authors: R C Cadwell; G F Joyce
Journal: PCR Methods Appl Date: 1992-08

3. Polymerase chain reaction: replication errors and reliability of gene diagnosis.

Authors: M Krawczak; J Reiss; J Schmidtke; U Rösler
Journal: Nucleic Acids Res Date: 1989-03-25 Impact factor: 16.971

4. High efficiency transformation of E. coli by high voltage electroporation.

Authors: W J Dower; J F Miller; C W Ragsdale
Journal: Nucleic Acids Res Date: 1988-07-11 Impact factor: 16.971

5. Modeling the polymerase chain reaction.

Authors: G Weiss; A von Haeseler
Journal: J Comput Biol Date: 1995 Impact factor: 1.479

6. The polymerase chain reaction and branching processes.

Authors: F Sun
Journal: J Comput Biol Date: 1995 Impact factor: 1.479

7. Biased random mutagenesis of peptides: determination of mutation frequency by computer simulation.

Authors: R Ophir; J M Gershoni
Journal: Protein Eng Date: 1995-02

8. RAMHA: a PC-based Monte-Carlo simulation of random saturation mutagenesis.

Authors: D P Siderovski; T W Mak
Journal: Comput Biol Med Date: 1993-11 Impact factor: 4.589

9. A simple method for displaying the hydropathic character of a protein.

Authors: J Kyte; R F Doolittle
Journal: J Mol Biol Date: 1982-05-05 Impact factor: 5.469

Review 10. Chemical and biochemical strategies for the randomization of protein encoding DNA sequences: library construction methods for directed evolution.

Authors: Cameron Neylon
Journal: Nucleic Acids Res Date: 2004-02-27 Impact factor: 16.971

6 in total

Review 1. Molecular engineering of antibodies for therapeutic and diagnostic purposes.

Authors: Frédéric Ducancel; Bruno H Muller
Journal: MAbs Date: 2012-07-01 Impact factor: 5.857

2. Relationships between the sequence of alpha-synuclein and its membrane affinity, fibrillization propensity, and yeast toxicity.

Authors: Michael J Volles; Peter T Lansbury
Journal: J Mol Biol Date: 2006-12-21 Impact factor: 5.469