| Literature DB >> 15990391 |
Michael J Volles1, Peter T Lansbury.
Abstract
A computer program for the generation and analysis of in silico random point mutagenesis libraries is described. The program operates by mutagenizing an input nucleic acid sequence according to mutation parameters specified by the user for each sequence position and type of point mutation. The program can mimic almost any type of random mutagenesis library, including those produced via error-prone PCR (ep-PCR), mutator Escherichia coli strains, chemical mutagenesis, and doped or random oligonucleotide synthesis. The program analyzes the generated nucleic acid sequences and/or the associated protein library to produce several estimates of library diversity (number of unique sequences, point mutations, and single point mutants) and the rate of saturation of these diversities during experimental screening or selection of clones. This information allows one to select the optimal screen size for a given mutagenesis library, necessary to efficiently obtain a certain coverage of the sequence-space. The program also reports the abundance of each specific protein mutation at each sequence position, which is useful as a measure of the level and type of mutation bias in the library. Alternatively, one can use the program to evaluate the relative merits of preexisting libraries, or to examine various hypothetical mutation schemes to determine the optimal method for creating a library that serves the screen/selection of interest. Simulated libraries of at least 10(9) sequences are accessible by the numerical algorithm with currently available personal computers; an analytical algorithm is also available which can rapidly calculate a subset of the numerical statistics in libraries of arbitrarily large size. A multi-type double-strand stochastic model of ep-PCR is developed in an appendix to demonstrate the applicability of the algorithm to amplifying mutagenesis procedures. Estimators of DNA polymerase mutation-type-specific error rates are derived using the model. Analyses of an alpha-synuclein ep-PCR library and NNS synthetic oligonucleotide libraries are given as examples.Entities:
Mesh:
Substances:
Year: 2005 PMID: 15990391 PMCID: PMC1166583 DOI: 10.1093/nar/gki669
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Estimated polymerase paired incorporation frequencies for the α-synuclein ep-PCR DNA librarya
| Specific event | Probability |
|---|---|
| 0.00161 | |
| 0.00037 | |
| 0.00121 | |
| 0.00026 | |
| 0.00005 | |
| 0.00043 |
aThis set of six paired mutagenic incorporation frequencies was derived using the estimator of Appendix B with the sequencing data of Table 1, and n = 9 and λ = 0.88. The minimum value of k in the estimator was taken as zero, because this library can potentially incorporate initial template sequences (see Appendix B). The values of n and λ were chosen as reasonable estimates that also coincide with the experimental amplification factor of ∼300. The accuracy of this set of probabilities, as measures of the paired incorporation frequencies of a real polymerase, depend strongly on the accuracy of the estimated values of n and λ. In contrast, the results of the simulation/analysis are relatively insensitive to these two parameters (Supplementary Appendix C.II). Statistical uncertainty in these values is reflected in the confidence intervals of Table 2.
Summary of α-synuclein DNA sequencing data from 89 sequencesa
| Number of times observed as: | ||||
|---|---|---|---|---|
| Wild-type base: | A | G | C | T |
| A | 11 233 | 64 | 19 | 73 |
| G | 20 | 11 979 | 2 | 12 |
| C | 8 | 2 | 6296 | 13 |
| T | 44 | 8 | 24 | 5975 |
aThese data were derived from the plus strand of all available sequences (89 sequences, 36 kb; see Methods section) including those selected for by expression of purifiable protein as well as unselected sequences. Significant differences between the observed mutation levels of selected and unselected sequences were not detected (all six χ2 values with Yates correction were of P-value >0.3). Sequencing data were taken from the final base of the forward primer through the stop codon of the synuclein cDNA. We observed almost no mutations in the first 22 bp downstream of the start codon, which are complementary to the forward primer. The reverse primer is complementary to the 3′-untranslated region and, therefore, does not influence mutation frequencies in the coding sequence. The observed overall mutation frequency in the library was 0.0081 mutations per base, or an average of 3.2 mutations per DNA sequence.
Estimated characteristics of the α-synuclein protein librarya
| Property | Average | Standard deviation | 99% Confidence interval |
|---|---|---|---|
| Truncations (%) | 15 | 0.02 | 12.0–18.3 |
| Stop codon mutated to a sense codon | 3 | 0.01 | 2.4–3.5 |
| Clones producing full length α-synuclein | 3.1 × 106 | 1.0 × 103 | 3.0 × 106–3.2 × 106 |
| Protein mutation frequency per amino acid | 0.016 | 0.00001 | 0.013–0.018 |
| Average number of mutations per protein | 2.1 | 0.001 | 1.7–2.4 |
| Unmutated sequences (wild-type) | 16 | 0.017 | 12.7–21.3 |
| Number of unique proteins | 1.3 × 106 | 0.8 × 103 | 1.0 × 106–1.5 × 106 |
| Number of unique point mutations | 1990 | 13 | 1766–2074 |
| Number of unique single point mutants | 1566 | 12 | 1438–1660 |
aAll data result from simulations of 3 770 580 sequences, which is the approximate number of independent, full-length, in-frame inserts in our library. The polymerase incorporation paired frequencies of Table 3 were used centrosymmetrically [E(N)cen, see Appendix B]. As with the experimental library, initial template sequences were allowed to be candidates for incorporation into the simulated library. The values of n and λ for the simulations were identical to those used in the estimator, 9 and 0.88, respectively.
bAs a percentage of the untruncated clones.
cCalculated after removing truncated proteins and proteins with the stop codon mutated to a sense codon. However, these occurrences are counted towards the total number of sequences generated (see Figure 1 legend). Data is derived from amino acid 8 through the stop codon; the forward PCR primer is complementary to the DNA corresponding to the first 7 amino acids.
dAverages and estimates of the SD are based on 10 independent simulations.
eCalculated by the method of Supplementary Appendix C.I with 2000 bootstrap replicates of 89 sequences each. The sampling distribution of the statistics showed that they were unbiased (Supplementary Appendices C and G). In rare cases (37 of 2000), the bootstrap sample contained no G→C or C→G mutations, resulting in very small negative values in the corresponding elements of E(N)cen. These were adjusted to zero as discussed in Appendix B.
fThe bootstrap sampling distribution of this property was significantly skew to the left. The other properties had sampling distributions which were quite normal.
Figure 1α-Synuclein ep-PCR protein library diversity (A) and screening efficiency (B) as a function of the library/screen size. (A) The black diamonds show the total number of unique sequences (non-wild-type) generated (left-hand y-axis, logarithmically spaced). The red squares indicate the number of unique point mutations generated; a unique point mutation is defined as a particular amino acid change at a particular position of the sequence, which has not occurred before. A single sequence may contain multiple unique point mutations. The green triangles refer to the number of unique single point mutants generated; a single point mutant is defined as a sequence with only a single amino-acid mutation, occurring at a specified sequence position. The general term ‘unique elements’ is used on the y-axis and encompasses all three of the above terms. The theoretical maximum number of unique point mutations and unique single point mutants is 19 times the sequence length; both are referred to the right-hand, linear, y-axis). Proteins with truncations or extensions (mutated stop codon) are not included on either y-axis, but are counted as generated sequences in the x-axis. This mimics an actual screen: significantly truncated or extended proteins are often not effective candidates, yet they must still be screened because they cannot be easily removed from the library. Points with abscissa values below 40 000 are averages of values from three independent simulations. Sampling without replacement is assumed (see Supplementary Appendix E). (B) The efficiency of sequence space coverage as a function of number of sequences screened. The efficiency is the expected number of unique elements [a new sequence variant (black curve), new point mutations (red curve), or a new single point mutant (green curve)] which will be covered by screening one additional clone. These curves are the derivatives of the curves in (A). Values may be greater than one for the efficiency of discovering new point mutations (red curve), because multiple new mutations can occur in a single sequence. Note that while our actual experimental library contains ∼3.8 × 106 sequences, the numerical simulations here were carried out for up to 109 sequences, for the purpose of example. The parameters used in the simulations are the same as those described in Table 2.
Figure 2Mutation frequencies and bias as a function of wild-type amino acid in an α-synuclein ep-PCR library. A grey-scale is used to show the number of times (labeled every 10 000 units) a particular wild-type amino acid (x-axis) was observed to be changed to a particular mutated amino acid (y-axis) after numerical simulation of 3 770 580 sequences (see Table 2 legend for details), normalized (divided) by the number of times the wild-type amino acid appears in the sequence. Sense mutations are recorded only in untruncated, unextended sequences. Positions in the graph corresponding to wild-type (no mutation) have values of zero. Both the y- and x-axes are arranged in order of increasing hydropathy index (21). Mutation to a stop codon is represented by an asterisk. α-Synuclein does not contain cysteine, arginine, or tryptophan, therefore, these amino acids do not appear on the x-axis. The graph was created using HeatMap Builder ().
Figure 3Peptide diversity and screening efficiency of NNS (S = G or C) oligonucleotide libraries. (A) Number of unique peptide sequences generated (logarithmically spaced y-axis) as a function of the number of NNS DNA sequences produced (log-spaced x-axis). Sequences which contained a stop codon were discarded during the simulation. Results with varying numbers of NNS triplets are labeled with amino acid length. Data for all 1 through 10mer peptides are shown, but 8–10, which have a much higher diversity than the maximum number of simulated sequences, are not separately discernible. Data points below 1000 represent average values from three separate runs, to reduce the noise inherent in very small simulations. Sampling without replacement is assumed (see Supplementary Appendix E). (B) Efficiency (y-axis; see Figure 1 legend for description) as a function of number of sequences screened (log-spaced x-axis). Colors are used to distinguish some of the overlapping curves (6mer, light green; 7mer, dark green; 8mer, blue; 9mer, pink; 10mer, red). The colors in (A) follow an identical scheme. The abrupt decrease in noise level at 100 000 in (B) is due to an increase in the Δx used to calculate slope from the points in (A), and is not inherent in the data. Note the decrease in initial efficiency [visible in (B), especially in the less noisy, colored, curves], as the size of the peptide increases: longer sequences have a greater chance of being eliminated due to incorporation of a nonsense codon.