Literature DB >> 20693400

SEWAL: an open-source platform for next-generation sequence analysis and visualization.

Jason N Pitt1, Indika Rajapakse, Adrian R Ferré-D'Amaré.   

Abstract

Next-generation DNA sequencing platforms provide exciting new possibilities for in vitro genetic analysis of functional nucleic acids. However, the size of the resulting data sets presents computational and analytical challenges. We present an open-source software package that employs a locality-sensitive hashing algorithm to enumerate all unique sequences in an entire Illumina sequencing run (∼ 10(8) sequences). The algorithm results in quasilinear time processing of entire Illumina lanes (∼ 10(7) sequences) on a desktop computer in minutes. To facilitate visual analysis of sequencing data, the software produces three-dimensional scatter plots similar in concept to Sewall Wright and John Maynard Smith's adaptive or fitness landscape. The software also contains functions that are particularly useful for doped selections such as mutation frequency analysis, information content calculation, multivariate statistical functions (including principal component analysis), sequence distance metrics, sequence searches and sequence comparisons across multiple Illumina data sets. Source code, executable files and links to sample data sets are available at http://www.sourceforge.net/projects/sewal.

Entities:  

Mesh:

Year:  2010        PMID: 20693400      PMCID: PMC3001052          DOI: 10.1093/nar/gkq661

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Recent advances in DNA sequencing technology afford new opportunities and new challenges in nucleic acid sequence analysis (1). The ability routinely to generate gigabases of data in laboratories outside of sequencing centers has necessitated the advent of computationally and monetarily inexpensive software. We present here an open-source software package entitled SEWAL (Sequence Evolution With Adaptive Landscapes) designed to analyze Illumina deep-sequencing data and display it in the form of interactive three-dimensional (3D) scatter plots. To generate these plots, SEWAL also features a variety of sequence manipulation and analysis options designed to run with high speed on desktop computers. SEWAL is currently compiled for the 64-bit Apple Macintosh operating system; however the code is open-source (http://www.sourceforge.net/projects/sewal), written in gnu C++, and should be easily compiled on any 64-bit operating system that supports OpenGL. SEWAL (Figure 1) analyzes deep-sequencing data sets generated from in vitro selection pools of functional nucleic acids and performs nearest-neighbor analysis of all sequences in a sequencing run using a locality sensitive hashing (LSH) algorithm (2,3). SEWAL then tabulates the frequency of every observed sequence and displays the data from the entire sequencing run in the form of a 3D scatter plot hereafter referred to as mutant spectrum (4). The goal in this type of analysis is to leverage massive sequencing as a tool to perform functional genetics of nucleic acids in vitro. This idea is based on the assumption that the frequency of a sequence in an in vitro selection experiment is proportional to the fitness of that sequence. Using SEWAL and high-throughput biochemical analysis, we have recently shown this assumption to be true for one class of catalytic RNA molecules (J. N. Pitt and A. R. Ferré-D’Amaré, submitted for publication). Therefore, data from SEWAL can be used to construct empirical fitness landscapes. A fitness landscape as originally conceived by Wright is analogous to a topographic map, depicting differing allele combinations in the horizontal axes with the height of any given allele combination representing Darwinian fitness of that allele combination in the vertical axis (5). The concept was extended to macromolecules by Maynard Smith (6). In this case, a given macromolecule sequence (allele), is characterized by specific combination of nucleotide or amino acid substitutions (i.e. a haplotype) relative to a reference sequence. These fitness landscapes have attracted considerable theoretical interest, but constructing them experimentally involves phenotypic analysis of all of alleles in the mutant spectrum, which can be technically challenging (7). In molecular evolution, the fitness landscape is a hyper-dimensional object with its dimensionality being a function of the length of the polymer in question (8). If one were able to generate a fitness landscapes for a sequence of interest under a variety of environmental conditions, these could be used to infer accessible evolutionary paths and even predict the course of evolution for that sequence. To this end, SEWAL can compare the frequency of all unique sequences across multiple next-generation sequencing runs. While our application (J. N. Pitt and A. R. Ferré-D’Amaré, submitted for publication) has been limited to analyzing in vitro selection pools of a catalytic RNA (9), SEWAL should prove useful for any assay where the copy number of unique nucleic acids in a complex data set may be informative, for instance, viral quasispecies evolution in response to immune selection or drug therapy, or differential mRNA expression (10).
Figure 1.

SEWAL pipeline for sequence analysis. SEWAL currently reads qseq.txt files generated by the Illumina pipeline. SEWAL commands are shown in italics. File types are shown by white boxes. All files are tab-delimited text files with defined formatting. Inset graphic is a screenshot of the SEWAL interactive graphics window.

SEWAL pipeline for sequence analysis. SEWAL currently reads qseq.txt files generated by the Illumina pipeline. SEWAL commands are shown in italics. File types are shown by white boxes. All files are tab-delimited text files with defined formatting. Inset graphic is a screenshot of the SEWAL interactive graphics window.

MATERIALS AND METHODS

The primary computational task with our analysis is the de novo comparison of every sequence in a data set to every other sequence in the data set in order to determine a frequency for each observed sequence. To determine the copy number of individual sequences in the data set, sequences first pass quality filtering and are then compared to each other to determine the frequency of each unique sequence. A brute-force comparison of all members in the set would result in a worst-case complexity of O(ν n2), where ν is the length of the sequence and n is the total number of sequences in the data set. In practice, these searches become time limiting on single-processor systems for data sets exceeding 2 × 105 sequences (ν = 58). LSH is a technique that uses LSH functions to preprocess the set using k number of hash functions (h) and hash tables (3). (A hash function is a deterministic function that reduces a high dimensional data set, such as a long polynucleotide sequence, to a single data point, such as an integer.) The hashing step can be performed with random strings or with a search string of interest (q), with the latter hash value being a distance metric to q (defined in the sewal.prefs file or subsequently using the newdim command). Hashing is performed (k = 4) and the set sequentially sorted using a stable sort for all stored hash values O[knν + kn log(kn)] (11). The probability of collision of non-equal sequences during hashing is a function of ν and k (Figure 2). Following hashing, the set is then traversed and copy number determined using pairwise comparison [limited to h (q) = h (n)] to prevent equating of non-equal sequences that collided during hashing. We determined empirically that for our typical Illumina data sets with (n ∼ 107, ν = 58) a pairwise comparison after sorting with k = 4 was faster than allowing the LSH to converge without pairwise comparison (k ∼ 20). The Qphred scores (12) of identical sequences are summed and stored. After the list is traversed it is pruned, and mean Qphred scores for each sequence are calculated as:
Figure 2.

Expected mutant frequencies and empirically determined algorithm performance. SEWAL is designed to analyze pools of nucleic acids to determine the frequency of specific mutant sequences in the pool. The worst-case comparison of all observed mutants to all other observed mutants would have a complexity of O(ν n2) where O is the upper bound of the growth of the function, n is the total number of sequences, and ν is the length of each sequence. The blue line is the probability of encountering any mutant in a pool composed of random RNAs as a function of ν. The red line is the probability of encountering a reference sequence with a pool mutagenized at a rate of 21% (7% each possible point mutation). The dashed red lines correspond to the probability of encountering a reference sequence with m number of specific point mutations. Black lines show the performance of the SEWAL sorting algorithm on test data sets (n = 107) using k number of LSH functions expressed as the probability of a collision between any two nonidentical sequences with equal hash values. Solid black lines represent random test data sets; dashed lines, in silico generated data sets mutagenized at 21% per position.

Expected mutant frequencies and empirically determined algorithm performance. SEWAL is designed to analyze pools of nucleic acids to determine the frequency of specific mutant sequences in the pool. The worst-case comparison of all observed mutants to all other observed mutants would have a complexity of O(ν n2) where O is the upper bound of the growth of the function, n is the total number of sequences, and ν is the length of each sequence. The blue line is the probability of encountering any mutant in a pool composed of random RNAs as a function of ν. The red line is the probability of encountering a reference sequence with a pool mutagenized at a rate of 21% (7% each possible point mutation). The dashed red lines correspond to the probability of encountering a reference sequence with m number of specific point mutations. Black lines show the performance of the SEWAL sorting algorithm on test data sets (n = 107) using k number of LSH functions expressed as the probability of a collision between any two nonidentical sequences with equal hash values. Solid black lines represent random test data sets; dashed lines, in silico generated data sets mutagenized at 21% per position.

RESULTS AND DISCUSSION

We chose to focus on the Illumina platform because its short read length and higher read density were advantageous in analyzing populations of small RNAs (13). The current release of SEWAL is designed to accept qseq.txt files from the Illumina Genome Analyzer v1.4 (Illumina, San Diego, CA, USA). These files are merged, filtered for quality, hashed, sorted, tabulated, merged with similar data sets generated under differing conditions and then formatted as a SEWAL aggregate frequency file, which can be analyzed using several sequence and graphic analysis features built into SEWAL. Experimentally generated test data sets are available for download at the NCBI Trace Archives (accession number SRA020870.1). An overview of the SEWAL computational pipeline is shown in Figure 1 and a detailed explanation is presented below. All SEWAL text commands are hereafter presented in italics.

Process

Illumina pipeline qseq files are first merged into a single data set using the SEWAL commands mergetiles and truncate to generate a fixed number of clusters that pass quality filtering for cross-sample or cross-lane comparison. This step is critical when inferring changes in the frequency of a particular sequence in multiple sequencing runs or flow-cell lanes. Similar comparisons can be achieved using the scale command to scale two data sets relative to each other based on the sequencing of a loading control sequence following construction of a SEWAL frequency file (readruler command). Quality filtering is user-controllable using a configuration file (sewal.prefs; see file header for details). Default quality-control settings ignore Illumina quality flags, but remove clusters containing one or more of the following errors: base calls flagged as gaps, clusters with homo-polymer runs longer than six residues, and more than six positions with Qphred scores of 3 or higher. A Qphred score (12) is the probability that a base call is incorrect. The Illumina pipeline encodes Qphred scores as an ASCII string (ASCII value −64 = Qphred; one char per base) as: Qphred = –10 log (ep), where (ep) is the error probability for that position. A Qphred score equal to 3 is a 50% chance that the base is incorrectly assigned.

Locality sensitive hashing

We have solved the problem of comparison of every sequence in the data set to every other sequence using the approach of LSH to reduce the dimensionality of the data from ν sequence positions to 4 LSH values. Using four successive table sorts of the LSH values, the frequency of each sequence is then determined in quasilinear time (Table 1). This approach has the added advantage that the data set has been preprocessed, accelerating subsequent nearest-neighbor searches (finds). We set the number LSH functions (k) to 4 because of the nature of our data; namely, mutant pools with high similarity (n ∼ 107, ν = 58). The primary trade-off between small and large values of k is between preprocessing/sorting time and pair-wise comparison time. In practice, all sorts with k > 1 resulted in efficient tabulation of data of the scale currently generated by the Illumina platform (Table 1). A k of 4 translated into brute-force searches for ∼0.001% of our data set or 10 000 sequences (Figure 2). For larger data sets (>>n) a greater number of LSH values could be computed, increasing the preprocessing time and space linearly, and incorporating additional sorts, but decreasing the collision probability and requisite pair-wise comparisons. A useful feature of the LSH algorithm used in this context is that as the length of a random polymer (ν) increases, the performance of the algorithm also increases (Table 1 and refs 2 and 3). This behavior results from the LSH function and is dependent on both k and ν. Therefore, as Illumina read lengths improve, this sorting procedure will remain highly efficient.
Table 1.

Performance of the SEWAL LSH sorting algorithm as a function of k (number of hash functions) and ν (length of each sequence) for in silico generated populations (n = 107)

kνProcessing time (s)Library typeCollision probabilityTotal collisionsTotal comparisons
1525Random0.88378 838 48010 001 677
2555Random0.38193 819 63010 000 378
3575Random0.37683 779 49010 029 480
45106Random0.002929 16510 029 165
1527Doped0.4464 460 76010 001 360
2548Doped0.18051 805 16010 000 409
3563Doped0.18451 853 66010 048 838
4583Doped0.0004447410 004 474
11021Random0.998615 475 10015 496 479
21046Random0.951211 955 50012 568 309
31075Random0.61726 715 2601 088 0872
410103Random0.34553 600 59010 422 430
11025Doped0.83028 892 80010 711 457
21053Doped0.50355 183 97010 295 630
31076Doped0.32783 304 51010 082 137
410103Doped0.17691 775 63010 040 221
15078523Random0.99991.61 607 × 10111.61 617 × 1011
25093Random0.9813525 142 000535 141 971
350104Random0.25223 373 35013 373 353
450140Random0.003635 77910 035 779
15072806Doped0.99991.77 183 × 1011177 193 × 1011
250118Doped0.9878811 783 000821 773 604
350106Doped0.43017 545 68017 544 589
450140Doped0.006262 61210 062 405
110058Random0.998778 376 30078 481 204
210063Random0.935933 599 30035 900 313
3100108Random0.15131 690 98011 174 301
4100129Random0.002929 43310 027 028
110091994Doped0.99991.44 037 × 10111.44 047 × 1011
210081Doped0.9793473 721 000483 721 331
310088Doped0.23763 116 04013 116 042
4100122Doped0.003232 33110 032 331

Library type is random (a 25% probability of each nucleotide at each position) or doped (a 79% probability of a wild-type residue at each position and a 7% probability of each possible mutant residue.

Performance of the SEWAL LSH sorting algorithm as a function of k (number of hash functions) and ν (length of each sequence) for in silico generated populations (n = 107) Library type is random (a 25% probability of each nucleotide at each position) or doped (a 79% probability of a wild-type residue at each position and a 7% probability of each possible mutant residue.

Processall

Following LSH, SEWAL generates a frequency file, sorted by copy number, which is equivalent to the tab-delimited qseq file with the Qphred string replaced by the mean Qphred string, and a unique ID tag, sequence copy number and hash values appended. The processall command compares the frequency of all sequences between two or more frequency files (maximum number of experiments/Illumina lanes is currently 7). SEWAL generates an aggregate frequency file by concatenating the individual frequency files and then sorting the sequences based on the stored LSH values. This sorting of data sets from multiple Illumina lanes is memory-intensive, and in the current implementation requires a 64-bit OS with >20 GB physical memory. After sorting, the duplicates in the list are collapsed as above, and a SEWAL aggregate frequency file created. This file contains each sequence with a unique ID number, a mean Qphred score string, the copy number of the observed sequence in each lane, the four LSH scores, the observed copy numbers from each lane, and the lane number containing the maximal copy number.

Data analysis

Histogram, hamminghistogram

These commands generate histograms of a user-defined bin size and bin number. The output is a tab-delimited text file containing the limits for each bin and the number of sequences in each bin. Both the number of unique sequences and the total number of observations of all members in the bin are given in the table. The difference between the hamminghistogram and histogram commands is the distance metric used to determine the bin placement for each sequence. Bins for the histogram command are measured using one of the four LSH values (user-defined), while hamminghistogram takes an input string for distance comparison and the outputted histogram bins are based on the Hamming distance from the query string (14).

Finda, findg, finds

These three search commands, finda, findg and finds, find sequences in the aggregate frequency file based on specific search queries. The finda (find all) and finds (find sequence) commands return a new aggregate frequency file containing sequences that existed in a specified region of the mutant spectrum or are similar to a specified search sequence, respectively. The findg (find and graph) command is similar to the finda command but returns either a .graph, .vector or .fit file. The .graph files are used to generate 3D scatter plots using the graph command. In these plots, each sequence is represented by a single dot with the axes being the copy number of the sequence and the other two axes being user-selected LSH values (Figure 3). The .vect files have these same axes but contain vectors showing the changes in frequency of each sequence in each analyzed lane (Figure 4b), while the .fit files depict the magnitude of the vector between any two Illumina lanes as a scatter plot.
Figure 3.

Orthogonal views of 3D mutant spectra generated by SEWAL. Each colored ball represents a unique sequence in the pool. Vertical axis is the log10 of the observed frequency of the sequence. Horizontal axes are the SEWAL LSH values using a wild-type sequence (Projection 1) or a sequence with no similarity to the data set (a segment of the Ornithorhynchus anatinus lysozyme gene; Projection 2). Sequences have been colored using the colorbase command based on the identity of the base at position 21 in the sequence (G, green, A, red, T,U, blue, C, orange). Curved arrow depicts the relative orientation of the views.

Figure 4.

3D Vector and fitness plots of genotype distributions in four separate sequencing experiments generated by SEWAL. (A) Each vertical line represents a unique sequence present in four separate sequencing experiments. Vectors are color coded to show the frequency of each sequence in each experimental condition. Sequences sorted using the SEWAL findg command to display only sequences that have a maximal distribution in condition 1 (red). (B) Sequences from the same data set plotted to display the magnitude of the difference between condition 1 (high stringency selection) and condition 4 (low stringency selection).

Orthogonal views of 3D mutant spectra generated by SEWAL. Each colored ball represents a unique sequence in the pool. Vertical axis is the log10 of the observed frequency of the sequence. Horizontal axes are the SEWAL LSH values using a wild-type sequence (Projection 1) or a sequence with no similarity to the data set (a segment of the Ornithorhynchus anatinus lysozyme gene; Projection 2). Sequences have been colored using the colorbase command based on the identity of the base at position 21 in the sequence (G, green, A, red, T,U, blue, C, orange). Curved arrow depicts the relative orientation of the views. 3D Vector and fitness plots of genotype distributions in four separate sequencing experiments generated by SEWAL. (A) Each vertical line represents a unique sequence present in four separate sequencing experiments. Vectors are color coded to show the frequency of each sequence in each experimental condition. Sequences sorted using the SEWAL findg command to display only sequences that have a maximal distribution in condition 1 (red). (B) Sequences from the same data set plotted to display the magnitude of the difference between condition 1 (high stringency selection) and condition 4 (low stringency selection).

Covary, buildmutant

One of the main uses of SEWAL is to understand the mapping between the sequence of a nucleic acid and its fitness. To display this mapping, SEWAL can generate mutation frequency tables based on defined regions of sequence space using the buildmutant command. The buildmutant command outputs a tab-delimited text file that contains a table of observed point mutant frequencies from a user provided reference sequence. The table also contains positional information content, in bits (15). This information content calculation can be adjusted to account for sequence bias in the starting population (16). In a structured nucleic acid, many positions are not conserved at the primary sequence level but are required to maintain Watson–Crick base pairing (17). At these positions, a mutation can be said to ‘covary’ with a mutation at another position in the RNA to maintain base pairing. These data can be used to infer the secondary structure of the nucleic acid. SEWAL can analyze the data set to look for instances of covariation using the covary command. The covary command takes a reference sequence and a defined region of the mutant spectrum and uses the sequence data from the deep sequencing run to output a covariation matrix for all of the sequenced positions that are separated by a minimum of three residues.

Singular value decomposition and principal component analysis

A specified region of the genotype frequency plot may be expressed in terms of a matrix A, consisting of m rows of positional nucleotide frequencies or information content, over z columns of differing sequencing lanes (e.g. z differing selective pressures). Any m × z matrix A can be factored into A = UDV (18). The columns of U(m × m) are eigenvectors of AA, and the columns of V (z × z) are eigenvectors of A. The r singular values on the diagonal of D(m × z) are the square roots of the nonzero eigenvalues of both AA and A. The matrix A is known as the covariance matrix of A, and it has familiar interpretations in statistics and other fields (18). When a data matrix has been centered, by subtracting the mean of each column from the entire column, this process is known as ‘principal component analysis’ (PCA). The right singular vectors are the components, and the scaled left singular vectors are the scores. PCAs are described in terms of the eigenvalues and eigenvectors of the covariance matrix, A (19). The first singular vectors or principal components, when applied to positional nucleotide frequencies for a single defined peak in the mutant spectrum, provides eigenvectors corresponding to the reference sequence of the spectrum (J. N. Pitt and A. R. Ferré-D’Amaré, submitted for publication, 20). The svd and pca commands return matrixes UDV for A(m, z) where m is specified by the user as either the information content of a position or the library bias adjusted base frequency at each position over z sequencing lanes. The commands svddmax or pcadmax perform identical calculations; however, sequences are presorted depending on their maximum observed distribution across lanes, and are binned in terms of their maximal distribution (z(i) = sequences with a maximal distribution in lane dmax(z):1, …, z).

Covariance

The covariance command can be used to generate AA or A for a specified region of the mutant spectrum. AA provides the correlation of each possible mutant residue in the sequence with every other possible mutant residue. For mutant spectra containing RNA secondary structures, this could be interpreted as Watson–Crick base-pairing covariance or any other structural feature dependent on more than one position. The magnitude of the covariance indicates critical (i.e. invariant) interactions, in that they are independent of the selective condition, or in terms of the matrix, highly correlated. Conversely, A provides the dependence or independence of a region of the mutant spectrum across sequencing lanes or selective conditions, indicating the correlation between selective regimes as indicated by the mutant spectra.

Symmetrized Kullback–Leibler divergence

The relative entropy or Kullback–Leibler divergence is a measure in statistics (21,22) that quantifies how close a probability distribution is to a model (or candidate) distribution : KL divergence is often used as a measure of the difference between two distributions (21,22). The KL divergence is not symmetric, thus we define the symmetrized Kullback–Leibler divergence (23) as: SKL is a useful metric for quantifying the difference between mutant spectra sequenced under differing conditions. When the selective pressures are equal, SKL should approach 0 and under differing selective pressures SKL is much greater than 0. SKL can be performed on specified regions of the mutant spectrum to determine regions of maximal or minimal divergence. The SKL command can also bin the sequences depending on their dmax across multiple sequencing lanes providing a measure of the relative entropy in bits between these populations: Where A is the symmetric matrix produced by SKL between the dmax binned populations from four example in vitro selection experiments with decreasing selective stringency.

Similarity

The similarity command returns several traditional population-based similarity indexes (Jaccard, Søresen and Mountford) frequently used in ecology to quantify the total relationship between two different populations (in this case, Illumina lanes) (24).

Graph, fit, select, colorbase

SEWAL consists of two windows, a text-based interface launched from the terminal/command line and an interactive 3D graphical interface that is activated by running the graph command from the SEWAL command line. Sequences are displayed as a mutant spectrum (Figure 3) using the open command to load a .graph (Figure 3), .vector (Figure 4A) or .fit file (Figure 4B) generated using findg (above). Multiple files can be loaded simultaneously, and the software has successfully produced graphs of as many as 7 × 106 sequences. Alternatively, using the .vector file format, the frequency of each sequence in different sequencing experiments is displayed visually as a color-coded vector (Figure 4). The .fit files are particularly useful when the difference in the frequency of a genotype across lanes is known (or hypothesized) to be correlated with a difference in that genotype’s phenotypic fitness, because this plot then represents an empirical fitness landscape (J. N. Pitt and A. R. Ferré-D’Amaré, submitted for publication). From the graphical window, sequences can be selected using the mouse and exported using the select command to generate a new aggregate frequency file. Sequences can be colored using specified sequence parameters using the colorbase command (Figure 3). The graphical window contains a menu system to adjust how elements are displayed, and the 3D plot can be rotated and zoomed using the mouse and captured and exported as a .tif file. The graphical window also generates a histogram for each of the three axes that can be displayed.

CONCLUSIONS

Visualization of hyperdimensional objects is not trivial; however, several software applications have been written to visualize multivariate data (25,26). We chose to build a new interactive 3D interface into SEWAL for several reasons. First, the standard software graphing packages available in most laboratories simply cannot graph more than 105 – 106 points. Illumina data sets routinely generate 107 – 108 sequences in a single run, and therefore all of these data cannot be visualized simultaneously. SEWAL has been used to graph (albeit slowly) data sets approaching 107 points, and there is no practical limit to the number it could graph at reduced frame rates. Second, using the 3D frequency plots to identify patterns or regions of interest in the data and exporting those DNA sequences is a useful feature. SEWAL supports this point-and-click methodology. Third, graphical representation of DNA sequence information in multiple colors and dimensions reveals interesting patterns in the data not provided by other applications. SEWAL provides a new platform for DNA sequence analysis and visualization of large data sets generated by massive sequencing to perform functional in vitro genetics. The visualization incorporates the underlying LSH information in a visually intuitive manner to generate mutant spectra. When the differences in these plots resulting from altered selective pressures are visualized, they can represent projections of the hyperdimensional fitness landscape, if the frequency of the sequences has been demonstrated to correlate with phenotypic fitness (8).

FUNDING

The W.M. Keck Foundation; Howard Hughes Medical Institute (HHMI); Mentored Quantitative Research Career Development Award (K25) from National Institutes of Health grant 1K25DK082791-01A109 (to I.R.). Funding for open access charge: Howard Hughes Medical Institute. Conflict of interest statement. None declared.
  15 in total

1.  Efficient large-scale sequence comparison by locality-sensitive hashing.

Authors:  J Buhler
Journal:  Bioinformatics       Date:  2001-05       Impact factor: 6.937

2.  Empirical fitness landscapes reveal accessible evolutionary paths.

Authors:  Frank J Poelwijk; Daniel J Kiviet; Daniel M Weinreich; Sander J Tans
Journal:  Nature       Date:  2007-01-25       Impact factor: 49.962

3.  The 30th anniversary of quasispecies. Meeting on 'Quasispecies: past, present and future'.

Authors:  Esteban Domingo; Simon Wain-Hobson
Journal:  EMBO Rep       Date:  2009-04-03       Impact factor: 8.807

4.  Natural selection and the concept of a protein space.

Authors:  J M Smith
Journal:  Nature       Date:  1970-02-07       Impact factor: 49.962

5.  Information content of binding sites on nucleotide sequences.

Authors:  T D Schneider; G D Stormo; L Gold; A Ehrenfeucht
Journal:  J Mol Biol       Date:  1986-04-05       Impact factor: 5.469

6.  RNA sequence analysis using covariance models.

Authors:  S R Eddy; R Durbin
Journal:  Nucleic Acids Res       Date:  1994-06-11       Impact factor: 16.971

7.  Inferring weak population structure with the assistance of sample group information.

Authors:  Melissa J Hubisz; Daniel Falush; Matthew Stephens; Jonathan K Pritchard
Journal:  Mol Ecol Resour       Date:  2009-04-01       Impact factor: 7.090

8.  Structure-guided engineering of the regioselectivity of RNA ligase ribozymes.

Authors:  Jason N Pitt; Adrian R Ferré-D'Amaré
Journal:  J Am Chem Soc       Date:  2009-03-18       Impact factor: 15.419

9.  A large genome center's improvements to the Illumina sequencing system.

Authors:  Michael A Quail; Iwanka Kozarewa; Frances Smith; Aylwyn Scally; Philip J Stephens; Richard Durbin; Harold Swerdlow; Daniel J Turner
Journal:  Nat Methods       Date:  2008-12       Impact factor: 28.547

10.  Informational complexity and functional activity of RNA structures.

Authors:  James M Carothers; Stephanie C Oestreich; Jonathan H Davis; Jack W Szostak
Journal:  J Am Chem Soc       Date:  2004-04-28       Impact factor: 15.419

View more
  9 in total

1.  Fitness analyses of all possible point mutations for regions of genes in yeast.

Authors:  Ryan Hietpas; Benjamin Roscoe; Li Jiang; Daniel N A Bolon
Journal:  Nat Protoc       Date:  2012-06-21       Impact factor: 13.491

Review 2.  Deep mutational scanning: assessing protein function on a massive scale.

Authors:  Carlos L Araya; Douglas M Fowler
Journal:  Trends Biotechnol       Date:  2011-05-10       Impact factor: 19.536

3.  Time-lapse imaging of molecular evolution by high-throughput sequencing.

Authors:  Nam Nguyen Quang; Clément Bouvier; Adrien Henriques; Benoit Lelandais; Frédéric Ducongé
Journal:  Nucleic Acids Res       Date:  2018-09-06       Impact factor: 16.971

4.  In vitro evolution of coenzyme-independent variants from the glmS ribozyme structural scaffold.

Authors:  Matthew W L Lau; Adrian R Ferré-D'Amaré
Journal:  Methods       Date:  2016-04-26       Impact factor: 3.608

Review 5.  Applications of High-Throughput Sequencing for In Vitro Selection and Characterization of Aptamers.

Authors:  Nam Nguyen Quang; Gérald Perret; Frédéric Ducongé
Journal:  Pharmaceuticals (Basel)       Date:  2016-12-10

Review 6.  Implementation of High-Throughput Sequencing (HTS) in Aptamer Selection Technology.

Authors:  Natalia Komarova; Daria Barkova; Alexander Kuznetsov
Journal:  Int J Mol Sci       Date:  2020-11-20       Impact factor: 5.923

7.  High-throughput sequence analysis reveals structural diversity and improved potency among RNA inhibitors of HIV reverse transcriptase.

Authors:  Mark A Ditzler; Margaret J Lange; Debojit Bose; Christopher A Bottoms; Katherine F Virkler; Andrew W Sawyer; Angela S Whatley; William Spollen; Scott A Givan; Donald H Burke
Journal:  Nucleic Acids Res       Date:  2012-12-14       Impact factor: 16.971

Review 8.  Visualizing genome and systems biology: technologies, tools, implementation techniques and trends, past, present and future.

Authors:  Georgios A Pavlopoulos; Dimitris Malliarakis; Nikolas Papanikolaou; Theodosis Theodosiou; Anton J Enright; Ioannis Iliopoulos
Journal:  Gigascience       Date:  2015-08-25       Impact factor: 6.524

9.  FASTAptamer: A Bioinformatic Toolkit for High-throughput Sequence Analysis of Combinatorial Selections.

Authors:  Khalid K Alam; Jonathan L Chang; Donald H Burke
Journal:  Mol Ther Nucleic Acids       Date:  2015-03-03       Impact factor: 10.183

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.