Literature DB >> 30126356

Clustering of circular consensus sequences: accurate error correction and assembly of single molecule real-time reads from multiplexed amplicon libraries.

Felix Francis^1,2, Michael D Dumas¹, Scott B Davis¹, Randall J Wisser^3,4.

Abstract

BACKGROUND: Targeted resequencing with high-throughput sequencing (HTS) platforms can be used to efficiently interrogate the genomes of large numbers of individuals. A critical issue for research and applications using HTS data, especially from long-read platforms, is error in base calling arising from technological limits and bioinformatic algorithms. We found that the community standard long amplicon analysis (LAA) module from Pacific Biosciences is prone to substantial bioinformatic errors that raise concerns about findings based on this pipeline, prompting the need for a new method.
RESULTS: A single molecule real-time (SMRT) sequencing-error correction and assembly pipeline, C3S-LAA, was developed for libraries of pooled amplicons. By uniquely leveraging the structure of SMRT sequence data (comprised of multiple low quality subreads from which higher quality circular consensus sequences are formed) to cluster raw reads, C3S-LAA produced accurate consensus sequences and assemblies of overlapping amplicons from single sample and multiplexed libraries. In contrast, despite read depths in excess of 100X per amplicon, the standard long amplicon analysis module from Pacific Biosciences generated unexpected numbers of amplicon sequences with substantial inaccuracies in the consensus sequences. A bootstrap analysis showed that the C3S-LAA pipeline per se was effective at removing bioinformatic sources of error, but in rare cases a read depth of nearly 400X was not sufficient to overcome minor but systematic errors inherent to amplification or sequencing.
CONCLUSIONS: C3S-LAA uses a divide and conquer processing algorithm for SMRT amplicon-sequence data that generates accurate consensus sequences and local sequence assemblies. Solving the confounding bioinformatic source of error in LAA allowed for the identification of limited instances of errors due to DNA amplification or sequencing of homopolymeric nucleotide tracts. For research and development in genomics, C3S-LAA allows meaningful conclusions and biological inferences to be made from accurately polished sequence output.

Entities: Chemical Disease Species

Keywords: Divide and conquer; Long-range PCR; PacBio amplicon analysis; Resequencing; Sequence error; Target enrichment

Mesh：

Year: 2018 PMID： 30126356 PMCID： PMC6102811 DOI： 10.1186/s12859-018-2293-0

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

High-throughput sequencing (HTS) platforms have revolutionized the study of genomes and genomic variation. However, HTS platforms are prone to base calling errors [1]. Even perfectly accurate sequence reads may be improperly assembled or incorrectly aligned to a reference sequence when read lengths are too short. The consequence of such errors can lead to incorrect results and misleading conclusions in a variety of settings ranging from scientific investigation [2] to clinical diagnostics [3]. Single molecule real-time (SMRT) sequencing by Pacific Biosciences (PacBio) generates long-read data, which, if error corrected (raw SMRT sequence reads have an error rate as high as 20% [4]), can help to produce complete de novo assemblies and accurate alignments to a reference genome. SMRT sequencing also exhibits relatively little sequence coverage bias, allowing regions of the genome with large differences in sequence complexity to be fully traversed [4]. Therefore, SMRT sequencing facilitates assembly, resequencing, haplotype phasing, characterization of isoforms and structural variation, etc., all of which are more prone to errors with “short-read” data [5]. For targeted resequencing applications, SMRT sequencing of tiled amplicons allows kilobase or larger-scale target regions of a genome to be sequenced at great depth, providing the opportunity to generate highly accurate, consensus assemblies [6]. In combination with molecular barcoding, sequencing of multiplexed amplicon libraries facilitates studies across broad biological disciplines [7-11]. However, such studies can be affected by confounding sources of errors arising from library preparation, sequencing and data analysis [12]. Isolating the sources and types of errors is crucial to progress in the development of sequencing technologies, sequence analysis methods and interpretation of sequence data. Several computational pipelines have been developed for automated processing and analysis of amplicon sequence data produced on different HTS platforms, such as PyroNoise [13], mothur [14] and Long Amplicon Analysis (LAA) [15]. LAA is the standard pipeline for analysis of SMRT sequence data from amplicon libraries. LAA uses a “coarse clustering” approach to group raw reads according to pairwise similarity estimated from BLASR alignments. The Quiver consensus calling framework [16] is then used to generate an error-corrected consensus sequence for each cluster. When we first used LAA to process amplicon sequences as part of a previous study [6], several of the consensus sequences outputted by LAA were incorrect. We found that clustering of high quality circular consensus sequences (i.e. clustering of CCS reads, which we refer to as C3S) to group the corresponding raw read data prior to performing analysis with Quiver recovered all of the expected sequences with high fidelity. Here, we investigated this further and present a new, open-source pipeline for processing tiled amplicon resequence data from multiplexed libraries.

Methods

Sequence data

PacBio sequence data (RS II chemistry P6/C4) from two amplicon libraries, a single sample library (SRX2880716) and a multiplex sample library (SRX3474979), were used for this study. SMRTbell libraries were constructed according to PacBio’s amplicon library protocol [17]. Sequencing was performed on a Pacbio RS II instrument with one SMRT Cell used for each library, using P6/C4 chemistry with a 6 h movie. SMRTbell library preparation and sequencing was carried out by the University of Delaware Sequencing and Genotyping Center (Newark, DE). Sequence data from the single sample library was from a previous study [6] and was comprised of nine amplicons, which were amplified from the maize inbred line B73. The maximum expected amplicon size was 4954 bp, such that the raw reads, which had a mean length of 23,794 bp, consisted of an average of approximately nine subreads per amplicon (Additional file 1: Table S1). The multiplex sample library produced for this study was comprised of a tiling path of six amplicons spanning approximately 23,000 bp of the maize genome, which were amplified from six different maize inbred lines (B73, CML277, Hp301, Mo17, P39, Tx303). The primer pairs used for the multiplex library had distinct symmetric barcodes for each sample and amplicon, along with a shared 5’ GTTAG padding sequence (Additional file 1: Table S2). The maximum expected amplicon size was 7752 bp and the raw reads consisted of an average of nine subreads per amplicon (Additional file 1: Table S1).

Clustering of circular consensus sequences for long amplicon analysis

A cluster and assembly pipeline was developed in which raw reads are clustered based on circular consensus sequences (CCS) prior to running error correction with Quiver. We refer to this divide and conquer approach as C3S-LAA, for Clustering of Circular Consensus Sequence (C3S) Long Amplicon Analysis (Fig. 1).

Fig. 1

Graphical representation of the C3S-LAA process and pipeline. a Raw reads comprised of multiple subreads are depicted for three different amplicons [green, fuchsia and blue boxes; different shades of color are used to portray variable subread sequence qualities (darker shading portrays higher quality)]. Subreads are separated by a shared adapter sequence (grey boxes). The higher quality CCS read for each raw read is used to cluster the corresponding raw reads into CCS-based cluster groups. Error correction is performed per CCS-based cluster, producing top quality consequences sequences, followed by assembly of any overlapping consensus sequences. b A single run parameters file is used by all components of the pipeline. The grey highlighted rectangles represent two main steps of C3S-LAA. (i) Using the CCS reads generated by the SMRT analysis reads of insert protocol, C3S clusters the raw reads according to each barcode-primer pair combination, producing files of read identifiers to whitelist the corresponding raw reads. (ii) Raw read clusters are passed to Quiver to generate amplicon-specific consensus sequences, which are then passed to Minimus for sequence assembly. Rectangles with folded corners represent single files or multiple files (depicted as stacks of files) and those with rounded edges represent scripts and tools. Arrows indicates output files that are generated. Connecting lines with dots at one end depict input files, with the dot corresponding to the source data for the connected script or tool Clustering is performed as follows. The reads of insert protocol in SMRT Portal is used to generate CCS reads (run settings: minimum of 1 subread at 90% CCS read accuracy). These are higher quality sequences formed from the corresponding raw reads based on their multiple subreads. Therefore, the CCS reads are used to cluster the data. Clustering is performed by a simple match function that identifies CCS reads containing both the forward and reverse primer sequences for each amplicon (considering the sense and antisense primer sequences). From this, a list of CCS read identifiers belonging to each amplicon cluster is produced. This list is then used to subset the corresponding raw reads, using the whitelist option in LAA, such that Quiver-based consensus calling [16] occurs on only the raw reads belonging to a given amplicon-specific cluster. Consensus sequences formed from clusters comprised of fewer than 100 subreads were eliminated when all available reads were used; this setting was adjusted to 0 for evaluation of accuracy (see below). The pipeline can be used to perform one-level clustering for non-barcoded amplicon libraries or two-level clustering for barcoded amplicon libraries. Because barcodes or other sequences may precede the primer sequence and may vary in length, the primer search space was designed as a user input parameter, which, for this study, was set to 21 bases at both the ends of the sequence. The pipeline proceeds to an assembly step (Fig. 1). The C3S-LAA consensus sequences are automatically merged into a Multi-FASTA format file and assembled (per barcode if barcoding is used) using Minimus based on the overlap-layout-consensus paradigm [18]. To trim extraneous sequences (e.g. padding or barcodes) for downstream analysis, a user input parameter (trim_bp) is specified to remove the corresponding number of bases from each end of the consensus sequences while writing them to the FASTA file. The assembly is then carried out among all trimmed consensus sequences, and mismatches between any two overlapping sequences are represented as Ns in the assembly sequence. Where there are more than two overlapping sequences with mismatches, the most frequent base will be represented in the assembly. In the case of barcoded sequencing libraries, the assembly is carried out separately for each barcode.

Evaluating the accuracy of C3S-LAA

First, evaluation of the performance of LAA was carried out on the sequence data from the single sample library. LAA v1 was run on SMRT Portal, using the following settings: minimum subread length: 2000 bp; maximum number of subreads: 2000 (default); ignore primer sequence when clustering: 0 bp (default); trim ends of sequences: 0 bp (default); provide only the most supported sequences: 0 (0=disabled filter; default); coarse cluster subreads by gene family: yes (default); phase alleles: no; split results from each barcode into independent output files: no; barcode: no. The minimum subread length was reduced from the default value of 3000 bp to 2000 bp since the sequencing library had one amplicon of 3330 bp, such that partial sequences may also be considered. Phasing of alleles was not used since the amplicons were produced from homozygous individuals (inbred lines). The resulting LAA consensus sequences were aligned using BLASTn [19] to the B73 v3 reference genome of maize [20] (BLASTn parameter settings: max target sequences: 10, E-value threshold: 1 e−4, word size: 11, match/mismatch scores: 1/-2). YASS [21] was used to generate dot plots for alignments between the incorrect (partial matches) consensus sequences formed by LAA and their expected amplicon sequence using the following score parameter settings: Scoring matrix (match: +5, transversion: -4, transition: -3, composition bias correction: -4), Gap costs (opening: -16, extension: -4), E-value threshold: 10 and X-drop threshold: 30. The same sequence data from above was also processed using C3S-LAA. In addition, the relationship between subread depth and the accuracy of consensus sequence construction as well as assembly was evaluated for the output from C3S-LAA. For each amplicon, sample sets of 1,2,3,...40 CCS read identifiers were randomly selected with replacement from among the eight amplicons. Using the corresponding raw reads of each CCS read set, C3S-LAA was used to create consensus sequences per amplicon cluster and assemblies from the corresponding group of consensus sequences belonging to a sampled CCS read set. This was repeated 25 times, such that a total of 8000 consensus sequences were generated in addition to the corresponding Minimus assemblies. BLASTn alignments with the B73 v3 reference genome were used to determine the map location and compute the percent identity for each of the amplicon-specific consensus sequences and corresponding assembly sequences. From these alignments, the number of mismatches and gaps were also recorded to characterize the types of errors present in the sequences. For each cluster of sequences, the number of subreads used to derive the consensus sequence was recorded. The minimum number of subread counts for a set of overlapping amplicons that produced an assembly was used as the number of subreads for that assembly. The performance of LAA versus C3S-LAA was also evaluated using the multiplex library. LAA was used to generate consensus sequences under the same settings indicated above, with an additional selection of the barcode demultiplexing option. Since the amplicons were barcoded using PacBio’s standard barcodes, the default pre-set in SMRT Portal pointing to PacBio barcodes with padding in the reference directory was used. C3S-LAA was used to perform two-level clustering of the CCS reads, using the primer and barcode sequence information. A search space of 121 bp was used for identifying barcode-primer sequences in order to cluster the CCS reads. Since one of the lines (B73) has a reference genome available, LAA and C3S-LAA consensus sequences associated with B73 were aligned using BLASTn to the B73 v3 reference genome, using the same BLASTn parameter settings indicated above. The C3S-LAA assembly for B73 was also compared to this reference.

Results and discussion

Improving the accuracy of amplicon sequence analysis

PacBio SMRT sequence data from a pooled library of long-range PCR amplicons was previously produced and used for part of this study [6]. The data was processed with PacBio’s LAA protocol under default settings using all of the raw read data. This did not produce a consensus sequence for all of the expected amplicons and included seven artifactual sequences (Table 1). Dot-plot visualization of the alignments between the incorrect consensus sequences and the reference sequence indicated the presence of spurious inverted duplications for six of these sequences and a truncated consensus sequence for the remaining one (Additional file 1: Figure S1).

Table 1

Comparison of LAA and C3S-LAA consensus sequences for B73 amplicons

Library type^a	Method	Number of consensus sequences	Complete match^b (100% identity)	Truncated match (100% identity)	Partial match (<100% identity)
Single	LAA	14	7	1	6
Single	C3S-LAA	9	9	0	0
Multiplex	LAA	8	4	1	3
Multiplex	C3S-LAA	6	5	0	1

aThe single library had nine expected consensus sequences, whereas the multiplex library had six expected consensus sequences.

bFor the multiplex sample library, the B73 v3 assembly contained a gap relative to one of the five amplicon sequences, leading to one C3S-LAA sequence having a partial match. This gap was filled in the latest B73 v4 release

Comparison of LAA and C3S-LAA consensus sequences for B73 amplicons aThe single library had nine expected consensus sequences, whereas the multiplex library had six expected consensus sequences. bFor the multiplex sample library, the B73 v3 assembly contained a gap relative to one of the five amplicon sequences, leading to one C3S-LAA sequence having a partial match. This gap was filled in the latest B73 v4 release The above errors led us to inspecting LAA, which uses a custom algorithm based on the raw read data to pre-cluster similar sequences for analysis by Quiver. However, PacBio raw reads have relatively low accuracy [4] and overlapping or repetitive sequences could be present, either of which may cause errors in cluster formation (our speculation based on results presented below). Moreover, the primer sequences used to produce the amplicons in a library are not considered. Therefore, we hypothesized that using the higher quality CCS reads to group the corresponding raw reads into amplicon-specific clusters based on the expected primer sequences would improve the consensus sequence analysis. A bioinformatic pipeline, C3S-LAA, was developed to carry out such clustering (Fig. 1). The divide and conquer principle used by C3S-LAA simplifies the determination of consensus sequences by only operating on raw reads for which there is a high degree of certainty that they were derived from the same locus. Indeed, our results indicated that C3S-LAA rectified the errors generated by the standard LAA protocol. For a typical use case, where all the reads from a sequencing library are used, C3S-LAA could resolve and accurately call the consensus sequence for every amplicon in the single sample library with no extraneous sequences (Table 1). In contrast, more than half of the consensus sequences generated by LAA had truncated or partial matches to the reference genome, and LAA could only fully resolve six of the nine amplicons in the single sample library. Based on these results, we recommend C3S-LAA for analysis of PacBio amplicon sequence data. Moreover, the C3S concept may be used in other situations where some portion of the sequences are known in advance.

Assembly of overlapping amplicon sequences

For tiled amplicon resequencing, C3S-LAA can also be used to assemble overlapping segments that may exist among the consensus sequences outputted for a given genotype (Fig. 1). We bootstrapped the read data from the single sample library to examine the accuracy of the assemblies, as well as the underlying consensus sequences, produced by C3S-LAA as a function of subread depth. All C3S-LAA alignments of the resulting amplicon consensus and assembly sequences mapped to the expected target region. The minimum subread depth from which amplicon-clustered consensus sequences were outputted by LAA was 21, which corresponds to approximately 2 CCS reads for our 3–5 kb amplicons (mean number of passes was 9.39; Additional file 1: Table S1). Accuracy of the consensus sequences from bootstrapped samples of amplicon-clustered data was generally high, with accuracies ranging from of 99.72-100% (Fig. 2a). By extension, Minimus assemblies of these consensus sequences were similarly accurate (Fig. 2b). Despite an increase in accuracy with subread depth, not all of the bootstrap replications from high CCS sample depths included completely accurate consensus sequences or assemblies. Even at a subread depth of nearly 400X, some bootstrap samples included imperfect assemblies (Fig. 3). This was primarily due to a specific error in one locus (locus_6_7045710_7052049) that was observed among some of the bootstrapped samples at different CCS sampling depths (rare instances of locus_1_25390617_25396540 also showed minor inaccuracies). For instance, at a CCS sample depth of 40, the consensus sequence for locus_6_7045710_7052049 contained a 2 bp insertion in two of the 25 bootstrap samples. This same type of insertion error occurred for both loci and was embedded within homopolymeric regions of the sequences (Fig. 4), indicating this was due to PCR or sequencing and not the pipeline per se. Among all the assemblies generated from bootstrapping (n=3787), errors in the form of insertions, deletions and single nucleotides contributed to 66.7, 17.2 and 16.1% of the total errors respectively (keep in mind that these are fractions of the total errors which constitute no more than 0.3% of the C3S-LAA consensus sequences).

Fig. 2

Fig. 3

Total number of accurate bootstrap assemblies per CCS sample size. At each level of the CCS read depth sample (1-40), the figure shows the total number of bootstrapped assemblies that were 100% identical to the reference sequence. This was determined for the four target regions (25 bootstrap assemblies at each of 4 loci, giving rise to a maximum of 100 on the x-axis) formed from the consensus sequences among the eight overlapping amplicons

Fig. 4

Sequence alignment highlighting a recurring insertion error in some bootstrap samples. The alignment corresponds to the consensus sequence for a part of the amplicon from a locus_6_7045710_7052049 (Query) and b locus_1_25390617_25396540 (Query) on maize chromosome 6 and 1 respectively compared to the B73 v3 reference sequence (Sbjct)

Sequence accuracy as a function of subread depth. a Accuracy of consensus and b assembly sequences. Data from all the amplicons were pooled together to evaluate the consensus calling accuracy as a function of depth of coverage of SMRT raw reads. The vertical line shows the minimum read depth of the consensus sequences used for assemblies Total number of accurate bootstrap assemblies per CCS sample size. At each level of the CCS read depth sample (1-40), the figure shows the total number of bootstrapped assemblies that were 100% identical to the reference sequence. This was determined for the four target regions (25 bootstrap assemblies at each of 4 loci, giving rise to a maximum of 100 on the x-axis) formed from the consensus sequences among the eight overlapping amplicons Sequence alignment highlighting a recurring insertion error in some bootstrap samples. The alignment corresponds to the consensus sequence for a part of the amplicon from a locus_6_7045710_7052049 (Query) and b locus_1_25390617_25396540 (Query) on maize chromosome 6 and 1 respectively compared to the B73 v3 reference sequence (Sbjct)

Processing multiplexed sequence data

For the multiplex sample library, the number of consensus sequences formed by LAA differed from the expected number for four of the six samples, and LAA generated consensus sequences for barcodes that were not used to make the library (Table 2). In contrast, C3S-LAA produced the exact number of expected consensus sequences per sample and per barcode. As with the single sample library, comparing the B73-barcode derived consensus sequences to the B73 reference genome showed substantial errors in the consensus sequences from LAA but not C3S-LAA, where LAA only resolved four of the amplicons from B73 (Table 1); the one C3S-LAA consensus sequence with an imperfect match was due to two separate 1 bp insertions embedded within homopolymeric regions. Another C3S-LAA consensus sequence aligned to the expected region of chromosome 1 with 100% identity but spanned a 531 bp assembly gap in the reference genome. This gap was filled in the recent v4 release of the B73 reference genome [22] and was a perfect match to the C3S-LAA consensus sequence. None of the other results were changed when using the B73 v4 reference sequence. C3S-LAA also produced assemblies for each sample from the corresponding set of consensus sequences. The 23,300 bp C3S-LAA assembly for B73 differed from the expected B73 reference genome sequence only by the differences indicated above.

Table 2

The number of consensus sequences generated from the multiplex library, following barcode demultiplexing

Sample^a	Barcode ID	LAA consensus	C3S-LAA consensus
B73	32	8	6
CML277	35	6	6
Hp301	31	6	6
Mo17	20	7	6
P39	2	7	6
Tx303	4	7	6
N/A	8	7	0
N/A	23	5	0
N/A	49	1	0
N/A	82	2	0
N/A	85	1	0
N/A	91	6	0
N/A	92	3	0

aNo samples were associated with the N/A barcode

The number of consensus sequences generated from the multiplex library, following barcode demultiplexing aNo samples were associated with the N/A barcode

Other considerations

C3S-LAA clearly outperformed LAA for the data examined in this study. We have observed the same performance using C3S-LAA on data from another multiplex library including 21 individuals amplified across multiple overlapping amplicons (not shown). Nevertheless, a potential limitation of C3S-LAA is that it requires the CCS reads have both the barcode and primer sequences intact. Accuracy of CCS reads is a function of the number of subreads [23]. Thus, for very long amplicons where one or a few subreads are sequenced, reliance on CSS reads will limit the number of sequences used from the available data. It may be possible to use a less stringent clustering algorithm, however, the fragment lengths of most amplicon libraries are expected to be well below the current and increasingly long read lengths of PacBio data, such that highly accurate CCS reads would be available for clustering. C3S-LAA is expected to be applicable for SMRT sequence data of amplicon libraries or where flanking sequences can be predefined. C3S-LAA was developed as part of an extension to tiled amplicon resequencing projects facilitated by ThermoAlign [6] and is released under an open source license.

Conclusions

This study shows that CCS-facilitated clustering of raw reads vastly improves the analysis of SMRT sequence data. This method directs error correction and consensus sequence analysis to be performed only on sequences derived from the same amplicon and sample, leading to accurate consensus sequences and local assemblies. The community standard LAA module could not resolve all of the expected amplicons from the sequence data evaluated in this study, and several spurious consensus sequences were generated by LAA during barcode demultiplexing and sequence clustering. Long amplicon analysis uses BLASR for pairwise alignment of all reads, which are then clustered based on their similarity using a Markov Model [15]. Given that the the underlying principle of LAA and C3S-LAA are essentially the same — use clustering to group reads from which consensus sequences should be formed — but only C3S-LAA produces correct output, indicates that the clustering algorithm of LAA is prone to error. This release of C3S-LAA provides users with a more accurate processing pipeline for SMRT sequence data, which addresses a critical gap in the analysis of amplicon sequence data. Table S1. PacBio reads of insert protocol output metrics. Table S2. Padded and barcoded primer sequences used for amplification of six maize lines. Figure S1. Dot-plots of alignments between amplicon reference sequences and inaccurate consensus sequences generated by LAA. (PDF 321 kb)

20 in total

1. Accurate determination of microbial diversity from 454 pyrosequencing data.

Authors: Christopher Quince; Anders Lanzén; Thomas P Curtis; Russell J Davenport; Neil Hall; Ian M Head; L Fiona Read; William T Sloan
Journal: Nat Methods Date: 2009-08-09 Impact factor: 28.547

2. The B73 maize genome: complexity, diversity, and dynamics.

Authors: Patrick S Schnable; Doreen Ware; Robert S Fulton; Joshua C Stein; Fusheng Wei; Shiran Pasternak; Chengzhi Liang; Jianwei Zhang; Lucinda Fulton; Tina A Graves; Patrick Minx; Amy Denise Reily; Laura Courtney; Scott S Kruchowski; Chad Tomlinson; Cindy Strong; Kim Delehaunty; Catrina Fronick; Bill Courtney; Susan M Rock; Eddie Belter; Feiyu Du; Kyung Kim; Rachel M Abbott; Marc Cotton; Andy Levy; Pamela Marchetto; Kerri Ochoa; Stephanie M Jackson; Barbara Gillam; Weizu Chen; Le Yan; Jamey Higginbotham; Marco Cardenas; Jason Waligorski; Elizabeth Applebaum; Lindsey Phelps; Jason Falcone; Krishna Kanchi; Thynn Thane; Adam Scimone; Nay Thane; Jessica Henke; Tom Wang; Jessica Ruppert; Neha Shah; Kelsi Rotter; Jennifer Hodges; Elizabeth Ingenthron; Matt Cordes; Sara Kohlberg; Jennifer Sgro; Brandon Delgado; Kelly Mead; Asif Chinwalla; Shawn Leonard; Kevin Crouse; Kristi Collura; Dave Kudrna; Jennifer Currie; Ruifeng He; Angelina Angelova; Shanmugam Rajasekar; Teri Mueller; Rene Lomeli; Gabriel Scara; Ara Ko; Krista Delaney; Marina Wissotski; Georgina Lopez; David Campos; Michele Braidotti; Elizabeth Ashley; Wolfgang Golser; HyeRan Kim; Seunghee Lee; Jinke Lin; Zeljko Dujmic; Woojin Kim; Jayson Talag; Andrea Zuccolo; Chuanzhu Fan; Aswathy Sebastian; Melissa Kramer; Lori Spiegel; Lidia Nascimento; Theresa Zutavern; Beth Miller; Claude Ambroise; Stephanie Muller; Will Spooner; Apurva Narechania; Liya Ren; Sharon Wei; Sunita Kumari; Ben Faga; Michael J Levy; Linda McMahan; Peter Van Buren; Matthew W Vaughn; Kai Ying; Cheng-Ting Yeh; Scott J Emrich; Yi Jia; Ananth Kalyanaraman; An-Ping Hsia; W Brad Barbazuk; Regina S Baucom; Thomas P Brutnell; Nicholas C Carpita; Cristian Chaparro; Jer-Ming Chia; Jean-Marc Deragon; James C Estill; Yan Fu; Jeffrey A Jeddeloh; Yujun Han; Hyeran Lee; Pinghua Li; Damon R Lisch; Sanzhen Liu; Zhijie Liu; Dawn Holligan Nagel; Maureen C McCann; Phillip SanMiguel; Alan M Myers; Dan Nettleton; John Nguyen; Bryan W Penning; Lalit Ponnala; Kevin L Schneider; David C Schwartz; Anupma Sharma; Carol Soderlund; Nathan M Springer; Qi Sun; Hao Wang; Michael Waterman; Richard Westerman; Thomas K Wolfgruber; Lixing Yang; Yeisoo Yu; Lifang Zhang; Shiguo Zhou; Qihui Zhu; Jeffrey L Bennetzen; R Kelly Dawe; Jiming Jiang; Ning Jiang; Gernot G Presting; Susan R Wessler; Srinivas Aluru; Robert A Martienssen; Sandra W Clifton; W Richard McCombie; Rod A Wing; Richard K Wilson
Journal: Science Date: 2009-11-20 Impact factor: 47.728

3. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.

Authors: Chen-Shan Chin; David H Alexander; Patrick Marks; Aaron A Klammer; James Drake; Cheryl Heiner; Alicia Clum; Alex Copeland; John Huddleston; Evan E Eichler; Stephen W Turner; Jonas Korlach
Journal: Nat Methods Date: 2013-05-05 Impact factor: 28.547

Review 4. Advancements in Next-Generation Sequencing.

Authors: Shawn E Levy; Richard M Myers
Journal: Annu Rev Genomics Hum Genet Date: 2016-06-09 Impact factor: 8.929

Review 5. The role of replicates for error mitigation in next-generation sequencing.

Authors: Kimberly Robasky; Nathan E Lewis; George M Church
Journal: Nat Rev Genet Date: 2013-12-10 Impact factor: 53.242

6. Mutation detection by clonal sequencing of PCR amplicons and grouped read typing is applicable to clinical diagnostics.

Authors: Philip A Chambers; Lucy F Stead; Joanne E Morgan; Ian M Carr; Kate M Sutton; Christopher M Watson; Victoria Crowe; Helen Dickinson; Paul Roberts; Clive Mulatero; Matthew Seymour; Alexander F Markham; Paul M Waring; Philip Quirke; Graham R Taylor
Journal: Hum Mutat Date: 2012-10-11 Impact factor: 4.878

7. Ultrasensitive measurement of hotspot mutations in tumor DNA in blood using error-suppressed multiplexed deep sequencing.

Authors: Azeet Narayan; Nicholas J Carriero; Scott N Gettinger; Jeannie Kluytenaar; Kevin R Kozak; Torunn I Yock; Nicole E Muscato; Pedro Ugarelli; Roy H Decker; Abhijit A Patel
Journal: Cancer Res Date: 2012-05-10 Impact factor: 12.701

8. Minimus: a fast, lightweight genome assembler.

Authors: Daniel D Sommer; Arthur L Delcher; Steven L Salzberg; Mihai Pop
Journal: BMC Bioinformatics Date: 2007-02-26 Impact factor: 3.169

9. YASS: enhancing the sensitivity of DNA similarity search.

Authors: Laurent Noé; Gregory Kucherov
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

10. ThermoAlign: a genome-aware primer design tool for tiled amplicon resequencing.

Authors: Felix Francis; Michael D Dumas; Randall J Wisser
Journal: Sci Rep Date: 2017-03-16 Impact factor: 4.379

3 in total

1. A Novel Approach To Display Structural Proteins of Hepatitis C Virus Quasispecies in Patients Reveals a Key Role of E2 HVR1 in Viral Evolution.

Authors: Yimin Tong; Qingchao Li; Rui Li; Yongfen Xu; Yu Pan; Junqi Niu; Jin Zhong
Journal: J Virol Date: 2020-08-17 Impact factor: 5.103

2. Defining Blood Group Gene Reference Alleles by Long-Read Sequencing: Proof of Concept in the ACKR1 Gene Encoding the Duffy Antigens.

Authors: Yann Fichou; Isabelle Berlivet; Gaëlle Richard; Christophe Tournamille; Lilian Castilho; Claude Férec
Journal: Transfus Med Hemother Date: 2019-12-11 Impact factor: 3.747

3. High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution.

Authors: Benjamin J Callahan; Joan Wong; Cheryl Heiner; Steve Oh; Casey M Theriot; Ajay S Gulati; Sarah K McGill; Michael K Dougherty
Journal: Nucleic Acids Res Date: 2019-10-10 Impact factor: 16.971

3 in total