Literature DB >> 21685062

Error correction of high-throughput sequencing datasets with non-uniform coverage.

Paul Medvedev¹, Eric Scott, Boyko Kakaradov, Pavel Pevzner.

Abstract

MOTIVATION: The continuing improvements to high-throughput sequencing (HTS) platforms have begun to unfold a myriad of new applications. As a result, error correction of sequencing reads remains an important problem. Though several tools do an excellent job of correcting datasets where the reads are sampled close to uniformly, the problem of correcting reads coming from drastically non-uniform datasets, such as those from single-cell sequencing, remains open.
RESULTS: In this article, we develop the method Hammer for error correction without any uniformity assumptions. Hammer is based on a combination of a Hamming graph and a simple probabilistic model for sequencing errors. It is a simple and adaptable algorithm that improves on other tools on non-uniform single-cell data, while achieving comparable results on normal multi-cell data. AVAILABILITY: http://www.cs.toronto.edu/~pashadag. CONTACT: pmedvedev@cs.ucsd.edu.

Entities: Chemical Disease Species

Mesh：

Year: 2011 PMID： 21685062 PMCID： PMC3117386 DOI： 10.1093/bioinformatics/btr208

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The continuing improvements to high-throughput sequencing (HTS) platforms have begun to unfold a myriad of new and exciting applications such as transcriptome analysis, metagenomics, single-cell assembly and variation detection. In addition, classical problems such as assembly are becoming feasible for large-scale endeavours, such as the Genome 10K project which seeks to assemble the genomes of 10 000 novel species (Genome 10K Community of Scientists, 2009). All these projects face the same difficulty of handling the base errors that are inevitably prevalent in HTS datasets. Error correction of reads has thus come to the forefront as an essential problem, with a slew of publications in the last 2 years (Ilie ; Kelley ; Qu ; Salmela, 2010; Schroder ; Shi ); Wijaya ; Yang ; Zhao ;). A common approach of error correction of reads is to determine a threshold and correct k-mers whose multiplicities fall below the threshold (Chaisson and Pevzner, 2008; Kelley ; Pevzner ; Shi ; Zhao ). Choosing the correct threshold is crucial since a low threshold will result in too many uncorrected errors, while a high threshold will result in the loss of correct k-mers. The histogram of the multiplicities of k-mers will show a mixture of two distributions—that of the error-free k-mers, and that of the erroneous k-mers. When the coverage is high and uniform, these distributions are centered far apart and can be separated without much loss using a cutoff threshold; such methods therefore achieve excellent results (Kelley ). Though HTS platforms provide relatively uniform coverage in many standard sequencing experiments, in some of the more challenging applications, such as transcriptome sequencing, the coverage remains drastically uneven (Li ; Qu ). Another prominent emerging example is that of single-cell sequencing, which has enabled the investigation (Kvist ; Mussmann ) of a diverse range of uncultivated bacteria, from the surface ocean environment (Yooseph ) to the human body (Gill ). Though these bacteria are not amenable to normal sequencing, recent advances in DNA amplification technology have enabled genome sequencing directly from individual cells without requiring growth in culture (Raghunathan ; Rodrigue ). The read datasets thus obtained from single-cells suffer from amplification bias, resulting in orders of magnitude difference in coverage (Raghunathan ) and the complete absence of coverage in some regions. Previous, single-cell studies have used error correction tools (Margulies ; Zerbino and Birney, 2008) that assume near-uniform coverage. However, such approaches are not as effective and result in decreased quality of assemblies when the two multiplicity distributions of erroneous and error-free k-mers are not separable using a simple threshold. Thus, despite the initial technological difficulties, the challenges facing single-cell genomics are increasingly computational rather than experimental (Rodrigue ). For applications such as these, it becomes paramount to develop better error-correction algorithms that do not assume uniformity of coverage. Recent papers have introduced a powerful idea that has the potential to remove uniformity assumptions ((Ilie ; Qu ; Salmela, 2010; Schroder ; Wijaya ; Yang ). Consider two k-mers x and y that are within a small Hamming distance and present in the read dataset. If our genome does not contain many non-exact repeats, then it is likely that both x and y were generated by the k-mer among them that has higher multiplicity. In this way, we can correct the lower multiplicity k-mer to the higher multiplicity one, without relying on uniformity. This can be further generalized by considering the notion of the Hamming graph, whose vertices are the distinct k-mers (annotated with their multiplicities) and edges connect k-mers with a Hamming distance less than or equal to a parameter τ (Yang ). In this article, we develop the method Hammer for error correction without any uniformity assumptions. Hammer is based on a combination of the Hamming graph and a simple probabilistic model. It is a simple and adaptable algorithm that improves on other tools on non-uniform single-cell Escherichia coli data, while achieving comparable results on normal multi-cell data.

2 METHODS

We will first present the simple probabilistic model on which Hammer is based, and then present the algorithm.

2.1 Model

We model sequencing as a process that repeatedly identifies a genomic position at random with an arbitrary (not necessarily uniform) distribution. Then for each position of u, it introduces a mutation with probabilty ϵ, and generates the resulting k-mer for the dataset. We denote G(u) as the set of all k-mers that were generated by u, and we call u a generating k-mer. Let us assume for the moment that we are given a multi-set C=G(u) of all the k-mers that were generated by u, but we do not know the identity of u. We wish to find the k-mer x that maximizes the likelihood of being the generator for C. The likelihood that x generates C is given by where d(x,y) is the Hamming distance between x and y. It follows that maximizing L(x) is equivalent to minimizing which we call the score. Define the consensus of C as a k-mer that at position i contains the most frequent nucleotide at that position in C, with ties broken arbitrarily. Finding the k-mer x that minimizes S(x) turns out to be easy (Jones and Pevzner, 2004):

Observation 1.

The k-mer x minimizing S(x) is the consensus of C. This observation is at the basis of the Hammer algorithm. In reality, we do not know the multi-sets C=G(u); however, we can use the connected components of the Hamming graph to approximate the multi-sets C. We now present the steps of the algorithm.

2.2 Algorithm

Hammer first uses a sort to identify the set K of distinct k-mers in the reads as well as a table m(x) storing the multiplicity of each k-mer. Next, it finds the connected components of the Hamming graph. It is computationally prohibitive to do an all-pairs comparison between the k-mers; however, this is a classical problem from combinatorial pattern matching for which there exist well-studied solutions that rely on spaced seeds (Jones and Pevzner, 2004):

Observation 2.

Let s1,…,sτ+1 be a disjoint partition of {1,…,k}. These represent spaced seeds into k-mers, and we denote the restriction of a k-mer x to the seed s as x(s). Then, if d(x,y)≤τ for some k-mers x and y, there exists j such that x(s)=y(s). Based on this observation, we create τ+1 spaced seeds s={j,j+τ+1,j+2(τ+1),…}. Then we make τ+1 copies of the k-mer set K with the j−th copy containing the restriction of K to the indices s. Each copy is then lexicographically sorted. In order to account for double-strandedness, we include the reverse complements of every k-mer in our set. Next, we linearly scan through each copy, and identify equivalence blocks, i.e. sets of k-mers that are equal when restricted to s. For every equivalence block, we compute the Hamming distances of all pairs of corresponding full-length k-mers. If a pair of k-mers is within a Hamming distance τ, their components are joined. The components are implemented using a union-find data structure (see (Cormen ) for details). This algorithm finds the connected components of the Hamming graph, which for simplicity we will refer to as clusters. It is similar to the algorithm of Reptile (Yang ), however, an important difference is that Reptile uses seeds that are consecutive blocks. It is known from sequence alignment (Ma ) that using spaced rather than consecutive seeds make it less likely for two k-mers to be equal. Because our index sets are spaced out across the whole range of k, the equivalence blocks should be smaller, allowing for a faster running time in practice. Finally, we process the clusters one at a time. Let C be the current cluster. If the cluster does not have size one and the consensus string is unique, we correct every k-mer in the cluster to the consensus string (Fig. 1).

Fig. 1.

On the left is an example of a typical cluster with good coverage. There are five k-mers clustered together, with five loci having mis-alignments. We compute the consensus string (taking multiplicities into account), which we find is already in the cluster (boxed in). All the k-mers are then corrected to the consenus. On the right is an example of a common cluster in low coverage regions. The generating k-mer was sequenced three times but each time with a single error. There are three k-mers in the cluster, but the consensus (boxed in) has not been sequenced and therefore is not in the cluster. Nevertheless, we correct all the k-mers to the consensus, allowing Hammer to reconstruct new k-mers that are not present in the original data. We refer to clusters which consist of only a single k-mer as singletons. This k-mer is trivially the consensus. It must either be the only element in the generating set of its generating k-mer, or it simply contains more than τ errors than the other members of the set. In another case, there are multiple consensus strings. which typically occurs in small clusters of size two or three. We refer to such clusters as ambiguous. In both of these cases, the Hamming graph does not provide good information regarding the correctness of the k-mers of C. We therefore make a decision solely based on the multiplicity of each k-mer x∈C, weighed by the quality values (Kelley ). If it is greater than a threshold singletonCutoff, we keep x in the dataset; otherwise we remove it. The choice of singletonCutoff should be based on minimum coverage—if we expect certain regions to have low coverage, singletonCutoff should be around one (we use one in our single-cell experiment). We also have a parameter saveCutoff, which makes sure that k-mers with a weighed multiplicity greater than or equal to saveCutoff are never removed, even if they are not the consensus of their cluster. This helps to address a shortcoming of our model, namely, that it supposes that there is only one correct k-mer in a cluster.

2.3 Time/space analysis and parameters

Hammer's running time and memory requirements are asymptotically (and in practice) dominated by the initial sorting of distinct k-mers. The sorting is done by storing the reads in memory and representing each k-mer as a pointer to its location in the reads. When n is the number of reads, ℓ is the read length, and n=nℓ−k=1) is the number of k-mers, the sort takes O(nlogn) time and O(nlogn) memory (here we treat ℓ and k as constants). There are more efficient alternatives for sorting k-mers, including map-reduce (Kelley ), hash functions and Graphical Processing Unit (GPU)-based methods (Shi ). However, for the purposes of the bacterial genomes we study in this article, we found that a sort sufficed. After building the set of distinct k-mers K, we no longer need to deal with all the n k-mers, but instead only with the distinct k-mers and their multiplicities. The second step of sorting the τ+1 copies of the k-mers therefore takes time O(n*logn*), where n*=|K| is the number of distinct k-mers in the dataset. Finding the connected components requires us to maintain a union find structure of size O(n*). Scanning through the sorted sets requires an all-pair comparison within each equivalence block, which in worst case can take O((n*)2) time; in practice, though, the size of the blocks is small. Finally, computing the consensuses of the clusters can be done in O(n*) time. Note that the steps of finding the connected components and their consensus are easily parallelizable, with almost no shared memory requirement. The three parameters of Hammer are: τ, the joining distance for k-mers, singletonCutoff, the threshold for singleton and ambiguous clusters, saveCutoff, the threshold above which k-mers are never removed, and the k-mer size. The trade-offs with the k-mer size are similar to those in assembly. Too small of a k results in reads being chopped up too much, with important information about the genome being lost. However, setting k too large results in low k-mer coverage and lowers the accuracy of the algorithm. The value of τ should be large enough as to connect most of the generating set of a k-mer into one connected component. Varying the value of singletonCutoff or saveCutoff represents a direct trade-off between sensitivity and specificity.

3 Results

To test the effectiveness of Hammer on non-uniform datasets, we used reads generated from a single-cell of E.coli K-12 strain, using one lane of the Illumina GAII pipeline (Chitsaz, H. et al., 2011, submitted for publication). They total 600× coverage mostly by 100 bp long reads. We mapped the reads to the E.coli genome using bowtie (Langmead ) and found extremely non-uniform coverage (Fig. 2). There were 94K positions that are not covered by any read, with an additional 116K being covered by less than six reads. In contrast, in an alternate dataset that was generated on a normal multi-cell sample with the same coverage and read length, there were no uncovered regions.

Fig. 2.

The coverage of E.coli by single-cell reads. There are 210K positions with coverage by less than six reads, while the average coverage is 600.

The coverage of E.coli by single-cell reads. There are 210K positions with coverage by less than six reads, while the average coverage is 600. Prior to working with this dataset, we trimmed the first three bases and any trailing bases with a quality value of 2; we then removed any remaining reads with ambiguous bases. We use a value of k=55 to run Hammer on this dataset, motivated by the fact that it was found to work best for assemblers on this data (Chitsaz, H. et al., 2011, submitted for publication). The distance τ used in building the Hamming graph was set to two, since we found larger values did not significantly decrease the number of singleton clusters. For the singletonCutoff, we used the default value of 1, which, roughly speaking, means that any singleton k-mer that appears at least twice with decent quality values is kept. However, to demonstrate that Hammer can be made super-sensitive if needed, we also ran Hammer with a singletonCutoff of 0.5. For the saveCutoff, we used a value of 10. The results are shown in Figure 3 and Table 1. To measure the quality of a set of reads, we measure its sensitivity and positive predictive value (PPV) with respect to the source genome. Let K be the set of distinct k-mers present in the reads, and let S be the k-spectrum of E.coli (i.e. the set of all distinct k-mers). The sensitivity is measured as and reflects the percentage of k-mers from the genome that are present in the reads. The positive predictive value is and reflects the fraction of k-mers in the dataset that do not contain errors. Note that we do not measure the quality of the corrected reads against the quality of uncorrected reads directly, but by measuring the quality of each separately against the E.coli genome.

Fig. 3.

Sensitivity and positive predictive values for single-cell reads. We show the results of varying the singletonCutoff in Hammer (0.5 and 1.0) and the cutoff in Quake (0.3–4.0). We also show the uncorrected dataset and the effect of simply trimming the reads down to 90 or 70 bases.

Table 1.

Sensitivity and positive predictive values for our experiments

E.coli	Method	Total distinct	Total correct	PPV	Sensitivity
dataset		k-mers	k-mers	(%)	(%)
single-cell	Uncorrected	87 579 268	4 422 073	5	96.86
	Hammer (cutoff = 1)	5 732 025	4 345 536	76	95.18
	Quake (cutoff = 4)	6 122 140	4 326 126	71	94.76
	Hammer (cutoff = 0.5)	17 790 512	4 398 487	25	96.34
	HiTEC (trimmed)	8 047 103	4 241 039	53	92.90
	Hammer (trimmed)	4 727 996	4 244 005	90	92.96
Normal	Uncorrected	77 421 925	4 564 319	6	99.98
	Hammer	4 787 644	4 545 158	95	99.56
	Quake	4 661 707	4 563 956	98	99.97

3.1 Comparison to other tools

We compared our results against Quake, which is a cutoff threshold-based method that was recently shown to have superior performance on normal (multi-cell) sequencing data (Kelley ). Quake works by modeling the weighted multiplicity histogram of k-mers as a combination of two separate distributions (one from erroneous reads, one from accurate reads), and using that model to determine the best cutoff threshold. It could not find any cutoff on the single-cell data, likely because the distribution model is based on uniformity assumptions that do not hold. However, we could manually try different values, and we plot the results in Figure 3 and show the raw numbers in Table 1. Hammer improves on Quake, retaining about 19K more genomic reads while reducing the number of erroneous reads by ~390 K. Though Quake was designed with uniformity in mind, an alternate method called HiTEC was recently published, which does not strongly rely on uniformity assumptions (Ilie ). HiTEC is based on a similar idea of using clusters to identify correct reads, albeit using suffix trees instead of the Hamming graph. Since HiTEC currently does not support ambiguous bases or variable length reads, in order to run it on our data we had to trim our reads down to 70 bp and remove all reads containing ambiguous bases. We then ran Hammer on the same altered dataset, with the results shown in Table 1. HiTEC sensitivity was nearly the same (a difference of 0.06%, or ~3 K reads, in favor of Hammer), but its PPV was significantly lower (only 53% compared to 90% by Hammer).

3.2 Analysis of Results

Out of all the correct distinct k-mers present in the data, 10% were in singletons. The distribution of their weighted multiplicities is shown in Figure 4. Most of the incorrect k-mers have weights <1, however, there is also ~53K correct k-mers that weigh between 0.5 and 1. This difference is apparent when we ran Hammer with a singletonCutoff of 0.5 (Table 1), allowing us to increase 1.2% in sensitivity.

Fig. 4.

The histogram showing the number of singletons with the given weighted multiplicity, on a vertical log scale.

The histogram showing the number of singletons with the given weighted multiplicity, on a vertical log scale. We categorize each non-singleton and non-ambiguous cluster according to whether the consensus is correct (i.e. exists in E.coli) and the number of correct k-mers present in the cluster prior to correction (Table 2). In 90% of the clusters, the consensus is accurate, and in 99% of those cases the consensus was present and was the only correct k-mer in the cluster.

Table 2.

Cluster classification

Cluster category		Number of clusters
Consensus	Num. correct	Single-cell	Normal
accurate?	k-mers in cluster	E.coli	E.coli
Yes	>1	12 717	13 056
Yes	1	3924 955	4 530 503
Yes	0	14 603	8427
No	>0	643	816
No	0	425 948	241 628

Classification of non-singleton and non-ambiguous clusters for both single-cell and normal data. Each cluster will have a consensus that is either correct or not, and will have a number of k-mers that were correct prior to correcion. We show the number of such clusters in Hammer's run on single-cell and on normal E.coli.

Cluster classification Classification of non-singleton and non-ambiguous clusters for both single-cell and normal data. Each cluster will have a consensus that is either correct or not, and will have a number of k-mers that were correct prior to correcion. We show the number of such clusters in Hammer's run on single-cell and on normal E.coli. The second largest category of clusters (10%) was where no correct k-mer was present and the chosen consensus was also incorrect. When the consensus is not present in the cluster, it is usually very small. However, in 3% of cases where the consensus is not present, it turned out to be correct, representing an important gain in sensitivity. These cases allow us to reconstruct k-mers that were not present in the original genome. It remains an outstanding problem to find a way to accurately distinguish whether the consensus of a cluster is correct or not, when it is not present in the cluster. Another category of interest is where the consensus was correct but there was more than one correct k-mer in the cluster. These cases occur when the genome has non-identical repeats, and hence there are correct k-mers with a small Hamming distance. Because this is a bacteria dataset, there were only 13K such cases, with a large fraction of them retained because of the saveCutoff. In eukaryotic genomes, however, this will represent a more serious problem, and thus extending Hammer for eukaryotic genomes is important continuing work.

3.3 Multi-cell normal data

We also tested Hammer on data generated from a normal multi-cell lane of E.coli with 600× coverage and 100 bp long reads (accession number ERX002508). We applied the same pre-filtering as for the single-cell data. The coverage of this dataset was close to uniform, with every single base covered by at least one read. As expected, Quake performed extremely well (Table 1), achieving 98% PPV and 99.98% sensitivity. Hammer did nearly as well, with 95% PPV and 99.56% sensitivity. There were still 242K consensus k-mers chosen when there were no correct k-mers in the cluster at all (Table 1). Again, it will be important to develop a way to detect these clusters without assuming uniformity. Though Hammer's results are comparable, Quake currently remains a better choice when it is known that the dataset is uniform.

3.4 Computational requirements

We used a single Intel Xeon 1.8 GHz machine with four dual-cores and 32 G of RAM. For our dataset, the run took approximately 4 h and needed 20 G of RAM. The longest step was the initial sorting of the k-mers, which took about 2 h.

4 DISCUSSION

Despite the success of many error-correction tools, emerging areas such as single-cell sequencing fuel the need to develop better algorithms for situations of non-uniform coverage. Recent approaches have taken steps in the right direction by looking together at all k-mers within a small Hamming neighborhood. Some rely on maximum likelihood estimation (Wijaya ), some on complex sets of rules (Yang ), and some on greedy local heuristics (Qu ). With the notable exception of Qu ), these algorithms still rely on uniformity of coverage. Moreover, most of these algorithms make decisions about k-mers independent of the decisions made about other k-mers in the same cluster. In this article, we have developed an algorithm that makes a decision about all the k-mers with a cluster in a unified but simple manner, allowing us to handle non-uniform single-cell datasets. In Hammer, we proposed the simple choice of the consensus string as a way to correct the k-mers within a cluster. However, this step is easily customizable with whatever choice may best suit the data, and we believe that this is a strength of our approach. In particular, one of Hammer's limitations is that it assumes that there is at most one correct k-mer in a cluster. Though this assumption results in relatively small losses in E.coli, and is furthermore offset by the use of the saveCutoff, it becomes a more serious problem in highly repetitive eukaryotic genomes. It is possible, however, to develop a more sophisticated approach that allows the identification of multiple correct k-mers within a cluster (Wijaya ). Another shortcoming is that Hammer chops the reads up into k-mers—while this is not be a problem for many downstream assemblers who do this anyway, it may be for other tools that require mapping. There are numerous existing techniques for reconstructing reads from the set of corrected k-mers (Chaisson and Pevzner, 2008; Kelley ; Yang ), and these may be incorporated into Hammer in the future. Newer platforms, such as the PacBio sequencer that has a large amount of indel errors, will also undoubtedly challenge Hammer's current approach; however, we believe that the idea of looking at all the k-mers within a single cluster jointly is a powerful one and will continue to be useful.

23 in total

1. An Eulerian path approach to DNA fragment assembly.

Authors: P A Pevzner; H Tang; M S Waterman
Journal: Proc Natl Acad Sci U S A Date: 2001-08-14 Impact factor: 11.205

2. Reptile: representative tiling for short read error correction.

Authors: Xiao Yang; Karin S Dorman; Srinivas Aluru
Journal: Bioinformatics Date: 2010-08-16 Impact factor: 6.937

3. Genome sequencing in microfabricated high-density picolitre reactors.

Authors: Marcel Margulies; Michael Egholm; William E Altman; Said Attiya; Joel S Bader; Lisa A Bemben; Jan Berka; Michael S Braverman; Yi-Ju Chen; Zhoutao Chen; Scott B Dewell; Lei Du; Joseph M Fierro; Xavier V Gomes; Brian C Godwin; Wen He; Scott Helgesen; Chun Heen Ho; Chun He Ho; Gerard P Irzyk; Szilveszter C Jando; Maria L I Alenquer; Thomas P Jarvie; Kshama B Jirage; Jong-Bum Kim; James R Knight; Janna R Lanza; John H Leamon; Steven M Lefkowitz; Ming Lei; Jing Li; Kenton L Lohman; Hong Lu; Vinod B Makhijani; Keith E McDade; Michael P McKenna; Eugene W Myers; Elizabeth Nickerson; John R Nobile; Ramona Plant; Bernard P Puc; Michael T Ronan; George T Roth; Gary J Sarkis; Jan Fredrik Simons; John W Simpson; Maithreyan Srinivasan; Karrie R Tartaro; Alexander Tomasz; Kari A Vogt; Greg A Volkmer; Shally H Wang; Yong Wang; Michael P Weiner; Pengguang Yu; Richard F Begley; Jonathan M Rothberg
Journal: Nature Date: 2005-07-31 Impact factor: 49.962

4. Genomic DNA amplification from a single bacterium.

Authors: Arumugham Raghunathan; Harley R Ferguson; Carole J Bornarth; Wanmin Song; Mark Driscoll; Roger S Lasken
Journal: Appl Environ Microbiol Date: 2005-06 Impact factor: 4.792

5. Specific single-cell isolation and genomic amplification of uncultured microorganisms.

Authors: Thomas Kvist; Birgitte K Ahring; Roger S Lasken; Peter Westermann
Journal: Appl Microbiol Biotechnol Date: 2006-11-16 Impact factor: 4.813

6. Short read fragment assembly of bacterial genomes.

Authors: Mark J Chaisson; Pavel A Pevzner
Journal: Genome Res Date: 2007-12-14 Impact factor: 9.043

7. The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.

Authors: Shibu Yooseph; Granger Sutton; Douglas B Rusch; Aaron L Halpern; Shannon J Williamson; Karin Remington; Jonathan A Eisen; Karla B Heidelberg; Gerard Manning; Weizhong Li; Lukasz Jaroszewski; Piotr Cieplak; Christopher S Miller; Huiying Li; Susan T Mashiyama; Marcin P Joachimiak; Christopher van Belle; John-Marc Chandonia; David A Soergel; Yufeng Zhai; Kannan Natarajan; Shaun Lee; Benjamin J Raphael; Vineet Bafna; Robert Friedman; Steven E Brenner; Adam Godzik; David Eisenberg; Jack E Dixon; Susan S Taylor; Robert L Strausberg; Marvin Frazier; J Craig Venter
Journal: PLoS Biol Date: 2007-03 Impact factor: 8.029

8. Metagenomic analysis of the human distal gut microbiome.

Authors: Steven R Gill; Mihai Pop; Robert T Deboy; Paul B Eckburg; Peter J Turnbaugh; Buck S Samuel; Jeffrey I Gordon; David A Relman; Claire M Fraser-Liggett; Karen E Nelson
Journal: Science Date: 2006-06-02 Impact factor: 47.728

9. Modeling non-uniformity in short-read rates in RNA-Seq data.

Authors: Jun Li; Hui Jiang; Wing Hung Wong
Journal: Genome Biol Date: 2010-05-11 Impact factor: 13.583

10. Insights into the genome of large sulfur bacteria revealed by analysis of single filaments.

Authors: Marc Mussmann; Fen Z Hu; Michael Richter; Dirk de Beer; André Preisler; Bo B Jørgensen; Marcel Huntemann; Frank Oliver Glöckner; Rudolf Amann; Werner J H Koopman; Roger S Lasken; Benjamin Janto; Justin Hogg; Paul Stoodley; Robert Boissy; Garth D Ehrlich
Journal: PLoS Biol Date: 2007-09 Impact factor: 8.029

45 in total

1. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.

Authors: Anton Bankevich; Sergey Nurk; Dmitry Antipov; Alexey A Gurevich; Mikhail Dvorkin; Alexander S Kulikov; Valery M Lesin; Sergey I Nikolenko; Son Pham; Andrey D Prjibelski; Alexey V Pyshkin; Alexander V Sirotkin; Nikolay Vyahhi; Glenn Tesler; Max A Alekseyev; Pavel A Pevzner
Journal: J Comput Biol Date: 2012-04-16 Impact factor: 1.479

2. The MaSuRCA genome assembler.

Authors: Aleksey V Zimin; Guillaume Marçais; Daniela Puiu; Michael Roberts; Steven L Salzberg; James A Yorke
Journal: Bioinformatics Date: 2013-08-29 Impact factor: 6.937

3. Using structural and evolutionary information to detect and correct pyrosequencing errors in noncoding RNAs.

Authors: Vladimir Reinharz; Yann Ponty; Jérôme Waldispühl
Journal: J Comput Biol Date: 2013-10-17 Impact factor: 1.479

Review 4. From next-generation resequencing reads to a high-quality variant data set.

Authors: S P Pfeifer
Journal: Heredity (Edinb) Date: 2016-10-19 Impact factor: 3.821

5. KCMBT: a k-mer Counter based on Multiple Burst Trees.

Authors: Abdullah-Al Mamun; Soumitra Pal; Sanguthevar Rajasekaran
Journal: Bioinformatics Date: 2016-06-09 Impact factor: 6.937

6. Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes.

Authors: Nathan D Olson; Todd J Treangen; Christopher M Hill; Victoria Cepeda-Espinoza; Jay Ghurye; Sergey Koren; Mihai Pop
Journal: Brief Bioinform Date: 2019-07-19 Impact factor: 11.622