Literature DB >> 28638050

De novo yeast genome assemblies from MinION, PacBio and MiSeq platforms.

Francesca Giordano1, Louise Aigrain2, Michael A Quail2, Paul Coupland3, James K Bonfield2, Robert M Davies2, German Tischler4, David K Jackson2, Thomas M Keane2, Jing Li5, Jia-Xing Yue5, Gianni Liti5, Richard Durbin2, Zemin Ning2.   

Abstract

Long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacific Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics affect the assembly quality. With a given read depth of 31X, the assemblies from both Pacific Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a significant challenge.

Entities:  

Mesh:

Year:  2017        PMID: 28638050      PMCID: PMC5479803          DOI: 10.1038/s41598-017-03996-z

Source DB:  PubMed          Journal:  Sci Rep        ISSN: 2045-2322            Impact factor:   4.379


Introduction

The advent of next generation sequencing technologies (NGS) has marked the start of a new era in genomics research. Compared to the previous Sanger technology[1], NGS has significantly lowered the cost of sequencing using massively parallel sequencing methods[2, 3]. In a typical NGS run, DNA molecules are sheared into small fragments and then clonally amplified before being sequenced. After DNA amplification, multiple fragments of the sequences obtained may cover the same genome region, so that computational algorithms can be used to concatenate and assemble such reads like a jigsaw puzzle and generate a consensus to correct for the occasional sequencing errors. The typical length of the DNA fragments sequenced is between 50 and 400 bases long[2], and as a result, the assembly obtained from such short reads is fragmented in contigs much smaller than the actual chromosome sizes. In particular, short reads are not able to solve complex genome features like repeated regions (repeats) longer than the fragment length or copy number variations, with the typical outcome that (almost-) identical repeats are collapsed into a single element in the assembly. To overcome the high fragmentation of NGS-based assemblies and to help resolve long repeats, long-read sequencing technologies have been developed and recently adopted by the genomics community. The main characteristic of these new platforms is to work with long DNA molecules and provide reads with lengths up to hundreds of kilobases (kb). Reads of such length can be exploited in various ways. Particularly in the genome assembly field they can be used for de novo assembly with long-read data only, or for scaffolding of NGS-based assemblies by bridging gaps between contigs or spanning long repeats thus resolving them. A major drawback of long-read technologies is the higher rate of sequencing errors (5–20%) compared to NGS data (<1%)[2]. Such an error profile could negatively affect the assembly accuracy, but because the errors are mostly randomly distributed the majority of long-read assemblers adopt the strategy of correcting base errors algorithmically before attempting to assemble the reads. The most established long-read technology is from Pacific Biosciences (PacBio), which uses a sequencing by synthesis approach where phospholinked nucleotides are used to synthesize the complement strand of a single stranded DNA template. For the PacBio RS II machine, commercialized since 2011, the reads are characterized by lengths in the 5–60 kb range, with an average length around 12 kb[4]. The error rate for raw reads is about 13%[4] with errors, mostly indels, randomly distributed. After a base error correction step the reads can reach an accuracy of 99.9% and higher across the read length. Throughput can reach up to about 1 Gigabases (Gb) per run[4]. Presently, the cost of sequencing a large genome with the PacBio technology is still quite high, as it involves a high initial cost for the platform (about $700k) and about $300 per Gigabases[4]. But PacBio has recently released a new platform, Sequel, that promises to increase the throughput to up to 10 Gb per run, maintaining a long average read length and decreasing the cost per Gb sequenced. More recently, since 2014, the Oxford Nanopore Technology company (ONT) started distributing a new innovative sequencer, the MinION. The MinION is the first handheld sequencer and measures only 2 cm × 4 cm × 9 cm and weighs about 100 g. Within a MinION flowcell, single-stranded DNA molecules are guided across a membrane through protein-based nanopores. The membrane is immersed in a saline solution with a fixed voltage across it, so that an ionic current constantly passes through the pores. The DNA-strand motion through a pore causes a variation of the current, which is constantly monitored. A basecaller then determines the sequence of bases through the pore according to the observed variation in ionic current. To date, various basecallers are available: from an older Hidden-Markov-Model based (HMM) one to more recent ones based on Recurrent Neural Networks (f.i. Nanonet https://github.com/nanoporetech/nanonet). For the data analysed in this study, we used the HMM-based basecaller, which was the first made available by Oxford Nanopore. This basecalling package infers the sequence of successive k-mer words (5-mer or 6-mer) that passed through a particular pore by analyzing the current variations in time. According to their estimated qualities, the reads are collected into a ‘Fail’ or a ‘Pass’ directory, the latter only including the highest quality bi-directional (2D) reads, for which both strands (template and complement) of a DNA molecule have been sequenced and then merged in a single read. Like PacBio, the MinION sequencing errors are mostly randomly distributed. They can be corrected given enough read depth, with the exception of long sequences of the same base (homomers), that are often collapsed into shorter sequences by the basecaller. Nanopore sequencing is still in its rapid development stage, and since the data produced in this study many progresses have been made regarding error rate and throughput stability. Because of its portable size, low price ($1000 initial investment for the MinION, and <$300/Gb for flowcells bought in bundles), low computing requirements and relatively low wet-lab work load, the MinION has the potential to revolutionize the genomic discipline, in particular with regard to field and clinical sequencing. A number of preliminary attempts have already been reported, ranging from in situ outbreak analysis and control strategy[5-7], field DNA collection and sequencing[8], exploration of metagenomics for clinical use[9-11], and many more (see https://nanoporetech.com/publications). Moreover, higher-throughput machines that incorporate 5 (GridION, https://nanoporetech.com/products/gridion) or 48 (PromethION, https://nanoporetech.com/products/promethion) flowcells are in early access stage, and they are foreseen to deliver throughput similar to that of Illumina HiSeq per run. In this paper, we report a comparative study on the yeast genome assemblies from three different sequencing platforms: MiSeq from Illumina (NGS), and the long-read platforms PacBio RS II and ONT MinION. Apart from the reference yeast strain, Saccharomyces cerevisiae S288C, we also sequenced and assembled three other yeast strains, SK1, and the two Saccharomyces paradoxus N44 and CBS432. We explored the results in terms of accuracy, time and memory consumption of a variety of existing pipelines for long-reads-only de novo assembly, and for scaffolding of MiSeq-based assemblies using long reads. For the reference strain S288C, a comparison between the results from an ONT and a PacBio dataset with same read depth and similar read length distribution and accuracy is presented to assess the performance of the various pipelines in correcting the error types typical of the two technologies.

Results

Data Sequencing and Production

We have sequenced the genome of the four yeast strains, N44, SK1, CBS432 and S288C, with three different sequencing platforms: Illumina MiSeq, PacBio RS II, and MinION mk1. For each strain, we obtained about 80X depth of 2 × 150 bp MiSeq paired reads and between 120X and 250X depth from various PacBio runs. For the ONT runs, we obtained a total (‘Fail’ + ‘Pass’) 2D throughput (2D-All) between 12X and 61X depth depending on the strain, but only a throughput between 4X and 31X was achieved for the 2D-Pass data. Most Nanopore runs were performed using the R7.3 flowcells, available at the time of sequencing. For the reference strain S288C, the sequencing reads included two testing runs using the newly released R9 flowcell, in addition to the data of four runs from R7.3. The R9 flowcells generated data with higher accuracy than that of the R7.3 ones (91% against a R7.3 average of 88%). The R9 total 2D-All and 2D-Pass throughputs, 700 Mb and 60 Mb respectively, were too low for an independent study, so we merged the R7.3 and R9 reads to get a dataset of 61X depth 2D-All and a dataset of 31X depth 2D-Pass, upon which our assembly comparisons are based. To compare the capabilities of the assembly pipelines we used the S288C strain data, so that assembly quality and accuracy could be easily and directly determined by mapping against its well known reference. Our S288C ONT and PacBio datasets have very different read depths: ONT only 31X (2D-Pass), while PacBio about 120X. By assembling the whole PacBio dataset we obtained contiguous assemblies with accuracy up to 99.95%, as shown in the Supplementary Note. For our pipeline comparison though, we decided to subset the PacBio dataset to the same depth of the ONT dataset, so that differences between platform results are driven by platform differences and not different sample sizes. If we selected the PacBio reads randomly until reaching the desired depth, the subset would follow a similar read length distribution of the original dataset, which has a smaller average read length than ONT. This would make it more difficult to interpret the assembly continuity as due to platform read intrinsic features like error rate and distribution, or different read lengths. Because of this, we decided to reduce the number of variables for assembly assessment by selecting a PacBio subsample of the S288C reads with the same depth (31X) but also similar read length distribution as the S288C 2D-Pass ONT reads: the PacBio ONT-Emu sample (Figure 1-(c)). More information on how we extracted the subset can be found in the Supplementary Note. To study the dependence of the assembly pipelines on read depth, we also selected two subsets with 10X and 20X depth for the S288C ONT and PacBio data, again with similar read length distributions. For the total PacBio S288C dataset, which consists of 120X read-depth, we performed a more detailed depth study shown in the Supplementary Note for subsets of 10X, 20X, 31X, 61X, 80X and the whole sample at 120X depth. Details about the sequenced datasets presented in this paper are summarized in Table 1 for ONT, and in Table 2 for PacBio samples which ‘emulate’ the ONT read length profiles (ONT-Emu). The read length distributions for the S288C, N44, CBS432 and SK1 strains of the 2D-Pass ONT datasets are shown in Fig. 1-(a) and those of the PacBio datasets in Fig. 1-(b).
Figure 1

Read length distributions for the ONT and PacBio datasets. Read length distributions for the four yeast strains, S288C, N44, CBS432 and SK1 of the 2D-Pass ONT datasets in (a) and the PacBio datasets in (b). Comparison of read length distributions for the S288C strain of the 31X datasets ONT 2D-Pass and PacBio ONT-emulating 31X-subset in (c).

Table 1

Statistic information for the 2D-Pass ONT datasets for the S288C, N44, CBS432 and SK1 strains.

Oxford Nanopore Datasets
StrainDatasetBases (Mb)ReadsAverage (b)Longest (b)N50 (b)Identity
S288C2D-Pass: 31X38342,3259,04056,47711,69393.3%
2D-Pass: 20X12113,3669,05456,02811,71692.0%
2D-Pass: 10X24226,7219,05756,47711,65992.8%
N442D-Pass: 11X13015,6548,29237,8379,861NA
CBS4322D-Pass: 9X11012,2118,95246,48111,201NA
SK12D-Pass: 4X515,9388,58936,79110,971NA

For the S288C strain, also shown are a 20X and a 10X subsets of randomly selected reads from the immediately larger 2D-Pass dataset.

Table 2

Statistic information for the PacBio datasets for the S288C, N44, CBS432 and SK1 strains.

Pacfic Biosciences Datasets
StrainDatasetBases (Mb)ReadsAverage (b)Longest (b)N50 (b)Identity
S288C120X1,463239,4086,10935,1968,65692.5%
ONT-Emu: 31X37542,1808,89335,19611,19691.9%
ONT-Emu: 20X24226,7869,03535,19611,61591.7%
ONT-Emu: 10X12113,4568,99331,62711,58291.2%
N44148X1,794371,0254,83433,9066,800NA
CBS432135X1,639324,4145,05334,1737,212NA
SK1248X3,019697,9894,32534,0806,184NA

For the S288C strain also the ONT-emulating subsets are shown: ‘ONT-Emu’ 31X, 20X and 10X subsets, selected to match the 31X, 20X and 10X ONT S288C datasets for depth and read length distribution.

Read length distributions for the ONT and PacBio datasets. Read length distributions for the four yeast strains, S288C, N44, CBS432 and SK1 of the 2D-Pass ONT datasets in (a) and the PacBio datasets in (b). Comparison of read length distributions for the S288C strain of the 31X datasets ONT 2D-Pass and PacBio ONT-emulating 31X-subset in (c). Statistic information for the 2D-Pass ONT datasets for the S288C, N44, CBS432 and SK1 strains. For the S288C strain, also shown are a 20X and a 10X subsets of randomly selected reads from the immediately larger 2D-Pass dataset. Statistic information for the PacBio datasets for the S288C, N44, CBS432 and SK1 strains. For the S288C strain also the ONT-emulating subsets are shown: ‘ONT-Emu’ 31X, 20X and 10X subsets, selected to match the 31X, 20X and 10X ONT S288C datasets for depth and read length distribution.

De novo Assembly and Scaffolding Pipelines

It has been shown already[12] that contiguous and accurate yeast assemblies can be generated de novo solely using ONT data. Here, we focus on assessing the various existing pipelines and on comparing the results obtained from ONT and PacBio data with particular attention to the similarities and differences between the two platforms. The long reads provided by PacBio and ONT can be used to generate a de novo assembly either by themselves or in conjunction with Illumina (NGS) data. In this paper, we show examples of assemblies from long reads only, and from’hybrid’ pipelines that use the Illumina reads either to correct the long reads or to generate a short-read assembly that is later scaffolded with long reads. We selected eight assembly pipelines for long reads. PBcR[13] (with Self- or MiSeq-based error correction: PBcR-Self, PBcR-MiSeq), Canu[14] and Falcon[15] are based on an Overlap-Layout-Consensus (OLC) algorithm; ABruijn[16] is based on a generalized De-Bruijn graph algorithm. All of them include a base-error correction step on the reads before assembling them. SMARTdenovo (available from https://github.com/ruanjue/smartdenovo) is also based on a OLC algorithm, but does not include a base-error correction step. Miniasm[17] chooses the best path from a string graph created from the overlap of uncorrected reads: it is the only pipeline here that includes neither a base-error correction nor a consensus step. Racon[18] aligns the raw long reads to a Miniasm assembly and generates a consensus, significantly increasing the initial accuracy. We selected and assessed also three scaffolding pipelines: npScarf[19], HybridSPAdes[20] and SMIS (available from https://github.com/fg6/smis.git), to scaffold an NGS-based assembly from SPAdes[21]. Details about these pipelines and the parameters used for running them can be found in the Supplementary Note.

Assemblies from different pipelines and platforms

The de novo assemblies of the whole PacBio datasets for S288C and the other strains are shown and discussed in the Supplementary Note, while here we compare the performances of long-read assemblers when run on a 31X depth ONT or PacBio datasets with similar read length distributions. For this purpose we selected a subsample from the PacBio S288C dataset with the same depth as the ONT 2D-Pass sample (31X) and with similar read length distribution, as described above and shown in Fig. 1-(c). The statistic information for the two S288C samples can be compared if looking at the sample S288C ‘2D-Pass: 31X’ in Table 1 for the ONT case, and at the S288C ‘ONT-Emu 31X’ in Table 2 for the PacBio case emulating the ONT sample. As shown in these tables, the two samples also have similar average error rates (92% average identity for PacBio and 93% for ONT datasets). Using these two similar samples, except for normal fluctuations, differences in assembler’s performances can be attributed to the peculiar features of the reads from the two platforms, for instance to error types and distributions. The assembly information for the ONT dataset ‘2D-Pass 31X’ and for the PacBio ‘ONT-Emu 31X’ can be found in Tables 3 and 4, respectively. With 31X depth the pipelines generated assemblies with similar features when running on PacBio or ONT data.
Table 3

Statistic information about the de novo assemblies for the S288C ONT datasets for the hybrid pipeline PBcR-MiSeq, the pipeline Miniasm, with no base error correction nor consensus step, and for the non-hybrid pipelines: Racon (on a Miniasm draft assembly), Falcon, SMARTdenovo, ABruijn, PBcR-Self and Canu for, from top to bottom: all the 2D-Pass reads (2D-Pass 31X), the 2D-Pass 20X subset and the 2D-Pass 10X subsets.

Oxford Nanopore S288C Datasets
DatasetAssemblerBases (Mb)ContigsN50 (kb)Reference CoverageSNPs, Indels (#per kb)IdentityMisAssNa50 (kb)Genes (6,615)CPU Time (h)Memory (GB)
2D-Pass 31XPBcR-MiSeq11.97630599.08% 0.1, 0.2 99.94% 18 273 6,514 147 17
Miniasm11.82773994.85%34, 6789.42%263623,353 0.1 5
Racon12.02775298.80%0.4, 11 98.76% 245346,53385
Falcon11.94371799.09%0.5, 2197.79%27 546 6,526 19 71
SMARTdenovo 12.1 2862599.54%0.3, 1498.50%255316,5562 5
ABruijn12.4 26 769 98.89% 0.1, 1598.49%315366,533448
PBcR-Self12.96461699.21%0.2, 1798.24%925256,55269523
Canu 12.1 29698 99.62% 0.1, 1798.30%34530 6,566 8014
+Nanopolish12.32970999.63%0.1, 499.57%355386,5841,83512
2D-Pass 20XPBcR-MiSeq11.86626999.09% 0.1, 0.2 99.94% 8 2626,5229513
Miniasm11.63941894.66%34, 6789.36%242863,271 0.1 3
Racon11.83942398.11%0.7, 13 98.56% 263936,4785 2
Falcon10.78421090.64%0.6, 2197.56%171945,9461044
SMARTdenovo11.9 29 656 98.99%0.8, 1698.23%244556,52814
ABruijn 12.0 29 46898.55%0.3, 1698.28%124366,495297
PBcR-Self12.972545 99.32% 0.3, 1898.08%74452 6,550 34220
Canu11.93154498.99%0.2, 1898.10%254416,5254110
2D-Pass 10XPBcR-MiSeq11.3123 161 95.75% 0.1, 0.2 99.94% 13 146 6,310 337
Miniasm7.91585867.90%24, 4689.26%12432,2560.020.002
Racon8.11586070.33%2, 1397.72%15584,52031
Falcon1.41131715.90%0.1, 397.43%616901324
SMARTdenovo10.4 114 11588.71%5, 2496.58% 12 1045,610 1 1
ABruijn8.58611172.68%1, 1697.47%21974,711168
PBcR-Self 11.5 16710691.33%1, 22 97.40% 641025,957713
Canu10.711513491.52%1, 2397.32%181125,955136

For the 2’D-Pass 31X’ dataset also the results from Nanopolish on the Canu assembly is shown. In each column the best value is highlighted in bold. For the identity column the best value is always for the hybrid assembly PBcR-MiSeq, but we also highlighted (bold and underlined) the best value for the non-hybrid pipelines, ignoring Nanopolish as it is the only polishing tool. For the 10X datasets, we ignored assemblies with less than 80% reference coverage when choosing the best values.

Table 4

As Table 3 but for PacBio-based assemblies from the ONT-Emu PacBio subsets at, from top to bottom, 31X, 20X and 10X depth.

Pacific Biosciences S288C Datasets
DatasetAssemblerBases (Mb)ContigsN50 (kb)Reference CoverageSNPs, Indels (# per kb)IdentityMisAssNa50 (kb)Genes (6,615)CPU Time (h)Memory (GB)
ONT-Emu 31XPBcR-MiSeq11.97627098.68% 0.1, 0.1 99.97% 9 2706,52613217
Miniasm12.53556396.10%19, 8889.37%531063,226 0.1 5
Racon 12.1 3454499.09%0.3, 499.50%224296,540195
Falcon12.03554998.18%0.3, 299.78%284366,5081364
SMARTdenovo12.3 20 929 99.97% 0.2, 399.66%275496,5962 4
ABruijn12.32666699.30% 0.1, 199.87%434696,565197
PBcR-Self12.43975199.50% 0.1, 199.92%435486,5906324
Canu12.32860799.92% 0.1, 1 99.93% 29534 6,601 1510
ONT-Emu 20XPBcR-MiSeq11.76430498.73% 0.1, 0.1 99.97% 7 2646,5018613
Miniasm12.08620293.08%18, 8489.57%53693,255 0.04 3
Racon11.68619495.35%1, 799.18%251896,24110 2
Falcon9.915211582.22%0.3, 299.65%291125,3415.641
SMARTdenovo 12.2 38 545 99.80% 0.5, 8 99.09%244346,53413
ABruijn11.75827296.06% 0.1, 299.72%362586,325239
PBcR-Self12.34450299.03%0.2, 299.78%354286,5603020
Canu 12.2 4245499.47%0.2, 2 99.79% 28432 6,565 87
ONT-Emu 10XPBcR-MiSeq 10.9 144 117 92.26% 0.1, 0.1 99.97% 9 111 6,058 46 7
Miniasm4.01203533.92%6, 2789.61%4191,0350.020.1
Racon3.81203435.48%1, 598.28%8332,09551
Falcon0.659148.99%0.1, 0.299.41%1013421123
SMARTdenovo8.51576171.29%3, 2296.43%8554,27111
ABruijn4.7677141.45%0.4, 498.87%10672,631137
PBcR-Self9.72325778.99%1, 798.96%35555,1111218
Canu8.91786275.35%0.4, 699.14%19594,81134
Statistic information about the de novo assemblies for the S288C ONT datasets for the hybrid pipeline PBcR-MiSeq, the pipeline Miniasm, with no base error correction nor consensus step, and for the non-hybrid pipelines: Racon (on a Miniasm draft assembly), Falcon, SMARTdenovo, ABruijn, PBcR-Self and Canu for, from top to bottom: all the 2D-Pass reads (2D-Pass 31X), the 2D-Pass 20X subset and the 2D-Pass 10X subsets. For the 2’D-Pass 31X’ dataset also the results from Nanopolish on the Canu assembly is shown. In each column the best value is highlighted in bold. For the identity column the best value is always for the hybrid assembly PBcR-MiSeq, but we also highlighted (bold and underlined) the best value for the non-hybrid pipelines, ignoring Nanopolish as it is the only polishing tool. For the 10X datasets, we ignored assemblies with less than 80% reference coverage when choosing the best values. As Table 3 but for PacBio-based assemblies from the ONT-Emu PacBio subsets at, from top to bottom, 31X, 20X and 10X depth. The fastest pipeline is Miniasm, which does not include a base error correction nor a consensus step. For both PacBio and ONT datasets it only took 4–5 minutes to run but only achieved 89% accuracy. Because of the high number of indels in the ONT and PacBio data, Miniasm reconstructed only about 95–96% of the genome, significantly less than the other pipelines. Globally, the assembly structure from the Miniasm pipeline is correct, indicating that the missing regions are due to local base errors. Part of these base errors are recovered by the other pipelines either by building a consensus from the raw reads (Racon, SMARTdenovo) or during their base-error correction steps (PBcR-MiSeq, PBcR-Self, Canu, Falcon and ABruijn). The assemblies with the highest accuracy, 99.94–99.97% are the ones from the only hybrid assembler, PBcR-MiSeq, that uses MiSeq reads to correct the ONT or PacBio reads. But PBcR-MiSeq also provided the most fragmented assemblies, with Na50s only 270 kb long, where Na50s are the N50s after breaking the contigs at the misassembly points found by Quast[22]. None of the other assemblers uses Illumina reads, and we refer to them as the non-hybrid pipelines. The non-hybrid pipelines reconstructed 98–99% of the genome, with Na50s in the 400–500 kb range long, and accuracy up to 98.76% for ONT reads and up to 99.93% for the PacBio reads. The highest accuracy between the non-hybrid assemblies for the ONT data is from Racon, with 98.76%, immediately followed by SMARTdenovo and ABruijn with about 98.50%. For the PacBio datasets the Celera-based assemblers (Canu, PBcR-Self) provided the highest accuracy, 99.92–99.93%, followed by ABruijn at 99.87% and Falcon at 99.78%. To assess the completeness of the newly generated assemblies, we checked for the reconstruction of a list of known S288C genes. From the list of ORF coding sequences reported in the Saccharomyces Genome Database (http://www.yeastgenome.org), we selected 6,615 coding sequences that mapped to the S288C reference for at least 90% of their length with at least 90% mapping identity. We used BWA[23] to align each gene against the new assembly under study, and declared a gene ‘reconstructed’ if at least 90% of it was assembled with at least 90% accuracy. Tables 3 and 4 for ONT and PacBio data respectively, show a correlation between the assembly continuity and the number of mapped genes, i.e. an assembly from a particular pipeline with longer contigs is likely to have reconstructed more genes. For example, the SMARTdenovo assemblies consist of 20 contigs when using 31X PacBio reads and 38 contigs when using 20X PacBio data, and the numbers of reconstructed genes are 6,596 and 6,534 respectively. For the ONT data at 31X, the contig number is 28, and it is associated with 6,556 genes; at read coverage of 20X, the contig number is increased to 29, while mapped genes dropped to 6,528. Even though there is no single pipeline that outperforms at every statistic gathered, SMARTdenovo and Canu assemblies have the longest reference coverage, best or near-best average identity, highest number of genes found and long Na50s for both PacBio and ONT data. Apart from Miniasm, SMARTdenovo is also the pipeline using the least resources, with CPU running time ≤2 h and ≤5 GB of memory. Racon, Falcon and ABruijn are the next fastest pipelines. PBcR is the pipeline typically requiring the longest running times especially when using ONT data.

Missing homomers

Ignoring Miniasm which has no error correction nor a consensus step, and the hybrid pipeline PBcR-MiSeq, which corrects the reads using Illumina data, it is clear that the PacBio assemblies have higher accuracies, from 99.50% to 99.93%, while the ONT assemblies reach a maximum accuracy of 98.76%. Because the average read accuracy, the depth and the read length distribution of the ONT and PacBio datasets used here are very similar, the final higher accuracies for the PacBio assemblies are likely due to platform-intrinsic features of the data. When no base-error correction nor consensus step is performed, like in the Miniasm pipeline, the resulting assemblies are about 89% accurate for both the ONT and PacBio datasets, and present lots of mismatches and indels, with prevalence of indels. Adding a correction or a consensus step like the other non-hybrid pipelines significantly reduces the number of mismatches and indel errors for both ONT and PacBio data. While the number of mismatches is reduced to a similar level for ONT and PacBio, the number of indels in ONT assemblies remains very high. The ONT assembly with fewer indels (Racon, indels = 11 per kb) has about 11 times more indels than the best PacBio case (Canu, indels = 1 per kb), and about 3 times more indels than the worst PacBio case (Racon, indels = 4 per kb). The high accuracy reached with the PacBio data shows that most PacBio errors can be corrected by generating a consensus between the reads that cover the same genomic region, and is a clear indication that PacBio read errors are mostly randomly distributed. For the ONT data the situation is more challenging: while it is true that some of the errors have a random distribution and are corrected by the various pipelines, a good portion of them, mainly indels, seem to escape the correction attempts, pointing to a possible systematic source of errors. Indeed most of the remaining errors are due to missing homomers in the ONT reads. For instance, the assembly from Canu contains far fewer 5-homomers (“AAAAA”, “TTTTT”, “CCCCC” and “GGGGG”) than the reference, as shown in Fig. 2, where the blue bars represent the Reference’s counts and the orange bars Canu’s counts. The figure also shows that if the Canu assembly is polished with Nanopolish[24] (green bars) many of the missing homomers are recovered, and the final assembly accuracy reaches 99.57%, as shown in Table 3 for the ‘2D-Pass 31X’ sample. This table also shows the downside of Nanopolish: it needs 1,835 CPU hours to polish the Canu assembly, which makes this pipeline impractical for larger genomes.
Figure 2

Homomer counts. Counts for the 5 bases homomers in the Reference (blue), in the Canu assembly (orange), and in the Canu assembly after polishing with Nanopolish (green).

Homomer counts. Counts for the 5 bases homomers in the Reference (blue), in the Canu assembly (orange), and in the Canu assembly after polishing with Nanopolish (green).

Mitochondrial genome reconstruction

While all of the 16 yeast nuclear chromosomes are well reconstructed by every tested pipeline with respect to continuity, accuracy and completeness, many pipelines failed to reconstruct the mitochondrial genome (Supplementary Tables S5 and S6). This chromosome is the smallest one in the reference assembly, only containing 85,779 bases, but appears to be very challenging to assemble, especially with ONT datasets. Using ONT data, only PBcR-MiSeq was able to reconstruct it almost completely (96%), and only Falcon and Canu were able to reconstruct at least half of it (67%, 64%, respectively). Using PacBio data there are three full reconstructions from Racon, SMARTdenovo and Canu, while Miniasm manages to reconstruct 77% of it. There are a number of possible reasons for the mitochondrial genome to be more challenging to reconstruct than the other chromosomes. The GC content in the mitochondrial genome is only 17%, much lower than the average value of 38% for the nuclear chromosomes. There are also a lot of highly repetitive small AT k-mers, which contribute to the difficulty of assembly. We found that while for the ONT dataset the mitochondrial read depth was consistent with that of the other chromosomes, the PacBio mitochondrial depth was about 6 times higher which could help explain why the mitochondrial genome reconstruction is more complete with the PacBio data. To exclude the possibility that this is due to a bias in our subsetting, we looked at the whole PacBio dataset and found that, similarly to the subset, the mitochondrial genome depth is about 7 times higher than the nuclear chromosomes. This is likely due to the high copy number of mitochondria in a cell. Possibly, the PacBio library preparation preserves this higher copy number, while the ONT library preparation does not. For instance, it is possible that DNA molecules have been less vigorously sheared in the ONT case, so that many of the very short mitochondrial genome remained circular preventing the ligation of the adapters needed for sequencing. In addition to a lower depth with respect to PacBio, the mitochondrial genome assembly from ONT data is also hindered by an higher than average homomer content: using the S288C reference, we estimated that the mitochondrial genome has at least 30% more homomers than the other chromosomes, for k-mer lengths 3, 4, 5, 6 and 7. Figure 3 shows the 5-homomer counts (#5A + #5T + #5G + #5C) ratios between each chromosome and the mitochondrial genome, each count normalized by the chromosome length in bases.
Figure 3

Chromosome homomer rate with respect to that of the mitochondrial genome. Ratios of 5–homomer counts normalized by the chromosome length between each chromosome and the mitochondrial genome (mt).

Chromosome homomer rate with respect to that of the mitochondrial genome. Ratios of 5–homomer counts normalized by the chromosome length between each chromosome and the mitochondrial genome (mt).

De novo Assembly by Varying the Read Depth

In order to assess the scalability of the assembly pipelines and to determine how the performances of the assemblers vary with read depth, we selected and analyzed subsets of the ONT and the PacBio full datasets. A dedicated study focusing only on the higher depth PacBio dataset is discussed in the Supplementary Note. Here, we compared the assemblies at 31X mentioned in the previous section, with 20X and 10X subsets from the ONT 2D-Pass and similar subsets from PacBio reads emulating the read length distribution of the ONT subsets, as we did for the 31X case. These datasets are presented in Tables 1 and 2 for ONT and PacBio respectively, where the PacBio subsets are labeled as ONT-Emu. The results of the pipelines are summarized for the 31X, 20X and 10X in Table 3 for ONT and Table 4 for PacBio data. At very low read depth, 10X, the hybrid pipeline, PBcR-MiSeq, was the best performing pipeline for both ONT and PacBio data providing assemblies which cover more than 92% of the genome with accuracies larger than 99.9%, and the longest Na50s, in the 100s of kb. At 20X depth the Na50 is about doubled, and 98–99% of the genes are reconstructed. For the non-hybrid pipelines, the Celera-based ones, Canu and PBcR-Self, were the best performers at low read depths. At 10X, Canu and PBcR-Self performed better on ONT than PacBio data. These assemblies cover about 90% (75–79%) of the reference with an accuracy of 97% (99%) and a Na50 of 100 (60) kb with ONT (PacBio) data. All the other assemblies cover significantly less proportions of the reference genome, and in particular Falcon seems to have the most difficulties at such low read depth. At 20X, all the Celera-based pipelines still performed very well, with Canu and PBcR-Self providing Na50s in the range of 400s kb, as well as SMARTdenovo. The assembly from Racon distinguishes itself on ONT data for the highest accuracy, immediately followed by ABruijn and SMARTdenovo, while for PacBio data Canu, PBcR-Self and ABruijn assemblies have the highest accuracies. Already at 20X, Canu, PBcR-Self, PBcR-MiSeq and SMARTdenovo reconstructed about 99% of the genes. At 31X depth the non-hybrid pipelines increased the Na50s up to 500s kb and slightly improved their final accuracies.

Genome scaffolding using long reads

We explored the scaffolding performance of three pipelines that use long ONT/PacBio reads to bridge and merge contigs from a NGS assembly. The NGS assembly used here has been generated from a dataset of 80X of 2 × 150 bp MiSeq paired-reads by SPAdes, as one of the scaffolding pipelines is embedded with it (HybridSPAdes). As expected, the assembly from SPAdes is fragmented but has very high accuracy: it has 206 contigs and its N50 is only 125 kb, as shown at the top of Table 5. The scaffolding pipelines were able to bridge a number of contigs significantly increasing the assemblies’ N50s while maintaining a high accuracy for both the PacBio and ONT samples. The npScarf pipeline achieved N50s much longer than HybridSPAdes or SMIS, 771 kb for ONT and 715 kb for PacBio, but the Quast analysis of these assemblies revealed that they are affected by a number of misassemblies, and their final Na50 is 514 kb for the ONT data and 413 kb for PacBio. Even after correcting for the misassemblies npScarf provided the more contiguous assemblies, but the HybridSPAdes and SMIS ones were able to maintain a higher accuracy than npScarf, 99.97–99.98% against 99.91%. The assemblies from HybridSPAdes have a higher coverage over the reference and reconstruct more genes than SMIS or npScarf. The npScarf tool has the advantage of requiring the least resources as it is faster than the other two pipelines and it uses less memory.
Table 5

MiSeq-only assembly from SPAdes in top row.

S288C Datasets: Scaffolding Pipelines
DatasetAssemblerBases (Mb)ContigsN50 (kb)Reference CoverageSNPs, Indels (# per kb)IdentityMisAssNa50 (kb)Genes (6,615)CPU Time (h)Memory (GB)
MiSeqSPAdes11.620612598.3%0.04, 0.0399.98%51256,399512
ONT 2D-Pass 31XnpScarf 11.9 21 771 99.8%0.4, 0.399.91%69 514 6,559 3 4
HybridSPAdes11.864444 99.97% 0.1, 0.04 99.97% 7 416 6,582 1812
SMIS11.88554998.4% 0.04, 0.04 99.98% 134936,41113 4
PacBio ONT-Emu 31XnpScarf 11.7 22 715 98.5%0.3, 0.499.91%67 413 6,458 2 3
HybridSPAdes 11.7 68364 99.9% 0.1, 0.04 99.97% 5 317 6,583 2712
SMIS 11.7 8954698.8% 0.04, 0.04 99.97% 403096,39996

MiSeq-only assembly from SPAdes scaffolded by the npScarf, HybridSPAdes and SMIS pipelines using the ‘2D-Pass 31X’ ONT sample (Middle) and the ‘ONT-Emu 31X’ PacBio subset (Bottom).

MiSeq-only assembly from SPAdes in top row. MiSeq-only assembly from SPAdes scaffolded by the npScarf, HybridSPAdes and SMIS pipelines using the ‘2D-Pass 31X’ ONT sample (Middle) and the ‘ONT-Emu 31X’ PacBio subset (Bottom). We ran the same scaffolding pipelines on the three other yeast strains N44, CBS432 and SK1 with 2D-Pass ONT data at the lower depths of 11X, 9X and 4X, respectively. The scaffolding results are shown in Table 6. As a reference is not available for these strains, we could not evaluate the accuracy and misassemblies of the scaffolds, but, as for the S288C case, we can expect high accuracy, >99.95%, as these assemblies are based on Illumina data. The longer N50s from npScarf are likely significantly affected by misassemblies, as observed in the S288C case, while the N50s estimated by HybridSPAdes and SMIS are expected to be more accurate. A comprehensive structure variation analysis of these strains can be found in ref. 25.
Table 6

Statistic information on the de novo assemblies from the MiSeq-only SPAdes pipeline and the same SPAdes assembly scaffolded with npScarf, HybridSPAdes and SMIS for the N44 (top panel), CBS432 (middle panel), and SK1 (bottom panel) strains.

Oxford Nanopore Datasets
DatasetAssemblerBases (Mb)ContigsN50 (kb)Genes (6,615)CPU Time (h)Memory (GB)
N44 2D-Pass: 11XSPAdes11.61871175,475713
npScarf11.7198985,53812
HybridSPAdes11.7613245,547813
SMIS11.7585115,47425
CBS432 2D-Pass: 9XSPAdes11.61811505,498512
npScarf11.4199285,44311
HybridSPAdes11.7495155,611612
SMIS11.7646585,49915
SK1 2D-Pass: 4XSPAdes11.62401186,341612
npScarf11.7435076,4350.31
HybridSPAdes11.71112276,444512
SMIS11.71423586,34113
Statistic information on the de novo assemblies from the MiSeq-only SPAdes pipeline and the same SPAdes assembly scaffolded with npScarf, HybridSPAdes and SMIS for the N44 (top panel), CBS432 (middle panel), and SK1 (bottom panel) strains.

Discussion

We have shown that the yeast genome can be de novo assembled with Na50s up to 550 kb with 31X read depth from PacBio or ONT platforms, reaching an accuracy of up to 99% for PacBio and 98% for ONT, when not using Illumina data. More fragmented assemblies but with an higher accuracy (up to 99.98%) can be achieved when using long reads in conjunction with Illumina reads in hybrid or scaffolding assemblies. Miniasm was the fastest pipeline, requiring only few minutes to run, but because of the lack of a correction or consensus step it generated assemblies with very poor accuracy (~89%). After Miniasm, SMARTdenovo is the pipeline requiring the least resources and in particular the least time for running, <2 h, but providing accuracies close to the highest one achieved without help from Illumina (98.5% for ONT and 99.7% for PacBio). The non-hybrid Celera-based assemblers, PBcR-Self and Canu, generated continuous and accurate assemblies, and appeared to be the most performing pipelines as the read coverage decreased. In our experience, PacBio and ONT platforms provided reads with similar error rates and lengths in the thousands of bases. The average PacBio lengths were generally smaller than the ONT ones, but the ONT throughput per run was significantly lower than the one from PacBio’s runs. This was mainly because we used early MinION flowcells and chemistries: the throughput improved significantly both in size and stability with more recent MinION kits and software. In addition, a fairer throughput comparison should be done between two benchtop devices, e.g. the PacBio RSII and the Oxford Nanopore GridION, or PromethION, both not available at the time of this study. Both platforms provided error-prone reads, but missing homomers in the ONT data represent a major difference with respect to PacBio. Because of this difference, while PacBio reads could be corrected with enough depth to reach assembly accuracies >99.9%, for ONT data increasing the read depth only helped to reach an accuracy up to 98–99%. To improve further a very time consuming polishing step is needed, at least until the missing homomers issue is reduced or solved, possibly already with a new Neural Network-based basecaller, Scrappie, presently in development at Oxford Nanopore.

Methods

Library Preparation

Library preparations for Oxford Nanopore sequencing

Two ug of genomic DNA was sheared to approximately 18,000 bp by centrifugation at 4000 rpm in a gTUBE. Sequencing libraries were prepared according to the SQK-MAP006 or SQK-NSK007 Sequencing Kit protocol, including the NEBNext FFPE DNA repair step.

MinIONTM flow cell preparation and sample loading

The sequencing mix was prepared with 6 uL of the DNA library, water, the Fuel Mix and the running buffer according to the SQK-MAP006 or the SQK-MAP007 protocols. The sequencing mix was added to the R7 or R9 flowcell for a 24–48 hour run. Typically the flowcells were reloaded after 24 hours as data yield had plateaued.

Pacific Biosciences sequencing library preparation

PacBio sequencing libraries were prepared as follows. Five ug of genomic DNA was sheared to approximately 15,000 bp by centrifugation at 5200 rpm in a gTUBE. DNA was repaired with damage repair reagent and end-repaired using end repair mix before ligation to PacBio blunt end adapter. Unligated material was digested with Exo III and Exo VII then library fragments purified via two consecutive Ampure clean-ups and size selection on Blue Pippin (SageScience, Beverley, MA, USA) with a 0.75% agarose cassette to purify fragments from 12–25 kb.

Illumina PCR-free library preparation and sequencing

DNA (1 ug) was sonicated to a 400 to 600 bp size range using a Covaris LE220 acoustic shearing device (Covaris, Woburn, MA, USA). Fragments were end-repaired using the NEBNext EndRepair Module (New England Biolabs, Ipswich, MA, USA) and A-tailed with the NEBNext dA-Tailing Module. Illumina adapters were added using the NEBNext Quick Ligation Module. Ligation products were purified with AMPure XP beads (Beckman Coulter Genomics, Danvers, MA, USA). Libraries were quantified by qPCR using the KAPA Library Quantification Kit for Illumina Libraries (Kapa Biosystems, Wilmington, MA, USA) and library profiles were assessed using a DNA High Sensitivity LabChip kit on an AgilentBioanalyzer (Agilent Technologies, Santa Clara, CA, USA). Libraries were sequenced on an Illumina MiSeq instrument (San Diego, CA, USA) using paired 150 base read chemistry.

Assembly Assessment and Other Tools

The ONT fastq sequencing reads were extracted from the fast5 files using Poretools[26]. For the assembly accuracy and other summary statistics we used dnadiff from MUMmer[27], while the number of misassemblies and the Na50s, i.e. the N50s after breaking the contigs at the misassembly points, were calculated using Quast[22]. We ran Quast with default parameters except for the Miniasm assemblies. For Miniasm we required the additional parameter: –min-identity 85, to enable Quast to align the low accuracy assemblies to the reference. Because of such low accuracies, Quast seemed to overestimate the presence of misassemblies for Miniasm, and reported Na50s smaller than the expected ones if comparing the Na50s with those of the higher accuracy Racon assemblies, based on Miniasm. We used the R Biostrings package (available from https://bioconductor.org/packages/release/bioc/html/Biostrings.html) to count the number of homomers in some of the assemblies, using the function oligonucleotideFrequency. Statistics and assessment for all the assemblies have been estimated after eliminating contigs shorter than 1 kb. Supplementary Note
  22 in total

Review 1.  Coming of age: ten years of next-generation sequencing technologies.

Authors:  Sara Goodwin; John D McPherson; W Richard McCombie
Journal:  Nat Rev Genet       Date:  2016-05-17       Impact factor: 53.242

2.  Phased diploid genome assembly with single-molecule real-time sequencing.

Authors:  Chen-Shan Chin; Paul Peluso; Fritz J Sedlazeck; Maria Nattestad; Gregory T Concepcion; Alicia Clum; Christopher Dunn; Ronan O'Malley; Rosa Figueroa-Balderas; Abraham Morales-Cruz; Grant R Cramer; Massimo Delledonne; Chongyuan Luo; Joseph R Ecker; Dario Cantu; David R Rank; Michael C Schatz
Journal:  Nat Methods       Date:  2016-10-17       Impact factor: 28.547

3.  Assembly of long error-prone reads using de Bruijn graphs.

Authors:  Yu Lin; Jeffrey Yuan; Mikhail Kolmogorov; Max W Shen; Mark Chaisson; Pavel A Pevzner
Journal:  Proc Natl Acad Sci U S A       Date:  2016-12-12       Impact factor: 11.205

4.  Hybrid error correction and de novo assembly of single-molecule sequencing reads.

Authors:  Sergey Koren; Michael C Schatz; Brian P Walenz; Jeffrey Martin; Jason T Howard; Ganeshkumar Ganapathy; Zhong Wang; David A Rasko; W Richard McCombie; Erich D Jarvis
Journal:  Nat Biotechnol       Date:  2012-07-01       Impact factor: 54.908

5.  Poretools: a toolkit for analyzing nanopore sequence data.

Authors:  Nicholas J Loman; Aaron R Quinlan
Journal:  Bioinformatics       Date:  2014-08-20       Impact factor: 6.937

6.  Fast and accurate de novo genome assembly from long uncorrected reads.

Authors:  Robert Vaser; Ivan Sović; Niranjan Nagarajan; Mile Šikić
Journal:  Genome Res       Date:  2017-01-18       Impact factor: 9.043

7.  Scaffolding and completing genome assemblies in real-time with nanopore sequencing.

Authors:  Minh Duc Cao; Son Hoang Nguyen; Devika Ganesamoorthy; Alysha G Elliott; Matthew A Cooper; Lachlan J M Coin
Journal:  Nat Commun       Date:  2017-02-20       Impact factor: 14.919

8.  Rapid metagenomic identification of viral pathogens in clinical samples by real-time nanopore sequencing analysis.

Authors:  Alexander L Greninger; Samia N Naccache; Scot Federman; Guixia Yu; Placide Mbala; Vanessa Bres; Doug Stryke; Jerome Bouquet; Sneha Somasekar; Jeffrey M Linnen; Roger Dodd; Prime Mulembakani; Bradley S Schneider; Jean-Jacques Muyembe-Tamfum; Susan L Stramer; Charles Y Chiu
Journal:  Genome Med       Date:  2015-09-29       Impact factor: 11.117

9.  Nanopore Sequencing as a Rapidly Deployable Ebola Outbreak Tool.

Authors:  Thomas Hoenen; Allison Groseth; Kyle Rosenke; Robert J Fischer; Andreas Hoenen; Seth D Judson; Cynthia Martellaro; Darryl Falzarano; Andrea Marzi; R Burke Squires; Kurt R Wollenberg; Emmie de Wit; Joseph Prescott; David Safronetz; Neeltje van Doremalen; Trenton Bushmaker; Friederike Feldmann; Kristin McNally; Fatorma K Bolay; Barry Fields; Tara Sealy; Mark Rayfield; Stuart T Nichol; Kathryn C Zoon; Moses Massaquoi; Vincent J Munster; Heinz Feldmann
Journal:  Emerg Infect Dis       Date:  2016-02       Impact factor: 6.883

10.  Real-time, portable genome sequencing for Ebola surveillance.

Authors:  Joshua Quick; Nicholas J Loman; Sophie Duraffour; Jared T Simpson; Ettore Severi; Lauren Cowley; Joseph Akoi Bore; Raymond Koundouno; Gytis Dudas; Amy Mikhail; Nobila Ouédraogo; Babak Afrough; Amadou Bah; Jonathan Hj Baum; Beate Becker-Ziaja; Jan-Peter Boettcher; Mar Cabeza-Cabrerizo; Alvaro Camino-Sanchez; Lisa L Carter; Juiliane Doerrbecker; Theresa Enkirch; Isabel Graciela García Dorival; Nicole Hetzelt; Julia Hinzmann; Tobias Holm; Liana Eleni Kafetzopoulou; Michel Koropogui; Abigail Kosgey; Eeva Kuisma; Christopher H Logue; Antonio Mazzarelli; Sarah Meisel; Marc Mertens; Janine Michel; Didier Ngabo; Katja Nitzsche; Elisa Pallash; Livia Victoria Patrono; Jasmine Portmann; Johanna Gabriella Repits; Natasha Yasmin Rickett; Andrea Sachse; Katrin Singethan; Inês Vitoriano; Rahel L Yemanaberhan; Elsa G Zekeng; Racine Trina; Alexander Bello; Amadou Alpha Sall; Ousmane Faye; Oumar Faye; N'Faly Magassouba; Cecelia V Williams; Victoria Amburgey; Linda Winona; Emily Davis; Jon Gerlach; Franck Washington; Vanessa Monteil; Marine Jourdain; Marion Bererd; Alimou Camara; Hermann Somlare; Abdoulaye Camara; Marianne Gerard; Guillaume Bado; Bernard Baillet; Déborah Delaune; Koumpingnin Yacouba Nebie; Abdoulaye Diarra; Yacouba Savane; Raymond Bernard Pallawo; Giovanna Jaramillo Gutierrez; Natacha Milhano; Isabelle Roger; Christopher J Williams; Facinet Yattara; Kuiama Lewandowski; Jamie Taylor; Philip Rachwal; Daniel Turner; Georgios Pollakis; Julian A Hiscox; David A Matthews; Matthew K O'Shea; Andrew McD Johnston; Duncan Wilson; Emma Hutley; Erasmus Smit; Antonino Di Caro; Roman Woelfel; Kilian Stoecker; Erna Fleischmann; Martin Gabriel; Simon A Weller; Lamine Koivogui; Boubacar Diallo; Sakoba Keita; Andrew Rambaut; Pierre Formenty; Stephan Gunther; Miles W Carroll
Journal:  Nature       Date:  2016-02-03       Impact factor: 69.504

View more
  56 in total

1.  Relative Performance of MinION (Oxford Nanopore Technologies) versus Sequel (Pacific Biosciences) Third-Generation Sequencing Instruments in Identification of Agricultural and Forest Fungal Pathogens.

Authors:  Kaire Loit; Kalev Adamson; Mohammad Bahram; Rasmus Puusepp; Sten Anslan; Riinu Kiiker; Rein Drenkhan; Leho Tedersoo
Journal:  Appl Environ Microbiol       Date:  2019-10-16       Impact factor: 4.792

Review 2.  Long-read sequencing in deciphering human genetics to a greater depth.

Authors:  Mohit K Midha; Mengchu Wu; Kuo-Ping Chiu
Journal:  Hum Genet       Date:  2019-09-19       Impact factor: 4.132

3.  Complete Genome of Micromonospora sp. Strain B006 Reveals Biosynthetic Potential of a Lake Michigan Actinomycete.

Authors:  Jana Braesel; Camila M Crnkovic; Kevin J Kunstman; Stefan J Green; Mark Maienschein-Cline; Jimmy Orjala; Brian T Murphy; Alessandra S Eustáquio
Journal:  J Nat Prod       Date:  2018-08-15       Impact factor: 4.050

Review 4.  Nanopore sequencing technology, bioinformatics and applications.

Authors:  Yunhao Wang; Yue Zhao; Audrey Bollas; Yuru Wang; Kin Fai Au
Journal:  Nat Biotechnol       Date:  2021-11-08       Impact factor: 54.908

5.  Long-read sequencing data analysis for yeasts.

Authors:  Jia-Xing Yue; Gianni Liti
Journal:  Nat Protoc       Date:  2018-05-03       Impact factor: 13.491

6.  MetaPlatanus: a metagenome assembler that combines long-range sequence links and species-specific features.

Authors:  Rei Kajitani; Hideki Noguchi; Yasuhiro Gotoh; Yoshitoshi Ogura; Dai Yoshimura; Miki Okuno; Atsushi Toyoda; Tomomi Kuwahara; Tetsuya Hayashi; Takehiko Itoh
Journal:  Nucleic Acids Res       Date:  2021-12-16       Impact factor: 16.971

Review 7.  Into the wild: new yeast genomes from natural environments and new tools for their analysis.

Authors:  D Libkind; D Peris; F A Cubillos; J L Steenwyk; D A Opulente; Q K Langdon; A Rokas; C T Hittinger
Journal:  FEMS Yeast Res       Date:  2020-03-01       Impact factor: 2.796

8.  LazyB: fast and cheap genome assembly.

Authors:  Thomas Gatter; Sarah von Löhneysen; Jörg Fallmann; Polina Drozdova; Tom Hartmann; Peter F Stadler
Journal:  Algorithms Mol Biol       Date:  2021-06-01       Impact factor: 1.405

9.  Linear Peptides-A Combinatorial Innovation in the Venom of Some Modern Spiders.

Authors:  Lucia Kuhn-Nentwig; Heidi E L Lischer; Stano Pekár; Nicolas Langenegger; Maria J Albo; Marco Isaia; Wolfgang Nentwig
Journal:  Front Mol Biosci       Date:  2021-07-06

Review 10.  The long reads ahead: de novo genome assembly using the MinION.

Authors:  Carlos de Lannoy; Dick de Ridder; Judith Risse
Journal:  F1000Res       Date:  2017-07-07
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.