Literature DB >> 35068060

High molecular weight DNA extraction strategies for long-read sequencing of complex metagenomes.

Florian Trigodet¹, Karen Lolans¹, Emily Fogarty², Alon Shaiber³, Hilary G Morrison⁴, Luis Barreiro¹, Bana Jabri¹, A Murat Eren^1,2,3,4.

Abstract

By offering extremely long contiguous characterization of individual DNA molecules, rapidly emerging long-read sequencing strategies offer comprehensive insights into the organization of genetic information in genomes and metagenomes. However, successful long-read sequencing experiments demand high concentrations of highly purified DNA of high molecular weight (HMW), which limits the utility of established DNA extraction kits designed for short-read sequencing. The challenges associated with input DNA quality intensify further when working with complex environmental samples of low microbial biomass, which requires new protocols that are tailored to study metagenomes with long-read sequencing. Here, we use human tongue scrapings to benchmark six HMW DNA extraction strategies that are based on commercially available kits, phenol-chloroform (PC) extraction and agarose encasement followed by agarase digestion. A typical end goal of HMW DNA extractions is to obtain the longest possible reads during sequencing, which is often achieved by PC extractions, as demonstrated in sequencing of cultured cells. Yet our analyses that consider overall read-size distribution, assembly performance and the number of circularized elements found in sequencing results suggest that column-based kits with enzyme supplementation, rather than PC methods, may be more appropriate for long-read sequencing of metagenomes.

Entities: Chemical

Keywords: high-molecular-weight DNA; long-read sequencing; metagenomics; nanopore

Mesh：

Substances：
DNA

Year: 2022 PMID： 35068060 PMCID： PMC9177515 DOI： 10.1111/1755-0998.13588

Source DB: PubMed Journal: Mol Ecol Resour ISSN： 1755-098X Impact factor: 8.678

INTRODUCTION

High‐throughput sequencing of metagenomes offers unprecedented insights into the diversity and gene pool of naturally occurring microbes and viruses that occupy soils (Nesme et al., 2016), marine habitats (Gregory, Zayed, et al., 2019; Sunagawa et al., 2015) and host‐associated environments (Gregory, Zablocki, et al., 2020; Human Microbiome Project Consortium, 2012). The high accuracy and the high throughput of the modern sequencing platforms are afforded by read lengths that typically remain below 250 bases. These relatively short reads pose significant constraints on the data utility, especially in metagenomics (Wommack et al., 2008). By stitching together the short reads that partially overlap, metagenomic assembly can reconstruct orders of magnitude longer contiguous segments of input DNA (Nurk et al., 2017) and enable the recovery of microbial genomes from metagenomes (Tyson et al., 2004). In recent years, this strategy has become a primary tool in microbiology to study the ecology and evolution of naturally occurring microbial populations (Al‐Shayeb et al., 2020; Delmont et al., 2018; Hug et al., 2016; Pasolli et al., 2019; Spang et al., 2015). However, metagenomic assembly is inherently challenging (Ayling et al., 2020) and the assembly of complex environments often leads to highly fragmented assemblies (Olson et al., 2019). These fragmented assemblies increase the likelihood of generating composite genomes that include contigs from multiple distinct populations (Chen et al., 2020), which risk erroneous insights into microbial ecology and evolution (Chen et al., 2020; Shaiber & Eren, 2019). By circumventing the problems associated with short‐read assembly, long‐read sequencing offers a compelling solution to the ideal of reconstructing complete genomes from metagenomes (Driscoll et al., 2017; White et al., 2016). Nanopore sequencing, which resolves the identity and order of nucleotides based on changes in ionic current as a single‐stranded RNA or DNA molecule passes through a tiny pore (Kasianowicz et al., 1996), has rapidly gained popularity among researchers (Wang et al., 2014) thanks to its availability through affordable and easy to operate sequencing devices, such as MinION by Oxford Nanopore Technologies (“The Long View on Sequencing”, 2018). Despite the high error rates and relatively lower sequencing depth, long reads from nanopore sequencing of metagenomes led to key insights from challenging systems (Pessi et al., 2020; Reveillaud et al., 2019) and enabled the recovery of circular, complete genomes from metagenomes (Cusco et al., 2020; Moss et al., 2020; Nicholls et al., 2019; Sanderson et al., 2018; Singleton et al., 2021; Somerville et al., 2019). The efficacy of long‐read sequencing depends heavily on the structural integrity of the input DNA (Schalamun et al., 2019), which poses a new and significant challenge. Commercial DNA extraction kits that emerged during the era of short‐read sequencing typically include steps that physically disrupt cells through mechanical lysis and generate highly fragmented DNA molecules. While these commercial kits improve short‐read sequencing outcomes as they ensure maximum yield and coverage of DNA in a sample, they set a critical limit to the outcomes of long‐read sequencing. Hence, establishing DNA extraction strategies that afford (i) preservation of high‐molecular‐weight (HMW) molecules, (ii) a high degree of sample purity and (iii) increased overall DNA yields have become critical considerations for the successful application of long‐read sequencing. Phenol–chloroform DNA extractions, first popularized by Sambrook et al. (1989), have been making a resurgence as a “go‐to” method for extracting HMW DNA (Maghini et al., 2020; Quick & Loman, 2018). While recent studies have used this approach to recover ultralong DNA fragments (e.g., >100 kb) from cultured organisms (Cicha et al., 2020; Hosoe et al., 2020; Kinoshita et al., 2020; Takeshita et al., 2020; Tippelt et al., 2020), the utility of phenol–chloroform extractions for metagenomics is not yet clear. In parallel, long‐read sequencing surveys of metagenomes have largely focused on high‐microbial‐biomass samples including human stool (Moss et al., 2020) or activated sludge (Singleton et al., 2021), and best practices to study metagenomes of lower biomass samples are yet to emerge. Increasing the breadth of long‐read sequencing requires DNA extraction protocols that can both produce long reads and can scale to a range of systems, including those that are associated with low microbial biomass. Here we designed six DNA extraction protocols and examined their relative effectiveness to extract HMW DNA in terms of total yield, concentration, purity, integrity and applicability for subsequent long‐read sequencing. Using an Oxford Nanopore MinION sequencer, we benchmarked our protocols based on the extent of host contamination, fragment size distribution, microbial taxonomy and metagenomic assembly outcomes across multiple sequencing runs, and compared taxonomic insights that emerged from long‐read sequencing to those from Illumina short‐read sequencing. We chose the human oral cavity as a model system, since it offers a challenging environment to benchmark DNA extraction strategies as it is home to complex microbial communities (Dewhirst et al., 2010) with relatively low biomass (Duran‐Pinedo & Frias‐Lopez, 2015) and is typically mixed with eukaryotic host DNA (Marotz et al., 2018).

MATERIALS AND METHODS

Tongue dorsum sample collection

A single healthy individual self‐collected scraping of their tongue dorsum on 13 separate occasions (one per day) over the course of 3 weeks. We used BreathRx Gentle Tongue Scrapers (Philips Sonicare) for sample collection, which was performed prior to eating, drinking or performing oral hygiene. Starting as far back as possible on the tongue, the scraper was passed forward over the entire surface three sequential times. We transferred the collected material to 520 µl of PBS (phosphate‐buffered saline) and immediately stored at −20°C until processing. The pooled 13 tongue dorsum samples were used for HMW DNA extraction comparisons, while another self‐collected tongue dorsum sample (named TD) from this same individual was obtained 2 weeks later for short‐read sequencing (amplicon and metagenomics).

DNA extraction methods

We compared six DNA extraction methods tailored for HMW DNA recovery. We included both commercially available kits and nonkit methods used in the published literature which incorporated different combinations of cell lysis mechanisms and DNA purification methodologies. To facilitate direct comparison between all extraction methods, we thawed and pooled the 13 samples immediately prior to DNA extraction. After homogenizing the pooled sample by vortexing for 15 s, we used a 500‐µl aliquot as the starting material for each extraction method, which was performed in duplicate. We resuspended the isolated genomic DNA from each method in a final 100 µl volume in a 1.5‐ml LoBind microfuge tube (Eppendorf). We sought to maximize read lengths by implementing best practices for handling HMW DNA throughout all the methods. We eliminated vortexing and mixing by pipetting, when possible, in favour of end‐over‐end tube rotation to minimize velocity gradients. We used wide‐bore pipette tips with gentle pipetting to reduce DNA breakage and avoided unnecessary freeze–thaw cycles by storing DNA at 4°C until sequence analysis.

DNeasy PowerSoil isolation kit with modified bead beating (PB)

DNeasy PowerSoil DNA isolation kits (Qiagen) are commonly used in metagenomics (Human Microbiome Project Consortium, 2012; Shaiber et al., 2020) to extract high‐quality DNA from environmental matrices. We sought to determine its compatibility with nanopore sequencing protocols by amending the PowerSoil DNA isolation protocol in two ways. In our first modification to the DNeasy PowerSoil DNA Isolation kit, we incorporated a modified bead beating step (Edwards et al., 2019), as a way to minimize velocity gradients and reduce DNA shearing, therefore improving fragment length. We transferred 500 µl of the pooled sample to the kit provided with a PowerBead tube, which we inserted flat into IKA Works MS2S8 Minishaker for Bioanalyzer DNA chips. Samples were agitated for 10 min (in 1 min pulse increments) at 2400 rpm (Edwards et al., 2019). We then followed the remainder of the manufacturer's instructions for DNA isolation and purification.

DNeasy PowerSoil Isolation kit supplemented with an enzymatic treatment (PE)

In the second modification to the DNeasy PowerSoil DNA Isolation kit, we fully replaced the use of mechanical cell lysis (or bead beating) with a heated, enzymatic treatment step (Yuan et al., 2012). We added 500 µl of the pooled sample to the kits’ PowerBeads tubes including the tube's solution (but lacking the beads). We added a lytic cocktail to facilitate cell lysis: 125 µl lysozyme (10 mg ml–1, Sigma‐Aldrich), 37.5 µl mutanolysin (10 KU ml–1, Sigma‐Aldrich), 7.5 µl lysostaphin (4000 U ml–1, Sigma‐Aldrich) and 5 µl RNase A (10 mg ml–1, Sigma‐Aldrich). We incubated for 1 h at 37°C. Then we added 50 µl of proteinase K (20 mg ml–1, Sigma‐Aldrich) alongside kit solution C1, and followed by incubation for 30 min at 56°C. After centrifugation of the tubes at 10,000 g for 30 s at room temperature (step 6), we continued the rest of the isolation protocol as described by the manufacturer.

DNeasy UltraClean Microbial Kit (UC)

We extracted genomic DNA using the DNeasy UltraClean Microbial Kit (Qiagen) and replaced its bead beating step with the manufacturer's alternative lysis procedure to reduce shearing of the DNA. This commercial kit was evaluated as a result of a direct recommendation by Oxford Nanopore. We added 500 µl of pooled sample to 300 µl of PowerBead solution. After the addition of 50 µl Solution SL, we incubated the sample for 10 min at 65°C. After centrifuging the tubes at 10,000 g for 30 s at room temperature (step 5), we continued the rest of the isolation protocol as described by the manufacturer.

Qiagen Genomic Tip 20/G supplemented with an enzymatic treatment (GT)

We extracted DNA using the Qiagen Genomic Tip 20/G (Qiagen) and followed the manufacturer's protocol for “Preparation Gram‐negative and some Gram‐positive Bacterial Samples.” We centrifuged 500 µl of pooled sample for 10 min at 10,000 g and resuspended the pellet in Buffer B1 supplemented with 20 µl DNAse‐free RNAse (10 mg ml–1, Sigma‐Aldrich), 45 µl Proteinase K (20 mg ml–1, Sigma‐Aldrich) and 20 µl lysozyme (100 mg ml–1, Sigma‐Aldrich), as outlined. We modified the lytic cocktail to include 9 µl lysostaphin (4000 U ml–1, Sigma‐Aldrich) and 45 µl mutanolysin (10 KU ml–1 Sigma‐Aldrich) in order to improve the lysis potential in Gram‐positive bacteria. After incubation for 2 h at 37°C, we added Buffer B2 and extended the incubation at 50°C to 90 min. As the lysate had not cleared after this initial period, we extended the incubation for an additional 2 h. We removed any remaining particulate matter by centrifugation at 5000 g for 10 min as recommended by the manufacturer. Finally, we used the columns in the kit according to the described isolation protocol.

Phenol/chloroform extraction (PC)

We extracted DNA using a phenol/chloroform extraction protocol modified from Chapter 6, protocol 1 of Sambrook et al. (1989). For SDS cell lysis, we added 500 µl of pooled sample to 10 ml of TLB (10 mm Tris‐HCL pH 8.0, 25 mm EDTA pH 8.0, 100 mm NaCl, 0.5% [w/v] SDS, 20 µg ml–1 RNase A) and vortexed at full speed for 5 s followed by an incubation at 37°C for 1 h. Proteinase K (Qiagen) was added to a final concentration of 200 µg ml–1 and we mixed the sample by slow inversion three times, followed by 2 h at 50°C with gentle mixing every 30 min. We purified the lysate with 10 ml buffer saturated phenol using phase‐lock gel falcon tubes, followed by phenol/chloroform‐isoamyl alcohol (1:1). We precipitated the DNA by adding 4 ml of 5 m ammonium acetate and 30 ml ice‐cold ethanol. We recovered DNA by one of the following methods: a glass hook followed by washing twice in 70% ethanol (if a DNA mass was visible) or centrifugation at 4500 g for 10 min followed by washing twice in 70% ethanol (if no DNA mass was visible). After spinning down at 10,000 g, we removed ethanol by drying at ambient temperature for 10 min. We added 100 µl EB (elution buffer, 10 mm Tris‐HCl, pH 8.5) to the DNA and left it at 4°C overnight to resuspend the pellet.

Agarose plug encasement and extraction (AE)

From the 500‐µl pooled sample aliquot, we pelleted the cells by centrifugation at 10,000 g for 10 min and removed the supernatant. We performed DNA extraction on the pellet using a pulsed‐field gel electrophoresis (PFGE)‐based agarose encasement extraction protocol (Matushek et al., 1996). The pelleted cells were resuspended in 300 µl 2× lysis solution (12.5 mm Tris‐HCl pH 7.6, 2 m NaCl, 20 mm EDTA pH 8.0, 1% [w/v] Brij 58, 1% [w/v] deoxycholate, 1% [w/v] sodium lauroyl sarcosine) to which we added lysozyme (1 mg ml–1) and DNAse‐free RNAse (30 µg ml–1) on the day of the experiment. We combined the entire suspension with 300 µl of molten 1.6% low‐melting point (LMP) agarose (TopVision, Thermo Fisher Scientific) which we pipetted into a plug mould and allowed to solidify. We placed each plug into 3 ml of 1× lysis solution containing: lysozyme, 0.5 mg ml–1; DNAse‐free RNAse, 100 µg ml–1; lysostaphin, 50 U ml–1; and mutanolysin, 0.3 KU ml–1. We added the enzymes on the day of the experiment and incubated the plugs overnight at 37°C with gentle shaking. After incubation, we replaced the lysis solution sequentially with ESP (10 mm Tris‐HCl pH 7.6, 1 mm EDTA) to which we added proteinase K (Sigma) at a final concentration of 100 µg ml–1 and 1% sodium dodecyl sulfate at 50°C for 1 h; and then two rounds of washing in sterile, dilute TE (10 mm Tris‐HCl pH 7.6, 0.1 mm EDTA) first at 50°C for 1 h and then at 35°C for 30 min with gentle shaking. We transferred the plug to 5 ml of fresh dilute TE in a clean tube for storage until we performed a β‐agarase I digestion of the agarose and isopropanol DNA precipitation according to the manufacturer's instructions (New England Biolabs). We resuspended the precipitated DNA in 100 µl EB (10 mm Tris‐HCl, pH 8.5).

Determination of DNA yield, purity metrics and fragment size distribution

For each extraction, we quantified the DNA yield on a Qubit 1.0 Fluorometer (Thermo Scientific), using the dsDNA HS (high sensitivity, 0.2–100 ng) Assay kit according to the manufacturer's protocols; a sample volume of 1 µl was added to 199 µl of a Qubit working solution. We assessed the purity of the extracted nucleic acids with the A260/280 and A260/230 absorbance ratios obtained using a NanoDrop spectrophotometer (NanoDrop Technologies). We assessed the DNA fragment size distribution by electrophoresis (90 V for 1.5 h) of genomic DNA (22 or 44 ng, as available) on a 0.8% (w/v) agarose gel followed by staining with ethidium bromide and UV light visualization. We used λ‐HindIII DNA size standards to estimate fragment sizes.

MinION library preparation and sequencing

Upon receipt, and again immediately prior to sequencing, we measured the flowcell pore counts using the Platform QC script (minknow). We stored the flowcells in their original packaging, which we resealed with parafilm and tape, at 4°C until use. Unless stated otherwise, we omitted size selection and the optional shearing step to allow us to evaluate the full distribution of fragment sizes produced by each method. We performed the library preparation using the Ligation Sequencing Kit (SQK‐LSK108) and Native Barcoding Kit (EXP‐NBD103) for genomic DNA, according to the standard 1D Native barcoding protocol provided by the manufacturer (Oxford Nanopore) (version: NBE_9006_v103_rev2_21DEC16) unless indicated. We carried out all reactions at room temperature. We followed the input DNA mass recommendation of 1.0 µg genomic DNA (gDNA) and performed DNA repair (NEBNext FFPE DNA Repair Mix, NEB M6630) to maximize read length. For the 1.0× AMPure clean‐up step, we used gentle rotation and an extended elution time (minimum 20 min) to assist in the release of long DNA fragments from the beads. We quantified 1‐µl aliquots by fluorometry (Qubit) after each phase of the library preparation (i.e., damage repair, end preparation, barcoding, pooling and adaptor ligation) to quantify DNA recovery, and identified substantial DNA loss after each step (Table 1).

TABLE 1

Summary of DNA concentration and quality metrics. Technical replicates are denoted as “XX_1” and “XX_2”

		Extraction methodology
		PB		PE		UC		GT		PC		AE
Method	Metric	PB_1	PB_2	PE_1	PE_2	UC_1	UC_2	GT_1	GT_2	PC_1	PC_2	AE_1	AE_2
Qubit	DNA concentration (µg ml^–1)	2.34	1.71	4.08	4.21	30.7	36.2	110	110	75.8	77.9	90.6	95.5
Nanodrop	A260/A280 ^a	6.01	5.53	1.75	2.27	1.95	1.74	1.74	1.82	1.97	1.95	1.8	1.76
Nanodrop	A260/A230 ^b	0.26	0.16	0.33	0.08	1.98	2.02	1.86	2.27	2.07	2.07	1.56	1.36

Primary measure of nucleic acid purity—expected value for “pure” DNA: ~1.8 while >2 indicates RNA contamination.

Secondary measure of nucleic acid purity—evaluates residual chemical contamination (phenol, guanidine HCl, carbohydrate carryover). Expected values are in the range of 1.8–2.2. Values significantly lower indicate chemical contamination and are undesired.

Summary of DNA concentration and quality metrics. Technical replicates are denoted as “XX_1” and “XX_2” Primary measure of nucleic acid purity—expected value for “pure” DNA: ~1.8 while >2 indicates RNA contamination. Secondary measure of nucleic acid purity—evaluates residual chemical contamination (phenol, guanidine HCl, carbohydrate carryover). Expected values are in the range of 1.8–2.2. Values significantly lower indicate chemical contamination and are undesired. We used two sequencing runs (three samples multiplexed on each flowcell) for the sequencing. We loaded the completed libraries onto R9.4 flowcells as per instructions from ONT and scheduled sequencing runs for 48 h. Sequencing continued until time expired or until pore exhaustion (defined by <10 functional pores).

BluePippin size selection

To quantify the magnitude of its impact, if incorporated into a long‐read sequencing workflow, we performed size selection using DNA from GT_1 and GT_2 with the BluePippin (Sage Science) system with 0.75% dye‐free agarose cassettes and marker S1. We selected fragments >6 kb in high‐pass collection mode (an approach enabling the collection of DNA fragments above a user‐defined size). Our 3.3‐µg sample DNA input was less than the recommended 5 µg; however, this input was maximized given sample DNA concentrations and loading volume constraints.

Long‐read sequence analysis

We uploaded the raw MinION FAST5 files produced with the minknow software (versions 1.15.4 to 3.1.19) to our cluster to perform the base‐calling and demultiplexing with guppy version 2.3.1. FastQ files were generated only for reads meeting a minimum quality threshold (quality score of 7). We used minimap2 version 2.14 (Li, 2018) and samtools version 1.9 (Li et al., 2009) to remove sequences that mapped to the human genome build 38 (GRCh38) from NCBI and estimate the amount of host contamination. We used anvi’o version 6.2 (2021) and the contig snakemake workflow to compute the sequence metrics (Köster & Rahmann, 2012). Briefly, the workflow created a contigs database with “anvi‐gen‐contigs‐database,” which used prodigal version 2.6.3 (Hyatt et al., 2010) with the metagenome mode to identify open reading frames. It used “anvi‐run‐hmm” to detect the single‐copy core genes from bacteria (n = 71, modified from Lee, 2019), archaea (n = 76, Lee, 2019), eukarya (n = 83, http://merenlab.org/delmont‐euk‐scgs) and ribosomal RNAs (rRNAs) (n = 12, modified from https://github.com/tseemann/barrnap). We used “anvi‐display‐contigs‐stats” to obtain the number of sequences, total length, N50, longest sequence, number of genes, number of single‐copy core genes and ribosomal genes. We used blast against the NCBI’s nr/nt database to get the best taxonomy, percentage identity and query alignment for each longest read per extraction method. We used “anvi‐get‐sequences‐for‐hmm‐hits” to recover 16S rRNA genes for each condition and used the Human Oral Microbiome Database (HOMD) online blast tool for taxonomic assignment. We assembled the long reads with flye and the metagenomic option (Kolmogorov et al., 2020), which takes into account the uneven coverage nature of metagenomes. We used short‐reads generated using HMW DNA extraction sample PB_2 (see method below) to polish the assemblies using pilon (Walker et al., 2014). We created anvi’o contigs databases, as described above, with the polished flye’s contigs to summarize the assembly metrics.

16S rRNA gene amplicon DNA extraction, library preparation, sequencing and analysis

For the 16S rRNA gene amplicon sequencing, we used a sample (TD), which was collected from the same individual 2 weeks after the pooled samples that were used for HMW DNA extraction. We performed sample DNA extraction using the DNeasy Powersoil kit (Qiagen) following the manufacturer's protocol. We amplified the V4–V5 hypervariable regions of the bacterial small subunit (SSU) rRNA gene using degenerate primers: 518F (CCAGCAGCYGCGGTAAN) and 926R (CCGTCAATTCNTTTRAGT, CCGTCAATTTCTTTGAGT, and CCGTCTATTCCTTTGANT). Amplification was done with fusion primers containing the 16S‐only sequences fused to Illumina adapters. The forward primers included a 5‐nt multiplexing barcode and the reverse primers a 6‐nt index. We generated PCR (polymerase chain reaction) amplicons in triplicate 33‐µl reaction volumes with an amplification cocktail containing 0.67 U SuperFi Taq Polymerase (Invitrogen), 1× enzyme buffer (includes MgCl2), 200 µm dNTP mix (ThermoFisher), and 0.3 µm of each primer. We added ~10–25 ng template DNA to each PCR and ran a no‐template control for each primer pair. Amplification conditions were: initial 94°C, 3‐min denaturation step; 30 cycles of 94°C for 30 s, 57°C for 45 s, and 72°C for 60 s; final 2‐min extension at 72°C. The triplicate PCRs were pooled after amplification, visualized with the negative controls on a Caliper LabChipGX, and purified using Ampure followed by PicoGreen quantification and Ampure size selection. Libraries were sequenced on an Illumina Miseq 250‐cycle paired‐end run. We used illumina‐utils version 2.7 (Eren et al., 2015) for the quality filtering, following the recommendations of Minoche et al. (2011), and the merging of the paired‐end reads. We used vsearch to remove chimeric sequences (Rognes et al., 2016) and Minimum Entropy Decomposition (MED) (Eren et al., 2015) to cluster the merged reads into oligotypes. We assigned taxonomy using dada2 (Callahan et al., 2016) and the Silva version 132 nonredundant database (Quast et al., 2013).

Short‐read metagenomic library preparation, sequencing and analysis

For the short‐read metagenomic sequencing, we used both the sample TD collected 2 weeks after the initial sampling, and the HMW DNA extraction sample PB_2 (the closest methodology to gold standard short‐read sequencing extraction methodology). Sample DNA concentrations, determined via a PicoGreen assay, were 67 ng µl–1 (TD) and 0.75 ng µl–1 (PB_2). We used 100 and 28 ng, respectively, for library construction. DNA was sheared to ~400 bp using the Covaris S2 acoustic platform and libraries were constructed using the Nugen Ovation Ultralow kit. Each required an amplification step: seven cycles (TD) or 11 cycles (PB_2). The products were visualized on an Agilent Tapestation 4200 and size‐selected to an average of 482 bp using BluePippin (Sage Biosciences). The final library pool was quantified with the Kapa Biosystems qPCR protocol and sequenced on the Illumina NextSeq500 in a 2 × 150 paired‐end sequencing run using dedicated read indexing. We used anvi’o version 6.2 and the metagenomics snakemake workflow for the assembly and analysis of the short reads. Briefly, the workflow uses illumina‐utils for quality filtering followed by a metagenomics assembly with idba‐ud version 1.1.3 (Peng et al., 2012) and generates contig databases as described in the long‐read sequence analysis section. The workflow also uses bowtie2 version 2.3.5.1 (Langmead & Salzberg, 2012) to map short reads on the assembled contigs and samtools (Li et al., 2009) to sort, index and convert sam files into bam files used by anvi’o to generate profiles databases. To compute taxonomic profiles of metagenomes we used the anvi’o program “anvi‐estimate‐scg‐taxonomy,” which aligns ribosomal proteins found in a metagenomic assembly to those that are found in reference genomes from the Genome Taxonomy Database (GTDB) (Parks et al., 2020).

Visualization

We generated the figures in R with ggplot2 (Wickham, 2009) and modified them with inkscape.

RESULTS

Our benchmarks here include four DNA extraction protocols based on three commercially available Qiagen DNA extraction kits (each incorporated modifications to their cell lysis procedures): the Qiagen DNeasy PowerSoil with a modified bead beating step (PB) or with bead beating replaced by enzymatic cell lysis (PE); the Qiagen DNeasy UltraClean Microbial Kit (UC) using the manufacturer's alternative lysis procedure to reduce DNA shearing; and the Qiagen Genomic Tip 20/G extraction kit (GT) augmented with additional enzymatic cell lysis. The remaining two protocols included in our study are a phenol–chloroform protocol (PC), and a pulsed‐field gel electrophoresis (PFGE)‐based agarose encasement extraction protocol (AE) followed by agarase digestion. Throughout the text we refer to these protocols as PB, PE, UC, GT, PC and AE, and we denote technical replicates as “XX_1” or “XX_2.” To benchmark the protocols that are detailed in the Methods section, we used a tongue dorsum sample pooled together from 13 samples collected from the same individual. An additional sample (TD) collected from the same individual has been used to compare taxonomic composition between short and long‐read sequencing.

Yield and quality metrics of isolated DNA vary between protocols

To ascertain their suitability for long‐read sequence analysis, we first analysed the quantity and quality of DNA isolated from each extraction protocol using both fluorometric (Qubit) and spectrophotometric (Nanodrop) methods (Table 1). The fluorescent dye used by Qubit binds specifically to its target molecule, dsDNA, and provides the most accurate DNA quantification. DNA concentrations were comparable between technical replicates for all methods, and three extraction protocols (GT, PC and AE) were distinguished from the others with concentrations >75 µg ml–1. These concentrations were comparable to previously published HMW DNA extractions from oral samples (Yahara et al., 2021). GT exhibited the highest sample concentrations with a mean of 110 µg ml–1. In comparison, the mean concentrations from PB, PE and UC were far less at 2.03, 4.15 and 33.45 µg ml–1, respectively. The MinION sequencing protocol for the Ligation Sequencing Kit (SQK‐LSK108) advises using 1–1.5 µg DNA. As we resuspended DNA into a final volume of 100 µl, only UC, GT, PC and AE had sufficient DNA yield to meet the recommended input. We then considered the absorption spectra to assess sample purity and identify potential non‐nucleic acid contamination. The A260/A280 ratio indicates DNA purity with expected values around 1.8 for pure DNA. All extraction methods fell within the desired range (1.74–1.97), except for both PB replicates and one replicate of PE; however, samples with concentrations approaching the lower limit of 2 µg ml–1 may result in unacceptable A260/A280 ratios. The A260/A230 ratio signifies possible residual chemical contamination such as EDTA, phenol, guanidine salts (often used in column‐based kits) or carbohydrate carryover. UC, GT and PC had ratios close to the expected value of 2. Overall, GT had the best combined yield and purity metrics; PC and AE were deemed suitable alternatives. The purity metrics and congruence of Qubit and Nanodrop concentrations in UC were also desirable, yet UC’s ~2–3‐fold lower DNA yield proved less appealing. We ran an agarose gel electrophoresis to visually assess the crude DNA fragment size distribution (Figure 1). All DNA samples migrated predominantly as a single HMW DNA band that aligned with (or was larger) than the reference 23.1‐kbp fragment of HindIII‐digested lambda DNA. A light smear, visible to 2.0 kb, was observed in GT, indicating the presence of smaller fragments. Recovery of larger DNA fragments by PC was denoted by the predominant band running higher than the 23.1‐kbp fragment.

FIGURE 1

Agarose gel electrophoresis of genomic DNA isolated from a pool of tongue dorsum samples. Genomic DNA was electrophoresed on a 0.8% (w/v) agarose gel. PB, PE and UC with replicates (22 ng input, left panel) and GT, PC and AE replicates (44 ng input, right panel) are shown. Different DNA inputs were used based on overall sample availability. λ‐HindIII, Lambda DNA, digested with the restriction endonuclease HindIII, was used to assess fragment size distribution. PB, DNeasy PowerSoil with modified bead beating; PE, DNeasy PowerSoil with enzymatic treatment; UC, DNeasy UltraClean Microbial Kit; GT, Qiagen Genomic Tip 20/G with enzymatic treatment; PC, phenol–chloroform; AE, agarose encasement. The designations “_1” and “_2” indicate replicate 1 and replicate 2, respectively Low DNA yield can result in sample loss during library preparation or low pore occupancy on MinION during sequencing. While working with samples of low DNA yield is inevitable, spiking in known DNA (such as lambda DNA) to “pad” samples with low DNA yields may be used to start sequencing, as we demonstrated previously (Reveillaud et al., 2019). However, to minimize the need for additional DNA to “pad” samples with low DNA yield due to the extraction protocol, we eliminated protocols PB, PE and UC from any further evaluation as they consistently resulted in mediocre DNA yield. Next, we sequenced the DNA from GT, PC and AE using two MinION flow cells to ensure the replicates of the same protocol were sequenced on different runs (Run 1: GT_1, PC_1 and AE_1; Run 2: GT_2, PC_2 and AE_2). The sequencing runs generated 4.84 and 7.79 Gbp, respectively, which were within the expected range of MinION sequencing output (Cusco et al., 2020; Moss et al., 2020). The increase in the output in the second run could be attributed to less sample loss during library preparation and subsequent input of three times more DNA (142.8 vs. 466.2 ng) into the flow cell. After performing a quality filtering step (using a minimum Q‐score of 7), the percentage of reads passing the quality check (Pass_Reads) was similar between extraction methods within each flow cell (96% and 93%–94% respectively, Table S1a). However, a comparison between runs demonstrated that a higher percentage of sequences were removed (Fail_Reads) in the second run (Table S1a). The total number of nucleotides was comparable between GT and PC within sequencing runs (1,699,213,259 and 1,641,389,967 on average, respectively), and was much smaller for AE (949,615,901 on average) in both runs.

Eukaryotic contamination is enriched in the pool of shorter fragments

Samples from the human oral cavity, prepared for metagenomic sequencing, are typically associated with extensive eukaryotic contamination, which can account for up to 45% of the short‐read sequencing product (Shaiber et al., 2020). Therefore, we assessed the amount of host contamination in each DNA sample by mapping reads to a reference human genome from the NCBI (GRCh38). Host DNA contamination was high for all protocols. AE had the least amount of human DNA (on average, 63%) compared to GT (on average, 75%) and PC (on average, 81%) (Table 2). This trend persisted when comparing their cumulative sequence lengths, although our analysis of read length distribution showed that the human reads were predominantly composed of shorter fragments (Figure 2, insets). The increased representation of human DNA in PC compared to other extraction methods is probably due to the use of detergents vs. enzymes. Lytic detergents exert their effect on both bacterial and eukaryotic cells, while lytic enzymes target bacterial cells only. PC lacked lytic enzymes and included SDS, a stronger lytic detergent than those used in GT (Tween 20, Triton X‐100) and AE (SLS, Brij, deoxycholate).

TABLE 2

The impact of HMW DNA extraction protocol on proportional read numbers and sequence lengths according to read type (microbial vs. human)

	Qiagen Genomic Tip (GT)		Phenol–chloroform (PC)		Agarose encasement (AE)
	GT_1	GT_2	PC_1	PC_2	AE_1	AE_2
All reads
Number of reads	2,052,300	3,681,083	2,008,795	4,338,575	739,186	962,065
Sequencing yield (Gbp)	1.440	1.959	1.311	1.972	0.767	1.133
Human reads
Number of reads	1,584,229	2,674,013	1,664,447	3,433,184	490,377	585,733
%	77.19	72.64	82.86	79.13	66.34	60.88
Sequencing yield (Gbp)	0.908	1.199	1.026	1.536	0.405	0.463
%	63.07	61.23	78.26	77.88	52.79	40.91
Microbial reads
Number of reads	468,071	1,007,070	344,348	905,391	248,809	376,332
%	22.81	27.36	17.14	20.87	33.66	39.12
Sequencing yield (Gbp)	0.532	0.759	0.285	0.436	0.362	0.669
%	36.93	38.77	21.74	22.12	47.21	59.09

FIGURE 2

The impact of DNA extraction protocol on the distribution of human (lighter colour) and microbial (darker colour) read lengths from MinION sequencing. These histograms visualize the total accumulative length (total number of nucleotides sequenced) per range of individual read lengths. The x‐axis represents the read length on a log scale and the y‐axis represents the cumulative length for a given size bin (bar width). The main panel shows the size distribution of reads >2500 bp for GT (green), PC (yellow) and AE (blue) while the inset panel shows the size distribution of all reads, using the same data. Results are outlined vertically by extraction method (replicate 1, top panel; replicate 2, lower panel)

The impact of HMW DNA extraction protocol on proportional read numbers and sequence lengths according to read type (microbial vs. human) The impact of DNA extraction protocol on the distribution of human (lighter colour) and microbial (darker colour) read lengths from MinION sequencing. These histograms visualize the total accumulative length (total number of nucleotides sequenced) per range of individual read lengths. The x‐axis represents the read length on a log scale and the y‐axis represents the cumulative length for a given size bin (bar width). The main panel shows the size distribution of reads >2500 bp for GT (green), PC (yellow) and AE (blue) while the inset panel shows the size distribution of all reads, using the same data. Results are outlined vertically by extraction method (replicate 1, top panel; replicate 2, lower panel)

Read size distribution is not uniform across extraction methods

After the removal of sequences that match the human genome, we assumed that the vast majority of the sequences were of microbial origin. We further focused on sequences that were longer than 2500 nucleotides, and quantified the number of reads and their length distribution across GT, PC and AE (Table 3). Overall, we observed a size distribution that was coherent with other long‐read metagenomics studies, with reads reaching 100,000 bp (Moss et al., 2020; Somerville et al., 2019).

TABLE 3

Microbial read size distribution. All percentages are relative to the total reads (or sequencing yield) of the quality filtered reads, prior to removal of human reads

	Qiagen Genomic Tip (GT)		Phenol–chloroform (PC)		Agarose encasement (AE)
	GT_1	GT_2	PC_1	PC_2	AE_1	AE_2
All reads
Number of reads	2,052,300	3,681,083	2,008,795	4,338,575	739,186	962,065
Sequencing yield (Gbp)	1.440	1.959	1.311	1.972	0.767	1.133
All microbial reads
Number of reads	468,071	1,007,070	344,348	905,391	248,809	376,332
%	22.81	27.36	17.14	20.87	33.66	39.12
Sequencing yield (Gbp)	0.532	0.759	0.285	0.436	0.362	0.669
%	36.93	38.77	21.74	22.12	47.21	59.09
N50	2810	1929	1116	449	2524	3649
L50	38,513	62,126	34,874	155,142	38,768	52,182
Median length (bp)	468	326	427	307	782	837
Longest microbial reads (bp)	73,029	90,424 ^a	163,320	180,460	68,189	116,730
The top hit for the longest microbial read on NCBI’s nr database (identity/alignment)	Streptococcus salivarius (93.6%/99%)	Streptococcus salivarius (92%/98%) ^a	Streptococcus sp. (89.9%/81%)	Veillonella dispar (88.7%/69%)	Veillonella nakazawae (90.2%/89%)	Prevotella histicola (92.5%/96%)
Microbial reads >2.5 kb
Number of reads	43,173	49,572	13,752	10,182	39,270	82,221
%	2.10	1.35	0.68	0.23	5.31	8.55
Sequencing yield (Gbp)	0.278	0.352	0.109	0.117	0.182	0.426
%	19.32	17.98	8.29	5.92	23.77	37.58

Showing the second longest read as the first longest read (92,515 bp) had no hits on NCBI.

Microbial read size distribution. All percentages are relative to the total reads (or sequencing yield) of the quality filtered reads, prior to removal of human reads Showing the second longest read as the first longest read (92,515 bp) had no hits on NCBI. Our comparison of the five longest reads per replicate revealed that PC produced the longest fragments (130,355–180,460 bp) while GT ranged from 68,275 to 92,515 bp and AE had a single 116,730‐bp read followed by significantly shorter reads (28,218–68,189 bp) (Table 3). The replicates of PC yielded 24 and 38 reads that were over 100,000 bp (Table S1b), but this extraction method was also associated with the smallest N50 score due to the large fraction of shorter reads (Figure 2, insets). The contribution of reads above 2500 bp to the sequencing yield (total number of nucleotides) was greater in GT (mean, 18.65%) and AE (mean, 30.68%), while dropping to 7.11% for PC (Table 3, Figure 2). AE had a few short reads contributing to the sequencing yield, but also lacked long reads as there were only 40 and 186 reads above 20 kbp in the two runs (Table S1b). We find it surprising that AE, an extraction strategy that can yield over a million base pair DNA fragments (Anand, 1986), produced only four reads that were longer than 50 kbp when we used identical DNA mass inputs for all extractions. We speculate that, in the absence of a fragmentation step, the very long fragments AE might have produced may have been lost during library preparation steps or become stuck in the pores during sequencing.

Size selection has limited utility and leads to substantial loss of biomass

Despite adding the nanopore recommended 1× volume of solid‐phase reversible immobilization (SPRI) beads to reaction mixtures to remove small fragments during library preparation steps, there was a large contribution of shorter reads (<2500 bp) in our initial sequencing effort. Thus, we evaluated additional options to reduce their numbers and increase the representation of longer fragments. We sought to determine the effect of using the BluePippin system on fragment size metrics by utilizing the “high‐pass collection mode,” which enables the collection of DNA molecules above a certain user‐defined size. To maximize sequences that might contain rRNA genes, which are particularly useful for taxonomic assignments (Camanocha & Dewhirst, 2014), we chose 6 kbp as our minimum threshold value. Due to its ample material availability, we used replicates of GT to compare the size‐selected sequencing metrics with data from the previous untreated sequencing runs. High sample loss is a known drawback of BluePippin high‐pass size filtering as the manufacturer warns to expect a loss between 20% and 50% of the sample input. However, we were able to recover only 16%–18% of the 3300 ng of input DNA for each replicate after the size selection step, in agreement with low recovery rates (25%–35%) also reported by others (Schalamun et al., 2019). Even though we started the size selection step with more than three times the amount of DNA than was used for the untreated workflow, our recovery post‐size selection was 550–600 ng, meaning that our sample input into the start of library preparation was half of the DNA input that went into the untreated sequencing run. Strikingly, the number of reads and sequencing yields were reduced by 80%–90% compared to the untreated sample (Table 4; Table S1c), probably as a consequence of the reduced sample input. Other notable shifts included a reduced proportion of human contamination and an increased N50 (Table 4). The number of microbial reads above 2.5 kbp and their cumulative length were comparable for both approaches, which is quite impressive given the low amount of input DNA in the size‐selected samples. However, this parallel did not persist when evaluating longer DNA fragments as the cumulative length of microbial reads above 20 kbp was greater for nonsize‐selected samples. Propelled by vastly reduced read numbers (and despite the superior N50), the size selection step did not result in the substantial improvements we hypothesized in overall read lengths and the cumulative nucleotide sequences. Considering the additional demands on (i) sample requirements, (ii) reagent/personnel costs, and (ii) sample handling and processing times, we consider size selection of this type to have limited utility in this context.

TABLE 4

Comparison of sequencing run read metrics between untreated and BluePippin size‐selected samples. All percentages are relative to the total reads (or sequencing yield) of the quality filtered reads

	Untreated		BluePippin high‐pass size selection
	GT_1	GT_2	GT_1 SS	GT_2 SS
All reads
Total reads	2,052,300	3,681,083	221,344	430,986
Sequencing yield (Gbp)	1.440	1.959	0.410	0.450
Human reads
Number of reads	1,584,229	2,674,013	117,071	282,113
%	77.19	72.64	52.89	65.46
Sequencing yield (Gbp)	0.908	1.199	0.104	0.162
%	63.07	61.23	25.31	35.92
Microbial reads
Number of reads	468,071	1,007,070	104,273	148,873
%	22.81	27.36	47.11	34.54
Sequencing yield (Gbp)	0.532	0.759	0.306	0.289
%	36.93	38.77	74.69	64.08
N50	2810	1929	6106	5594
Microbial reads >2.5 kb
Number of reads	43,173	49,572	40,248	34,949
%	2.10	1.35	18.18	8.11
Sequencing yield (Gbp)	0.278	0.352	0.252	0.215
%	19.32	17.98	61.26	47.74
Microbial reads >20 kb
Number of reads	1269	2294	300	407
%	0.06	0.06	0.14	0.09
Sequencing yield (Gbp)	0.035	0.067	0.008	0.011
%	2.42	3.41	1.86	2.49

GT, Qiagen Genomic Tip 20/G with enzymatic treatment; SS size‐selection.

Comparison of sequencing run read metrics between untreated and BluePippin size‐selected samples. All percentages are relative to the total reads (or sequencing yield) of the quality filtered reads GT, Qiagen Genomic Tip 20/G with enzymatic treatment; SS size‐selection.

Extraction method and read length distribution alter taxonomic profiles

Extraction methods may have distinctive biases that can impact the determination of microbial community from metagenomic data, such as a differential ability to lyse Gram‐negative vs. Gram‐positive organisms. To investigate the microbial community composition of our long‐read metagenomes we used bacterial, archaeal and eukaryotic single‐copy core genes (SCGs) and rRNAs. Given the high rate of insertion or deletion errors in nanopore sequencing, predicting genes accurately is a significant challenge in uncorrected reads. While we assumed that these biases would similarly impact all extraction methods and elected to not correct our raw reads, we suggest the use of appropriate bioinformatics tools to correct frame‐shift errors in long‐read sequencing results (Arumugam et al., 2019; Huang et al., 2020). In agreement with the sequencing yield of bacterial reads across extraction strategies, GT had the most and PC had the fewest number of genes predicted (Table 5). PC also yielded the lowest number of 16S rRNA genes with an average of 123 genes, as compared to GT and AE, which yielded an average of 418 and 398 genes, respectively. Read size distribution played an important role in the recovery of SCGs and rRNA genes: for example, PC_2 and AE_1 had a comparable sequencing yield of microbial reads (436 and 361 Mbp respectively, Table 3) but the numbers of SCGs and rRNA genes were two times larger in AE_1, despite a smaller number of predicted genes. These two extraction strategies differed drastically in their read size distribution (Figure 2) with AE_1 having fewer very long reads (>50,000 bp, Table S1b) but more medium‐sized ones (>2.5 kbp, Table 3). To complement this observation, AE_2 had the most SCGs and rRNA genes out of all extractions and also the most reads above 2.5 kbp (Table 3), despite having very few long reads (only three >50 kbp, Table S1b) and being the second smallest sample in terms of number of reads and sequencing yield before the removal of human reads (Table 3).

TABLE 5

Results of prodigal gene calling and HMM hits for single‐copy core genes (SCGs) and ribosomal RNAs (rRNAs). Analysis was performed after removal of human‐reads

	Qiagen Genomic Tip (GT)		Phenol–chloroform (PC)		Agarose encasement (AE)
	GT_1	GT_2	PC_1	PC_2	AE_1	AE_2
Number of genes	676,577	994,050	391,675	665,986	440,845	771,653
Bacterial SCGs	2893	3131	1309	1210	2136	3559
rRNAs (per 1000 genes)	901 (1.33)	1107 (1.11)	295 (0.75)	307 (0.46)	655 (1.49)	1355 (1.76)
Bacterial 16S rRNA	373	462	120	125	249	547

Results of prodigal gene calling and HMM hits for single‐copy core genes (SCGs) and ribosomal RNAs (rRNAs). Analysis was performed after removal of human‐reads We used the HOMD to assign taxonomy to the 16S rRNA genes found in our long reads. The top genera included Prevotella, Rothia, Streptococcus and Veillonella, which are commonly found in oral samples (Mark Welch et al., 2016; Zaura et al., 2009). We observed a comparable taxonomic profile between GT and AE, with Streptococcus, Veillonella, Prevotella and Actinomyces identified as the most abundant genera (Figure 3). PC differed with relatively more Prevotella and Haemophilus (both Gram‐negative bacteria) and less Streptococcus (a Gram‐positive bacterium).

FIGURE 3

Relative abundance of 16S rRNA at the genus level. We used the Human Oral Microbiome Database (HOMD) to assign taxonomy to the 16S rRNA from the MinION reads. For the short‐read metagenomes, we used the taxonomy of the ribosomal gene S7 with the Genome Taxonomy Database (GTDB). We processed the 16S rRNA amplicons with the Minimum Entropy Decomposition (MED) algorithm and used Silva version 132 to assign taxonomy. Genera representing less than 1% of a sample were pooled as rare (in light grey). Samples noted as TD correspond to an additional sampling performed 2 weeks after the initial pool of samples used for the long‐read extractions. PB, DNeasy PowerSoil with modified bead beating; GT, Qiagen Genomic Tip 20/G with enzymatic treatment; PC, phenol–chloroform; AE, agarose encasement. The designations “_1” and “_2” indicate replicate 1 and replicate 2, respectively We then compared these profiles to the short‐read sequencing of 16S rRNA gene amplicons found in a sample (TD) collected for the same individual 2 weeks after the initial sampling. While long‐read and amplicon sequencing results were similar to each other qualitatively, the relative abundance estimates between these approaches differed (Figure 3; Table S2). Multiple sources of bias may have contributed to these differences, including the variable detection of rare taxa due to differences in depth of sequencing, differences in sampling time, the use of a single time point for TD vs. the pooling of samples from all other sites, and the use of primers. For instance, our 16S rRNA gene primers did not match to TM7, a prevalent taxon in the human oral cavity, which explains their absence in the amplicon sequencing. Finally, we compared taxonomic profiles across protocols to the short‐read sequencing of two metagenomes from the same individual generated from (i) the sample used for PB_2 and (ii) the sample used for 16S rRNA gene amplicon sequencing (sample TD). To estimate the relative abundance of taxa in metagenomes, we used short read coverage of single‐copy core genes matching to ribosomal protein S7, which was the most frequently found ribosomal protein in the assemblies of short‐read metagenomes (Table S3). While the genus composition was very comparable with long‐read metagenomes, we observed a lower relative abundance of Streptococcus and higher relative abundance of TM7, which is probably due to copy number of the rRNA operons in these genera (Stoddard et al., 2015) that skew the relative abundance estimates based on short‐read amplicons and long‐read metagenomes. We also investigated the distribution and taxonomic assignment of ribosomal proteins in long‐read metagenomes (Table S3, Figure S1), but the relatively low sequencing depth of these data and challenges associated with gene calling in uncorrected MinION sequences prevented reliable insights. Overall, these results highlight (i) the differences observed between extraction methods (i.e., Gram‐negative/positive biases), but also (ii) the critical role that the read size distribution has on the recovery of SCGs, rRNAs and, consequently, its impact on taxonomic profiling.

Read size distribution dramatically influences assembly outcomes

We finally compared different extraction methods by assembling microbial reads using a long‐read assembler, flye (Kolmogorov et al., 2020). We also used short reads generated with PB_2 to polish the assemblies using pilon (Walker et al., 2014). GT and AE resulted in larger assemblies with an average size of 31,518,599 and 28,354,223 bp respectively, while the average assembly size of PC was much smaller at 11,269,952 bp (Table 6). The extraction method GT stood out for its high N50, but also by having the longest contig assembled (1,176,789 bp) and the greatest number of contigs above 100 kbp. The number of predicted genes was directly related to the assembly size. For that reason, GT had on average more genes predicted and more SCGs and rRNAs, which led to more bacterial genomes expected in the assembly (based on the most frequently occurring count of SCG hits). The assembly step did not result in any circular bacterial chromosomes but there were a few circular plasmids and phage genomes. PC had the fewest circular contigs (Table 6), and while AE had more, they were shorter. GT stood out again with the most circular contigs.

TABLE 6

flye assembly statistics. Assemblies were polished using pilon and short‐reads from the extraction PB_2

	Qiagen Genomic Tip (GT)		Phenol–chloroform (PC)		Agarose encasement (AE)
	GT_1	GT_2	PC_1	PC_2	AE_1	AE_2
Total length (bp)	28,002,213	35,034,984	10,630,822	11,909,082	17,557,138	39,151,308
Number of contigs	401	466	137	106	483	855
No.of contigs >5 kb	369	412	130	101	467	775
No. of contigs >10 kb	347	383	124	100	440	714
No. of contigs >20 kb	299	336	107	97	354	564
No. of contigs >50 kb	159	194	73	67	84	203
No. of contigs >100 kb	63	84	33	35	21	69
Longest contig (bp)	1,025,627	1,176,789	504,505	676,998	304,763	875,613
Shortest contig (bp)	512	528	2103	1025	913	511
N50 (bp)	129,677	155,366	122,828	166,496	44,461	71,755
Number of genes	32,540	41,818	11,628	13,025	21,392	48,784
Single‐copy core genes
Bacteria_71	845	1072	298	339	551	1163
Archaea_76	428	554	139	167	283	614
Protista_83	34	48	11	14	26	54
Ribosomal RNAs	75	116	30	43	44	100
No. of expected bacterial genome	12	12	5	5	7	15
Circular contigs
Number of circular contigs	20	32	8	5	9	24
Max. length	86,329	155,366	155,422	155,411	24,494	88,098

flye assembly statistics. Assemblies were polished using pilon and short‐reads from the extraction PB_2

DISCUSSION

Prior to undertaking this evaluation, we established two key conditions that a successful HMW DNA extraction method to study host‐associated environments with low microbial biomass would need to fulfil: (i) optimizing a read length distribution profile to improve downstream sequence analyses and (ii) minimizing the proportion of host‐associated DNA contamination. While the utility of minimizing nontarget DNA in sequencing libraries is relatively obvious from a resource conservancy point of view, the impact of the read length distribution on studies of metagenomes may escape attention when a sequencing strategy focuses on obtaining the longest possible reads. Based on the assembly of 2267 genomes and the average length of rDNA operons, Koren et al. (2013) had suggested 7 kbp as the ideal read length that could span through most repeats for dramatic improvements in genome assembly. The modified column‐based extraction method (GT) consistently yielded the greatest number of reads over 10 kbp and led to most successful downstream assemblies, as indicated by the total size, the length of longest contigs and the number of circularized elements. The representation of host DNA contamination varied between the extraction methods, and reached its maximum in the phenol–chloroform extraction (PC). As a consequence, the phenol–chloroform extraction resulted in fewer microbial reads for an equal amount of DNA input (Table S2). While host DNA was enriched in the pool of shorter sequences and thus could be eliminated through size selection, our analyses revealed a high cost for this step as size selection removed many of the very long reads (>60 kbp) and required three times more DNA as input. Size selection can also be performed using SPRI, which holds great potential for improvements on read‐size distributions. Indeed, a recent analysis of diverse and customized SPRI formulations demonstrated significant increases in the sizing threshold of SPRI beads from 150–800 bp to 1.5–7 kbp (Stortchevoi et al., 2020). The modified PowerSoil kit methods resulted in low DNA yield, which was probably due to our modifications of the manufacturer protocols regarding the bead beating step. Extraction methods that rely on mechanical sample lysis will probably result in much lower yields when modified to diverge from the manufacturer's guidelines to maximize the recovery of HMW DNA. A unique advantage of nanopore sequencing platforms is the real‐time access to the product that is being sequenced, which promotes new methods for reference‐based identification of nontarget DNA molecules and their real‐time rejection from an active pore to minimize their impact on sequencing results (Kovaka et al., 2020). Maximizing the input DNA quality, and the use of novel assembly algorithms (Kolmogorov et al., 2020; Koren et al., 2017; Li, 2016; Ruan & Li, 2020) and effective polishing strategies (Loman et al., 2015; Morisse et al., 2021; Vaser et al., 2017; Walker et al., 2014), will lead to new insights in metagenomics especially with complete circular MAGs.

AUTHOR CONTRIBUTIONS

F.T., K.L. and A.M.E. conceived the study. K.L. developed and implemented long‐read sequencing protocols. F.T. performed analyses of short‐ and long‐read sequencing data. K.L., E.F. and A.S. helped with data analysis. H.G.M. performed short‐read sequencing. H.G.M., L.B., B.J. and A.M.E. supervised research. F.T., K.L. and A.M.E. wrote the paper with critical input from all authors.

OPEN RESEARCH BADGES

This article has earned an Open Data Badge for making publicly available the digitally‐shareable data necessary to reproduce the reported results. The reproducible workflow and the data can found at the URL https://merenlab.org/data/hmw‐dna‐extraction‐strategies/. Fig S1 Click here for additional data file. Table S1 Click here for additional data file. Table S2 Click here for additional data file. Table S3 Click here for additional data file.

74 in total

1. Minimap2: pairwise alignment for nucleotide sequences.

Authors: Heng Li
Journal: Bioinformatics Date: 2018-09-15 Impact factor: 6.937

2. Beyond microbial community composition: functional activities of the oral microbiome in health and disease.

Authors: Ana E Duran-Pinedo; Jorge Frias-Lopez
Journal: Microbes Infect Date: 2015-04-07 Impact factor: 2.700

3. Improved high-molecular-weight DNA extraction, nanopore sequencing and metagenomic assembly from the human gut microbiome.

Authors: Dylan G Maghini; Eli L Moss; Summer E Vance; Ami S Bhatt
Journal: Nat Protoc Date: 2020-12-04 Impact factor: 13.491

4. Long-read metagenomics retrieves complete single-contig bacterial genomes from canine feces.

Authors: Anna Cuscó; Daniel Pérez; Joaquim Viñes; Norma Fàbregas; Olga Francino
Journal: BMC Genomics Date: 2021-05-06 Impact factor: 3.969

5. Evaluation of methods for the extraction and purification of DNA from the human microbiome.

Authors: Sanqing Yuan; Dora B Cohen; Jacques Ravel; Zaid Abdo; Larry J Forney
Journal: PLoS One Date: 2012-03-23 Impact factor: 3.240

6. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

Authors: Sergey Koren; Brian P Walenz; Konstantin Berlin; Jason R Miller; Nicholas H Bergman; Adam M Phillippy
Journal: Genome Res Date: 2017-03-15 Impact factor: 9.043

Review 7. New approaches for metagenome assembly with short reads.

Authors: Martin Ayling; Matthew D Clark; Richard M Leggett
Journal: Brief Bioinform Date: 2020-03-23 Impact factor: 11.622

Review 8. Accurate and complete genomes from metagenomes.

Authors: Lin-Xing Chen; Karthik Anantharaman; Alon Shaiber; A Murat Eren; Jillian F Banfield
Journal: Genome Res Date: 2020-03-18 Impact factor: 9.043

9. Complete Genome Sequence of Pseudomonas putida Strain TS312, Harboring an HdtS-Type N-Acyl-Homoserine Lactone Synthase, Isolated from a Paper Mill.

Authors: Ayaka Hosoe; Toshikazu Suenaga; Takumi Sugi; Taro Iizumi; Naohiro Nagai; Akihiko Terada
Journal: Microbiol Resour Announc Date: 2020-03-26

10. The Gut Virome Database Reveals Age-Dependent Patterns of Virome Diversity in the Human Gut.

Authors: Ann C Gregory; Olivier Zablocki; Ahmed A Zayed; Allison Howell; Benjamin Bolduc; Matthew B Sullivan
Journal: Cell Host Microbe Date: 2020-08-24 Impact factor: 21.023

2 in total

1. Long-Read-Resolved, Ecosystem-Wide Exploration of Nucleotide and Structural Microdiversity of Lake Bacterioplankton Genomes.

Authors: Yusuke Okazaki; Shin-Ichi Nakano; Atsushi Toyoda; Hideyuki Tamaki
Journal: mSystems Date: 2022-08-08 Impact factor: 7.324

2. High molecular weight DNA extraction strategies for long-read sequencing of complex metagenomes.

Authors: Florian Trigodet; Karen Lolans; Emily Fogarty; Alon Shaiber; Hilary G Morrison; Luis Barreiro; Bana Jabri; A Murat Eren
Journal: Mol Ecol Resour Date: 2022-02-07 Impact factor: 8.678

2 in total