Literature DB >> 31139509

Long-fragment targeted capture for long-read sequencing of plastomes.

Kevin Bethune¹, Cédric Mariac¹, Marie Couderc¹, Nora Scarcelli¹, Sylvain Santoni², Morgane Ardisson², Jean-François Martin³, Rommel Montúfar⁴, Valentin Klein¹, François Sabot¹, Yves Vigouroux¹, Thomas L P Couvreur¹.

Abstract

PREMISE: Third-generation sequencing methods generate significantly longer reads than those produced using alternative sequencing methods. This provides increased possibilities for the study of biodiversity, phylogeography, and population genetics. We developed a protocol for in-solution enrichment hybridization capture of long DNA fragments applicable to complete plastid genomes. METHODS AND
RESULTS: The protocol uses cost-effective in-house probes developed via long-range PCR and was used in six non-model monocot species (Poaceae: African rice, pearl millet, fonio; and three palm species). DNA was extracted from fresh and silica gel-dried leaves. Our protocol successfully captured long-read plastome fragments (3151 bp median on average), with an enrichment rate ranging from 15% to 98%. DNA extracted from silica gel-dried leaves led to low-quality plastome assemblies when compared to DNA extracted from fresh tissue.
CONCLUSIONS: Our protocol could also be generalized to capture long sequences from specific nuclear fragments.

Entities: Chemical

Keywords: DNA probes; MinION; de novo assembly; long‐range PCR; whole plastome sequencing

Year: 2019 PMID： 31139509 PMCID： PMC6526642 DOI： 10.1002/aps3.1243

Source DB: PubMed Journal: Appl Plant Sci ISSN： 2168-0450 Impact factor: 1.936

High‐throughput sequencing is revolutionizing research in plant evolutionary biology. The development of second‐generation sequencing, also known as next‐generation sequencing (NGS), led to the cost‐effective generation of a massive amount of sequence data (Straub et al., 2012). Although NGS offers many advantages, one shortcoming is that this sequencing method generates short reads (between 100–400 bp). This is problematic for de novo assemblies of plant genomes that prove difficult to resolve due to repetitive sequences resulting from transposable elements, polyploidy, and large genome sizes. In contrast to NGS, third‐generation sequencing (TGS) directly targets single DNA molecules without prior PCR, enabling “real‐time sequencing” (Bleidorn, 2016). The main improvement of TGS is the significant increase in read length from several to tens of thousands of bases per single read (termed “long reads”). This provides important advantages to improve de novo assemblies (Jiao and Schneeberger, 2017), gap filling (Eckert et al., 2016), or phasing (Laver et al., 2016). Technologies such as MinION, a portable real‐time sequencing device developed by Oxford Nanopore Technologies (ONT; Oxford, United Kingdom), are able to generate mean read lengths ranging from 5 to 20 kbp in standard analyses (and peak up to 2 Mbp) depending on the quality of the DNA (Lee et al., 2016). One drawback is that most TGS technologies have high error rates when compared to NGS (~10% for ONT MinION vs. 0.1% for Illumina; Goodwin et al., 2016). However, new base‐calling algorithms, associated with a posteriori corrections, allow for a significant decrease of sequence errors. With sufficient coverage and proper algorithms, TGS can lead to assemblies with consensus nucleotide accuracy of 99.90% (Lee et al., 2016). The application of TGS using MinION to large genomes such as plants is problematic mainly because of the generally low output of data currently available from MinION (10–20 Gbp vs. 1500 Gbp for a HiSeq 4000 [Illumina, San Diego, California, USA]). Thus, efficiently sequencing specific regions will depend on genome reduction approaches, such as targeted sequencing (Cronn et al., 2012; Jones and Good, 2016). Genome reduction via sequence capture refers to DNA fragments (nuclear, ribosomal, or plastid) that are directly captured from a total genomic library using probes binding to the complementary DNA sequences. This approach has the advantage of being cost effective, optimizing read depth on the targeted region, and allowing the analysis of more samples per run. However, our ability to capture and sequence long DNA fragments has yet to be properly applied in plants. Indeed, sequence capture is only routinely undertaken on short DNA fragments (Mamanova et al., 2010; Cronn et al., 2012), limiting its usefulness for long read–based TGS. In this study, we focused on complete plastid genome or plastome sequencing for two main reasons. First, in practical terms, plastome DNA provides an ideal model to test capture protocols because it is generally easy to sequence and good a priori data on structure and composition are available (reference plastomes are available for numerous species). Thus, plastome DNA is a good starting point to validate capture protocols that can then be extended to the nuclear genome. Second, the plastome has been shown to be a cost‐effective marker for the study of plant evolution (Mariac et al., 2014; Twyford and Ness, 2017), and improved sequencing is highly desirable. Indeed, de novo assembly of genomes based on short reads can be problematic because it is linked to the presence of repeated regions, which lead to low‐quality assemblies (Sohn and Nam, 2018). In plastomes, the presence of two large inverted repeat (IR) regions (~20 kbp long) present in most plant species is problematic for de novo assembly (Mariac et al., 2014); long reads would be particularly useful here. This is especially true for non‐model taxa for which high‐quality reference genomes are not available. Given the low output of data from MinION, this technology cannot be easily used to sequence plastomes directly from genomic DNA (e.g., genome skimming). In addition, long‐read sequencing will provide new insights into the structural variation of plastomes (Mower and Vickrey, 2018). The main challenge in efficiently applying TGS to the study of plant evolution will be based on our ability to capture long DNA fragments. To date, long‐read targeted capture has mainly been undertaken on organisms with small genomes such as bacteria or viruses (e.g., Eckert et al., 2016); it has rarely been attempted in organisms with large genomes such as plants. Protocols for DNA enrichment for segments in excess of 20 kbp in length have also been developed (Dapprich et al., 2016). In plants, few studies have undertaken long‐read targeted capture (Giolai et al., 2016, 2017). These protocols prove that capturing long DNA fragments is possible but has yet to be routinely developed for non‐model plants. Here, we present a protocol to capture long reads for plastome sequencing and reassembly using ONT MinION technology. We first developed our protocol for the model plant species Oryza sativa L. (Asian rice). We then applied the protocol to sequence plastomes in several wild species and non‐model but economically important crops. Finally, we tested the ability to capture and assemble plastomes from DNA extracted from silica gel–dried leaves.

MATERIAL AND METHODS

Sampling strategy and DNA extraction

For this study, we focused on seven economically important plant species from Asia, Africa, and South America. First, we developed and validated our long‐read capture protocol using the model plant species O. sativa (Asian rice). We then applied our protocol to several other plant species from the same genus (Oryza L.), family (Poaceae), and finally superorder (Lilianae or monocotyledons): African rice (O. glaberrima Steud.), pearl millet (Cenchrus americanus (L.) Morrone [previously known as Pennisetum glaucum (L.) Leeke]), fonio (Digitaria exilis Stapf), and three species of palms (Podococcus acaulis Hua, Raphia textilis Welw., and Phytelephas aequatorialis Spruce) (Table 1, Appendix 1).

Table 1

Output data obtained from MinION plastome‐enriched library sequencing

Species	DNA	Probe origin	Total no. of reads (bp)	Median read length (bp)	Longest read (bp)	% of plastome readsa	X‐fold	Longest plastome read (bp)	Median plastome read length (bp)
Oryza sativa	Fresh	Oryza sativa	17,129	4627	26,128	70.8	5.3	25,828	4264
Oryza glaberrima	Fresh	Oryza sativa	81,361	3695	24,804	98.2	12.4	24,504	3398
Cenchrus americanus	Fresh	Oryza sativa	105,760	4914	25,468	97.0	156.3	25,167	4623
Digitaria exilis	Fresh	Oryza sativa	141,250	3783	19,378	94.4	13.8	19,078	3489
Podococcus acaulis	Silica gel	Podococcus barteri	202,924	2486	13,103	15.7	25.0	12,805	2129
Raphia textilis	Silica gel	Podococcus barteri	83,833	2322	10,705	87.5	94.8	10,405	1997
Phytelephas aequatorialis	Silica gel	Podococcus barteri	202,925	2437	15,132	79.0	21.6	14,832	2158

The percentage of plastome mapped reads was calculated using Burrows–Wheeler alignment to indicated reference plastomes.

Output data obtained from MinION plastome‐enriched library sequencing The percentage of plastome mapped reads was calculated using Burrows–Wheeler alignment to indicated reference plastomes. DNA was extracted from fresh leaves for O. sativa, O. glaberrima, C. americanus, and D. exilis; while silica gel–dried leaves were used for DNA extraction for P. acaulis, R. textilis, and P. aequatorialis. In both cases, DNA extraction was performed using a MATAB lysis buffer (Sigma‐Aldrich, St. Louis, Missouri, USA) and chloroform isoamyl alcohol (24 : 1) purification method following Mariac et al. (2006).

General probe design

Long‐fragment plastome sequences were captured from the total genomic DNA extracts using two different sets of biotinylated probes: one based on O. sativa and used on related Poaceae species (O. glaberrima, C. americanus, D. exilis) and one based on Podococcus barteri G. Mann & H. Wendl. and used for P. acaulis, R. textilis, and P. aequatorialis. Podococcus barteri is the sister species to P. acaulis, whereas R. textilis and P. aequatorialis are distantly related to P. barteri, being in two different subfamilies (Calamioideae and Ceroxyloideae, respectively). Probe production (Fig. 1; see Appendix 2 for a detailed protocol) was undertaken following the protocol described elsewhere (Cronn et al., 2012; Mariac et al., 2014) and led to an average probe size of 300 bp. First, an initial full‐length plastome was amplified by long‐range PCR (LR‐PCR) using 11 primer pairs taken from Scarcelli et al. (2011) for O. sativa (Appendix S1), and another set of 11 primer pairs taken from Faye et al. (2016) for P. barteri (Appendix S1). LR‐PCR was carried out using the LongAmp Taq PCR kit (#E5200S; New England BioLabs, Ipswich, Massachusetts, USA) following the manufacturer's instructions in a final volume of 50 μL and using 300 ng of DNA. For each probe set, LR‐PCR amplicons were pooled at an equimolar ratio and sheared to reach a mean size fragment of 300 bp, then ligated to adapters for PCR amplification with biotinylated primers.

Figure 1

Schematic representation of the protocol used for long‐fragment capture of plastomes (adapted from Mariac et al., 2014).

Library preparation, in‐solution hybridization, multiplexing, and sequencing

Illumina libraries were constructed following the protocol of Rohland and Reich (2012) using 6‐bp barcodes and Illumina indices, with extra steps added to allow for amplification and in‐solution hybridization (Fig. 1, Appendix S2). Briefly, each high‐molecular‐weight DNA was sheared using a g‐TUBE (Covaris, Woburn, Massachusetts, USA) to a mean target size of 10 kbp. DNA fragments less than 2000 bp were removed by a sizing step performed with 0.4× AMPure magnetic beads (Beckman Coulter, Beverly, Massachusetts, USA). DNA was then end‐repaired, ligated with adapters (allowing PCR amplifications), and then nick filled‐in before performing a pre‐hybridization PCR. Optimal cycle number (ranging from five to 12) was defined by real‐time amplification (KK2700; KAPA Biosystems, Roche Sequencing and Life Science, Bâle, Switzerland). After clean‐up and quantification using the NanoQuant Plate (Tecan Group Ltd., Männedorf, Switzerland) and QIAxcel (QIAGEN, Valencia, California, USA), library preparations were mixed with biotin‐labeled probes for hybridization of the targeted regions. DNA probe hybridization complexes were then immobilized with 100 μg of streptavidin‐coated magnetic beads. This step was performed using the Dynabeads M‐280 kilobaseBINDER Kit (#60101; Invitrogen, ThermoFisher Scientific, Waltham, Massachusetts, USA), which is designed for immobilizing double‐stranded DNA molecules longer than 2 kbp. We then prepared the DNA to create the MinION library. A magnetic field was applied to the resulting solution, and the supernatant containing unbounded DNA was discarded. Enriched DNA fragments were then dehybridized from the beads and amplified in 12 to 15 cycles of real‐time PCR in order to obtain the requested quantity for the Nanopore library preparation. The final libraries were then constructed following the ONT MinION library preparation detailed in the 1D Amplicon by ligation (SQK‐LSK108; ONT) protocol for single samples and also in the 1D Native barcoding genomic DNA (with EXP‐NBD103 and SQK‐LSK108) protocol. Briefly, 1 μg of enriched DNA was end‐repaired, extended with a dA‐tailing, ligated with Nanopore barcodes, and then ligated with Nanopore tether‐adapter before loading and sequencing on the MinION flow cell. To benefit from multiplexing and to limit costs and workload, up to four individuals were pooled at an equimolar ratio using ONT barcodes. Prior to each run, flow cells (FLO‐MIN106, R9.4 Version; ONT) were quality tested using MinKNOW software (version 1.2.8; ONT) to ensure the presence of at least 50% (256) active channels. Flow cells were loaded with approximately 275 ± 100 fM capture‐amplified DNA libraries. All costs and product details are provided in Appendix S3.

Non‐enriched MiSeq data

To estimate enrichment rate, we used single‐sample non‐enriched library data sets originating from various Illumina MiSeq sequencing runs for O. sativa, O. glaberrima, C. americanus, and P. aequatorialis. We used MiSeq data here because its higher yield (compared to MinION) allows a better estimation, at a lower cost, of the percentage of plastid reads of a non‐enriched library. For D. exilis, P. acaulis, and R. textilis, we merged 10, two, and 16 samples, respectively, of non‐enriched libraries to provide adequate read counts. Forward sequencing read outputs from each MiSeq run (i.e., R1 files) were first demultiplexed using the demultadapt script (https://github.com/Maillol/demultadapt) to sort reads according to a given barcode list. Adapters at the beginning of each read from the R2 and demultiplexed R1 files were removed using cutadapt‐1.2.1 software (Martin, 2011) with the default parameters. Reads were then filtered by length (size >35 bp) and mean quality values (Q > 30) before being paired using compare_fastq_paired_v5.pl (https://github.com/SouthGreenPlatform/arcad-hts/blob/master/scripts/arcad_hts_3_synchronized_paired_fastq.pl and https://github.com/SouthGreenPlatform/arcad-hts/blob/master/scripts/arcad_hts_2_Filter_Fastq_On_Mean_Quality.pl). A final trimming step using the fastx‐trimmer command from the FASTX‐Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/) was undertaken onto the R2 paired files to remove the last six bases of each read to ensure removal of any possible barcode present on short reads.

Bioinformatics

All command lines are available in Appendix S4. Using the MinION Fast5 output format, base‐calling and demultiplexing were undertaken using Albacore v2.5.11 (https://github.com/Albacore/albacore). This generated a FASTQ file from which reads were filtered out. The average quality score was lower than 7. For each barcode, a quality control using the MinionQC R script (https://github.com/roblanf/minion_qc) was performed to check for read mean length and quality scores. Reads were then trimmed using Porechop (https://github.com/rrwick/Porechop) in order to remove the sequencing adapters and barcodes. The only non‐default setting is that splitting reads containing adapters in the middle was disabled, in order to avoid issues during the polishing step using Nanopolish (see below). For each library, the percentage of plastome‐associated reads was estimated by mapping reads to a reference plastid genome using the Burrows–Wheeler alignment tool (BWA‐MEM, https://github.com/lh3/bwa) with the “‐B 1” option for non‐enriched short‐read data and the “‐x ont2d” option for long‐read data (Li and Durbin, 2009). We then calculated the X‐fold enrichment to evaluate capture efficiency (the ratio of plastome reads obtained with capture relative to plastome reads obtained without capture). Coverage and depth values were calculated using Bedtools (Quinlan and Hall, 2010) genomecov (https://github.com/arq5x/bedtools2). Mismatch percentage values between mapped reads and references were recovered using Tablet version 1.17.08.17 (Milne et al., 2010).

De novo assembly of plastid genomes

We used the Flye assembler version 2.3 (Kolmogorov et al., 2019) for de novo assembly of plastomes based on long MinION reads. For O. sativa, all available reads (17,129) were assembled. For the other species, the number of reads was too high, in excess of 3000× the reference coverage for some data sets, which caused memory usage issues. To alleviate this, the reads were randomly split into sets of approximately equal size. Each set was then assembled individually using the raw Nanopore reads mode. The “min_overlap” parameter (i.e., the minimum overlap between reads) in Flye was adjusted on a species‐by‐species basis ranging from 3000 bp (the default value for our genome size) to 1000 bp, depending on the medium read length for each species. This was done in order to ensure that a sufficient amount of overlaps were detected for the assembly. The draft assemblies were then polished using Nanopolish version 0.9.1 (https://github.com/jts/nanopolish), using minimap2 on the “map‐ont” preset for the overlapping step. Finally, the assemblies were mapped on the reference sequence of each species using the dnadiff tool of MUMmer version 4.0beta2 (Kurtz et al., 2004), which directly provides alignment coordinates and global statistics such as the mean identity percentage of alignments. Although read length is an important consideration, the uniformity of reference coverage by the reads can also affect plastome assembly. This is especially problematic for low‐molecular‐weight DNA extractions resulting in shorter read lengths on average (as happened in our study comparing extractions from silica gel–dried leaves vs. from fresh tissue). To test for the impact of the uniformity of reference coverage on the assembly, simulated reads for P. aequatorialis (using DNA extracted from silica gel–dried leaves) were generated using NanoSim v2.1.0 (Yang et al., 2017). A model was first trained on the raw real reads and then 40,000 simulated reads were generated, ensuring they have approximately the same length distribution and error model as the real reads (see Results). The simulated reads were then assembled using the same workflow described above.

RESULTS

Plastome enrichment protocol validation on Oryza sativa

All raw reads for the seven species are available from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (BioProject number PRJNA526996). After read filtering (Q > 7), the median length of the 12,227 mapped plastome reads was 4264 bp (Table 1, Fig. 2B). We recovered the whole plastome with an average coverage depth of 364× for the enriched MinION library, with a standard deviation increasing from 0.25 to 0.37 between enriched and non‐enriched libraries (Fig. 2B, C; Appendix 1). The average mismatch was 11.80% (Appendix 1), and 70.8% of the reads mapped to the reference plastome (Fig. 2D, Appendix 1), representing an approximately fivefold increase in plastome reads when compared to the non‐enriched MiSeq sequenced library (13.32% mapped; Appendix 1). The longest plastome read recovered was 25,828 bp long (Table 1).

Figure 2

Long‐fragment capture results for Oryza sativa. (A) Panicule of Oryza sativa (© IRD ‐ Jean‐Pierre Montoroi. Reprinted with permission.). (B) Number of reads per read length before mapping. (C) Plastome coverage after mapping. Black bars indicate approximate position of both inverted repeat (IR) regions. (D) Percentage of useful reads mapped to the O. sativa reference plastome (KT289404.1) between non‐enriched (N, light gray) and enriched (E, dark gray) libraries.

Plastome enrichment protocol applied to non‐model species

DNA extraction qualities were variable depending on the source of the leaf material used. DNA extracted from fresh tissue always produced single bands (not degraded) with fragments higher than 20 kbp (Appendix 1). In contrast, DNA extracted from silica gel–preserved material was of lower quality and generally degraded (smear present), with fragments less than 20 kbp long (Appendix 1). For the six non‐model species, sequencing of the non‐enriched libraries resulted in 0.63–7.94% of plastome reads (Appendix 1). In contrast, enriched libraries resulted in 15.7–98.2% plastome reads, corresponding to a 12‐fold to 156‐fold increase in plastome DNA sequences (Fig. 3, Table 1, Appendix 1). The mean average of fragments sequenced from DNA extracted from fresh tissue was 4279 bp versus 2525 bp for DNA extracted from silica gel–dried leaves (Fig. 4A). Sequences mapped to the reference plastomes ranged mainly from 2 kbp to 8 kbp, depending on the species (Fig. 4A). Average coverage depth was 1988× for enriched libraries (Fig. 4B, Appendix 1). The longest read mapped to the plastome ranged from 10,405 to 25,167 bp for R. textilis and C. americanus, respectively (Table 1).

Figure 3

Percentage of useful reads mapped to their respective reference plastome (see Table 1) between Illumina non‐enriched (N, light gray) and MinION enriched (E, dark gray) protocols for the six non‐model species in our study. Photo credits: Cenchrus americanus (CC0 Public Domain; https://pxhere.com/fr/photo/706162); Pennisetum glaucum (© IRD ‐ Cédric Mariac, reprinted with permission); Digitaria exilis (© IRD ‐ A. Barnaud, reprinted with permission); Podococcus acaulis (© IRD ‐ Thomas Couvreur, reprinted with permission); Raphia textilis (© IRD ‐ Thomas Couvreur, reprinted with permission); Phytelephas aequatorialis (© IRD ‐ Thomas Couvreur, reprinted with permission).

Figure 4

Long‐fragment capture results for six non‐model plant species. (A) Distribution of read lengths before mapping. (B) Plastome coverage results from the enriched long‐read capture protocol. Black bars indicate approximate position of both inverted repeat (IR) regions.

De novo assembly of the plastid genome

When DNA was extracted from fresh leaves, the plastome was assembled in two contigs covering most of the reference (Table 1, Appendix S5 for a visual example in C. americanus). Assembled contig lengths varied from 81,053 to 12,5727 bp long. However, the assembler never managed to fully recover the circular plastid genome throughout a single contig. For DNA extracted from silica gel–dried leaves, where reads were shorter and the coverage more heterogeneous, assembly was suboptimal (Table 1, Appendix S6 for a visual example in P. aequatorialis), with more final contigs (10–17), uncovered regions, and sometimes misassemblies. The longest assembled contigs were also much shorter than those from fresh material (Table 2). In addition, the IR regions were also often not differentiated.

Table 2

De novo assembly results from real and simulated data in the number of contigs, and coverage and identity percentages to the respective reference plastome genomes (see Table 1)

Species	Minimum overlap (bp)a	Plastome contigs	Coverage %	Identity %	Longest contig (bp)
Oryza glaberrima	3000	2	92.32	99.14	109,087
Cenchrus americanus	3000	2	99.91	98.52	81,053
Digitaria exilis	3000	2	99.97	99.18	125,727
Podococcus acaulis	1000	17	81.24	98.86	22,803
Raphia textilis	1000	10	83.87	98.84	21,797
Phytelephas aequatorialis	1000	13	87.60	98.31	20,700
Simulated assemblyb	1000	4	99.72	99.05	107,633

Minimum overlap between reads as defined in Flye.

The simulated data were based on the output results of Phytelephas aequatorialis.

De novo assembly results from real and simulated data in the number of contigs, and coverage and identity percentages to the respective reference plastome genomes (see Table 1) Minimum overlap between reads as defined in Flye. The simulated data were based on the output results of Phytelephas aequatorialis. Using a simulated data set of reads uniformly distributed across the plastome (Appendix S7) and based on the same quality as P. aequatorialis significantly improved assembly (Table 2). The assembler resulted in four contigs (vs. 13) covering 99.72% (vs. 87.60%) of the reference, and the longest contig was 107,633 bp (vs. 21,797 bp). However, the existence of two distinct repeated regions was still not resolved.

DISCUSSION

We show that targeted capture hybridization of long plastome DNA fragments with sufficient coverage (362× to 3318×) is possible in plants (Table 1, Appendix 1). In addition, we show a significant enrichment of our target region (the plastome) when compared to non‐enriched data (Figs. 2D, 3; Appendix 1). The different steps of our protocol (Fig. 1, Appendix 2) are not fundamentally different from previous plastome short‐read capture protocols (e.g., Mariac et al., 2014) based on in‐house probe preparation, shearing, adapter ligation, hybridization, and finally capture (Fig. 1, Appendix 2). Thus, our approach requires minimal adaptation from previous cost‐ and time‐effective protocols and should therefore be of broad interest. The main technical change focused on the beads used to capture long DNA fragments. For that, we used the kilobaseBINDER Kit (Invitrogen), which is said to capture DNA fragments longer than 2 kbp. The sizing step we performed at 0.4× using AMPure removes fragments smaller than 2 kbp and corresponds to the maximum allowed size with the AMPure beads. However, other approaches are possible to achieve sizing with higher molecular weight and could be tested (e.g., gel extraction, automated size selection system). Based on our protocol and costs provided in Appendix S3, we estimate a cost of €33.77 (US$42.00) per individual to sequence a plastome at 30× (costs estimated February 2019). We stress, however, that this value is only an estimate, and the aim of our protocol was not focused on cost effectiveness. When capturing plastomes across a range of different species, we find a difference in enrichment percentage ranging from 15.7% to 98.2% of useful reads (X‐fold: 12.4–156.3; Table 1, Fig. 4). Differences in genome versus plastome ratios between species might explain the variation of on‐target mapped reads percentage compared to non‐enriched libraries. In general, our results suggest that species with smaller genomes show higher mapped read percentages, although this trend should be confirmed with more data and specific tests. Alternatively, the quality of the material used for DNA extractions (e.g., the cellular type and the degradation state) can also explain such variations. The low enrichment observed for P. acaulis (Table 1, Fig. 3) could potentially be linked to a large genome size, although we do not have an estimate of its genome. A common coverage gap is observed among the plastid genomes of the three palm species because of a region that was not covered by the probes (see Faye et al., 2016). Lower coverage of other regions can be explained by biases that occur during DNA shearing, PCR amplification, and hybridization capture, considering a CG content effect. Probe bulk normalization from long‐range PCR also has to be taken into account. However, we obtained a decreased standard deviation of the whole target coverage homogeneity (Appendix 1), suggesting that our protocol did not introduce more on‐target coverage heterogeneity. Nevertheless, applying an alternative capture method such as region‐specific extraction (Dapprich et al., 2016) could help maintain overall good coverage by accessing highly complex, variable, repeat‐masked, or unknown regions that prohibit adequate probe binding. Probes were designed to hybridize across the entire targeted region (Fig. 1), as is generally done using short‐read approaches (Stull et al., 2013; Mariac et al., 2014). However, a recent study showed that probes targeting small regions are also effective in capturing long reads surrounding the targeted region. Indeed, Gasc and Peyret (2017) were able to reconstruct a 21.6‐kbp fragment using probes designed for a small 471‐bp microbial gene target. This shows that long‐read capture will also be very useful for targeted sequence capture of nuclear regions. We demonstrated the capacity of heterologous plastome probes to capture target DNA in other species or genera in Arecaceae and Poaceae. For example, probes designed on P. barteri hybridized well to other palm genera in different subfamilies. This underlines the good portability of probes for capturing plastomes across a broad evolutionary spectrum (Stull et al., 2013), even for long‐fragment capture.

Limits and challenges

Although we were able to successfully capture long plastome fragments using our enrichment protocol, assembling plastomes from these data remains challenging. Indeed, the best assembly resulted in two mapped contigs, and the worst assembly resulted in 17 (Table 2). Assembly of plastomes is well known to be problematic (Twyford and Ness, 2017), mainly because of the presence of near identical IR regions. Indeed, the similarity of the two IR regions is too high for assemblers to decipher between IR regions when resolving the assembly graph for the entry and exit point of those sequences. Thus, when the sequenced reads are shorter than the IR regions themselves, it becomes difficult to correctly assemble the plastome into a single contig. This can be seen, for example, in C. americanus (Appendix S5), where the resulting two contigs do not completely cross with one of the IR regions, thus failing to reach a single contig. Of course, this problem is amplified when dealing with overall shorter reads sequenced from low‐molecular‐weight DNA (see Appendix S6 for an example). In our case, DNA fragments from silica gel–dried leaves were shorter than those extracted from fresh leaves (Table 1). Moreover, we observed a decrease of the average library fragment size during the preparation steps and after PCR because of preferential amplification of shorter fragments, as observed by Giolai et al. (2016) and Eckert et al. (2016). Optimizing read length in such a way that single reads are longer than the entire IR region should significantly help in the assembly process. In this sense, DNA shearing could be removed in order to increase the average size of the reads. Technical limitations would, however, include (1) the ability of streptavidin beads to immobilize fragments of tens of thousands of base pairs and (2) the long‐range PCR amplification step of the enriched fragments, which is necessary to produce an input of several hundred nanograms for the construction of Nanopore libraries. The latter is probably the most limiting because it is difficult to produce amplicons more than 10 kbp long, and even if this is achieved, representation bias must be considered. Finally, we show, via simulations, that the uniformity of read coverage across the reference is important for assembly (Table 2). Indeed, uniformly distributed reads, even of lower quality, lead to better assemblages than poor coverage of the reference (Table 2). Therefore, uniform coverage of the reference by the captured reads plays a significant role in the correct and improved assembly even for suboptimal DNA extractions. A final concern, which is not restricted to long‐read sequencing, is the presence of plastid DNA in the nuclear genome (i.e., nuclear plastid DNAs [NUPTs]). The vast majority of NUPTs are of small size (<1000 bp; Yoshida et al., 2014) and thus are normally not captured and sequenced using our protocol (we sized libraries for reads >2 kbp, see Appendix 2). Longer NUPTs up to several kilo base pairs long have also been reported (Yoshida et al., 2014). However, differentiating these NUPTs from true plastid DNA is difficult because of their length and because they are generally associated with low divergence levels (p‐distance <0.01%; Yoshida et al., 2014). It is thus possible that our protocol does capture NUPTs longer than 2 kbp, influencing our enrichment efficiency (the X‐fold value). However, this did not affect our de novo assembly of the Oryza plastome (99.14% identity between the de novo assembly and the reference plastome), even though rice contains many NUPTs, including long ones (Yoshida et al., 2014).

AUTHOR CONTRIBUTIONS

C.M., Y.V., and T.L.P.C. conceived the idea; R.M., T.L.P.C., C.M., and Y.V. provided material; C.M., Y.V., J.F.M., S.S., and M.A. designed the protocol; K.B., C.M., and M.C. undertook the experiments; C.M., K.B., F.S., N.S., and V.K. analyzed the data. K.B. and T.L.P.C. led the writing, and all authors read and commented on the final version. APPENDIX S1. Both sets of long‐range PCR primers, elongation times, and annealing temperature values used. Click here for additional data file. APPENDIX S2. Raw output data of the three MinION runs undertaken in our study. Click here for additional data file. APPENDIX S3. Detailed list of materials and costs to implement the capture protocol. Click here for additional data file. APPENDIX S4. Bioinformatic codes used for analyses of MinION data. Click here for additional data file. APPENDIX S5. Mummer visualization of assembled long reads for Cenchrus americanus. Click here for additional data file. APPENDIX S6. Mummer visualization of assembled long reads for Phytelephas aequatorialis. Click here for additional data file. APPENDIX S7. Mapping of Phytelephas aequatorialis–like simulated reads (using LAST software [http://last.cbrc.jp/]) to the reference plastome NC_029957.1. Click here for additional data file.

Species	Country of origin	Voucher/accession no.	DNA quality	Type of library preparation	Probes used	No. of passed reads	Reference	% reads mapped reference	Average coverage depth	Standard deviation	% mismatch	% reference uncovered	% 10×	% 50×
Oryza sativa	Japan	(no voucher) NIP – IRD Montpellier	Band, >20 kbp	Chloroplast enrichment	Oryza sativa	17,129	KT289404.1	70.80	362.69	0.36	11.80	0.000	100.000	99.996
Oryza sativa	Japan	(no voucher) NIP – IRD Montpellier	Band, >20 kbp	Shotgun	None	151,234	KT289404.1	13.32	34.15	0.25	0.20	0.000	99.898	4.146
Oryza glaberrima	Senegal	(no voucher) CG14 – IRD Montpellier	Band, >20 kbp	Chloroplast enrichment	Oryza sativa	81,361	KM088021.1	98.23	2033.68	0.43	9.60	0.019	99.934	99.295
Oryza glaberrima	Senegal	(no voucher) CG14 – IRD Montpellier	Band, >20 kbp	Shotgun	None	1,201,104	KM088021.1	7.94	99.56	0.44	1.80	0.034	99.546	88.834
Pennisetum glaucum	Senegal	(no voucher) PE01455 – IRD Montpellier	Band, >20 kbp	Chloroplast enrichment	Oryza sativa	105,757	KJ490012.1	97.03	3318.44	0.39	10.20	0.000	99.992	99.981
Pennisetum glaucum	Senegal	(no voucher) PE01455 – IRD Montpellier	Band, >20 kbp	Shotgun	None	525,428	KJ490012.1	0.62	3.19	0.70	2.20	10.150	0.950	0.000
Digitaria exilis	Mali	(no voucher) CM05784 – IRD Montpellier	Band, >20 kbp	Chloroplast enrichment	Oryza sativa	141,248	NC_024176.1	94.42	3311.87	0.29	11.40	0.000	100.000	99.700
Digitaria exilis	Mali	(no voucher) CM05784 – IRD Montpellier	Band, >20 kbp	Shotgun	None	8,665,102	NC_024176.1	6.85	593.29	0.33	0.50	0.045	99.846	99.462
Podococcus acaulis	Gabon	Couvreur TLP 556 (WAG)	Band, 15–20 kbp	Chloroplast enrichment	Podococcus barteri	249,416	NC_027276.1	15.72	497.68	0.47	10.20	0.454	96.163	93.096
Podococcus acaulis	Gabon	Couvreur TLP 556 (WAG)	Band, 15–20 kbp	Shotgun	None	308,230	NC_027276.1	0.63	1.14	2.63	4.60	56.858	0.597	0.085
Raphia textilis	Angola	Lautenschläger 1086 (K)	Slight degradation, 5–16 kbp	Chloroplast enrichment	Podococcus barteri	83,832	NC_020365.1	87.49	893.98	0.47	12.00	0.352	96.068	91.972
Raphia textilis	Angola	Lautenschläger 1086 (K)	Slight degradation, 5–16 kbp	Shotgun	None	57,007,740	NC_020365.1	0.92	381.41	3.77	6.20	0.200	99.018	98.258
Phytelephas aequatorialis	Ecuador	Couvreur 1191 (QCA) – TAGUA F1	Smear, 1–10 kbp fragments	Chloroplast enrichment	Podococcus barteri	202,924	NC_029957.1	79.04	1713.58	0.53	10.30	0.004	99.522	96.383
Phytelephas aequatorialis	Ecuador	Couvreur 1191 (QCA) – TAGUA F1	Smear, 1–10 kbp fragments	Shotgun	None	699,918	NC_029957.1	3.65	22.68	1.58	3.50	0.000	95.083	1.380

27 in total

1. Navigating the tip of the genomic iceberg: Next-generation sequencing for plant systematics.

Authors: Shannon C K Straub; Matthew Parks; Kevin Weitemier; Mark Fishbein; Richard C Cronn; Aaron Liston
Journal: Am J Bot Date: 2011-12-14 Impact factor: 3.844

2. Targeted enrichment strategies for next-generation plant biology.

Authors: Richard Cronn; Brian J Knaus; Aaron Liston; Peter J Maughan; Matthew Parks; John V Syring; Joshua Udall
Journal: Am J Bot Date: 2012-02-06 Impact factor: 3.844

Review 3. Target-enrichment strategies for next-generation sequencing.

Authors: Lira Mamanova; Alison J Coffey; Carol E Scott; Iwanka Kozarewa; Emily H Turner; Akash Kumar; Eleanor Howard; Jay Shendure; Daniel J Turner
Journal: Nat Methods Date: 2010-02 Impact factor: 28.547

4. Diversity of wild and cultivated pearl millet accessions (Pennisetum glaucum [L.] R. Br.) in Niger assessed by microsatellite markers.

Authors: Cedric Mariac; Viviane Luong; Issoufou Kapran; Aïssata Mamadou; Fabrice Sagnard; Monique Deu; Jacques Chantereau; Bruno Gerard; Jupiter Ndjeunga; Gilles Bezançon; Jean-Louis Pham; Yves Vigouroux
Journal: Theor Appl Genet Date: 2006-10-18 Impact factor: 5.699

5. Tablet--next generation sequence assembly visualization.

Authors: Iain Milne; Micha Bayer; Linda Cardle; Paul Shaw; Gordon Stephen; Frank Wright; David Marshall
Journal: Bioinformatics Date: 2009-12-04 Impact factor: 6.937

6. BEDTools: a flexible suite of utilities for comparing genomic features.

Authors: Aaron R Quinlan; Ira M Hall
Journal: Bioinformatics Date: 2010-01-28 Impact factor: 6.937

7. Versatile and open software for comparing large genomes.

Authors: Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg
Journal: Genome Biol Date: 2004-01-30 Impact factor: 13.583

8. A set of 100 chloroplast DNA primer pairs to study population genetics and phylogeny in monocotyledons.

Authors: Nora Scarcelli; Adeline Barnaud; Wolf Eiserhardt; Urs A Treier; Marie Seveno; Amélie d'Anfray; Yves Vigouroux; Jean-Christophe Pintaud
Journal: PLoS One Date: 2011-05-26 Impact factor: 3.240

9. Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture.

Authors: Nadin Rohland; David Reich
Journal: Genome Res Date: 2012-01-20 Impact factor: 9.043

10. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

8 in total

1. High-throughput method for the hybridisation-based targeted enrichment of long genomic fragments for PacBio third-generation sequencing.

Authors: Tim Alexander Steiert; Janina Fuß; Simonas Juzenas; Michael Wittig; Marc Patrick Hoeppner; Melanie Vollstedt; Greta Varkalaite; Hesham ElAbd; Christian Brockmann; Siegfried Görg; Christoph Gassner; Michael Forster; Andre Franke
Journal: NAR Genom Bioinform Date: 2022-07-13

2. CRISPR-Cas9 enrichment and long read sequencing for fine mapping in plants.

Authors: Elena López-Girona; Marcus W Davy; Nick W Albert; Elena Hilario; Maia E M Smart; Chris Kirk; Susan J Thomson; David Chagné
Journal: Plant Methods Date: 2020-09-01 Impact factor: 4.993

Review 3. Strategies for reducing per-sample costs in target capture sequencing for phylogenomics and population genomics in plants.

Authors: Haley Hale; Elliot M Gardner; Juan Viruel; Lisa Pokorny; Matthew G Johnson
Journal: Appl Plant Sci Date: 2020-04-14 Impact factor: 1.936

4. Wristwatch PCR: A Versatile and Efficient Genome Walking Strategy.

Authors: Lingqin Wang; Mengya Jia; Zhaoqin Li; Xiaohua Liu; Tianyi Sun; Jinfeng Pei; Cheng Wei; Zhiyu Lin; Haixing Li
Journal: Front Bioeng Biotechnol Date: 2022-04-12

5. Bactris gasipaes Kunth var. gasipaes complete plastome and phylogenetic analysis.

Authors: Maria Camila Buitrago Acosta; Rommel Montúfar; Romain Guyot; Cedric Mariac; Timothy J Tranbarger; Silvia Restrepo; Thomas L P Couvreur
Journal: Mitochondrial DNA B Resour Date: 2022-08-26 Impact factor: 0.610

Review 6. A Guide to Carrying Out a Phylogenomic Target Sequence Capture Project.

Authors: Tobias Andermann; Maria Fernanda Torres Jiménez; Pável Matos-Maraví; Romina Batista; José L Blanco-Pastor; A Lovisa S Gustafsson; Logan Kistler; Isabel M Liberal; Bengt Oxelman; Christine D Bacon; Alexandre Antonelli
Journal: Front Genet Date: 2020-02-21 Impact factor: 4.599

7. Can we use it? On the utility of de novo and reference-based assembly of Nanopore data for plant plastome sequencing.

Authors: Agnes Scheunert; Marco Dorfner; Thomas Lingl; Christoph Oberprieler
Journal: PLoS One Date: 2020-03-24 Impact factor: 3.240

8. A Novel Framework for Characterizing Genomic Haplotype Diversity in the Human Immunoglobulin Heavy Chain Locus.

Authors: Oscar L Rodriguez; William S Gibson; Tom Parks; Matthew Emery; James Powell; Maya Strahl; Gintaras Deikus; Kathryn Auckland; Evan E Eichler; Wayne A Marasco; Robert Sebra; Andrew J Sharp; Melissa L Smith; Ali Bashir; Corey T Watson
Journal: Front Immunol Date: 2020-09-23 Impact factor: 7.561

8 in total