Literature DB >> 32244605

The Treasure Vault Can be Opened: Large-Scale Genome Skimming Works Well Using Herbarium and Silica Gel Dried Material.

Inger Greve Alsos¹, Sebastien Lavergne², Marie Kristine Føreid Merkel¹, Marti Boleda², Youri Lammers¹, Adriana Alberti³, Charles Pouchon², France Denoeud³, Iva Pitelkova¹, Mihai Pușcaș⁴, Cristina Roquet^2,5, Bogdan-Iuliu Hurdu⁶, Wilfried Thuiller², Niklaus E Zimmermann⁷, Peter M Hollingsworth⁸, Eric Coissac².

Abstract

Genome skimming has the potential for generating large data sets for DNA barcoding and wider biodiversity genomic studies, particularly via the assembly and annotation of full chloroplast (cpDNA) and nuclear ribosomal DNA (nrDNA) sequences. We compare the success of genome skims of 2051 herbarium specimens from Norway/Polar regions with 4604 freshly collected, silica gel dried specimens mainly from the European Alps and the Carpathians. Overall, we were able to assemble the full chloroplast genome for 67% of the samples and the full nrDNA cluster for 86%. Average insert length, cover and full cpDNA and rDNA assembly were considerably higher for silica gel dried than herbarium-preserved material. However, complete plastid genomes were still assembled for 54% of herbarium samples compared to 70% of silica dried samples. Moreover, there was comparable recovery of coding genes from both tissue sources (121 for silica gel dried and 118 for herbarium material) and only minor differences in assembly success of standard barcodes between silica dried (89% ITS2, 96% matK and rbcL) and herbarium material (87% ITS2, 98% matK and rbcL). The success rate was > 90% for all three markers in 1034 of 1036 genera in 160 families, and only Boraginaceae worked poorly, with 7 genera failing. Our study shows that large-scale genome skims are feasible and work well across most of the land plant families and genera we tested, independently of material type. It is therefore an efficient method for increasing the availability of plant biodiversity genomic data to support a multitude of downstream applications.

Entities: CellLine Chemical Disease Gene Species

Keywords: ITS; alpine; chloroplast DNA; environmental DNA; matK; nuclear ribosomal DNA; phylogenomic; plant DNA barcode; polar; rbcL

Year: 2020 PMID： 32244605 PMCID： PMC7238428 DOI： 10.3390/plants9040432

Source DB: PubMed Journal: Plants (Basel) ISSN： 2223-7747

1. Introduction

Genetic and genomic data are of critical importance for many applications, including species delimitation [1,2,3], studies on evolution and phylogenies [4,5,6], biodiversity assessments and conservation [7,8], reconstructions of past plant communities [9,10,11], or for more applied tasks such as forensics [12,13], pollination and food web studies [14,15,16] and monitoring of invasive species [17]. While many of these tasks can be undertaken by sequencing plastid or rDNA amplicons [1,2,18,19], increasing emphasis has been given to the potential of using genomic data for DNA barcoding and wider phylogenomic studies [4,20,21,22,23,24]. One key approach for gathering large scale genomic data from plants is genome skimming which consists of shallow pass shotgun sequencing [23,25,26,27]. The major advantage of genome skimming is the large amount of genetic information it provides. Genome skims notably allow for simultaneous assembly of both nuclear ribosomal and plastid DNA. Thus, a single analysis may provide complete plastid and ribosomal assemblies including all the plastid and nuclear ribosomal markers that have been used in plant DNA barcoding (e.g., the plastid genes matK and rbcL [1], plastid spacers/introns such as trnH-psbA and trnL, as well as the nuclear ribosomal regions ITS1 and ITS2, see also further discussion of plant barcodes [2,18,20]. The use of the complete chloroplast genome as a standard barcode has been repeatedly suggested [23,28,29,30] because of its capacity to increase the resolution at lower taxonomic levels in plants [31]. It is also a useful information source for deeper level phylogenetic studies [4]. Most chloroplast genomes are 110–160 kbp, a size that, on the one hand, provides much more information than a few loci and, on the other hand, allows the chloroplast genome to more easily be sequenced and assembled than the much larger nuclear genome. Moreover, when genome skimming approaches are used, the problem of nonuniversal primer sites that have been a limitation for several of the most use markers as matK, ITS1 and ITS2 [1,2,20], is avoided. However, the structure and complexity of the chloroplast varies [32], and especially taxa with chloroplast genomes harbouring many repeats are, according to our experience, challenging to assemble and therefore to annotate. Also, in some genera, species may not be distinguishable by chloroplasts due to recent alloploid origins, chloroplast sharing, or hybrid speciation [2]. Nuclear ribosomal DNA is a good complement to the chloroplast genome, as it includes the frequently used and rapidly evolving markers ITS1 and ITS2, as well as the more conserved 18S, 5.8S and 28S [33,34]. Generating large scale data-sets involving thousands of samples is a major effort, even with standard amplicon sequencing (e.g., building DNA barcoding reference libraries for regional floras [35,36]). There is thus considerable interest in developing approaches to increase the number of loci recovered from plant samples using a method that is scalable over multiple individuals of multiple species, while remaining tractable and manageable at a scale of many thousands of samples. Two recent studies that have tackled this using genome skimming and have generated large-scale genomic data from plants, showing the potential to extend sampling coverage to the scale of regional floras, the first from China (n = 1659) [4], the second from Australia (n = 672) [37]. Further studies are required to refine protocols and assess which approaches result in efficient and cost-effective recovery of data. Of particular importance, is the development and testing of informatics pipelines across diverse sample sets, and developing robust laboratory protocols that cope with the inevitable heterogeneity of tissue type and quality that is found in large scale studies. A very important, but potentially challenging source of tissue for large scale studies are the plant collections housed in the world’s herbaria. They contain all described species of multicellular plants worldwide including their type specimens, as well as both species that are extinct or not yet described [38,39]. They represent several hundred years of global efforts in collecting, describing and identifying plants in both easily accessible and more remote areas [26,40]. Using herbarium collections for large scale genome skimming thus offers the opportunity to open the ‘treasure vault’ that these specimens represent [41,42]. However, the quality and quantity of DNA found in herbarium specimens depends on conditions during collection and storage, which is, in general, lower than for freshly collected plant material followed by immediate drying in silica gel or freezing [43]. Low quantity and quality of DNA from the outset can affect all downward steps such as sequencing success, assembly and annotation, and may therefore affect the overall success of a large scale project. However, genome skimming methods have improved and several recent studies have shown that it is possible to extract sufficient quality and quantity of herbarium material for retrieval of partial or full complete plastid genomes [41,44,45,46]. In this study we focus on the practicalities of large-scale genome skimming. We share experience gained from two large-scale projects involving several thousands of species to guide future deployment of genome skim sequencing to understand plant biodiversity. The first project (PhyloAlps including PhyloCarpates) is focused on the European Alps and the Carpathians, and is mainly based on freshly collected leaf material dried in silica gel. The second (PhyloNorway), is mainly based on herbarium material from Norway and the Arctic region. Except for a modification in the extraction protocol for the herbarium material, these two projects use the same methods and therefore also allow for herbarium material to be evaluated as a cost-efficient source for large scale genome skimming. Specifically, we evaluate (1) the quality of the DNA recovered, and the success of genome skimming of herbarium and of silica gel dried material, (2) the recovery of standard plant barcodes from the genome skim data, and (3) the effect of sample age and time of growing season on assembling the full chloroplast from herbarium material.

2. Results

2.1. The Total Dataset

Overall, 6655 specimens of 5575 taxa (species, subspecies and a few hybrids) belonging to 161 families were sequenced (Table 1). These consisted of 141 families of angiosperms (6469 specimens, 5444 taxa, 997 genera), 4 families of gymnosperms (41 specimens, 31 taxa, 10 genera), and 16 families of ferns (Polypodiopsida: 145 specimens, 100 taxa, 30 genera) (Supplementary Appendix 1). Of these, 4604 were based on fresh leaf tissue collected for the PhyloAlps (4280) and the PhyloCarpates (324) projects and, 2051 were taken from herbarium material sampled for the PhyloNorway project (Table 1). The majority of the specimens were collected in the Alps and Norway, although other arctic and alpine areas were also sampled (Figure 1, Supplementary Appendix 1).

Table 1

The data analyzed for low coverage genome skims. The table lists the number of specimens sampled and analyzed overall (All) and for the two projects, PhyloAlp (including PhyloCarpates) and PhyloNorway. The three last columns list the number of specimens in the species-rich families that contain at least 20 taxa across the studied region. Number of complete genomes, average cover, and average library insert size (base pairs) is also given.

	All	PhyloAlps + Carp	Phylo Norway	All (+ 20 fam)	PhyloAlps + Carp (+ 20 fam)	PhyloNorway (+ 20 fam)
Specimens	6655	4604	2051	5893	4057	1836
Libraries	6817	4726	2091	6018	4147	1871
Families	161	158	112	43	43	43
Genera	1037	922	576	804	705	461
Taxa	5575	4437	1899	4957	3914	1689
Complete genome cpDNA	4439	3303	1136	3944	2922	1022
Average sequencing depth cpDNA	278	318	187	265	300	188
Complete nrDNA cluster	5748	4021	1727	5092	3543	1549
Average sequencing depth nrDNA	603	674	444	579	638	450
Average library insert size	316	346	249	318	350	249

Figure 1

Collection sites for the projects PhyloAlps (blue), PhyloCarpates (yellow) and PhyloNorway (purple).

For the chloroplast genome, the average sequencing depth (the average number of reads representing a given nucleotide; also referred to as read depth or average cover) of silica gel dried material was 1.7 times higher than that of herbarium material, and this difference was highly significant (Mann-Whitney p = 1.8 × 10−88). Similarly, the average sequencing depth of silica gel dried material was 1.5 times higher than that of herbarium material for nrDNA cluster (Mann-Whitney p = 3.4 × 10−5) (Figure 2). Also, the average library insert size (the length of the DNA fragment sequenced) was on average 100 bp longer for silica dried material compared to herbarium material (Figure 3). The effect of insert size on assembly success was significant in both the chloroplast (Mann-Whitney p = 1.7 × 10−31) and the nrDNA cluster (Mann-Whitney p = 5.4 × 10−7, Table 1, Figure 3).

Figure 2

Chloroplast and nuclear ribosomal sequencing depth (the average number of reads representing a given nucleotide) of the total dataset of 4604 freshly collected and silica gel dried material from the Alps and the Carpathians (“Silica gel”) and 2051 herbarium specimens from Norway and polar regions (“Herbarium”). (a) Effect of preservation methods on sequencing depth. (b) Sequencing depth in relation to complete assembly success for herbarium and silica gel dried material combined. Note that the y-axis is on a logarithmic scale.

Figure 3

Library insert size (the length of the DNA fragment sequenced) for (a) herbarium and silica gel dried material and (b) success of complete assembly of chloroplast and nrDNA for herbarium and silica gel dried material combined.

The success rate of assembling the full chloroplast was considerably higher for silica gel dried 70%) than for herbarium material (54%) (Mann-Whitney p = 1.73 × 10−13, Figure 4). This had, however, little effect on the number of genes assembled as we were able to assemble 121 and 118 chloroplast genes for silica gel dried and herbarium material respectively, using the global assembler. This included on average 77 and 77 Coding Sequences (CDS), 6.6 and 7.1 rDNA and 35.1 and 36.3 tRNA genes for silica gel dried and for herbarium material, respectively. The success rate of assembling the full nrDNA cluster was only slightly higher for silica gel dried (85%) than for herbarium material (83%) (Mann-Whitney p = 3.36 × 10−7, Figure 4).

Figure 4

Sequencing success of the complete chloroplast and nrDNA clusters, the standard chloroplast barcodes rbcL and matK, the optional nuclear ribosomal barcode ITS2, and all three barcodes for freshly collected silica dried (n = 4604) and herbarium material (n = 2051).

For the global assembly method, the success rate for recovering matK was higher for silica gel dried (77%) than for herbarium (68%) materials (Fisher p = 3.0 × 10−14). This rate increased for both materials when we applied the targeted marker assembly leading to similar success rates of 96% and 98%, respectively, for silica gel dried and herbarium material (Fisher p = 3.7 × 10−2). Similarly, the initial success rate for rbcL was higher for silica gel dried (78%) than for herbarium (70%) material (Fisher p = 1.5 × 10−11), rising to 96% and 98%, respectively, when using the targeted assembler (Fisher p = 1.05 × 10−2) (Figure 4). For ITS2, only the global assembly was used and the success rate was marginally higher for silica gel dried (89%) than for herbarium (87%) material (Fisher p = 5.4 × 10−3). All three markers were successfully obtained in 86% for herbarium and 88% for silica gel dried material, and at least one marker was successfully sequenced in all families (n = 161) and all genera (n = 1047) except for one genus of Alismataceae (Luronium, n = 1), 7 genera of Boraginaceae (see below), and one of Onagraceae (Clarkia, n = 1. Supplementary Appendix 1).

2.2. Results For Recovery of Standard Barcode Loci From Large Families

There were 43 families that contained a minimum of 20 taxa each across the projects, thus contributing the majority of specimens across the total dataset (n = 5893 samples: 4057 based on silica gel and 1836 on herbarium material). Of these, all families had 10 or more taxa in the PhyloAlps dataset and 32 families had 10 or more taxa in the PhyloNorway dataset. The following families had less than 10 taxa in PhyloNorway: Amarylliaceae (7), Cistaceae (3), Crassulaceae (9), Euphorbiaceae (9), Hyacintaceae (6), Hypericaceae (6), Iridaceae (3), Liliaceae (6), Linaceae (2), Scrophulariaceae (2) and Solanaceae (8): which all had 100% success for rbcL and matK after targeted assembly, except for Amarylliaceae (6 out of 7 specimens for both marker) and Crassualaceae (11 out of 12 specimens for both markers). For the global assembly of the standard barcodes rbcL, matK and ITS2, the success rate for large families was higher for silica gel dried material than for herbarium material (dark colours in Figure 5). This in part reflects the issue that the assembly of the full chloroplast was problematic in some species-rich families such as Campanulaceae, Cyperaceae and Ericaceae. This was mainly due to the global assembler having difficulties dealing with the highly repeated structure of the chloroplast in these families. However, following the targeted assembly of rbcL and matK, these differences almost disappeared, and all families except for Amarylliaceae and Boraginaceae had higher than 90% success rate for both herbarium and silica gel dried material. There were close similarities in success rate between rbcL and matK, whereas the success rate of ITS2 sometimes showed deviating patterns (Figure 5).

Figure 5

Sequencing success for 43 families with a minimum of 20 taxa available across the studied specimens based on freshly collected silica gel dried (Alps and Carpathians) and herbarium (Norway) material. (a) matK, (b) rbcL and (c) ITS2.

For Boraginaceae, the success rate was reasonably high for herbarium material (83% for matK and 75% for rbcL, n = 12) whereas it was only (8% for both markers) for silica gel dried material (n = 156). Similarly, the ITS2 success rate was higher for herbarium (50%) than for silica gel dried material (8%). It is important to note that there were partly different genera represented for Boraginaceae in the silica and herbarium samples. The assembly failed for the following Boraginaceae genera: Alkanna (n = 2), Asperugo (n = 2), Borago (n = 2), Eritrichium (n = 4), Lappula (n = 4), Lithodora (n = 2), Neatostema (n = 2). Among these, only Eritrichium was represented in the herbarium material (Supplementary Appendix 1). For some genera that were represented in both herbarium and silica gel dried material, the relative success rate was higher for herbarium than silica, e.g., Cynoglossum (1 of 2 herbarium, 0 of 11 silica gel) and Myosotis (3 of 3 herbarium, 0 of 31 silica gel).

2.3. Effect of Sample Age and Time of Season

We had a documented exact age for 2022 herbarium specimens. The majority were less than 20 years old (1166 samples) and only 165 samples were older than 50 years. The maximum age was 153 years old and the full chloroplast of this specimen was assembled. There was no effect of age on chloroplast assembly success (Mann-Whitney p = 0.259, Figure 6). Similarly, there was no effect of time of season on chloroplast assembly success (Mann-Whitney p = 0.367).

Figure 6

Sequencing success for herbarium material in relation to age.

3. Discussion

The overall high success rate clearly shows that large-scale genome skims are now feasible for plant biodiversity genomic studies. This work was undertaken in relatively small laboratories with 1–2 technicians, indicating that large scale genome skimming is feasible even for small laboratories. This clearly opens many future research directions and capacities for many laboratories.

3.1. Effect of Starting Material

While the success rate for assembling the full chloroplast and nrDNA cluster was clearly higher in silica gel dried than herbarium material, this mainly affected the global assembly and had little effect on the ability to retrieve large suites of coding genes. Thus, if retrieving targeted loci is the main goal, herbarium material has about as equal a success rate as does silica gel dried material. This shows that the treasure vault may be opened [40,41], which is a great advantage compared to collecting fresh material for regional barcode projects, because field campaigns are labour- and cost-intensive and require subsequent expert identification [47]. For studies on chloroplast structure, the increased coverage and insert length seen in silica gel dried material likely increased the assembly success. The insert length of silica gel dried material was limited by the size of DNA fragments by sonication (350 bp), whereas the insert length for herbarium material was shorter (250) probably due to degradation during drying and/or storage. There is potential for increasing the success of herbarium material by using an improved extraction protocol [48] or library preparation protocol [49,50], but this is not likely to improve the insert length. For nrDNA clusters, where the average cover was higher due to the higher number of copies in the cell, the assembly success was also higher. This indicates that the sequencing depth was more important than the insert length, and increasing the sequencing effort of the problematic sequencing libraries (we usually target 6 million of reads per library) will automatically increase the sequencing depth of chloroplast, and will most likely resolve the problem related to some herbarium specimens. A sequencing depth of about 90x or higher is sufficient for most families (Figure 2), whereas families with complex chloroplast structure, such as Ericaceae, Cyperaceae and Campanulaceae, may need either deeper sequencing or may require alternative assembly methods.

3.2. Success Rate Compared to Amplicon-based DNA Barcoding

Our success rate for chloroplast loci (96–98%) and ITS2 (87–89%) is similar to what has been recorded for 672 herbarium samples from Australia [37]. In contrast, success rate based on large scale amplicon-based sequencing is generally lower. For example, the success rate for the first regional flora (Wales) was 57% for matK (n = 2419 specimens) and 77% for rbcL (n = 3304) [36]. Similarly, the success rate for 3176 specimens of herbarium material from Finland was 79% and 55% for rbcL and matK, respectively, and for both loci jointly it was only 53%. In a more recent and considerably larger study of the flora of Canada, overall success rates were 35% for matK (n = 9412 specimens), 84% for rbcL (n = 20816) and 42% for ITS2 (n = 13233). In the latter study, a combination of two markers was successful in 51–71% of specimens, whereas all three markers were only successful in 48% of the specimens (n = 2442). In comparison, on average, approximately 120 genes were assembled in our study. The cost of genome skims at the start of our project was five times that of two standard barcodes. There is a breakpoint in effort and costs, where it pays off to do genome skimming rather than adding more single barcodes. Exactly where this breakpoint is depends on individual laboratories’ facilities, methods used, the complexity of the flora, the purpose of the reference library, as well as the success rate of the different approaches. However, no single barcode region can be used to identify all plant species across genera and families, and the number and type of loci needed for the individual large-scale reference library project is usually unknown at the start of a project. Thus, especially when several loci are targeted in a study, genome skimming may be more cost efficient compared to amplicon sequencing. This is particularly true if the labour costs involved in repeated attempts to sequence difficult regions such as matK are factored into the calculations.

3.3. Success Rate Among Families

As with amplicon sequencing, we also observed some differences in success rates among families. Boraginaceae had by far the lowest success rate in our study. This family also had a low success rate for both silica gel dried and herbarium material from Canada [35]. Contrary to expectations, our herbarium material worked reasonably well in this family, with 67% success. Similarly, Nevill et al. [37] had high success for genome skims of herbarium material. Kuzmina et al. [35] suggested that low success rates in Boraginaceae could be related to the fact that most genera of this family are synthesising and storing pyrrolizidine alkaloids, compounds that may cause rapid and permanent DNA damage [51]. They further suggested that these compounds may preclude DNA preservation during the early phase of drying, immediately after collection. However, if this was the case, we would have expected better results for silica gel dried than herbarium material. The success rate of both the current study and that of Nevill et al. (2020) suggest that these compounds might inhibit DNA extraction, and that this may be broken down over time in herbarium material. Aspleniaceae worked well for silica gel dried but less so for herbarium material. Sanger sequencing is often also problematic for ferns [35,52]. In fact, eight and eleven fern families failed for matK and ITS2, respectively, in the Sanger sequencing based study of the flora of Canada [35]. In contrast, our genome skim worked for all 161 families that we tested. Also, the major problem with ferns for our data may be bioinformatic rather than sequencing success, because their sequences diverge from the Angiosperms which are mainly used in the assembling and annotating tools.

3.4. Effect of Age and Time of Season

We found no effect of age in up to 153-year-old herbarium material. Similarly, Nevill et al. [37] found no effect of age on genome skims of up to 80-year-old material. Also for capture probes, the target enrichment efficiency declines with age [53]. This is in contrast to what has been observed for Sanger sequencing [35,54]. Similarly, while we in an earlier Sanger sequencing study observed that specimens collected later in the growing season had poorer success rate (boldsystems.org, project NNOR, n = 1805), time of season caused no problem for genome skimming. Thus, genome skimming seems less effected by template quality than alternative methods.

3.5. Utilisation of the Genome Skimming Data

Our preliminarily targeted assembly shows that most genes can be assembled with the skimming approach used here. This allows us to analyse gene dropout [44] and phylogenies [4]. Also, the taxonomic resolution of difficult taxa may be enhanced with this approach [6,13,55,56] and even within-species diversity can be studied using this protocol [31]. Our genome skims allow for the design of purpose primers and/or capture probes [57,58,59]. For example, Li et al. [20] suggest a two-step barcode process, where the plastid genome is used in a second step for designing within-group markers. As the plastid genome of 8–10 taxa may be sufficient to design taxon-specific barcodes [20], even the skimming of only a small proportion of the global flora may greatly improve the discrimination power of barcodes. Thus, our dataset of 43 families with a minimum of 20 taxa, may already greatly advance the possibilities to design primers or capture probes. For studies of environmental DNA and ancient DNA, the genome skims allows for direct comparison with shotgun sequence data of sediments. The first 256 genome skims of the PhyloNorway dataset were used in an ancient DNA study from southern Sweden, and greatly enhanced the ability to taxonomically assign the sequences [60]. Furthermore, it may enable studies of population genomics based on herbarium material [61] or sediment samples, as has been done for algae [62] and humans [63]. In the near future, as the number of fully assembled genomes steadily increase, the ability to use these for assembling genome skim data will increase. This may allow for increased assembly and utilisation of the nuclear and mitochondrial DNA [64] recovered from the genome skims, increasing the power and range of applications for the data.

4. Materials and Methods

4.1. Sampling and DNA Extraction

For PhyloAlps, we collected most of the fresh leaves during the summer months in 2009, 2010, and 2011, with some additional materials collected in subsequent years to fill sampling gaps. Most PhyloCarpates samples were collected during the 2013, 2015 and 2016 fieldwork seasons, focusing on Carpathian endemics and regionally distributed Carpathian–Balkan taxa. For PhyloNorway, we sampled leaf material from herbarium specimens at Tromsø Museum (herbarium TROM, 220,000 specimens). The only treatment used for minimising insects in this herbarium is freezing at −30 °C for 4 days. The specimens in the herbarium are stored at a temperature of around 15 °C in woody cabinets with around 50% humidity. Specimens were selected using 5 criteria: (1) The species is native in boreal and/or arctic regions; (2) The specimen is healthy—every specimen was inspected under a dissecting microscope to exclude specimens with e.g., visible fungal infections.; (3) Collection date for the specimen is as early in the growing season as possible; (4) Sampling of specimens collected in the field after year 2000 was prioritised, where they met the other criteria; (5) The sample has good documentation and reliable taxonomic identification. The primary aim was to cover all species of Norway and polar regions, but common invasive plant species in this region were also included. DNA extractions were performed using Macherey-Nagel Nucleospin 96 Plant II kit with the following specifications and modifications. A minimum of 20 mg dried leaf material was collected from each specimen; a few specimens had less material due to their small size. Two tungsten carbide beads (3 mm diameter) were added to each sample before they were inserted into the TissueLyser for 4 × 1 minutes at 25 Hz. For each batch of 96 samples, a lysis buffer consisting of 50 mL Buffer PL1 and 1 mL RNase A were prepared, and 500 µL lysis buffer was dispensed to each sample. For silica dried samples (PhyloAlps and PhyloCarpates), a brief spin was performed at this step; this was skipped for herbarium material (PhyloNorway). Incubation time at 65 °C was increased to overnight for all samples, followed by a centrifugation step, silica gel dried material for 10 min 16,000× g and herbarium material for 15 min at 13,200 rpm. A filtration step was performed after step 3 in the original protocol, loading 400 μL cell lysate into NucleoSpin Flash Filter Plate stacked on top of a square-well block, and then centrifuged for 2 min at 2500× g for silica dried and 4600 rpm for herbarium material. Thereafter, 450 μL Binding Buffer PC was added to the square-well block. For step 6 in the original protocol (DNA binding to silica membrane), centrifugation was increased to 20 min at 4600 rpm for herbarium material. All wash steps for herbarium material were centrifuged at 4600 rpm. In step 7 (wash and dry silica membrane), all wash steps for herbarium material were centrifuged at 4600 rpm. For the third wash, we first centrifuged for 2 min before the square-well block was emptied and re-centrifuged without seal for 5 min, and then dried at room temperature for 5 min instead of the original 10 min centrifugation. For step 8 (DNA elution), we used 150 μL preheated Buffer PE and the flow-through was re-applied onto the filter to increase DNA yielding for herbarium material. See full extraction protocol in Supplementary Appendix 2.

4.2. Library Preparation and Sequencing

The library preparation protocol applied was chosen on the basis of the DNA extraction yields. When available, 250 ng of genomic DNA were sonicated using the E210 Covaris instrument (Covaris, Inc., USA). The NEBNext DNA Modules Products (New England Biolabs, MA, USA) were used for end-repair, 3’-adenylation and ligation of NextFlex DNA barcodes (Bio Scientific Corporation). After two consecutive 1x AMPure XP clean ups, the ligated products were amplified by 12 cycles PCR using Kapa Hifi Hotstart NGS library Amplification kit (Kapa Biosystems, Wilmington, MA), followed by a 0.6x AMPure XP purification. When the extraction yielded low DNA quantities, 10–50 ng of genomic DNA were sonicated. Fragments were end-repaired, 3’-adenylated and NEXTflex DNA barcoded adapters were added by using NEBNext Ultra II DNA Library prep kit for Illumina (New England Biolabs). After two consecutive 1x AMPure clean ups, the ligated products were PCR-amplified with NEBNext Ultra II Q5 Master Mix included in the kit, followed by 0.8x AMPure XP purification. All libraries were subjected to size profile analyses conducted by Agilent 2100 Bioanalyzer (Agilent Technologies, USA) and qPCR quantification (MxPro, Agilent Technologies, USA), then sequenced using 101 base-length read chemistry in a paired-end flow cell on the Illumina HiSeq2000 sequencer (Illumina, USA). For 155 libraries, the same extract was sequencing twice either as a quality control or because the first results were poor. An Illumina filter was applied to remove the least reliable data from the analyses. The raw data were filtered to remove any clusters with too much intensity corresponding to bases other than the called base. Adapters and primers were removed from the whole read. Nucleotides exhibiting a low Illumina sequence quality score (below 20) were trimmed from both extremities of the read. Sequences between the second unknown nucleotide (N) and the end of the read were also removed. Reads shorter than 30 nucleotides after trimming were discarded. Finally, the reads and their mates that mapped onto run quality control sequences (PhiX genome) were removed. These trimming steps were achieved using internal software based on the FastX package [65].

4.3. Global Assembly and Annotation

For each sample, the complete sequence of the nrDNA and of the chloroplast genome were first assembled using the Organelle Assembler [66], which is a De Bruijn graph based assembler specifically developed for the PhyloAlps and PhyloNorway projects and designed for the assembly of high copy genetic elements such as organelle genomes and nrDNA from genome skimming datasets. The sequence data was indexed with the “oa index” using a variable length cut-off that retains 90% of the input sequences. The chloroplast protein coding genes and nrDNA from Arabidopsis were used to find the assembly seeds in the index sequence with the “oa seed” command. For both the chloroplast and nrDNA assemblies, the assembly graphs were constructed with “oa buildgraph” allowing up to 30 iterations for filling assembly gaps. The final assembled contigs were produced with “oa unfold” and “oa unfoldrdna” for the chloroplast and nrDNA assemblies respectively. A circular contig was attempted to be generated from the chloroplast assembly graph. However, if none could be obtained, the separate contigs were produced instead. The assembled sequences were annotated with the ORG.Annot pipeline [67].

4.4. Targeted Assembly for matK and rbcL

As some chloroplast assemblies did not succeed for all specimens, we used the OrthoSkim pipeline [68] to retrieve the chloroplast genes for samples lacking complete assemblies (n = 1815). This pipeline consists of assembling all sequencing reads into genomic contigs and extracting all targeted genes from these contigs by mapping to close reference. For this, we formatted a database of chloroplast coding genes from our annotations by keeping all protein sequences. For each sample, assembly was performed in OrthoSkim using the SPAdes assembler [69] with the “SPAdes_assembly” mode. Afterwards, OrthoSkim selected the closest taxon for each gene of the database in the NCBI taxonomy and contigs were first mapped to this closed reference to extract matching contigs from the contigs set with a diamond [70]. Selected contigs were then mapped using exonerate [71] in order to identify the exonic regions for each gene, which were next extracted. This was implemented using the “extraction” mode with the “chloroplast_CDS” target.

4.5. Quality Control

For PhyloNorway, the chloroplast rbcL and nuclear ribosomal ITS2 barcode regions were extracted from the annotated database for quality controls. For each marker the data was uploaded to BOLD systems [72] and analyzed via the Taxon ID Tree option (visualization via a simple NJ tree). Samples that were misplaced in the tree were manually checked for misidentifications based on the uploaded herbarium material, corrected where possible or removed from the final dataset if the final identification was unclear. A total of 87 samples were corrected and 8 libraries were removed in this step. Additionally, the in-house quality control process that was applied to the reads that passed the Illumina quality filters included a taxonomic assignment step. For each dataset, taxonomic assignment was performed on a random sample of 20,000 reads using MegaBLAST [73], Kraken [74], or Centrifuge [75]. This allowed us to identify 35 additional PhyloNorway samples that likely corresponded to identification/sampling errors, as well as 113 PhyloAlps samples. These samples were discarded from subsequent analyses. We also tagged 42 PhyloNorway samples that were contaminated by other DNA from the environment (bacteria, fungi, birds, fish, human; contamination was 0.5–14% of total reads). These had lower success rates than the overall PhyloNorway dataset (Fisher p = 2.43 × 10−4), and we were only able to assemble the full chloroplast genome for 11 of these samples. These are kept in the overall dataset to give realistic statistics of success rate.

4.6. Statistical Analyses

All success rates are calculated based on libraries. To evaluate the significance of correlations between continuous variables (coverage, insert size, age) with assembly success or preservation methods, Wilcoxon rank-sum tests were used. To estimate the correlation between success rate and preservation method, Fisher’s exact test was used. All statistical analyses were done in R version 3.6 [76].

4.7. Data Availability

For PhyloNorway, the full dataset of matK, rbcL and ITS2 is available on BOLD [72]. A subset of 1535 samples has been included in ongoing work (Wang et al. in prep) and the raw reads and sequence assemble will be deposited at the European Nucleotide Archive [77]. The remaining data will be released after further quality control. Metadata for the majority of specimens are provided on PhyloAlps [78].

52 in total

1. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing.

Authors: Anton Bankevich; Sergey Nurk; Dmitry Antipov; Alexey A Gurevich; Mikhail Dvorkin; Alexander S Kulikov; Valery M Lesin; Sergey I Nikolenko; Son Pham; Andrey D Prjibelski; Alexey V Pyshkin; Alexander V Sirotkin; Nikolay Vyahhi; Glenn Tesler; Max A Alekseyev; Pavel A Pevzner
Journal: J Comput Biol Date: 2012-04-16 Impact factor: 1.479

Review 2. Biological collections and ecological/environmental research: a review, some observations and a look to the future.

Authors: Graham H Pyke; Paul R Ehrlich
Journal: Biol Rev Camb Philos Soc Date: 2009-11-24

3. Fifty thousand years of Arctic vegetation and megafaunal diet.

Authors: Eske Willerslev; John Davison; Mari Moora; Martin Zobel; Eric Coissac; Mary E Edwards; Eline D Lorenzen; Mette Vestergård; Galina Gussarova; James Haile; Joseph Craine; Ludovic Gielly; Sanne Boessenkool; Laura S Epp; Peter B Pearman; Rachid Cheddadi; David Murray; Kari Anne Bråthen; Nigel Yoccoz; Heather Binney; Corinne Cruaud; Patrick Wincker; Tomasz Goslar; Inger Greve Alsos; Eva Bellemain; Anne Krag Brysting; Reidar Elven; Jørn Henrik Sønstebø; Julian Murton; Andrei Sher; Morten Rasmussen; Regin Rønn; Tobias Mourier; Alan Cooper; Jeremy Austin; Per Möller; Duane Froese; Grant Zazula; François Pompanon; Delphine Rioux; Vincent Niderkorn; Alexei Tikhonov; Grigoriy Savvinov; Richard G Roberts; Ross D E MacPhee; M Thomas P Gilbert; Kurt H Kjær; Ludovic Orlando; Christian Brochmann; Pierre Taberlet
Journal: Nature Date: 2014-02-06 Impact factor: 49.962

Review 4. Telling plant species apart with DNA: from barcodes to genomes.

Authors: Peter M Hollingsworth; De-Zhu Li; Michelle van der Bank; Alex D Twyford
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2016-09-05 Impact factor: 6.237

5. Large scale genome skimming from herbarium material for accurate plant identification and phylogenomics.

Authors: Paul G Nevill; Xiao Zhong; Julian Tonti-Filippini; Margaret Byrne; Michael Hislop; Kevin Thiele; Stephen van Leeuwen; Laura M Boykin; Ian Small
Journal: Plant Methods Date: 2020-01-04 Impact factor: 4.993

6. Genomic treasure troves: complete genome sequencing of herbarium and insect museum specimens.

Authors: Martijn Staats; Roy H J Erkens; Bart van de Vossenberg; Jan J Wieringa; Ken Kraaijeveld; Benjamin Stielow; József Geml; James E Richardson; Freek T Bakker
Journal: PLoS One Date: 2013-07-29 Impact factor: 3.240

7. How to open the treasure chest? Optimising DNA extraction from herbarium specimens.

Authors: Tiina Särkinen; Martijn Staats; James E Richardson; Robyn S Cowan; Freek T Bakker
Journal: PLoS One Date: 2012-08-28 Impact factor: 3.240

8. Using herbaria to study global environmental change.

Authors: Patricia L M Lang; Franziska M Willems; J F Scheepens; Hernán A Burbano; Oliver Bossdorf
Journal: New Phytol Date: 2018-08-30 Impact factor: 10.151

9. Factors Affecting Targeted Sequencing of 353 Nuclear Genes From Herbarium Specimens Spanning the Diversity of Angiosperms.

Authors: Grace E Brewer; James J Clarkson; Olivier Maurin; Alexandre R Zuntini; Vanessa Barber; Sidonie Bellot; Nicola Biggs; Robyn S Cowan; Nina M J Davies; Steven Dodsworth; Sara L Edwards; Wolf L Eiserhardt; Niroshini Epitawalage; Sue Frisby; Aurélie Grall; Paul J Kersey; Lisa Pokorny; Ilia J Leitch; Félix Forest; William J Baker
Journal: Front Plant Sci Date: 2019-09-18 Impact factor: 5.753

10. The unexpected depths of genome-skimming data: A case study examining Goodeniaceae floral symmetry genes.

Authors: Brent A Berger; Jiahong Han; Emily B Sessa; Andrew G Gardner; Kelly A Shepherd; Vincent A Ricigliano; Rachel S Jabaily; Dianella G Howarth
Journal: Appl Plant Sci Date: 2017-10-20 Impact factor: 1.936

10 in total

1. Ancient DNA extraction methods for herbarium specimens: When is it worth the effort?

Authors: Pia Marinček; Natascha D Wagner; Salvatore Tomasello
Journal: Appl Plant Sci Date: 2022-06-15 Impact factor: 2.511

2. Increased resolution in the face of conflict: phylogenomics of the Neotropical bellflowers (Campanulaceae: Lobelioideae), a rapid plant radiation.

Authors: Laura P Lagomarsino; Lauren Frankel; Simon Uribe-Convers; Alexandre Antonelli; Nathan Muchhala
Journal: Ann Bot Date: 2022-05-12 Impact factor: 5.040

3. A Comprehensive Phylogenomic Platform for Exploring the Angiosperm Tree of Life.

Authors: William J Baker; Paul Bailey; Vanessa Barber; Abigail Barker; Sidonie Bellot; David Bishop; Laura R Botigué; Grace Brewer; Tom Carruthers; James J Clarkson; Jeffrey Cook; Robyn S Cowan; Steven Dodsworth; Niroshini Epitawalage; Elaine Françoso; Berta Gallego; Matthew G Johnson; Jan T Kim; Kevin Leempoel; Olivier Maurin; Catherine Mcginnie; Lisa Pokorny; Shyamali Roy; Malcolm Stone; Eduardo Toledo; Norman J Wickett; Alexandre R Zuntini; Wolf L Eiserhardt; Paul J Kersey; Ilia J Leitch; Félix Forest
Journal: Syst Biol Date: 2022-02-10 Impact factor: 15.683

4. Environmental palaeogenomic reconstruction of an Ice Age algal population.

Authors: Peter D Heintzman; Inger Greve Alsos; Youri Lammers
Journal: Commun Biol Date: 2021-02-16

5. Late Quaternary dynamics of Arctic biota from ancient environmental genomics.

Authors: Yucheng Wang; Mikkel Winther Pedersen; Inger Greve Alsos; Bianca De Sanctis; Fernando Racimo; Ana Prohaska; Eric Coissac; Hannah Lois Owens; Marie Kristine Føreid Merkel; Antonio Fernandez-Guerra; Alexandra Rouillard; Youri Lammers; Adriana Alberti; France Denoeud; Daniel Money; Anthony H Ruter; Hugh McColl; Nicolaj Krog Larsen; Anna A Cherezova; Mary E Edwards; Grigory B Fedorov; James Haile; Ludovic Orlando; Lasse Vinner; Thorfinn Sand Korneliussen; David W Beilman; Anders A Bjørk; Jialu Cao; Christoph Dockter; Julie Esdale; Galina Gusarova; Kristian K Kjeldsen; Jan Mangerud; Jeffrey T Rasic; Birgitte Skadhauge; John Inge Svendsen; Alexei Tikhonov; Patrick Wincker; Yingchun Xing; Yubin Zhang; Duane G Froese; Carsten Rahbek; David Nogues Bravo; Philip B Holden; Neil R Edwards; Richard Durbin; David J Meltzer; Kurt H Kjær; Per Möller; Eske Willerslev
Journal: Nature Date: 2021-10-20 Impact factor: 69.504

6. Chloroplast genome sequencing based on genome skimming for identification of Eriobotryae Folium.

Authors: Fang Li; Xuena Xie; Rong Huang; Enwei Tian; Chan Li; Zhi Chao
Journal: BMC Biotechnol Date: 2021-12-11 Impact factor: 2.563

7. Skimming for barcodes: rapid production of mitochondrial genome and nuclear ribosomal repeat reference markers through shallow shotgun sequencing.

Authors: Mykle L Hoban; Jonathan Whitney; Allen G Collins; Christopher Meyer; Katherine R Murphy; Abigail J Reft; Katherine E Bemis
Journal: PeerJ Date: 2022-08-05 Impact factor: 3.061

8. Postglacial species arrival and diversity buildup of northern ecosystems took millennia.

Authors: Inger Greve Alsos; Dilli Prasad Rijal; Dorothee Ehrich; Dirk Nikolaus Karger; Nigel Gilles Yoccoz; Peter D Heintzman; Antony G Brown; Youri Lammers; Loïc Pellissier; Torbjørn Alm; Kari Anne Bråthen; Eric Coissac; Marie Kristine Føreid Merkel; Adriana Alberti; France Denoeud; Jostein Bakke
Journal: Sci Adv Date: 2022-09-28 Impact factor: 14.957

9. Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification.

Authors: Kristine Bohmann; Siavash Mirarab; Vineet Bafna; M Thomas P Gilbert
Journal: Mol Ecol Date: 2020-06-29 Impact factor: 6.185

10. Sedimentary ancient DNA shows terrestrial plant richness continuously increased over the Holocene in northern Fennoscandia.

Authors: Dilli P Rijal; Peter D Heintzman; Youri Lammers; Nigel G Yoccoz; Kelsey E Lorberau; Iva Pitelkova; Tomasz Goslar; Francisco J A Murguzur; J Sakari Salonen; Karin F Helmens; Jostein Bakke; Mary E Edwards; Torbjørn Alm; Kari Anne Bråthen; Antony G Brown; Inger G Alsos
Journal: Sci Adv Date: 2021-07-30 Impact factor: 14.136

10 in total