| Literature DB >> 32542933 |
Kristine Bohmann1, Siavash Mirarab2, Vineet Bafna3, M Thomas P Gilbert1,4,5.
Abstract
Genetic tools are increasingly used to identify and discriminate between species. One key transition in this process was the recognition of the potential of the ca 658bp fragment of the organelle cytochrome c oxidase I (COI) as a barcode region, which revolutionized animal bioidentification and lead, among others, to the instigation of the Barcode of Life Database (BOLD), containing currently barcodes from >7.9 million specimens. Following this discovery, suggestions for other organellar regions and markers, and the primers with which to amplify them, have been continuously proposed. Most recently, the field has taken the leap from PCR-based generation of DNA references into shotgun sequencing-based "genome skimming" alternatives, with the ultimate goal of assembling organellar reference genomes. Unfortunately, in genome skimming approaches, much of the nuclear genome (as much as 99% of the sequence data) is discarded, which is not only wasteful, but can also limit the power of discrimination at, or below, the species level. Here, we advocate that the full shotgun sequence data can be used to assign an identity (that we term for convenience its "DNA-mark") for both voucher and query samples, without requiring any computationally intensive pretreatment (e.g. assembly) of reads. We argue that if reference databases are populated with such "DNA-marks," it will enable future DNA-based taxonomic identification to complement, or even replace PCR of barcodes with genome skimming, and we discuss how such methodology ultimately could enable identification to population, or even individual, level.Entities:
Keywords: DNA barcoding; DNA reference databases; K-mers; biodiversity; environmental DNA; next-generation sequencing
Mesh:
Substances:
Year: 2020 PMID: 32542933 PMCID: PMC7496323 DOI: 10.1111/mec.15507
Source DB: PubMed Journal: Mol Ecol ISSN: 0962-1083 Impact factor: 6.185
FIGURE 1Methods to assign a genetic identity to voucher and query samples. (a) Traditional approaches are based on PCR amplification of barcode loci. (b) Increasingly genome skimming is used to bioinformatically mine the (c) barcode loci or whole organellar genomes from shotgun sequenced data. (d) We advocate that the remaining data could be used to assign a k‐mer profile to the specimen, (e) ultimately enhancing the resolution to which it can be identified (e)
FIGURE 2Overview of the DNA‐mark pipeline. Computational steps are shown in blue boxes, and one example tool that can be used in each step is shown below each box. For each set of reads (whether representing the voucher or the query), the sample has to be first preprocessed in several stages. First, reads are cleaned up to remove adapters, deduplicate reads and merge paired‐end reads. Then, extragenic reads need to be filtered out, typically by matching each read against a database of potential contaminants. The remaining reads need to be represented as k‐mers; the set of k‐mers need to be hashed and sketched for efficient storage and fast processing. Also, the coverage of the genome skim and properties of the underlying genome (e.g. its size and repeat structure) need to be estimated. Thus, the preprocessing (which needs to happen only once) generates both the k‐mer set and the genomic parameters, which are sufficient for sample identification. To identify a new query sample, we need to first compute its distance to the set of reference genome skims. The query can be assigned to the reference with the smallest distance. Alternatively, the query can be placed on a reference phylogenetic tree (which can be computed from the genome skims or can be retrieved from any other source)
FIGURE 3(a) Simplified description of the workflow process for generating different types of data that could be used for taxonomic identification. (b) Illustrative example showing that while the underlying cost of sample collection, vouchering and DNA extraction remain relatively constant with time as it is principally constrained by the cost of human labour, the cost of generating data using different next‐generation sequencing techniques is rapidly converging. Thus while, for example, the amount of shotgun sequence data needed to generate a species‐specific k‐mer profile is considerably more than is needed to mine an organellar genome, the economic cost of generating that much more sequence data is rapidly narrowing. We argue this supports the rationale for exploiting genome skims fully as a tool to complement traditional barcoding
Overview of sample collection, laboratory and sequence processing steps and of applications of DNA‐based sample identification methods
| Traditional PCR‐based barcoding | Genome skimming | Earth BioGenome Project | |||
|---|---|---|---|---|---|
| Sanger sequencing | Next‐generation sequencing | Organelle assembly | k‐mers | Whole‐genome assembly | |
| Sample collection | |||||
| Sampling efforts | Same | Same | Same | Same | Same |
| Voucher specimen | Same | Same | Same | Same | Same |
| Laboratory | |||||
| Extraction | Standard | Standard | Standard | Standard | High molecular weight |
| PCR of marker region | Yes | Yes | No | No | No |
| Library build | No | Yes | Yes | Yes | Yes Multiple types |
| Sequence read processing | |||||
| Initial trimming of sequence reads | Yes (manual) | Yes | Yes | Yes | Yes |
| Quality check of barcode sequence | Yes (manual) | Yes | Yes | No | Yes |
| Creating k‐mer profile | No | No | No | Yes | No |
| Assembly of organellar genome | No | No | Yes | Optional | Yes |
| Assembly of whole genomes | No | No | No | No | Yes |
| Applications | |||||
| Identification at taxonomic species level | Sometimes | Sometimes | Yes | Yes | Yes |
| Taxonomic identification of simple samples | Yes | Yes | Yes | Yes | Yes |
| Taxonomic reconstruction of complex samples | Yes | Yes | Yes unless contains very closely related taxa | Perhaps—remains to be fully explored | No |
| Population‐level resolution | Rarely—requires population structure and high genetic divergence between populations | Rarely—requires population structure and high genetic divergence between populations | Sometimes—if characterized by unique organelle haplotypes | Perhaps—to be fully explored | Yes if sufficient population structure exists |
| Discerning individual‐level information | No | No | No | Perhaps | Yes |
Requires ca. 1 gbp of shotgun sequencing (Coissac et al., 2016).
If funding can be secured, the EBP aims to generate chromosome‐level genome assemblies for all known eukaryote species (Lewin et al., 2018).