| Literature DB >> 32153629 |
Tobias Andermann1,2, Maria Fernanda Torres Jiménez1,2, Pável Matos-Maraví1,2,3, Romina Batista2,4,5, José L Blanco-Pastor1,6, A Lovisa S Gustafsson7, Logan Kistler8, Isabel M Liberal1, Bengt Oxelman1,2, Christine D Bacon1,2, Alexandre Antonelli1,2,9.
Abstract
High-throughput DNA sequencing techniques enable time- and cost-effective sequencing of large portions of the genome. Instead of sequencing and annotating whole genomes, many phylogenetic studies focus sequencing effort on large sets of pre-selected loci, which further reduces costs and bioinformatic challenges while increasing coverage. One common approach that enriches loci before sequencing is often referred to as target sequence capture. This technique has been shown to be applicable to phylogenetic studies of greatly varying evolutionary depth. Moreover, it has proven to produce powerful, large multi-locus DNA sequence datasets suitable for phylogenetic analyses. However, target capture requires careful considerations, which may greatly affect the success of experiments. Here we provide a simple flowchart for designing phylogenomic target capture experiments. We discuss necessary decisions from the identification of target loci to the final bioinformatic processing of sequence data. We outline challenges and solutions related to the taxonomic scope, sample quality, and available genomic resources of target capture projects. We hope this review will serve as a useful roadmap for designing and carrying out successful phylogenetic target capture studies.Entities:
Keywords: Hyb-Seq; Illumina; NGS; anchored enrichment; bait; high throughput sequencing; molecular phylogenetics; probe
Year: 2020 PMID: 32153629 PMCID: PMC7047930 DOI: 10.3389/fgene.2019.01407
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Published studies deposited in Web of Science that have used target sequence capture in phylogenetic research. (A) Number of publications by year (** our search included papers in Web of Science by December 20, 2019). (B) Normalized cumulative publications using target sequence capture in relation to other phylogenomic studies over time sorted by year of publication. We restricted our searches for studies published from 2006, the year of release of the first commercial high-throughput sequencer. We searched for Original Articles published in English in the category ‘Evolutionary Biology’. We used eight combinations of keywords in independent searches that included the terms: ‘hybrid’ OR ‘target*’ OR ‘exon’ OR ‘anchored’ AND ‘enrichment’ OR ‘capture’ AND ‘phylogenom*’. We merged the datasets and we removed duplicated records by comparing unique DOIs (blue bars in panel A). These searches were contrasted with all other phylogenomic studies as specified by the keywords ‘sequencing’ AND ‘phylogenom*’ (yellow bars in panel A).
Figure 2Decision chart and overview of the main considerations for project design in high throughput sequencing. The flow chart shows the most common groups of sequencing methodologies. Sections 1–3 summarize key components of project design, starting by choosing the sequencing methods, followed by bait design and finishing with the optimization of laboratory practices. Section 3 shows recommended (full circle), recommended in some cases (half circles) and not recommended (empty circles) practices based on input DNA quality and quantity. “Low input” refers to low input DNA extraction kits and “touch down” refers to temperature ramps at the hybridization and capture steps.
List of publicly available bait sets. This is not a complete list; it aims to highlight the taxonomic diversity of bait sets for broader organism groups. See the for the number of baits in each set.
| Name of bait set | Clade | Number of targeted loci | Reference |
|---|---|---|---|
| Arachnida 1.1Kv1 |
| 1,120 |
|
| Coleoptera 1.1Kv1 |
| 1,172 |
|
| Diptera 2.7Kv1 |
| 2,711 |
|
| Hemiptera 2.7Kv1 |
| 2,731 |
|
| Hymenoptera 1.5Kv1 (hym‐v1) |
| 1,510 |
|
| Hymenoptera 2.5Kv2 (hym‐v2) |
| 2,590 |
|
| BUTTERFLY1.0 |
| 425 |
|
| BUTTERFLY2.0 |
| 13* |
|
| Lepidoptera 1.3K-v1 |
| 1,381 |
|
| Actinopterygians 0.5Kv1 |
| 500 |
|
| Acanthomorphs 1Kv1 |
| 1,314 |
|
| - |
| 8,706 |
|
| - |
| 1,265 |
|
| FrogCap |
| ~15,000 |
|
| AHE |
| 512 |
|
| GENECODE |
| 205,031 |
|
| SqCL |
| 5,312 |
|
| Coding Regions |
| 3,888 |
|
| Tetrapods-UCE-2.5Kv1/Tetrapods-UCE-5Kv1 |
| 2,386 |
|
| Anthozoa 1.7Kv1 |
| 1,791 |
|
| Sphaerospira-Austrochlotitis-120-60-v2 |
| 2,648 |
|
| Angiosperms-353 |
| 353* |
|
| - |
| 4,184 |
|
| PhyloPalm |
| 795* |
|
| 40916-Tapeworm |
| 3,641 |
|
| PenSeq |
| ~48* |
|
| MetCap |
| 331 sequence clusters |
|
| MEGaRICH | – | 2,490 |
|
| ViroCap |
| Baits designed to identify viruses in human samples |
|
*Complete genes, including all exons. The target phylum is indicated in bold.
Popular short read processing pipelines. Full circles stand for ‘Applies’, half circles for ‘Partly applies’, and empty circles for ‘Does not apply’ for the respective category of the pipeline.
| Read cleaning | Sequence engineering | Intron recovery | MSA generation | Allele phasing | SNP extraction | Ease of installation | |
|---|---|---|---|---|---|---|---|
| aTRAM ( | ○ | ● | ● | ○ | ○ | ○ |
|
| HYBPIPER ( | ○ | ● | ● | ○ | ○ | ○ |
|
| PHYLUCE ( |
| ○ | ○ | ● | ● |
| ● |
| SECAPR ( | ● |
|
| ● | ● |
| ● |
Figure 3The most common sources of read-variation within reference-based assemblies of a given organism. (A) Sequencing errors are identifiable as single variants that are only present on an individual read and are generally not shared across several reads. (B) Paralogous reads are visible as blocks of reads with several variants shared among a low frequency of reads. Paralogous reads originate from a different part of the genome and are a result of gene or genome duplication. (C) Allelic variation can usually be identified by variants that are shared among many reads, occurring at a read frequency of approximately 1/ploidy-level, i.e. 0.5 for diploid organisms.