| Literature DB >> 28137744 |
Ryan K Schott1, Bhawandeep Panesar2, Daren C Card3, Matthew Preston2, Todd A Castoe3, Belinda S W Chang1,2,4.
Abstract
Despite continued advances in sequencing technologies, there is a need for methods that can efficiently sequence large numbers of genes from diverse species. One approach to accomplish this is targeted capture (hybrid enrichment). While these methods are well established for genome resequencing projects, cross-species capture strategies are still being developed and generally focus on the capture of conserved regions, rather than complete coding regions from specific genes of interest. The resulting data is thus useful for phylogenetic studies, but the wealth of comparative data that could be used for evolutionary and functional studies is lost. Here, we design and implement a targeted capture method that enables recovery of complete coding regions across broad taxonomic scales. Capture probes were designed from multiple reference species and extensively tiled in order to facilitate cross-species capture. Using novel bioinformatics pipelines we were able to recover nearly all of the targeted genes with high completeness from species that were up to 200 myr divergent. Increased probe diversity and tiling for a subset of genes had a large positive effect on both recovery and completeness. The resulting data produced an accurate species tree, but importantly this same data can also be applied to studies of molecular evolution and function that will allow researchers to ask larger questions in broader phylogenetic contexts. Our method demonstrates the utility of cross-species approaches for the capture of full length coding sequences, and will substantially improve the ability for researchers to conduct large-scale comparative studies of molecular evolution and function.Entities:
Keywords: comparative studies of protein-coding genes; cross-species sequence capture; hybrid enrichment of complete coding sequences
Mesh:
Year: 2017 PMID: 28137744 PMCID: PMC5381602 DOI: 10.1093/gbe/evx005
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
Comparison of Pros and Cons of Different High-Throughput Sequencing Strategies
| Method | Pros | Cons | References |
|---|---|---|---|
| WGS | Produces data that can be used for many different applications | Expensive. Assembly can be difficult and time consuming |
|
| RNA-Seq | Produces data that can be used for many different applications. Also provides information on expression levels. | Need fresh tissue for RNA. Obtaining sequences depends on expression levels and tissue/temporal-specific expression |
|
| PCR | Can sequence small numbers of loci (up to ∼100) efficiently across divergent species. | Primer design can be difficult, time consuming, and needs conserved regions. Can give biased results. |
|
| RRL/RAD-Seq | Can produce 1000s of loci for use in intraspecific and shallow phylogenetic studies. | Only useful at very shallow time scales. |
|
| Targeted Capture (Hybrid Enrichment) | Can sequence 100s to 1000s loci from species at varying levels of divergence. | Initial probe design can be expensive/time consuming and may require genomic resources. | See below |
| Whole Exome Capture | Produces data that can be used for many different applications. | Only applicable with model organisms and their very close relatives. |
|
| Conserved Region Capture | Sequence 1000s of loci that can be used for phylogenetic studies at shallow to deep timescales. | Data only useful for phylogenetic studies. |
|
| Exon Capture | Sequence 100s to 1000s of individual exons across low to moderate levels of divergence. | Data only useful for phylogenetic studies. Performance decreases sharply with divergence. |
|
| Complete CDS Capture | Produces data that can be used for many different applications. Applicable across divergent species. Can target genes with physiological relevance. | Manual curation can be time consuming. Guided assembly can require additional references sequences for divergent species. | Current study |
Note.—WGS, whole genome sequencing; RRL, reduced representations library; RAD, restriction site associated DNA; CDS, coding sequence.
FCross-species hybrid capture methods. (A), exons were extracted from the genomes of an average of three reference species (anole, turtle, and chicken). (B), probes were designed against each exon. Since probe length was constant at 120 bp, exons shorter than the probe length were padded with non-homologous sequence. Exons the same length as the probe matched exactly, while those longer were extensively tiled across the exon (10X coverage). The overall number of probes covering each base was normalized to ensure even coverage. (C), multiple reference species and tiling were designed to help facilitate cross-species capture. For example, a region of high divergence may occur in one species and not another, or could still be captured by tiling across it.
FSpecies relationships of the 16 species sequenced and the enrichment, percent of genes recovered and the average completeness of those genes that were recovered. These results represent the combined best for the different assembly methods and references used. Genes were considered recovered if they were at least 5% complete and could be properly identified based on BLAST similarity and/or phylogenetic position. Species most closely related to the reference are shown in red, snakes in green, the plated lizard in orange, and geckos in blue. Tree topology based on Pyron et al. (2013). Divergence times from Hedges et al. (2015).
Comparison of BWA Assembly with the Anolis and Snake References
| Reference Sequences | |||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
| Snake (139 seqs.) | |||||||
| Species |
|
|
|
|
|
|
|
|
|
| Anole (control) | 71.5 | 99.4 | 94.8 | 56.5 | 99.3 | 95.9 | 30.1 | 98.6 | 66.7 |
| Leopard lizard | 58.6 | 100 | 88.2 | 48.7 | 100 | 89.7 | 32.4 | 98.6 | 70.3 |
| Chameleon | 48.5 | 98.1 | 72.8 | 42.4 | 98.6 | 74.3 | 31.9 | 95.0 | 63.9 |
| Monitor | 56.7 | 100 | 77.7 | 48.6 | 100 | 79.8 | 39.6 | 97.8 | 71.5 |
| Glossy snake | 48.7 | 88.1 | 64.4 | 44.5 | 98.6 | 64.9 | 50.8 | 100 | 85.6 |
| Scarlet snake | 49.6 | 88.1 | 67.8 | 45.9 | 98.6 | 68.3 | 52.1 | 100 | 86.1 |
| King snake | 48.7 | 87.5 | 64.8 | 43.9 | 97.8 | 65.3 | 50.5 | 100 | 85.6 |
| Corn snake | 48.5 | 88.8 | 62.6 | 44.8 | 99.3 | 63.0 | 50.9 | 100 | 85.0 |
| Leaf-nosed sn. | 52.6 | 89.4 | 67.6 | 48.9 | 100 | 67.8 | 56.5 | 100 | 86.2 |
| Long-nosed sn. | 47.3 | 86.9 | 61.5 | 43.6 | 97.1 | 61.9 | 50.7 | 100 | 85.6 |
| Night snake | 49.1 | 87.5 | 64.2 | 45.8 | 97.8 | 64.7 | 52.4 | 100 | 85.0 |
| Garter snake | 49.1 | 86.3 | 63.8 | 45.8 | 96.4 | 64.2 | 51.6 | 100 | 84.7 |
| Plated lizard | 48.3 | 98.8 | 75.2 | 42.5 | 99.3 | 75.9 | 35.8 | 97.8 | 67.4 |
| Tokay gecko | 47.5 | 90.0 | 69.4 | 41.5 | 97.1 | 69.5 | 36.9 | 92.8 | 61.7 |
| Giant day gecko | 47.8 | 90.6 | 69.1 | 39.9 | 96.4 | 69.7 | 33.6 | 94.2 | 61.4 |
| Reef gecko | 53.4 | 90.6 | 65.6 | 46.7 | 97.1 | 66.1 | 38.7 | 93.5 | 58.7 |
| Average | 51.6 | 91.9 | 70.6 | 45.6 | 98.3 | 71.3 | 43.4 | 98.0 | 75.3 |
Note.––To make the comparison between the Anolis and Snake reference even, the Anolis reference was trimmed to contain only the sequences present in the Snake reference (Anolis trim Snake). Enrichment was measured as the percent of reads mapping to the reference (uncorrected for genome size). Recovery was calculated as the percent of the 165 targeted genes that had at least 5% completeness and that could be identified based on BLAST similarity and/or phylogenetic analysis. Completeness was calculated relative to the reference sequence for those genes identified as recovered using the BWA assembly method and the best reference for each gene. Abbreviations—Enrich., Enrichment; Rec., Recovery; Comp., Completeness; sn., snake.
Comparison of BWA Assembly with the Anolis and Gekko References
| Species | Reference Sequences | ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
| |||||||
|
|
|
|
|
|
|
|
|
| |
| Tokay gecko | 47.5 | 90.0 | 69.4 | 34.1 | 97.3 | 70.5 | 36.9 | 100 | 88.4 |
| Giant day gecko | 47.8 | 90.6 | 69.1 | 33.1 | 97.3 | 69.7 | 37.6 | 100 | 89.6 |
| Reef gecko | 53.4 | 90.6 | 65.6 | 38.3 | 99.1 | 65.8 | 43.0 | 100 | 84.3 |
| Average | 49.6 | 90.4 | 68.0 | 35.2 | 97.9 | 68.7 | 39.2 | 100 | 87.5 |
Note.––To make the comparison between the Anolis and Gekko reference even, the Anolis reference was trimmed to contain only the sequences present in the Snake reference (Anolis trim Gekko). Enrichment was measured as the percent of reads mapping to the reference (uncorrected for genome size). Recovery was calculated as the percent of the 165 targeted genes that had at least 5% completeness and that could be identified based on BLAST similarity and/or phylogenetic analysis. Completeness was calculated relative to the reference sequence for those genes identified as recovered using the BWA assembly method and the best reference for each gene. Abbreviations—Enrich., Enrichment; Rec., Recovery; Comp., Completeness.
Comparison of Average Completeness of Recovered Coding Regions Obtained Using Different Assemblers
| Species | Average Completeness (%) | |||||
|---|---|---|---|---|---|---|
| BWA (Default) | BWA ( | NGM | STAMPY | STAMPY (Div. = 0.1) | BT2 | |
| Anole (control) | 94.1 | 94.8 | 91.9 | 89.8 | — | 74.1 |
| Leopard lizard | 84.9 | 88.2 | 89.5 | 91.1 | — | 37.4 |
| Chameleon | 63.4 | 72.8 | 81.5 | 85.7 | — | 42.0 |
| Monitor | 70.3 | 77.7 | 83.9 | 88.8 | — | 38.2 |
| Glossy snake | 54.7 | 64.4 | 74.3 | 86.5 | — | 28.3 |
| Scarlet snake | 59.2 | 67.8 | 75.2 | 87.1 | — | 32.0 |
| King snake | 54.6 | 64.8 | 73.4 | 86.7 | — | 32.0 |
| Corn snake | 53.9 | 62.6 | 72.3 | 85.5 | — | 26.5 |
| Leaf-nosed sn. | 60.3 | 67.6 | 77.0 | 87.3 | — | 21.8 |
| Long-nosed sn. | 53.2 | 61.5 | 73.0 | 86.4 | — | 31.1 |
| Night snake | 54.4 | 64.2 | 72.8 | 86.4 | — | 27.5 |
| Garter snake | 53.8 | 63.8 | 73.2 | 85.9 | — | 31.0 |
| Plated lizard | 65.7 | 75.2 | 81.8 | 86.7 | — | 37.2 |
| Tokay gecko | 59.2 | 69.4 | 79.6 | 85.4 | 85.3 | 47.6 |
| Giant day gecko | 60.0 | 69.1 | 78.2 | 84.5 | 84.2 | 43.9 |
| Reef gecko | 55.4 | 65.6 | 77.3 | 84.3 | 84.0 | 42.3 |
|
|
|
|
|
|
|
|
Note.––All assembly was done to the Anolis reference. Abbreviations—Div., divergence parameter; BT2, Bowtie 2.
FAnalyses of the effect of divergence on completeness of recovered coding regions. (A), average gene completeness for each species compared to average pairwise sequence identity to anole demonstrating the strong correlation between sequence identity and the completeness of genes recovered. Average completeness was calculated as the average across each gene recovered using the best assembly method and reference for each gene. Pairwise identity was calculated between each species and anole for a set of representative genes obtained independently for each species. The reduced major axis regression lines are shown both including and excluding Anolis in the regression. (B), completeness of each gene captured in Gekko compared to the pairwise identity between Anolis and Gekko for those genes. Completeness was calculated for genes with complete probe and reference sequences in Anolis that were also found in the de novo transcriptome assembly of Gekko and are the values for BWA assembly against the Gekko reference. Pairwise identity was calculated between the de novo assembled Gekko coding sequences and the Anolis reference sequences thus removing bias from the guided assembly approach.
Comparison of the Performance of the Assembly and Annotation Pipeline on RNA-Seq and Whole Genome Data with the Targeted Capture Approach
| Sequencing Method | Species | Number QC-Passed Reads |
| Genes Recovered | Average Completeness |
|---|---|---|---|---|---|
| Targeted capture | Average of 16 | 8,567,574 | 51.6 | 147 (92%) | 70.6% |
| RNA-Seq |
| 29,843,391 | 4.29 | 112 (70%) | 73.5% |
| WGS |
| 26,817,128 | 0.05 | 40 (26%) / 106 (68%) | 12.6% / 30.8% |
| WGS |
| 43,587,682 | 0.05 | 42 (27%) / 135 (87%) | 12.6% / 40.4% |
| WGS |
| 184,107,287 | 0.02 | 31 (20%) / 85 (55%) | 26.8% / 39.0% |
| WGS |
| 219,605,123 | 0.02 | 122 (79%) / 151 (97%) | 48.6% / 76.2% |
| WGS |
| 265,237,073 | 0.10 | 155 (100%) | 82.9% / 88.2% |
Note.––Data shown is for assembly with BWA against the Anolis (targeted capture, RNA-Seq) or Gallus reference (WGS). Data for individual genes is shown in the supplementary tables S11 and S12, Supplementary Material online. Since the whole genome sequencing (WGS) suffered from low mapping rates (due to no enrichment) we additionally relaxed the requirement in our assembly pipeline to only require the minimum (default) level of coverage of 3X, which is shown after the slash (/). A sequence was considered recovered if it had a minimum of 5% completeness and could be identified via BLAST. Completeness was calculated relative to the reference sequence for those genes identified as recovered.
Cost Comparison of Targeted Capture, RNA-Seq, and Whole Genome Sequencing Experiments
| Per Sample Costs | Extraction | Kit & Reagents | Library prep | Fraction of HiSeq Lane | Cost of Sequencing | Total | Relative Cost |
|---|---|---|---|---|---|---|---|
| Targeted capture 16 | $3.81 | $482.74 | $100.00 | 1/32 | $65.63 | $652 | 100.0% |
| Targeted capture 96 | $3.81 | $367.79 | $100.00 | 1/32 | $65.63 | $537 | 82.4% |
| RNA-Seq | $14.45 | — | $300.00 | 1/6 | $350.00 | $664 | 101.9% |
| WGS | $3.81 | — | $251.00 | 1/7 | $300.00 | $555 | 85.1% |
| WGS | $3.81 | — | $251.00 | 1/5 | $420.00 | $675 | 103.5% |
| WGS | $3.81 | — | $251.00 | 1/4 | $525.00 | $780 | 119.6% |
| WGS | $3.81 | — | $251.00 | 3/4 | $1,575.00 | $1,830 | 280.6% |
| WGS | $3.81 | — | $251.00 | 1 | $2,100.00 | $2,356 | 361.2% |
Note.––Costs were those incurred by us in CDN dollars and are likely to vary depending on country and sequencing center used, and are expected to change over time. Additional cost details are available in the supplementary table S13, Supplementary Material online. Costs assume use of an Agilent Custom SureSelect kit (either 16 or 96 samples) and of a paid service at a core facility for targeted capture, library preparation, and/or sequencing, and thus should be generally reproducible by any lab without need for specialized equipment/expertise. They additionally assume the possibility of sequencing on partial HiSeq lanes when necessary.
FBayesian multigene phylogeny of the 16 species with two outgroups. A total of 16 phylogenetic marker genes were used. The topology agrees strongly with the multigene phylogeny of (Pyron et al. 2013) differing only in the placement of the corn snake. Posterior probability support is shown at each node.