| Literature DB >> 28690805 |
Ethan B Linck1, Zachary R Hanna2,3,4, Anna Sellas5, John P Dumbacher4,5.
Abstract
Laboratory techniques for high-throughput sequencing have enhanced our ability to generate DNA sequence data from millions of natural history specimens collected prior to the molecular era, but remain poorly tested at shallower evolutionary time scales. Hybridization capture using restriction site-associated DNA probes (hyRAD) is a recently developed method for population genomics with museum specimens. The hyRAD method employs fragments produced in a restriction site-associated double digestion as the basis for probes that capture orthologous loci in samples of interest. While promising in that it does not require a reference genome, hyRAD has yet to be applied across study systems in independent laboratories. Here, we provide an independent assessment of the effectiveness of hyRAD on both fresh avian tissue and dried tissue from museum specimens up to 140 years old and investigate how variable quantities of input DNA affect sequencing, assembly, and population genetic inference. We present a modified bench protocol and bioinformatics pipeline, including three steps for detection and removal of microbial and mitochondrial DNA contaminants. We confirm that hyRAD is an effective tool for sampling thousands of orthologous SNPs from historic museum specimens to describe phylogeographic patterns. We find that modern DNA performs significantly better than historical DNA better during sequencing but that assembly performance is largely equivalent. We also find that the quantity of input DNA predicts %GC content of assembled contiguous sequences, suggesting PCR bias. We caution against sampling schemes that include taxonomic or geographic autocorrelation across modern and historic samples.Entities:
Keywords: RADseq; ancient DNA; hyRAD; museum genomics; sequence capture
Year: 2017 PMID: 28690805 PMCID: PMC5496524 DOI: 10.1002/ece3.3065
Source DB: PubMed Journal: Ecol Evol ISSN: 2045-7758 Impact factor: 2.912
Sampling information
| Specimen | Subspecies | Sample type | Locality | Date collected | Age of sample (years) | Total number of reads | Number of cleaned reads | % Duplicate reads | Average coverage (X) | Specificity | Fold enrichment | Sensitivity | Loci | Average locus length (nt) | Number of contigs over 1KB | Initial concentration (ng/μl) | GC content of on‐target contigs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KU:Birds:5215 |
| Tissue | Ivimka Camp, Gulf Province, Papua New Guinea | 2003 | 13 | 18,713,496 | 2,915,214 | 68.1 | 12.74 | 16.87 | 16.58 | 84.93 | 22,568 | 518 | 489 | 71.6 | 44.22 |
| KU:Birds:5464 |
| Tissue | Sapoa Camp, Gulf Province, Papua New Guinea | 2003 | 13 | 17,246,396 | 2,512,928 | 68.2 | 11.2 | 16.65 | 16.36 | 79.43 | 19,880 | 499 | 340 | 51.8 | 44.69 |
| KU:Birds:6927 |
| Tissue | Mt. Suckling, Oro Province, Papua New Guinea | 2011 | 5 | 25,892,086 | 3,103,912 | 73.2 | 14.46 | 17.37 | 17.07 | 85.02 | 23,162 | 510 | 494 | 37 | 44.53 |
| KU:Birds:7131 |
| Tissue | Dark End Camp, Gulf Province, Papua New Guinea | 2002 | 14 | 6,819,041 | 1,154,541 | 60.4 | 5.87 | 17.29 | 16.99 | 65.44 | 12,725 | 434 | 157 | 31.8 | 45.01 |
| CAS:ORN:626 |
| Blood | Varirata National Park, Central Province, Papua New Guinea | 2011 | 5 | 15,548,547 | 1,945,868 | 70.2 | 9.38 | 17.06 | 16.77 | 77.25 | 18,519 | 488 | 262 | 43 | 44.62 |
| KU:Birds:9145 |
| Tissue | Gahom Camp, East Sepik Province, PNG | 2003 | 13 | 18,049,058 | 2,309,716 | 70.2 | 10.75 | 16.91 | 16.62 | 80.15 | 20,057 | 452 | 312 | 10.1 | 44.35 |
| AMNH:Birds:637445 |
| Toe pad | Wasior, West Papua Province, Indonesia | 1928 | 88 | 13,112,780 | 48,257 | 78.2 | 2.59 | 11.16 | 10.97 | 33.52 | 849 | 449 | 57 | 3.66 | 46.03 |
| AMNH:Birds:329542 |
| Toe pad | Sewa Bay, Normanby Island, Papua New Guinea | 1934 | 82 | 15,875,059 | 40,402 | 82.2 | 1.8 | 8.9 | 8.75 | 29.19 | 388 | 465 | 24 | 3.84 | 47.62 |
| AMNH:Birds:637464 |
| Toe pad | Wannambai, Maluku Province, Indonesia | 1896 | 120 | 21,058,924 | 139,248 | 59.9 | 9.2 | 11.07 | 10.88 | 50.24 | 3,197 | 449 | 154 | 10.2 | 44.98 |
| AMNH:Birds:637450 |
| Toe pad | Humbolt Bay, Papua Province, Indonesia | 1928 | 88 | 3,692,636 | 16,359 | 75.1 | 0.89 | 12.06 | 11.85 | 14.62 | 153 | 543 | 15 | 0.714 | 48.74 |
| AMNH:Birds:637429 |
| Toe pad | Misool Island, West Papua Province, Indonesia | 1900 | 116 | 13,531,142 | 35,925 | 79.8 | 2.13 | 9.46 | 9.30 | 30.09 | 412 | 448 | 26 | 2.54 | 46.44 |
| AMNH:Birds:300723 |
| Toe pad | Waigeu Island, West Papua Province, Indonesia | 1900 | 116 | 14,143,466 | 34,596 | 82.2 | 1.91 | 10.03 | 9.86 | 27.78 | 508 | 401 | 21 | 0.846 | 46.88 |
| AMNH:Birds:637446 |
| Toe pad | Kepaur, West Papua Province, Indonesia | 1897 | 119 | 20,487,169 | 36,627 | 84.1 | 2.15 | 8.02 | 7.88 | 23.19 | 177 | 600 | 22 | 1.07 | 48.18 |
| NHMUK:ZOO:1911.12.20.823 |
| Toe pad | Mimika River, Papua Province, Indonesia | 1913 | 103 | 2,875,704 | 19,115 | 50.9 | 1.55 | 10.85 | 10.66 | 20.44 | 239 | 611 | 40 | 1.49 | 47.28 |
| NHMUK:ZOO:1911.12.20.822 |
| Toe pad | Satakwa River, Papua Province, Indonesia | 1911 | 105 | 5,464,605 | 46,548 | 59.8 | 2.08 | 9.44 | 9.28 | 24.39 | 296 | 579 | 36 | 0.584 | 48.32 |
| AMNH:Birds:437798 |
| Toe pad | Amberbaki, West Papua Province, Indonesia | 1877 | 139 | 1,579,610 | 4,957 | 73.4 | 0.39 | 9.1 | 8.94 | 4.58 | 52 | 582 | 4 | 1.84 | 48.88 |
| AMNH:Birds:637441 |
| Toe pad | Mt. Mori, West Papua Province, Indonesia | 1899 | 117 | 6,803,386 | 19,623 | 76.9 | 1.3 | 9.5 | 9.34 | 21.38 | 157 | 512 | 11 | 7.36 | 47.89 |
| AMNH:Birds:293741 |
| Toe pad | Ifaar, Papua Province, Indonesia | 1928 | 88 | 27,602,106 | 49,786 | 91.1 | 2.14 | 12.48 | 12.27 | 23.84 | 493 | 447 | 28 | 0.124 | 50.41 |
| AMNH:Birds:293715 |
| Toe pad | Kepaur, West Papua Province, Indonesia | 1928 | 88 | 30,751,249 | 56,972 | 90.5 | 2.15 | 11.15 | 10.96 | 24.2 | 438 | 504 | 30 | 0.15 | 50.95 |
Figure 1Bioinformatics pipeline for S. torotoro hyRAD data. We demultiplexed 100‐bp paired‐end reads from three genomic libraries and filtered for adapter contamination/quality scores (A), or adapter contamination/quality scores and E. coli contamination (A1). Reads were clustered (as consensus fasta files) (B), and repeat regions removed from probes (C). After determining which assembled clusters were orthologous with probe regions (D), we merged flanking regions from on‐target loci in modern samples with the repeat‐free probe sequence to create a pseudo‐reference genome (E). To identify which contigs represented contamination in the original probe sample library from exogenous microbes or mitochondrial DNA, we BLAST searched against both the NCBI nt database and a full mitochondrial genome from S. torotoro relative Halcyon sanctus (F). We aligned quality filtered reads to this pseudo reference (G), called SNPs to produce a raw.vcf file for historic and modern DNA libraries separately (H). After filtering SNPs for origin in contaminant contigs and then restricting our matrix to sites present in both sample types (I), we filtered SNPs by read depth, quality scores, probability of being variable sites, and minor allele frequencies (J) prior to downstream analyses
Results of microbial and mitochondrial contamination removal at three distinct steps. Raw reads were filtered for E. coli contamination; assemblied contigs were filtered for mitochondrial DNA; SNP matrices were filtered for both microbial DNA and mitochondrial DNA
| Contamination filtering step | Sample type | Total count | Count removed | Percent removed |
|---|---|---|---|---|
| Raw reads | Modern | 13,942,179 | 4 | <0.001 |
| Historic | 548,415 | 2,451 | 0.44 | |
| Assembled contigs | Modern | 296,828 | 8 | <0.001 |
| Historic | 36,109 | 53 | 0.15 | |
| SNP matrix | Modern | 6,915,902 | 6,620 | <0.001 |
| Historic | 749,091 | 3,945 | 0.53 |
Figure 2Differences in sequencing and assembly performance between historical and modern DNA extractions. We observed significantly higher specificity (t = −17.711, df = 14.015, p < .001), sensitivity (t = –12.928, df = 14.014, p < .001), fold enrichment (t = −17.711, df = 14.015, p < .001), and average coverage (t = −6.248, df = 7.555, p < .001) in modern samples. We recovered a significantly higher total number of on‐target loci in modern samples (t = −12.239, df = 5.221, p < .001), but significantly higher mean percent GC content in historic samples (t = 6.997, df = 13.368, p < .001). We observed no significant differences between modern and historic samples for mean contig size or mean percentage of duplicate reads
Figure 3Among historic samples, the number of trimmed reads was a significant predictor of the number of captured loci (R 2 = .872, p < .001) and the initial sample DNA input quantity was a significant predictor of percent GC content in on‐target assembled contigs (R 2 = .370, p = .016)
Figure 4This displays the relationship between the cutoff for the minimum number of genotyped individuals required for a given SNP to be included in the data matrix and the total number of SNPs included in the resulting matrix
Figure 5Individual membership probabilities for K = 3 ancestral populations inferred from analyzing 1,690 SNPs (100% complete data matrix) using discriminant analysis of principal components and retaining six principal component axes with two discriminant axes. Individual sampling locations are color coded accordingly
P values for simple linear regression between PC axes and sample variables
| Specificity | Sensitivity | Fold enrichment | Age | Initial concentration | |
|---|---|---|---|---|---|
| PC1 | <.001 | <.001 | <.001 | <.001 | <.001 |
| PC2 | .076 | .534 | .076 | .242 | .253 |
| PC3 | .437 | .335 | .437 | .466 | .552 |