| Literature DB >> 24949246 |
Jonathan B Puritz1, Christopher M Hollenbeck1, John R Gold1.
Abstract
Restriction-site associated DNA sequencing (RADseq) has become a powerful and useful approach for population genomics. Currently, no software exists that utilizes both paired-end reads from RADseq data to efficiently produce population-informative variant calls, especially for non-model organisms with large effective population sizes and high levels of genetic polymorphism. dDocent is an analysis pipeline with a user-friendly, command-line interface designed to process individually barcoded RADseq data (with double cut sites) into informative SNPs/Indels for population-level analyses. The pipeline, written in BASH, uses data reduction techniques and other stand-alone software packages to perform quality trimming and adapter removal, de novo assembly of RAD loci, read mapping, SNP and Indel calling, and baseline data filtering. Double-digest RAD data from population pairings of three different marine fishes were used to compare dDocent with Stacks, the first generally available, widely used pipeline for analysis of RADseq data. dDocent consistently identified more SNPs shared across greater numbers of individuals and with higher levels of coverage. This is due to the fact that dDocent quality trims instead of filtering, incorporates both forward and reverse reads (including reads with INDEL polymorphisms) in assembly, mapping, and SNP calling. The pipeline and a comprehensive user guide can be found at http://dDocent.wordpress.com.Entities:
Keywords: Bioinformatics; Molecular ecology; Next-generation sequencing; Population genomics; RADseq
Year: 2014 PMID: 24949246 PMCID: PMC4060032 DOI: 10.7717/peerj.431
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Levels of coverage for each unique read in the red snapper data set.
The horizontal axis represents the minimal level of coverage, while the vertical axis represents the number of unique paired reads in thousands.
Results from individual runs of dDocent and Stacks.
dDocent runs varied in the level of similarity used to cluster reference sequences: A (90%), B (96%), and C (99%). For Stacks, forward reads and reverse reads were separately processed with denovo_map.pl (Stacks version 1.08), using three different sets of parameters: A, minimum depth of coverage of two to create a stack, a maximum distance of two between stacks, and a maximum distance of four between stacks from different individuals; B, minimum depth of coverage of three to create a stack, a maximum distance of four between stacks, and a maximum distance of eight between stacks from different individuals; and C, minimum depth of coverage of three to create a stack, a maximum distance of four between stacks, and a maximum distance of 10 between stacks from different individuals. For dDocent, complex variants were decomposed into canonical SNP and Indel calls and Indel calls were filtered out. SNP calls were evaluated at different individual coverage levels: (i) total number of SNPs; (ii) number of SNPS called in 75%, 90%, and 99% at 3X coverage; (iii) number of SNPS called in 75% and 90% of individuals at 5X coverage; (iv) number of SNPS called in 75% and 90% of individuals at 10X coverage; and, (v) number of SNPS called in 75% and 90% of individuals at 20X coverage. Run times are in minutes. Results from forward and reverse reads of Stacks were combined for comparison with dDocent, which inherently calls SNPs on both reads.
|
| ||||||
| Total 3X SNPS | 53,298 | 53,316 | 53,361 | 28,817 | 33,479 | 34,459 |
| 75% 3X SNPs | 21,195 | 20,990 | 20,724 | 4,150 | 5,735 | 5,728 |
| 90% 3X SNPs | 9,102 | 8,850 | 8,639 | 675 | 987 | 983 |
| 99% 3X SNPs | 78 | 47 | 15 | – | – | – |
| 75% 5X SNPs | 14,881 | 14,594 | 14,339 | 2,632 | 4,351 | 4,324 |
| 90% 5X SNPs | 5,021 | 4,925 | 4,785 | 179 | 579 | 574 |
| 75% 10X SNPs | 7,556 | 7,318 | 7,154 | 783 | 1,618 | 1,579 |
| 90% 10X SNPS | 1,414 | 1,340 | 1,286 | 7 | 48 | 47 |
| 90% IND 90% 5X | 10,267 | 10,026 | 9,798 | 806 | 1,807 | 1,079 |
| 90% IND 90% 10x | 4,242 | 4,093 | 3,974 | 129 | 441 | 434 |
| Run time | 41 | 41 | 42 | 70 | 47 | 53 |
|
| ||||||
| Total 3X SNPS | 46,378 | 46,688 | 46,832 | 45,792 | 50,821 | 52,366 |
| 75% 3X SNPs | 36,745 | 36,905 | 36,900 | 24,134 | 28,991 | 28,981 |
| 90% 3X SNPs | 32,356 | 32,424 | 32,330 | 13,439 | 17,946 | 17,874 |
| 99% 3X SNPs | 11,906 | 11,910 | 11,774 | 828 | 1,264 | 1,259 |
| 75% 5X SNPs | 34,279 | 34,393 | 34,336 | 21,021 | 26,526 | 26,464 |
| 90% 5X SNPs | 28,532 | 28,566 | 28,431 | 10,494 | 15,282 | 15,207 |
| 75% 10X SNPs | 27,523 | 27,605 | 27,488 | 12,928 | 17,018 | 16,983 |
| 90% 10X SNPS | 19,434 | 19,442 | 19,283 | 4,159 | 6,734 | 6,705 |
| 75% 20X SNPs | 15,080 | 15,111 | 14,981 | 2,276 | 3,538 | 3,516 |
| 90% 20X SNPs | 7,365 | 7,409 | 7,304 | 243 | 1,974 | 1,961 |
| Run time | 43 | 45 | 45 | 58 | 55 | 65 |
|
| ||||||
| Total 3X SNPS | 68,668 | 68,825 | 68,861 | 48,742 | 55,505 | 58,352 |
| 75% 3X SNPs | 30,771 | 30,391 | 30,051 | 7,596 | 9,705 | 9,696 |
| 90% 3X SNPs | 14,952 | 14,673 | 14,415 | 2,007 | 3,439 | 3,433 |
| 99% 3X SNPs | 4,294 | 4,060 | 3,952 | 132 | 527 | 523 |
| 75% 5X SNPs | 20,534 | 20,188 | 19,968 | 4,789 | 7,290 | 7,274 |
| 90% 5X SNPs | 9,103 | 8,750 | 8,533 | 1,225 | 2,573 | 2,570 |
| 75% 10X SNPs | 9,765 | 9,400 | 9,159 | 2,094 | 3,547 | 3,546 |
| 90% 10X SNPS | 3,923 | 3,691 | 3,490 | 489 | 1,224 | 1,223 |
| 75% 20X SNPs | 4,069 | 3,832 | 3,624 | 703 | 1,415 | 1,411 |
| 90% 20X SNPs | 1,431 | 1,313 | 1,228 | 136 | 417 | 418 |
| Run time | 88 | 95 | 59 | 93 | 89 | 204 |
Figure 2SNP results averaged across the three different run parameters for dDocent and Stacks.
(A) Red snapper, (B) Red drum, (C) Silk snapper (see Methods or Table 1 for SNP categories description). Error bars represent one standard error.