| Literature DB >> 27408618 |
Lindsay V Clark1, Erik J Sacks1.
Abstract
BACKGROUND: In genotyping-by-sequencing (GBS) and restriction site-associated DNA sequencing (RAD-seq), read depth is important for assessing the quality of genotype calls and estimating allele dosage in polyploids. However, existing pipelines for GBS and RAD-seq do not provide read counts in formats that are both accurate and easy to access. Additionally, although existing pipelines allow previously-mined SNPs to be genotyped on new samples, they do not allow the user to manually specify a subset of loci to examine. Pipelines that do not use a reference genome assign arbitrary names to SNPs, making meta-analysis across projects difficult.Entities:
Keywords: Genotyping-by-sequencing; Meta-analysis; Read depth; Restriction site-associated DNA sequencing; Single nucleotide polymorphism (SNP); Tag counts
Year: 2016 PMID: 27408618 PMCID: PMC4940913 DOI: 10.1186/s13029-016-0057-7
Source DB: PubMed Journal: Source Code Biol Med ISSN: 1751-0473
Fig. 1Graphical representation of a sequence indexing tree generated by TagDigger. Use of the tree to match sequencing reads to known tags is illustrated. The red read does not match any known tags, and it takes two steps (looking at the first two nucleotides of the read) to make this determination. The blue read matches one of the expected tags, and it takes four steps to make the match. In comparison, if every read were compared to every tag, seven steps (one for each possible tag) would be required for every read. The maximum number of steps required to match a read will always be the length of the longest tag, which is advantageous when there are thousands of possible tags that are each 40–80 nucleotides long
Performance of the TagDigger search algorithm on a FASTQ file from RAD-seq with 96 barcodes
| Number of tags | Time to build indexing tree (s) | Time to process 10,000 FASTQ reads (s) | Estimated time to process 200,000,000 FASTQ reads (min) |
|---|---|---|---|
| 100 | 0.03 ± 0.01 | 0.218 ± 0.016 | 73 |
| 1000 | 0.84 ± 0.09 | 0.238 ± 0.005 | 79 |
| 10,000 | 7.79 ± 1.18 | 0.291 ± 0.007 | 97 |
For each number of tags, 1000 replications were performed with TagDigger, each with a different randomly-sampled subset of tags, and each with a different set of 10,000 reads from the FASTQ file. Means and standard deviations are provided
Performance of de-novo GBS and RAD-seq pipelines when analyzing a single FASTQ file
| Software | Size of intermediate files generated (Gb) | RAM utilized by pipeline (Gb) | Total time, across all processor cores, to process 203,000,000 FASTQ reads and output genotypes (min) |
|---|---|---|---|
| UNEAK pipeline in TASSEL 3.0 | 0.5 | 1.9 | 22 |
| Stacks 1.4 | 8.2 | 4.2 | 424 |
| pyRAD 3.0 | 40.9 | 18.5 | 23,215 |
The FASTQ file analyzed is the same as that used to produce Table 1. pyRAD differs from UNEAK and Stacks in that it searches for insertions and deletions, whereas the other two only search for substitutions, which is likely to account to for the substantially longer processing time
Performance of software for de-multiplexing a single FASTQ file
| Software | Time (min) |
|---|---|
|
| 169 |
| Stacks 1.4 | 156 |
| pyRAD 3.0 | 358 |
The FASTQ file analyzed is the same as that used to produce Tables 1 and 2