| Literature DB >> 22384329 |
Julian M Catchen, Angel Amores, Paul Hohenlohe, William Cresko, John H Postlethwait.
Abstract
Advances in sequencing technology provide special opportunities for genotyping individuals with speed and thrift, but the lack of software to automate the calling of tens of thousands of genotypes over hundreds of individuals has hindered progress. Stacks is a software system that uses short-read sequence data to identify and genotype loci in a set of individuals either de novo or by comparison to a reference genome. From reduced representation Illumina sequence data, such as RAD-tags, Stacks can recover thousands of single nucleotide polymorphism (SNP) markers useful for the genetic analysis of crosses or populations. Stacks can generate markers for ultra-dense genetic linkage maps, facilitate the examination of population phylogeography, and help in reference genome assembly. We report here the algorithms implemented in Stacks and demonstrate their efficacy by constructing loci from simulated RAD-tags taken from the stickleback reference genome and by recapitulating and improving a genetic map of the zebrafish, Danio rerio.Entities:
Keywords: Illumina; RAD-seq; RAD-tag; meiotic linkage map; zebrafish
Year: 2011 PMID: 22384329 PMCID: PMC3276136 DOI: 10.1534/g3.111.000240
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Stacks component programs
| Program | Description | Inputs | Database interaction |
|---|---|---|---|
| process_radtags.pl | Cleans raw Illumina reads, outputs FASTA/FASTQ files. | Raw Illumina reads | No |
| ustacks (unique stacks) | Builds loci | Cleaned FASTA/FASTQ files | No |
| cstacks (catalog stacks) | Merges loci from multiple individuals to form a catalog. | ustacks, tab-separated files | No |
| sstacks (search stacks) | Matches loci from an individual against a catalog. | ustacks and cstacks, tab-separated files | No |
| markers.pl | Calls mappable markers from parental loci. | None | Yes |
| index_radtags.pl | Indexes the database for use by the web interface. | None | Yes |
| denovo_map.pl | Executes ustacks on each individual, builds a catalog with cstacks, and matches individuals against the catalog with sstacks. Calls markers with markers.pl and indexes the database with index_radtags.pl. | Cleaned FASTA/FASTQ files | Yes |
| genotypes.pl | Calls genotypes in a map cross population and outputs markers for use by JoinMap or r/QTL. | None | Yes |
| pstacks (population stacks) | Takes cleaned reads aligned to a reference genome, builds stacks based on the genomic locations of the reads, and detects haplotypes in one individual. | Bowtie or SAM sequence alignments | No |
| ref_map.pl | Executes pstacks on each individual, builds a catalog with cstacks, and matches individuals against the catalog with sstacks. Calls markers with markers.pl and indexes the database with index_radtags.pl. | Cleaned FASTA/FASTQ files | Yes |
| sort_read_pairs.pl | Given a set of | ustacks output files, cleaned FASTA/FASTQ files | No |
| load_sequences.pl | Loads a set of loci-associated sequences ( | FASTA file containing sequences | Yes |
| export_catalog.pl | Exports sequences from the database, including loci and loci-related sequences. | None | Yes |
Figure 1 Stacks schematic. (A) The ustacks program forms stacks in an individual from short sequencing reads (cleaned by process_radtags.pl) that match exactly. (B) The ustacks program breaks down the sequence of each stack into k-mers and loads them into a dictionary. The ustacks program breaks down each stack again into k-mers and queries the k-mer Dictionary to create a list of potentially matching stacks, which can be visualized as nodes in a graph connected by the nucleotide distance between them. (C) ustacks merges matched stacks to form putative loci. (D) ustacks matches secondary reads that were not initially placed in a stack against putative loci to increase stack depth. An SNP model in ustacks checks each locus at each nucleotide position for polymorphisms. (E) ustacks calls a consensus sequence and records SNP and haplotype data. (F) The cstacks program loads stacks from the parents of a genetic cross into a Catalog to create a set of all possible loci in a mapping cross. (G) sstacks matches map cross progeny against the Catalog to determine the haplotypes at each locus in every individual in the cross.
Stacks marker types
| Marker type | Female | Male | Number of segregating alleles | Notes |
|---|---|---|---|---|
| ab/aa | Heterozygous | Homozygous | 2 | |
| aa/ab | Homozygous | Heterozygous | 2 | |
| ab/ab | Heterozygous | Heterozygous | 2 | |
| aa/bb | Homozygous | Homozygous | 2 | Detected by cstacks |
| ab/– | Heterozygous | Absent | 2 | Polymorphic RAD-site in male, restriction site mutated in female |
| –/ab | Absent | Heterozygous | 2 | Polymorphic RAD-site in female, restriction site mutated in male |
| ab/cc | Heterozygous | Homozygous | 3 | ab detected by ustacks, cc detected by cstacks |
| cc/ab | Homozygous | Heterozygous | 3 | ab detected by ustacks, cc detected by cstacks |
| ab/ac | Heterozygous | Heterozygous | 3 | |
| ab/cd | Heterozygous | Heterozygous | 4 |
Figure 2 Stacks web interface. (A) The interface allows a researcher to view observed haplotypes at each locus in all individuals. (B) Researchers can click each haplotype to view the stack itself. The interface provides extensive filtering facilities as well as the ability to annotate and export results in a number of formats, including Excel, JoinMap, and R/qtl.
Figure 3 Stacks simulation results. The stickleback reference genome was digested in silico by SbfI, and 60 bp reads were made from each direction from the 22,774 cut sites at several different sequencing depths with several different error rates. The left panel shows the number of (A) loci, (B) stacks, and (C) SNPs observed in the Stacks output. Loci that Stacks assembled incorrectly are displayed in a dark color, whereas loci containing repetitive sequences are shown in a crosshatch pattern. A comparison of the number of loci present in the dataset (A) vs. the number of stacks reconstructed (B) showed that ustacks collapsed repetitive loci but correctly reconstructed nearly all other loci at low and moderate error rates or at high coverage. The right panel shows the number of reads with a certain number of sequencing errors that were incorporated into correct stacks, incorrect stacks, and unused reads for 20× coverage and error rates of (D) 0.5%, (E) 1%, and (F) 3%. As errors accumulated, Stacks excluded more reads, lowering the overall depth, whereas some reads accumulated enough errors to be incorporated into stacks that appeared to be correctly assembled but, in fact, joined stacks representing loci from which they did not originate (indicated by reads with more errors than allowed by the k-mer matching algorithm, four errors in the simulation).
Figure 4 Stacks depth of coverage distribution. (A) Correctly reconstructed stacks have a depth of coverage equal to twice the mean sequencing coverage because the simulation assumes diploid individuals. With no polymorphism or error (gray line), the depth of coverage distribution nearly matched the known simulation distribution (dotted red line), with the exception of repetitive loci, which created the long tail of the distribution to the right, which was truncated at 200× but extends to 17,000×. After adding SNPs, ustacks failed to reconstruct a small number of loci (green arrow) as shown by the increase in stacks with a depth of coverage equal to the sequencing mean depth. (B–C) With the addition of sequencing error and increasing mean sequencing depth, most stacks were still properly reconstructed. Results showed a repeating pattern of improperly reconstructed stacks occurring at multiples of the mean sequencing depth corresponding to the number of loci improperly merged together. The increasing error rate caused a general loss of depth in the stacks (green vs. violet lines).
Figure 5 Danio rerio RAD-tag map compared to the doubled haploid map. We constructed a RAD-seq genetic map of zebrafish (RADmap) using DNA from 42 individuals of the doubled haploid mapping panel (HSmap) that had been previously genotyped by microsatellites or single strand conformation polymorphism (Kelly ; Woods ; Woods ). Stacks recovered the 25 zebrafish linkage groups (Figure S2) with lengths nearly identical to published values (3186 cM in the HSmap vs. 3160 cM in the RADmap). With 7861 markers, our RADmap had nearly twice as many markers as appeared in the HSmap (4073 markers). The insert shows the scale for marker density.
Figure 6 RADmap marker order is consistent with the sequenced zebrafish genome. A specific region on LG20 with no recombination in the RADmap spanned almost 10 Mb in the physical genome (inset). This recombination suppression could be due to a heterozygous inversion present in the genome of the mother of the gynogenetic HS mapping panel.