| Literature DB >> 26754002 |
Arthur T O Melo1, Radhika Bartaula2, Iago Hale3.
Abstract
BACKGROUND: With its simple library preparation and robust approach to genome reduction, genotyping-by-sequencing (GBS) is a flexible and cost-effective strategy for SNP discovery and genotyping, provided an appropriate reference genome is available. For resource-limited curation, research, and breeding programs of underutilized plant genetic resources, however, even low-depth references may not be within reach, despite declining sequencing costs. Such programs would find value in an open-source bioinformatics pipeline that can maximize GBS data usage and perform high-density SNP genotyping in the absence of a reference.Entities:
Mesh:
Year: 2016 PMID: 26754002 PMCID: PMC4709900 DOI: 10.1186/s12859-016-0879-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Outline of the GBS-SNP-CROP workflow, featuring inputs and outputs of all seven steps (scripts)
| Input file(s) | Output file(s) | Timea (hrs:mins) | |
|---|---|---|---|
| Stage 1. Process the raw GBS data | |||
|
| - CASAVA generated paired-end (R1, R2) files (.fastq.gz) | - Parsing summary information (.txt) | 2:24 |
| - Read length distribution summary (.txt) | |||
| - Barcode-ID file (.txt) | - Parsed paired-end [PE] reads (.fastq) | ||
| - Parsed, unpaired R1 reads (.fastq) | |||
|
| - Parsed PE reads (.fastq) | - High quality, parsed PE reads (.fastq) | 0:10 |
| - High quality, parsed singletons (.fastq) | |||
|
| - One pair (R1, R2) of high quality files (.fastq) per library | - One pair (R1, R2) of high quality files (.fastq) per genotype | 0:16 |
| - Barcode-ID file (.txt) | |||
| Stage 2. Build the Mock Reference | |||
|
| - Genotype-specific PE files (.fastq) | - Mock Reference [centroids] (.fasta) | 0:14b |
| - Barcode-ID file (.txt) | - Mock Reference [genome] (.fasta) | ||
| Stage 3. Map the processed reads and generate standardized alignment files | |||
|
| - Genotype-specific high quality PE files (.fastq) | - Filtered reads (.bam) | 3:36 |
| - Sorted BAM files (.sorted.bam) | |||
| - Reference or MR [genome] (.fasta) | - Indexed BAM files (.sorted.bam.bai) | ||
| - Barcode-ID file (.txt) | - Indexed reference or MR (.fasta.idx) | ||
| - One base call alignment summary file (.mpileup) per genotype | |||
|
| - One base call alignment summary file (.mpileup) per genotype | - One base call alignment summary count file (.txt) per genotype | 4:37 |
| - Barcode-ID file (.txt) | - SNP discovery master matrix (.txt) | ||
| Stage 4. Call SNPs and Genotypes | |||
|
| - SNP discovery master matrix (.txt) | - SNP genotyping matrix for the population (.txt) | 0:04 |
a The computation times presented here are specific to the particular dataset in this study
b The time to build the Mock Reference using only the single most read-abundant genotype (-MR01). Using the five most read abundant genotypes and using all 48 genotypes, the required computation time for this step increases to 0:55 and 4:30, respectively (see Table 2)
Fig. 1Schematic of the four stages of the SNP-GBS-CROP workflow
Performance of GBS-SNP-CROP under three different sampling strategies for building the Mock Reference: Using all 48 individuals in the population (MR48), using only the 5 individuals with the highest number of parsed reads (MR05), and using only the single most read-abundant genotype (MR01)
| Pipelines | Total number of centroids used to build the Mock Referencea | Total number of paired-end reads used for SNP callingb | Number of SNPs calledc | Avg. depthd | Hetero (%)e | Homo (%)f | Missing data (%)g | Time (hrs:mins)h |
|---|---|---|---|---|---|---|---|---|
| GBS-SNP-CROP-MR48 | 1,276,734 | 92,667,123 | 14,712 | 70.74 | 32.47 | 59.31 | 8.20 | 14:30 |
| GBS-SNP-CROP-MR05 | 500,795 | 132,920,383 | 20,226 | 71.02 | 34.50 | 57.18 | 8.31 | 12:06 |
| GBS-SNP-CROP-MR01 | 229,549 | 154,506,669 | 21,318 | 69.34 | 34.51 | 56.85 | 8.29 | 11:03 |
a Total number of non-redundant consensus sequences (centroids) identified via clustering to represent the GBS fragment space. This is also the number of FASTA entries in the “MockRef_Clusters.fasta” file
b Number of reads retained by the pipeline after mapping procedures and thus used for SNP calling
c Total number of SNPs called, given all SNP calling filters and genotyping criteria described in the text
d Average read depth for all SNPs across the entire population
e Percentage of heterozygous genotype calls
f Percentage of homozygous genotype calls
g Percentage of missing cells (i.e. no genotype call for a given SNP*accession combination) in the final SNP genotype matrix
h The total computation time required for all pipeline analysis when executed on a Unix workstation with 16 GB RAM and a 2.6 GHz Dual Intel processor
Fig. 2Structure of the final SNP genotyping matrix. As shown here, the GBS-SNP-CROP final genotyping matrix contains summary statistics as well as complete genotype-specific alignment data for each SNP called. The cells in red represent instances in which a genotypic state could not be assigned, either due to insufficient read depth (-|0/4) or a read depth ratio outside of the user-specified acceptable range (-|132/5)
Comparative data usage and computation times for five different analyses of 150 bp paired-end GBS data from 48 accessions of Actinidia arguta
| Pipeline | Min required read length (bp)a | Max usable read length (bp)b | Total number of usable R1 readsc | Total usable bases (Gb)d | Time (hrs:mins)e |
|---|---|---|---|---|---|
|
| |||||
| GBS-SNP-CROP-RG | 32 | NA | 128,577,030 | 16.82 | 8:30 |
| TASSEL-GBS-mxTagL32 | 50 | 32 | 120,593,880 | 3.85 | 0:35 |
| TASSEL-GBS-mxTagL64 | 75 | 64 | 105,908,174 | 6.77 | 1:10 |
|
| |||||
| GBS-SNP-CROP-MR01 | 32 | NA | 128,577,030 | 16.82 | 11:03 |
| TASSEL-UNEAK | 32 | 64 | 134,352,640 | 8.60 | 0:27 |
a GBS-SNP-CROP utilizes the entire R1 and R2 paired-end sequences of all parsed and quality trimmed reads longer than a user-specified (i.e. adjustable) minimum length, in this case 32 bp. The TASSEL-GBS pipelines utilize a uniform user-specified portion (e.g. 32 bp, 64 bp) from the beginning of acceptable R1 (single-end) reads that exceed a minimum length (e.g. 50 bp, 75 bp) before barcode and cut site trimming. TASSEL-UNEAK utilizes up to 64 bp from the beginning of acceptable R1 (single-end) reads that exceed a minimum length of 32 bp after barcode and cut site trimming
b The maximum length of sequences utilized by GBS-SNP-CROP is set by the sequencing platform (e.g. 100 bp, 150 bp, etc.). In TASSEL-GBS, the user specifies a maximum tag length, thereby effectively setting a uniform tag length. The maximum usable sequence length in TASSEL-UNEAK is 64 bp, with all shorter reads greater than 32 bp padded with poly-A’s to a uniform 64 bp tag length
c The number of R1 (i.e. single-end) reads ultimately used by each pipeline, after filtering based on quality and read length requirements. The R1 (single-end) counts are shown here to facilitate comparison across pipelines. Because GBS-SNP-CROP utilizes paired-end reads, the total number of actual reads used (R1 and R2) is twice this number
d The total number of nucleotides of sequence data used in each analysis
e The total computation time required for each analysis when executed on a Unix workstation with 16 GB RAM and a 2.6 GHz Dual Intel processor
Comparative pipeline performances before (4A) and after (4B) depth-based genotyping criteria and population-level SNP calling filters for 150 bp paired-end GBS data from 48 accessions of Actinidia arguta
| Pipeline | Number of SNPsa | Average depth [D]b | Reads with D ≥ 20 (%)c | Hetero (%)d | Homo (%)e | Missing data (%)f |
|---|---|---|---|---|---|---|
| 4A. No SNP calling or genotyping filters appliedg | ||||||
| GBS-SNP-CROP-MR01 | 56,598 | 44.47 | 52.85 | 26.19 | 57.25 | 16.55 |
| GBS-SNP-CROP-RG | 23,564 | 47.39 | 39.00 | 26.10 | 51.09 | 22.80 |
| TASSEL-UNEAK | 12,905 | 6.98 | 9.61 | 13.93 | 52.12 | 33.94 |
| TASSEL-GBS-mxTagL32 | 19,095 | 134.18 | 49.20 | 16.86 | 71.66 | 11.46 |
| TASSEL-GBS-mxTagL64 | 25,005 | 34.65 | 35.60 | 19.21 | 65.80 | 14.98 |
| 4B. Depth-based genotyping criteria and population-level SNP calling filters appliedh | ||||||
| GBS-SNP-CROP-MR01 | 21,318 | 69.34 | 99.92 | 34.51 | 56.85 | 8.64 |
| GBS-SNP-CROP-RG | 5,471 | 77.11 | 99.85 | 38.31 | 53.29 | 8.40 |
| TASSEL-UNEAK | 1,160 | 44.70 | 83.62 | 31.66 | 66.61 | 1.73 |
| TASSEL-GBS-mxTagL32 | 5,593 | 64.41 | 92.52 | 26.33 | 71.58 | 2.09 |
| TASSEL-GBS-mxTagL64 | 8,907 | 51.42 | 78.07 | 27.80 | 69.36 | 2.84 |
a Total number of SNPs called within each pipeline, under the indicated SNP calling filters and genotyping criteria
b Average read depth for all SNPs across the entire population
c Percentage of called SNPs with an average read depth of at least 20
d Percentage of heterozygous genotype calls
e Percentage of homozygous genotype calls
f Percentage of missing cells (i.e. no genotype call for a given SNP*accession combination)
g Liberal pipeline results in the absence of subsequent SNP calling or genotyping filters
h Pipeline results after culling SNPs with less than 75 % scored genotypes, with D ≤ 4 (low depth), or D ≥ 200 (over-represented sequences). Further reduction is due to imposing stringent depth-based genotyping criteria, including a minimum read depth of 11 for homozygotes when the secondary allele count is zero, a minimum depth of 48 for homozygotes when the secondary allele count is one, a minimum depth of 3 for each allele for heterozygotes, and a read-depth ratio of the lower- to higher-depth allele greater than 0.1
Fig. 3Bar plot showing the extent of marker overlap among the five evaluated pipelines. The sets of SNPs called by the five pipelines are largely orthogonal to one another, as shown by the fact that both the reference-based and reference-independent pipelines call high proportions of SNPs called by no other pipeline (grey bars). Shared SNPs among pipelines are indicated by color-coordinated bars. Whereas only 0.6 and 0.4 % of the 8,907 and 5,593 SNPs called by TASSEL-GBS-64 and TASSEL-32, respectively, were identified by TASSEL-UNEAK, 33.7 % of the SNPs called by GBS-SNP-CROP-RG were called by GBS-SNP-CROP-MR01
Comparative pipeline performances, in terms of consistency in genotyping biological replicates
| cv. ‘Opitz Male’ | cv. ‘Dumbarton Oaks’ | ||||||
|---|---|---|---|---|---|---|---|
| Pipelines | Number of SNPsa | Gower genetic similarityb | Pearson correlationc | Shared genotype calls (%)d | Gower genetic similarity | Pearson correlation | Shared genotype calls (%) |
| GBS-SNP-CROP-MR01 | 21,318 | 0.999 | 0.998 | 99.9 | 0.998 | 0.997 | 99.8 |
| GBS-SNP-CROP-RG | 5,471 | 0.999 | 0.998 | 99.9 | 0.998 | 0.997 | 99.9 |
| TASSEL-UNEAK | 1,160 | 0.935 | 0.948 | 93.6 | 0.950 | 0.961 | 94.8 |
| TASSEL-GBS-32 bp | 5,593 | 0.967 | 0.909 | 96.3 | 0.969 | 0.922 | 96.4 |
| TASSEL-GBS-64 bp | 8,907 | 0.967 | 0.920 | 96.7 | 0.966 | 0.919 | 96.6 |
a The total number of SNPs used in this analysis refers to numbers from Table 4B
b A modified Gower’s general Coefficient of Similarity [48], ranging from 0 to 1, to quantify identity-by-state based on bi-allelic SNPs
c Pearson correlation calculated using R [31]; for all correlations in the table, p-value < 0.01
d The percentage of SNPs with exact genotype matches for the two biological replicates. All loci with missing data for either replicate were discarded