| Literature DB >> 27547936 |
Davoud Torkamaneh1,2, Jérôme Laroche2, François Belzile1,2.
Abstract
Next-generation sequencing (NGS) has revolutionized plant and animal research in many ways including new methods of high throughput genotyping. Genotyping-by-sequencing (GBS) has been demonstrated to be a robust and cost-effective genotyping method capable of producing thousands to millions of SNPs across a wide range of species. Undoubtedly, the greatest barrier to its broader use is the challenge of data analysis. Herein we describe a comprehensive comparison of seven GBS bioinformatics pipelines developed to process raw GBS sequence data into SNP genotypes. We compared five pipelines requiring a reference genome (TASSEL-GBS v1& v2, Stacks, IGST, and Fast-GBS) and two de novo pipelines that do not require a reference genome (UNEAK and Stacks). Using Illumina sequence data from a set of 24 re-sequenced soybean lines, we performed SNP calling with these pipelines and compared the GBS SNP calls with the re-sequencing data to assess their accuracy. The number of SNPs called without a reference genome was lower (13k to 24k) than with a reference genome (25k to 54k SNPs) while accuracy was high (92.3 to 98.7%) for all but one pipeline (TASSEL-GBSv1, 76.1%). Among pipelines offering a high accuracy (>95%), Fast-GBS called the greatest number of polymorphisms (close to 35,000 SNPs + Indels) and yielded the highest accuracy (98.7%). Using Ion Torrent sequence data for the same 24 lines, we compared the performance of Fast-GBS with that of TASSEL-GBSv2. It again called more polymorphisms (25.8K vs 22.9K) and these proved more accurate (95.2 vs 91.1%). Typically, SNP catalogues called from the same sequencing data using different pipelines resulted in highly overlapping SNP catalogues (79-92% overlap). In contrast, overlap between SNP catalogues obtained using the same pipeline but different sequencing technologies was less extensive (~50-70%).Entities:
Mesh:
Year: 2016 PMID: 27547936 PMCID: PMC4993469 DOI: 10.1371/journal.pone.0161333
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Number of SNPs and indels detected among 24 soybean lines using seven different bioinformatics pipelines on Illumina reads.
The time and amount of memory needed to run each pipeline are also provided.
| Variants | |||||
|---|---|---|---|---|---|
| Approach | Pipeline | SNPs | Indels | Time | Memory (Gb) |
| Stacks | 13,303 | ND | 3:07 | 7 | |
| UNEAK | 24,743 | ND | 1:11 | 20 | |
| TASSEL-GBSv1 | 54,412 | ND | 1:45 | 15 | |
| Stacks | 18,941 | ND | 3:30 | 14 | |
| IGST | 25,650 | 3,170 | 12:59 | 240 | |
| TASSEL-GBSv2 | 28,158 | ND | 4:16 | 18 | |
| Fast-GBS | 34,953 | 3,921 | 1:47 | 27 | |
* Using a Linux system with 10 CPU and 25G of memory
Accuracy of GBS SNP data derived from Illumina platform using different bioinformatics pipeline.
| Approach | Reference-based | ||||||
|---|---|---|---|---|---|---|---|
| Parameter/Pipeline | Stacks | UNEAK | TASSEL-GBS v1 | Stacks | IGST | TASSEL-GBS v2 | Fast-GBS |
| Number of SNPs | 13,303 | 24,743 | 54,412 | 18,941 | 25,650 | 28,158 | 34,953 |
| Number of genotypes | 319,272 | 593,832 | 1,305,888 | 454,584 | 615,600 | 675,792 | 838,872 |
| Missing data (%) | 41.3 | 39.4 | 28 | 57.3 | 44 | 35.6 | 46 |
| Heterozygotes (%) | 3.7 | 5.3 | 11.5 | 4.4 | 5.9 | 5.7 | 3.4 |
| Loci with >50% heterozygotes | 0 | 0 | 1125 | 65 | 324 | 551 | 184 |
| Accuracy (%) | 93.6 | 93.9 | 76.1 | 93.2 | 98.4 | 92.3 | 98.7 |
*These were eliminated from the final catalogue used to estimate accuracy
Degree of overlap among SNP loci called using Fast-GBS and six other bioinformatics pipelines
| SNPs | |||||
|---|---|---|---|---|---|
| Approach | Pipeline | Total | Common (in %) | Other pipeline only | Fast-GBS only |
| Stacks | 13,303 | 89.1 | 1,450 | 23,100 | |
| UNEAK | 24,743 | 87.5 | 3,172 | 13,382 | |
| TASSEL-GBS v1 | 54,412 | 36.7 | 34,420 | 14,961 | |
| Stacks | 18,941 | 96.2 | 1,709 | 16,721 | |
| IGST | 25,650 | 92.4 | 1,950 | 11,253 | |
| TASSEL-GBS v2 | 28,158 | 88.3 | 3,295 | 10,090 | |
Fig 1Venn diagram representing the degree of overlap among SNP loci called using seven bioinformatics pipelines.
The percentages indicate the estimated accuracy for all groups of SNPs (unique or shared).
Fig 2Systematic approach used to investigate the possible causes of unique inaccurate SNP calls.
Number and characteristics of unique inaccurate SNPs called by different pipelines.
| Stacks | UNEAK | TASSEL GBS v1 | Stacks | IGST | TASSEL GBS v2 | Fast-GBS | |
|---|---|---|---|---|---|---|---|
| 495 | 533 | 9,828 | 103 | 207 | 558 | 272 | |
| (3.7% of 13,303) | (2.2% of 24,743) | (18.1% of 54,412) | (0.5% of 18,941) | (0.8% of 25,650) | (2.0% of 28,158) | (0.8% of 34,953) | |
| 146 | 72 | 1,126 | 20 | 46 | 132 | 35 | |
| (29.7) | (13.5) | (11.5) | (19.4) | (22.2) | (23.7) | (12.9) | |
| 349 | 461 | 8,702 | 83 | 161 | 426 | 237 | |
| (70.3) | (86.5) | (88.5) | (80.6) | (77.8) | (76.3) | (87.1) | |
| 45 | 120 | 1,828 | 9 | 15 | 60 | 17 | |
| (13) | (26) | (21) | (11) | (9) | (14) | (7) | |
| 304 | 341 | 6,875 | 74 | 146 | 366 | 220 | |
| (87) | (74) | (79) | (89) | (91) | (86) | (93) | |
Number of SNPs and indels detected among 24 soybean lines using Ion Torrent reads and two different bioinformatics pipelines
| Variants | |||||
|---|---|---|---|---|---|
| Approach | Pipeline | SNP | Indels | Time | Memory (Gb) |
| TASSEL-GBSv2 | 22,921 | ND | 3:29 | 17 | |
| Fast-GBS | 23,792 | 2,054 | 1:31 | 20 | |
* Using a Linux system with 10 CPU and 25G of memory
Accuracy of SNP data derived using Ion Torrent reads and two different bioinformatics pipelines
| Stat type/Pipeline | TASSEL-GBSv2 | Fast-GBS |
|---|---|---|
| 22,921 | 23,792 | |
| 37 | 33 | |
| 4,831 | 861 | |
| 6.6 | 4.5 | |
| 91.1 | 95.2 |
*These were eliminated from the final catalogue used to estimate accuracy
Fig 3Venn diagram for overlap of the SNPs called using two different bioinformatics pipelines (a) Overlap of SNPs called with Fast-GBS using Illumina and Ion Torrent reads. (b) Overlap of SNPs called with TASSEL-GBS v2 using Illumina and Ion Torrent reads. The percentages indicate the estimated accuracy for all groups of SNPs (unique or shared).