| Literature DB >> 30321264 |
Abstract
SUMMARY: GBS-SNP-CROP is a bioinformatics pipeline originally developed to support the cost-effective genome-wide characterization of plant genetic resources through paired-end genotyping-by-sequencing (GBS), particularly in the absence of a reference genome. Since its 2016 release, the pipeline's functionality has greatly expanded, its computational efficiency has improved, and its applicability to a broad set of genomic studies for both plants and animals has been demonstrated. This note details the suite of improvements to date, as realized in GBS-SNP-CROP v.4.0, with specific attention paid to a new integrated metric that facilitates reliable variant identification despite the complications of homologs. Using the new de novo GBS read simulator GBS-Pacecar, also introduced in this note, results show an improvement in overall pipeline accuracy from 66% (v.1.0) to 84% (v.4.0), with a time saving of ∼70%. Both GBS-SNP-CROP versions significantly outperform TASSEL-UNEAK; and v.4.0 resolves the issue of non-overlapping variant calls observed between UNEAK and v.1.0.Entities:
Mesh:
Year: 2019 PMID: 30321264 PMCID: PMC6513162 DOI: 10.1093/bioinformatics/bty873
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Comparative summary of GBS-SNP-CROP v.4.0 performance, based on a set of simulated data from GBS-Pacecar
| Pipeline | MR geno | Time (min) | Variants called | Type I error | Type II error | Accuracy |
|---|---|---|---|---|---|---|
| UNEAK | NA | 8.5 | 2642 | 0.9% | 92.5% | 7.5% |
| GSC v.1.0 | 1 | 370.8 | 23 395 | 1.3% | 34.1% | 65.4% |
| GSC v.4.0 | 1 | 121.7 | 29 738 | 0.6% | 15.6% | 84.0% |
| 5 | 156.9 | 26 885 | 0.6% | 23.6% | 76.0% | |
| 10 | 171.5 | 26 854 | 0.5% | 23.7% | 76.1% | |
| 15 | 179.1 | 26 897 | 0.5% | 23.6% | 76.1% | |
| 20 | 183.0 | 26 892 | 0.5% | 23.6% | 76.1% | |
| 25 | 163.2 | 26 901 | 0.5% | 23.5% | 76.2% |
Note: In total, 25 000 SNPs and 10 000 indels were simulated across a genomic space of 100 000 GBS fragments. A total of 60 002 165 single-end reads were simulated for a population of 25 individuals (average of 2.4 million reads per genotype), with a sequencing error rate of 1.1%. See Supplementary Table S1 for more details
UNEAK = TASSEL-UNEAK; GSC = GBS-SNP-CROP.
The number of genotypes used for mock reference (MR) assembly.
Computation time (minutes) required to run the full analysis on a Unix workstation with 16 GB RAM and a 2.6 GHz Dual Intel processor.
Number of variants called by a pipeline (Note: a total of 35 000 variants were simulated, consisting of 25 000 SNPs and 10 000 indels).
Percentage of called variants that could not be validated (false positives).
Percentage of true, simulated variants that were not detected by the pipeline.
Overall accuracy: 100 * [number of validated variants/(total number of simulated variants + number of non-validated variants)].