| Literature DB >> 29281959 |
Daniel P Wickland1,2, Gopal Battu1,3, Karen A Hudson4, Brian W Diers1, Matthew E Hudson5.
Abstract
BACKGROUND: Genotyping-by-sequencing (GBS), a method to identify genetic variants and quickly genotype samples, reduces genome complexity by using restriction enzymes to divide the genome into fragments whose ends are sequenced on short-read sequencing platforms. While cost-effective, this method produces extensive missing data and requires complex bioinformatics analysis. GBS is most commonly used on crop plant genomes, and because crop plants have highly variable ploidy and repeat content, the performance of GBS analysis software can vary by target organism. Here we focus our analysis on soybean, a polyploid crop with a highly duplicated genome, relatively little public GBS data and few dedicated tools.Entities:
Keywords: Bioinformatics pipelines; Crops; GBS; Soybean; Variant calling; WGS
Mesh:
Year: 2017 PMID: 29281959 PMCID: PMC5745977 DOI: 10.1186/s12859-017-2000-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
GBS library data for the three populations analyzed in this study
| Population 1 | Population 2 | Population 3 | |
|---|---|---|---|
| Description | F2 from cross between Prize and mutagenized Williams 82 | F2 from cross between two breeding lines | 81 unrelated lines |
| Number of samples | 378 | 391 | 200 |
| Sequencer | Illumina HiSeq2500 | Illumina HiSeq4000 | Illumina HiSeq2500 |
| Read length | 100 bp | 100 bp | 100 bp |
| Number of reads | 234,574,472 (single-end) | 392,001,642 (single-end) | 247,063,538 (single-end) |
| Average depth per sequenced base | 1.87 reads | 3.63 reads | 4.47 reads |
| Average percent of genome covered by at least 1 read | 2.29 | 2.02 | 2.35 |
| Average percent of genome covered by at least 2 reads | 1.08 | 1.42 | 1.71 |
DNA was extracted using the CTAB method [19] except for the Prize x NMU-mutagenized Williams 82 population (Population 1), which used the E-Z 96 Plant DNA kit (Omega Bio-Tek, Norcross, GA). All libraries were sequenced at low coverage typical of plant breeding experiments, with coverage varying from 1.87× to 4.47×
Major steps of the 5 GBS workflows analyzed
| TASSEL-GBS | IGST | Fast-GBS | Stacks | GB-eaSy | |
|---|---|---|---|---|---|
| Demultiplex reads | GBSSeqToTagDBPlugin, TagExportToTagDBPlugin | Sabre | Sabre | process_radtags | GBSX |
| Trim adapters | cutadapt* | trimAdaptor3.py | cutadapt | process_radtags | GBSX |
| Align to reference | bwa-mem* | bwa-aln | bwa-mem | bwa-mem* | bwa-mem |
| Call SNPs | DiscoverySNPCallerPluginV2, ProductionSNPCallerPluginV2 | SAMtools/BCFtools | Platypus | pstacks, cstacks, stacks, populations | BCFtools |
Each workflow uses a different series of tools to carry out read demultiplexing, adapter trimming, alignment to the reference genome, and SNP calling
*step performed manually outside the workflow
WGS library data for six lines
| Prize | LG12 | Magellan | Maverick | Prohio | Skylla | |
|---|---|---|---|---|---|---|
| Population of origin | Population 1 | Population 2 | Population 3 | Population 3 | Population 3 | Population 3 |
| Read length | 100 bp | 150 bp | 150 bp | 150 bp | 150 bp | 150 bp |
| Number of reads | 130,404,160 (paired-end) | 43,756,742 (paired-end) | 12,880,066 (paired-end) | 19,038,600 (paired-end) | 34,177,159 (paired-end) | 23,190,927 (paired-end) |
| Coverage (LN / G) | 13.65 | 6.87 | 2.02 | 2.99 | 5.37 | 3.64 |
| Percent of genome covered by at least 1 read | 98.67 | 97.76 | 74.38 | 94.06 | 98.36 | 96.16 |
| Percent of genome covered by at least 2 reads | 98.31 | 97.04 | 73.03 | 85.18 | 97.27 | 90.36 |
Prize and LG12 were also included in GBS Populations 1 and 2, respectively. Magellan, Maverick, Prohio and Skylla were included in GBS Population 3. Coverage was computed as the product of read length and number of reads, divided by genome size
Fig. 1Number of SNPs identified by each pipeline in 3 populations. SNPs with a minimum read depth of 2 reads are shown
Fig. 2SNP overlap among 5 GBS pipelines. a shows overlap for the 3 populations. b shows overlap for 6 lines from those populations: Prize is from GBS Population 1, LG12 is from GBS Population 2, and the four remaining lines are from GBS Population 3. SNPs with a minimum read depth of 2 reads are shown. All SNPs were called relative to the Williams 82 reference genome
Fig. 3Comparisons between GBS SNPs and WGS SNPs for 6 individual soybean lines. Prize is from GBS Population 1, LG12 is from GBS Population 2, and the four remaining lines are from GBS Population 3. Panel a shows the total number of SNPs identified in each line by 5 GBS pipelines. Panel b shows the percent of GBS SNP sites from panel A in agreement with WGS for each line. Panels c and d show the percent and number (respectively) of GBS SNP alleles from panel A in agreement with WGS. SNPs with a minimum read depth of 2 reads are shown. Below each soybean line is shown its average depth of sequenced GBS bases followed by its WGS coverage. All SNPs were called relative to the Williams 82 reference genome
Missing data fraction generated by each GBS pipeline
| TASSEL | IGST | Fast-GBS | Stacks | GB-eaSy | |
|---|---|---|---|---|---|
| Population 1 | |||||
| Missing data per line | 84.5% | 85.4% | 85.0% | 89.7% | 83.4% |
| SNPs in 25% of lines | 6812 | 12,334 | 18,731 | 3576 | 23,633 |
| SNPs in 50% of lines | 1237 | 1714 | 2984 | 202 | 3558 |
| SNPs in 75% of lines | 736 | 112 | 382 | 31 | 407 |
| SNPs in 90% of lines | 335 | 25 | 75 | 2 | 119 |
| Population 2 | |||||
| Missing data per line | 59.4% | 70.8% | 70.0% | 66.1% | 71.5% |
| SNPs in 25% of lines | 65,119 | 68,805 | 122,801 | 142,154 | 120,437 |
| SNPs in 50% of lines | 35,107 | 39,055 | 76,485 | 52,991 | 76,717 |
| SNPs in 75% of lines | 2185 | 1548 | 4418 | 372 | 4880 |
| SNPs in 90% of lines | 973 | 26 | 219 | 21 | 187 |
| Population 3 | |||||
| Missing data per line | 62.4% | 69.3% | 68.4% | 67.2% | 69.6% |
| SNPs in 25% of lines | 54,960 | 65,695 | 88,904 | 69,300 | 88,025 |
| SNPs in 50% of lines | 18,859 | 22,369 | 32,077 | 19,756 | 32,698 |
| SNPs in 75% of lines | 6196 | 7813 | 12,204 | 4539 | 13,005 |
| SNPs in 90% of lines | 775 | 479 | 934 | 98 | 1352 |
The average percent of missing data per line is shown, as well as the number of SNPs detected at various proportions within each population
Wall-clock time to completion for each GBS pipeline (h:mm)
| TASSEL | IGST | Fast-GBS | Stacks | GB-eaSy | |
|---|---|---|---|---|---|
| Population 1 | 2:08 | 12:17 | 3:20 | 8:36 | 5:21 |
| Population 2 | 4:58 | 18:46 | 8:01 | 16:34 | 6:51 |
| Population 3 | 3:38 | 11:28 | 4:06 | 10:15 | 4:23 |