| Literature DB >> 23445355 |
Glenn T Howe1, Jianbin Yu, Brian Knaus, Richard Cronn, Scott Kolpak, Peter Dolan, W Walter Lorenz, Jeffrey F D Dean.
Abstract
BACKGROUND: Douglas-fir (Pseudotsuga menziesii), one of the most economically and ecologically important tree species in the world, also has one of the largest tree breeding programs. Although the coastal and interior varieties of Douglas-fir (vars. menziesii and glauca) are native to North America, the coastal variety is also widely planted for timber production in Europe, New Zealand, Australia, and Chile. Our main goal was to develop a SNP resource large enough to facilitate genomic selection in Douglas-fir breeding programs. To accomplish this, we developed a 454-based reference transcriptome for coastal Douglas-fir, annotated and evaluated the quality of the reference, identified putative SNPs, and then validated a sample of those SNPs using the Illumina Infinium genotyping platform.Entities:
Mesh:
Year: 2013 PMID: 23445355 PMCID: PMC3673906 DOI: 10.1186/1471-2164-14-137
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Strategy for assembling the Douglas-fir reference transcriptome and detecting SNPs. We used one Sanger sequence dataset (MG1) and two 454 sequence datasets (MG2 and SG) to assemble the reference transcriptome. We then used these same datasets plus four Illumina short read datasets (MG2, CB, YK, INT) to detect flanking variants. Orange boxes represent Sanger and 454 datasets, blue boxes represent Illumina short-read datasets, green boxes represent the reference transcriptome, red boxes represent SNP filtering steps, and yellow boxes represent SNP genotyping and analytical steps. The number of SNPs for which Infinium genotyping assays were successfully designed (Assay Design Tool score ≥ 0.6) depends on the probability used for filtering the target SNPs (P< 0.01, 0.001, and 0.0001) and the probability used to mask nucleotides in the flanking regions (P = 0.1, 0.01, 0.001, and 0.0001). Larger P values resulted in more flanking variants and fewer target SNPs with successful designs.
Sequence datasets used to construct the Douglas-fir reference transcriptome*
| Multi-genotype #1 (MG1 | Sanger | 12,157 (100) | 57 (0.47) | 0 (0.00) | 2 (0.02) | 2 (0.02) | 0 (0.00) | 1 (0.01) |
| Multi-genotype #2 (MG2 | GS-FLX Titanium | 1,709,211 (100) | 6649 (0.39) | 1893 (0.11) | 8570 (0.50) | 5519 (0.32) | 7264 (0.42) | 11,114 (0.65) |
| Single-genotype (SG | GS-FLX Titanium | 1,241,260 (100) | 6582 (0.53) | 1826 (0.15) | 11,070 (0.89) | 10,463 (0.84) | 86,828 (7.00) | 21,849 (1.76) |
| All datasets | 2,962,628 (100) | 13,288 (0.45) | 3719 (0.13) | 19,642 (0.66) | 15,984 (0.54) | 94,092 (3.18) | 32,964 (1.11) | |
* For each dataset, the numbers of reads filtered using the SnoWhite pipeline (Figure 1, Step 3) are shown by sequence type.
† GS-FLX Titanium is the Roche 454 sequencing platform.
Characteristics of the Douglas-fir transcriptome assembly using Newbler v2.3
| Reads used by Newbler* | 2,764,549 | 360 | 392 | 416 | 996,614,802 |
| Reads assembled by Newbler† | 2,544,087 | 364 | 394 | 416 | 925,577,338 |
| Isotigs§ | 38,589 | 1390 | 1141 | 1883 | 53,622,767 |
| Isogroups | 25,002 | 1443 | 1181 | 1864 | 36,069,331 |
| Isogroups with 1 isotig (I1) | 18,774 | 1334 | 1053 | 1750 | 25,046,862 |
| Isogroups with >1 isotig (IM)‡ | 6228 | 1770 | 1547 | 2141 | 11,022,469 |
| Singletons | 102,623 | 356 | 384 | 413 | 36,504,221 |
| Total (isogroups + singletons) | 127,625 | 569 | 413 | 517 | 72,573,552 |
* The input number of reads is less than the total in Table 1 (2.96 × 106) because reads shorter than 50 nt were excluded. Statistics were calculated using the sequences actually used in the assembly after applying a default minimum length of 40 for reads trimmed by Newbler.
† Includes reads that assembled as complete reads or as partial reads.
§ Isotigs ≥ 200 nt were deposited at DDBJ/EMBL/GenBank under accession GAEK01000000.
‡ Statistics for the IM isogroups were calculated using the longest isotig in each isogroup.
Figure 2Taxonomic distributions of Douglas-fir sequences identified as bacterial or fungal contaminants. We used preliminary assemblies of the SG and MG2 datasets and BLAST searches to identify isotigs and singletons resulting from bacterial or fungal contamination (see Methods). Reads corresponding to these singletons and isotigs were removed prior to the final assembly. Numbers in parentheses are the total number of sequences (isotigs and singletons) in each category.
Comparison between Douglas-fir isotigs and white spruce unigenes[16]
| C1 | 1 | No | - | Highest | 5140 | 261 | |
| C2 | 2+ | No | - | Higher | 896 | 88 | |
| C3 | 1 | Yes | No | Higher | 1767 | 577 | |
| C4 | 2+ | Yes | No | Medium | 586 | 159 | |
| C5 | 1 | Yes | Yes | Lower | 1736 | 6974 | |
| C6 | 2+ | Yes | Yes | Lowest | 3405 | 7040 | |
| | - | - | - | | |||
| C7 | No matches | - | - | Unknown | 5244 | 4716 | |
*Douglas-fir (DF) isotigs were categorized into seven classes (C1-C7) and three levels of confidence based on their relationships to white spruce (WS) contigs using the SCARF program [68].
†Number of white spruce contigs that matched the Douglas-fir query.
§‘Yes’ indicates that at least one non-query isotig also matched the same white spruce contig.
‡‘Yes’ indicates that the query and at least one non-query isotig matched the same region of the white spruce contig (overlapped).
#Subjective level of confidence in the isotig assembly based on the information presented in columns 2–4.
@Cross-hatched bars represent white spruce contigs, black bars represent query Douglas-fir isotigs, and white bars represent non-query Douglas-fir isotigs.
Numbers and percentages of Douglas-fir sequences with matches to sequences in three protein databases*
| | ||||||
| Uniref50 | 15,054 | 80.2 | 3446 | 55.3 | 25,757 | 25.1 |
| TAIR10 | 13,749 | 73.2 | 3260 | 52.3 | 15,907 | 15.5 |
| Annot8r | 11,733 | 62.5 | 2862 | 46.0 | 14,836 | 14.5 |
*Matches were recorded for isogroups and singletons at a tBLASTX E-value < 10-5.
†Isogroups are Newbler v2.3 isogroups. For the isogroups with more than 1 isotig (IM subset), a hit was counted only if all isotigs matched the same protein in the database.
§Singletons are 454 reads that did not assemble with any other reads.
Numbers and percentages of Douglas-fir sequences with matches to sequences in the Uniref50 protein database*
| | ||||||
| Conifers | 4088 | 27.16 | 1073 | 31.14 | 6486 | 25.18 |
| Other plants | 9713 | 64.52 | 2047 | 59.40 | 16,061 | 62.36 |
| Other Eukaryotes | 582 | 3.87 | 182 | 5.28 | 658 | 2.55 |
| Invertebrates | 487 | 3.24 | 120 | 3.48 | 1087 | 4.22 |
| Bacteria | 123 | 0.82 | 8 | 0.23 | 830 | 3.22 |
| Environmental | 21 | 0.14 | 6 | 0.17 | 37 | 0.14 |
| Vertebrates | 17 | 0.11 | 6 | 0.17 | 92 | 0.36 |
| Fungi | 19 | 0.13 | 4 | 0.12 | 487 | 1.89 |
| Viruses | 4 | 0.03 | 0 | 0.00 | 19 | 0.07 |
*Matches are grouped by taxonomic affiliation and percentages are relative to the total number of matches (tBLASTX E-value < 10-5). Numbers of input Douglas-fir sequences are in parentheses.
†Isogroups are Newbler v2.3 isogroups. For the isogroups with more than 1 isotig (IM subset), a hit was counted only if all isotigs matched the same protein in the database.
§Singletons are 454 reads that did not assemble with any other reads.
Figure 3Distributions of Douglas-fir sequences and genes by GO slim terms. Distributions are shown for Arabidopsis genes (TAIR10 accessions), two types of Douglas-fir isogroups (I1 subset = isogroups with one isotig and IM subset = isogroups with more than one isotig), and Douglas-fir singletons.
Numbers of potential SNPs detected in Douglas-fir using an individual dataset probability value of 10
| | | | ||||||
|---|---|---|---|---|---|---|---|---|
| Multi-genotype #1 (MG1 | Coastal | Sanger | 2.77 | 3982 (2606) | 101,089 (85,635) | 29,922 (25,523) | 81,633 (69,109) | 107,884 (90,487) |
| Multi-genotype #2 (MG2 | Coastal | Roche 454 | | | | | | |
| Single-genotype (SG | Coastal | Roche 454 | | | | | | |
| Multi-genotype #2 (MG2 | Coastal | Illumina | 64.00 | 18,694 (15,617) | 192,693 (162,560) | 41,952 (35,700) | 146,242 (123,503) | 192,693 (162,560) |
| Coos Bay (CB | Coastal | Illumina | 13.41 | 1044 (895) | 66,304 (56,547) | 29,051 (24,703) | 53,275 (45,437) | 66,304 (56,547) |
| Yakima (YK | Yakima | Illumina | 8.99 | 638 (545) | 43,066 (36,621) | - | 40,840 (34,750) | 47,573 (40,505) |
| Interior (INT | Interior | Illumina | 80.45 | 71,241 (61,334) | 151,014 (127,403) | 40,840 (34,750) | - | 226,124 (192,076) |
*The number of unique SNPs and the number of SNPs shared in other datasets of the coastal, Yakima, and interior seed sources are presented for all isogroups (I1 + IM) and for the 1 isotig per isogroup subset (I1) (in parentheses). The total number of unique SNPs detected in all datasets was 278,979.
†SNP totals are not the sums of the values in the same row because SNPs are double-counted. For example, we detected 66,304 SNPs in the CB dataset, 29,051 of which were detected in the YK dataset and 53,275 of which were detected in the INT dataset.
Douglas-fir SNPs detected using an Illumina Infinium SNP array (n = 260 trees)
| SNP category | Number of SNPs | Attempted (n=8769) | Assayed (n=8067) |
| SNPs attempted | 8769 | 100.0 | 108.7 |
| SNPs assayed | 8067 | 92.0 | 100.0 |
| Called SNPs (call frequency > 0.85)† | 7256 | 82.7 | 89.9 |
| Called SNPs that are polymorphic (MAF > 0) | 5847 | 66.7 | 72.5 |
| Percent of called SNPs that are polymorphic (5847/7256) = 80.6 | |||
*The number of SNPs in each category is expressed as a percentage of the total number of SNPs attempted (n = 8769) and number of SNPs successfully assayed on the array (n = 8067).
†Successful calls are those with a GenCall score ≥ 0.15 [19].
Characteristics of 5847 successful SNPs based on data from an Illumina Infinium SNP array
| GenTrain score | 0.81 | 0.84 | 0.35-0.98 |
| GC50 score (median GenCall score) | 0.78 | 0.87 | 0.15-0.99 |
| Call frequency† | 0.99 | 1.00 | 0.85-1.00 |
| Minor allele frequency (MAF) | 0.24 | 0.24 | 0.002-0.5 |
| Heterozygosity (observed) | 0.33 | 0.36 | 0.00-1.00 |
| Heterozygosity (expected) | 0.32 | 0.36 | 0.004-0.5 |
| Number of SNPs with a significant HWE deviation = 263 (4.5%)§ | |||
*Successful SNPs are those with a call frequency > 0.85 and MAF > 0 based on an analysis of 260 trees.
† Successful calls are those with a GenCall score ≥ 0.15 [19].
§ Tested using an exact test of HWE and a probability level of 0.9 × 10-5 (i.e., Bonferroni-corrected P-value of 0.05 based on 5847 SNPs).
Figure 4Distributions of minor allele frequencies for successful Douglas-fir SNPs. Open bars represent all 5847 successful SNPs. Solid bars represent 5584 successful SNPs that were in Hardy-Weinberg Equilibrium (HWE). Successful SNPs had call frequencies > 0.85 and were polymorphic. Successful calls are those with GenCall scores ≥ 0.15 [19].
Figure 5Distributions of expected and observed heterozygosities for successful Douglas-fir SNPs. Open bars represent all 5847 successful SNPs. Solid bars represent 5584 SNPs that were in Hardy-Weinberg Equilibrium (HWE). Successful SNPs had call frequencies > 0.85 and were polymorphic. Successful calls are those with GenCall scores ≥ 0.15 [19].