| Literature DB >> 30020434 |
Orzenil Bonfim Silva-Junior1,2, Dario Grattapaglia1,2, Evandro Novaes3, Rosane G Collevatti4.
Abstract
Targeted sequence capture coupled to high-throughput sequencing has become a powerful method for the study of genome-wide sequence variation. Following our recent development of a genome assembly for the Pink Ipê tree (Handroanthus impetiginosus), a widely distributed Neotropical timber species, we now report the development of a set of 24,751 capture probes for single-nucleotide polymorphisms (SNPs) characterization and genotyping across 18,216 distinct loci, sampling more than 10 Mbp of the species genome. This system identifies nearly 200,000 SNPs located inside or in close proximity to almost 14,000 annotated protein-coding genes, generating quality genotypic data in populations spanning wide geographic distances across the species native range. To provide recommendations for future developments of similar systems for highly heterozygous plant genomes we investigated issues such as probe design, sequencing coverage and bioinformatics, including the evaluation of the capture efficiency and a reassessment of the technical reproducibility of the assay for SNPs recall and genotyping precision. Our results highlight the value of a detailed probe screening on a preliminary genome assembly to produce reliable data for downstream genetic studies. This work should inspire and assist the development of similar genomic resources for other orphan crops and forest trees with highly heterozygous genomes.Entities:
Mesh:
Year: 2018 PMID: 30020434 PMCID: PMC6191306 DOI: 10.1093/dnares/dsy023
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1.Flowchart of the sequence of steps and corresponding input and output results in terms of sequence data, probes and SNPs obtained along the development and evaluation of the sequence capture system for Handroanthus impetiginosus.
Figure 2.Distribution of off-target SNPs distances to the closest probe sequence coordinate. (A) The narrow spectrum of distances indicates that most of the identified SNPs are located up to 200 bp upstream or downstream to a targeted location in the genome. (B) The wide spectrum of distances shows that off-target variants were found spread across the genome in regions up to 25 kbp from the closest targeted probe location.
Summary of per sample coverage and capture efficiency over the 23,232 target loci in the genome assembly of H. impetiginosus
| Sample | |||||
|---|---|---|---|---|---|
| POS-15-1 | 304 | 19,627 | 72% | 63% | 71% |
| POT-05-1 | 265 | 18,951 | 70% | 66% | 69% |
| CAR-05-1 | 210 | 19,077 | 70% | 67% | 69% |
| POS-10-1 | 211 | 18,671 | 68% | 63% | 68% |
| SEC-23-1 | 217 | 18,944 | 70% | 63% | 69% |
| SEC-08-1 | 245 | 18,784 | 69% | 65% | 68% |
| SEC-05-1 | 146 | 18,524 | 68% | 63% | 68% |
| CAR-02-1 | 100 | 18,047 | 66% | 63% | 66% |
| SUM-13-1 | 132 | 17,909 | 68% | 47% | 65% |
| SUM-12-1 | 114 | 17,797 | 68% | 46% | 65% |
| MOC-08-1 | 112 | 17,877 | 70% | 37% | 65% |
| MOC-11-1 | 136 | 17,788 | 69% | 44% | 65% |
| MOC-02-1 | 148 | 17,765 | 68% | 41% | 64% |
| MOC-09-1 | 129 | 17,547 | 68% | 39% | 64% |
| MOC-12-1 | 133 | 17,625 | 68% | 42% | 64% |
| MOC-13-1 | 125 | 17,705 | 68% | 43% | 64% |
| MOC-07-1 | 118 | 17,524 | 67% | 41% | 63% |
| MOC-03-1 | 112 | 17,436 | 67% | 41% | 63% |
| SUM-10-1 | 72 | 17,443 | 66% | 48% | 63% |
| MOC-06-1 | 94 | 17,226 | 66% | 42% | 63% |
| MOC-10-1 | 70 | 16,986 | 65% | 43% | 62% |
| MOC-05-1 | 25 | 16,652 | 62% | 45% | 60% |
| SEC-09-1 | 9 | 16,229 | 58% | 65% | 59% |
| MOC-04-1 | 28 | 15,942 | 61% | 44% | 58% |
Coverage is the ratio of the aligned read depth, which denotes the number of quality reads after alignment to the genome assembly, to the number of sequenced loci in the corresponding sample.
Denotes the number of target loci in the initial design of the capture system for which at least one quality read was detected after alignment to the genome assembly.
Capture efficiency was computed at sample level by the ratio between the number of loci for which at least one quality SNP was detected using the improvement procedure for SNP calling and genotyping refinement and the total number of loci determined by probes in the corresponding design: 19,962 loci from transcripts of protein-coding genes (c(a)); 3,270 loci from low-coverage WGS (c(b)); total of 23,232 loci (c(c)).
Summary of the numbers of loci, size in base pairs (bp), numbers of probes, SNPs, gene models and intergenic regions retained following each filtering step in the variant analysis using the GATK framework
| SNP call set | Number of loci | Target loci | Number of probes | Number of SNPs | Number of gene models | ||
|---|---|---|---|---|---|---|---|
| Total size (bp) | Mean (bp) | Median (bp) | |||||
| STANDARD | 19,627 | 11,913,787 | 607 | 520 | 26,451 | 688,754 | 11,024 |
| GQ20+VQSR | 18,216 | 10,983,469 | 603 | 520 | 24,771 | 352,879 | 10,400 |
| GQ20+VQSR+MM80 | 11,748 | 7,392,636 | 630 | 520 | 16,901 | 83,476 | 7,991 |
See Results for the definitions of the SNP call sets.
Performance of the initially designed probe sequences for target enrichment and capture
| Genic region | Intergenic region | Total | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Source | Filter | Exon | Exon + flanking | Intron | |||||||
| #of successes | #of failures | #of successes | #of failures | #of successes | #of failures | #of successes | #of failures | #of successes | #of failures | ||
| Probes from transcripts | STANDARD | 3,062 (88%) | 405 (12%) | 16,208 (90%) | 1,870 (10%) | 1,163 (83%) | 230 (17%) | 2,070 (58%) | 1,518 (42%) | 22,503 (85%) | 4,023 (15%) |
| GQ20+VQSR | 2,872 (83%) | 595 (17%) | 15,197 (84%) | 2,881 (16%) | 1,051 (75%) | 342 (25%) | 1,821 (51%) | 1,767 (49%) | 20,941 (79%) | 5,585 (21%) | |
| GQ20+VQSR+MM80 | 2,447 (71%) | 1,020 (29%) | 11,529 (64%) | 6,549 (36%) | 511 (37%) | 882 (63%) | 1,101 (31%) | 2,487 (69%) | 15,588 (59%) | 10,938 (41%) | |
| Probes from low-coverage WGS | STANDARD | 102 (90%) | 11 (10%) | 1,813 (94%) | 106 (6%) | 214 (94%) | 14 (6%) | 1,819 (91%) | 190 (9%) | 3,948 (92%) | 321 (8%) |
| GQ20+VQSR | 102 (90%) | 11 (10%) | 1,799 (94%) | 120 (6%) | 209 (92%) | 19 (8%) | 1,700 (85%) | 309 (15%) | 3,810 (89%) | 459 (11%) | |
| GQ20+VQSR+MM80 | 69 (61%) | 44 (39%) | 966 (50%) | 953 (50%) | 71 (31%) | 157 (69%) | 207 (10%) | 1,802 (90%) | 1,313 (31%) | 2,956 (69%) | |
Capture efficiencies across filtered call sets (see Results for the definitions of the filter criteria) were inspected by counting the number of loci and probe sequences for which at least one quality SNP was detected. A success rate was obtained by dividing the number of successful probes by their totals in the initial design, i.e. 30,795 probe sequences, whether the source of design were mRNA (26,526 probes) or low-coverage WGS (4,269 probes). Conversely, a failure rate was defined as (1 − success rate). The Success/Failure rate was also stratified according the probe location in the genome assembly for predicted gene model features or intergenic region.