| Literature DB >> 26412485 |
John C Mu1, Pegah Tootoonchi Afshar2, Marghoob Mohiyuddin1, Xi Chen3, Jian Li1, Narges Bani Asadi1, Mark B Gerstein4, Wing H Wong3,5, Hugo Y K Lam1.
Abstract
A high-confidence, comprehensive human variant set is critical in assessing accuracy of sequencing algorithms, which are crucial in precision medicine based on high-throughput sequencing. Although recent works have attempted to provide such a resource, they still do not encompass all major types of variants including structural variants (SVs). Thus, we leveraged the massive high-quality Sanger sequences from the HuRef genome to construct by far the most comprehensive gold set of a single individual, which was cross validated with deep Illumina sequencing, population datasets, and well-established algorithms. It was a necessary effort to completely reanalyze the HuRef genome as its previously published variants were mostly reported five years ago, suffering from compatibility, organization, and accuracy issues that prevent their direct use in benchmarking. Our extensive analysis and validation resulted in a gold set with high specificity and sensitivity. In contrast to the current gold sets of the NA12878 or HS1011 genomes, our gold set is the first that includes small variants, deletion SVs and insertion SVs up to a hundred thousand base-pairs. We demonstrate the utility of our HuRef gold set to benchmark several published SV detection tools.Entities:
Mesh:
Year: 2015 PMID: 26412485 PMCID: PMC4585973 DOI: 10.1038/srep14493
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Workflow to construct small variant and SV gold sets.
Figure 2Histogram of size ranges for HuRef gold set variants.
GiaB, Illumina platinum genome and Baylor Gold Set (BGS) are shown for comparison. Bin names represent the upper bound in size range.
Figure 3Comparison of HuRef SV counts to Baylor Gold Set (BGS).
Suffix of “All” refers to the entire set. We show the variants in the 50–100 bp range separately since BGS defines SVs are >100 bp.
Figure 4Counts of SNVs and Indels in each gold set.
Suffix of “All” refers to the complete set of small variants. The other bar is the gold set variants.
Deletion SVs detection comparison.
| Method | TP | TPR | NSVR-FP | NSVR-FDR | |
|---|---|---|---|---|---|
| LUMPY | 1,671 | 0.8512 | 387 | 0.1880 | 0.8311 |
| DELLY | 1,507 | 0.7677 | 360 | 0.1928 | 0.7869 |
| MetaSV | 1,683 | 0.8574 | 32 | 0.0186 | 0.9152 |
| Pindel | 1,638 | 0.8344 | 135 | 0.0761 | 0.8769 |
| BreakDancer | 1,741 | 0.8869 | 6,534 | 0.7896 | 0.3401 |
| CNVnator | 700 | 0.3566 | 82 | 0.1049 | 0.5100 |
| BreakSeq2 | 1,504 | 0.7662 | 23 | 0.0151 | 0.8619 |