| Literature DB >> 31218349 |
David E Larson1,2, Haley J Abel1,2, Colby Chiang1, Abhijit Badve1, Indraniel Das1, James M Eldred1, Ryan M Layer3,4, Ira M Hall1,2,5.
Abstract
SUMMARY: Large-scale human genetics studies are now employing whole genome sequencing with the goal of conducting comprehensive trait mapping analyses of all forms of genome variation. However, methods for structural variation (SV) analysis have lagged far behind those for smaller scale variants, and there is an urgent need to develop more efficient tools that scale to the size of human populations. Here, we present a fast and highly scalable software toolkit (svtools) and cloud-based pipeline for assembling high quality SV maps-including deletions, duplications, mobile element insertions, inversions and other rearrangements-in many thousands of human genomes. We show that this pipeline achieves similar variant detection performance to established per-sample methods (e.g. LUMPY), while providing fast and affordable joint analysis at the scale of ≥100 000 genomes. These tools will help enable the next generation of human genetics studies.Entities:
Mesh:
Year: 2019 PMID: 31218349 PMCID: PMC6853660 DOI: 10.1093/bioinformatics/btz492
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The svtools pipeline. SVs are detected separately in each sample using LUMPY. Breakpoint probability distributions are utilized to merge and refine the coordinates of SV breakpoints within a cohort, followed by parallelized re-genotyping and copy number annotation. Variants are merged into a single cohort-level VCF file and variant types are classified using the combined breakpoint genotype and read-depth information
Detection sensitivity in large and small cohorts
| 12 sample callset | |||
|---|---|---|---|
| Merge only | Reclassified (naïve bayes) | ||
| Sample | Sensitivity (all) (%) | Sensitivity (all) (%) | Sensitivity (HC) (%) |
| HG00513 | 80.94 | 87.53 | 82.43 |
| HG00731 | 78.17 | 83.96 | 78.88 |
| HG00732 | 82.40 | 87.43 | 81.39 |
| NA12878 | 82.39 | 88.19 | 83.15 |
| NA19238 | 84.39 | 88.58 | 82.41 |
| NA19239 | 74.39 | 77.60 | 73.36 |
|
| |||
|
1000 sample callset | |||
|
Merge only |
Reclassified (regression) | ||
| Sample | Sensitivity (all) (%) | Sensitivity (all) (%) | Sensitivity (HC) (%) |
|
| |||
| HG00513 | 80.23 | 88.03 | 83.80 |
| HG00731 | 77.67 | 84.47 | 80.46 |
| HG00732 | 81.50 | 88.12 | 82.56 |
| NA12878 | 81.58 | 88.62 | 84.18 |
| NA19238 | 83.86 | 88.53 | 82.80 |
| NA19239 | 74.01 | 77.81 | 73.31 |
Note: Sensitivity is defined as percent of detectable 1000 Genomes Project variants identified in the cohort. HC stands for high confidence variants.
ME rate in large and small cohorts
| 12 sample | ||||||
|---|---|---|---|---|---|---|
| Merge only | Reclassified (naïve bayes) | |||||
| All | All | High confidence | ||||
| Family | Variants | ME rate (%) | Variants | ME rate (%) | Variants | ME rate (%) |
| CEPH1463 | 6107 | 12.72 | 6237 | 8.11 | 3184 | 2.29 |
| PR05 | 5783 | 15.75 | 6182 | 8.57 | 3164 | 2.24 |
| SH032 | 5670 | 16.83 | 6054 | 8.77 | 3182 | 2.64 |
| Y117 | 7534 | 15.18 | 7519 | 8.83 | 3889 | 2.24 |
|
| ||||||
|
1000 sample | ||||||
|
Merge only |
Reclassified (regression) | |||||
|
All |
All |
High confidence | ||||
| Family | Variants | ME rate (%) | Variants | ME rate (%) | Variants | ME rate (%) |
|
| ||||||
| CEPH1463 | 6147 | 12.93 | 10 429 | 13.13 | 3605 | 2.77 |
| PR05 | 5827 | 15.99 | 10 381 | 14.54 | 3629 | 2.98 |
| SH032 | 5708 | 16.92 | 10 123 | 14.29 | 3574 | 3.33 |
| Y117 | 7568 | 15.38 | 11 488 | 13.34 | 4208 | 2.69 |
Note: ME rate is defined as the number of MEs divided by the total number of informative variants on the autosomes.
Computational benchmarking of svtools subcommands
| Num. samples | 10 | 100 | 1000 | |||
|---|---|---|---|---|---|---|
| Program | Wall (m) | RAM (MB) | Wall (m) | RAM (MB) | Wall (m) | RAM (MB) |
| lsort | 0.129 | 5.964 | 1.117 | 1696.008 | 16.788 | 3402.480 |
| lmerge | 2.108 | 87.791 | 18.708 | 258.402 | 193.346 | 2032.114 |
| genotype | 13.425 | 2008.828 | 31.725 | 1222.536 | 61.413 | 1255.593 |
| copynumber | 0.225 | NA | 0.333 | NA | 0.533 | NA |
| vcfpaste | 0.088 | NA | 1.379 | 75.660 | 79.083 | 181.845 |
| afreq | 0.096 | NA | 0.908 | 77.701 | 20.192 | 97.713 |
| vcftobedpe | 0.083 | NA | 0.183 | 3.277 | 0.933 | 70.851 |
| bedpesort | 0.079 | 17.363 | 0.183 | NA | 0.892 | 70.904 |
| prune | 0.179 | 18.790 | 0.404 | 61.667 | 1.388 | 171.728 |
| bedpetovcf | 0.079 | 17.368 | 0.183 | 34.794 | 0.879 | 71.003 |
| vcfsort | 0.033 | NA | 0.100 | 0.594 | 1.508 | 900.887 |
| classify | 6.938 | 530.722 | 8.250 | 526.480 | 25.621 | 680.900 |
Note: For three different size cohorts, each tool was run (n = 4; n = 3 for the 100 sample bedpetovcf) to generate mean wall clock time and RAM utilization. For the genotype and copynumber commands, benchmarking was performed on a single, representative sample within the cohort of median file size. All other commands were evaluated on the entire dataset. Some benchmarking runs finished before LSF was able to gather memory usage metrics and these are reported as NA.