| Literature DB >> 25527832 |
Michael R Lindberg1, Ira M Hall2, Aaron R Quinlan3.
Abstract
UNLABELLED: Current strategies for SNP and INDEL discovery incorporate sequence alignments from multiple individuals to maximize sensitivity and specificity. It is widely accepted that this approach also improves structural variant (SV) detection. However, multisample SV analysis has been stymied by the fundamental difficulties of SV calling, e.g. library insert size variability, SV alignment signal integration and detecting long-range genomic rearrangements involving disjoint loci. Extant tools suffer from poor scalability, which limits the number of genomes that can be co-analyzed and complicates analysis workflows. We have developed an approach that enables multisample SV analysis in hundreds to thousands of human genomes using commodity hardware. Here, we describe Hydra-Multi and measure its accuracy, speed and scalability using publicly available datasets provided by The 1000 Genomes Project and by The Cancer Genome Atlas (TCGA).Entities:
Mesh:
Year: 2014 PMID: 25527832 PMCID: PMC4393510 DOI: 10.1093/bioinformatics/btu771
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Receiver operating characteristic curves describing deletion detection in NA12878 from three scenarios. The relative accuracy of Hydra-Multi (red) was compared with both DELLY (blue and purple) and GASVPro (green) in three analyses that each compared fragment size parameters of 5 and 8 median absolute deviations (MADs) (Supplementary Methods). Each plot displays the relationship between the number of true and false positives at varying levels of minimum alignment support (4–10 read-pairs). A true positive was defined as detection of one of the 3077 non-overlapping truth set deletions where both intervals from a predicted deletion breakpoint intersected with both of the truth set deletion breakpoint intervals. In order to make a fair comparison across all tools, each predicted breakpoint was represented as two 200 bp intervals that faithfully represent the region implicated by the original SV call. A list of regions to exclude based on excessively high read-depth were used on both the truth set and putative call sets (Supplementary Methods). The three situations used to assess the three tools are as follows: (A) The 50× NA12878 dataset was subsampled to 5× and analyzed. (B) The 50× NA12878 data were analyzed. (C) The subsampled 5× NA12878 dataset was analyzed concurrently with 64 randomly selected datasets of ∼5× coverage from 1KGP. Total support was evaluated as the total number of read-pairs across all datasets analyzed. The presence of a deletion in NA12878 by DELLY was inferred by both the reported genotype (GT) and by observing at least one high-quality variant pair (DV) in NA12878. Only GT was reported in the single dataset analyses, as GT and DV are functionally the same when requiring 4–10 read pairs of support. In both single and joint analyses using Hydra-Multi, the contribution of at least one read pair by NA12878 was required. Note: GASVPro does not simultaneously run on multiple datasets
Fig. 2.Reduction in the somatic SV FDR for tumor-specific mutations by simultaneously integrating data from 128 TCGA samples. The somatic FDR is the predicted rate at which somatic SV breakpoints are false, either due to false positive SV calls or due to inherited germline SVs that have been misclassified as somatic due to false negatives. For this experiment, we identify false somatic calls by their presence in a single normal genome but not in the paired tumor genome or any of N additional tumor-normal pairs (X-axis)
Memory usage and runtime performance from four scenarios
| Hydra-Multi | DELLY | GASVPro | ||||
|---|---|---|---|---|---|---|
| Maximum memory | Total runtime | Maximum memory | Total runtime | Maximum memory | Total runtime | |
| NA12878 (5×) | 1.9 Gb | 17 min | 1.6 Gb | 37 min | 1.1 Gb | 217 min |
| NA12878 (50×) | 1.8 Gb | 145 min | 7.1 Gb | 337 min | 7.8 Gb | 2017 min |
| NA12878 (5×) + 64 Datasets (5×) | 1.9 Gb | 192 min | 41.3 Gb | 2 392 min | N/A | N/A |
| 500 NA12878 (5×) | 6.9 Gb | 1817 min | 70.7 Gb | 21 258 min | N/A | N/A |
The relative speed and scalability of Hydra-Multi was compared with the other tools by measuring the maximum memory used per process and runtime with Runit (https://github.com/lh3/misc/tree/master/sys/runit). Hydra-Multi (8 processors) and DELLY were parallelized (32 threads). GASVPro ran as a single process/thread, never exceeding the Java Virtual Machine allocation of 20 Gb. From top, we analyzed the following datasets: a 5× NA12878 dataset obtained by subsampling the 50× NA12878 dataset; the 50× NA12878 dataset; the 5× NA12878 dataset combined with 64 additional ∼5× datasets from 1KGP; 500 copies of the 5× NA12878 dataset. Note: GASVPro cannot jointly analyze multiple datasets (indicated by ‘N/A’).