| Literature DB >> 31604420 |
Michael D Linderman1, Davin Chia2, Forrest Wallace2, Frank A Nothaft3,4.
Abstract
BACKGROUND: XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results.Entities:
Keywords: Copy-number variation; Exome sequencing; High-performance computing
Year: 2019 PMID: 31604420 PMCID: PMC6787990 DOI: 10.1186/s12859-019-3108-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1DECA parallelization and performance. a DECA parallelization (shown by dashed outline) and data flow. The normalization and discovery steps are parallelized by sample (rows of the samples (s) × targets(t) read-depth matrix). The inputs and outputs of the different components are shown with thinner arrows. b DECA and XHMM execution time starting from the read-depth matrix for s = 2535 on both the workstation and on-premises Hadoop cluster for different numbers of executor cores. Mod. XHMM is a customized XHMM implementation that partitions the discovery input files and invokes XHMM in parallel. c DECA execution time for coverage and CNV discovery for different numbers of samples using the entire workstation (16 cores) and cluster (approximately 640 executor cores dynamically allocated by Spark)
Fig. 2Algorithm for determining K components to removing during PCA normalization
Fig. 3Components to be removed in PCA normalization. K components to be removed during PCA normalization, minimum k components when computing the SVD to accurately determine K, and final k used by DECA for different numbers of initial samples for the XHMM default relative variance cutoff of 0.7 / n
On-premises evaluation systems
| Workstation | 16-core workstation with two 8-core 2.1 GHz Intel Xeon E5–2620 CPUs, 256 GB RAM, and 16 TB of HDD in 2 × −striped JBOD (four 4 TB 7200 RPM HDDs connected via 6Gbps SATA). |
| Cluster | 56-node Hadoop cluster with 16-core nodes managed by YARN. Each node has two 8-core 2.6 GHz Intel Xeon E5–2670 CPUs, 256 GB RAM and 4 TB of HDD (four 1 TB 7200RPM HDDs connected via 6Gpbs SATA). Nodes are connected with two 1GbE connections and one switchable 10GbE/40Gbps IB connection to a 40GbE TOR switch. HDFS was configured with 128 MB blocks and a 2× replication factor. |
Fig. 4Comparison between DECA and XHMM results. a Concordance of XHMM and DECA CNV calls for the full 1000 Genomes Project phase 3 WES dataset (s = 2535) when starting from the same read-depth matrix (t = 191,396). Exact matches have identical breakpoints and copy number, while overlap matches do not have identical breakpoints. b Range of Some Quality (SQ) scores computed by DECA compared to XHMM probability for exact matching variants
Execution time for XHMM PCA step (--PCA) for different LAPACK libraries. Execution time and speedup for XHMM linked to NetLib and OpenBLAS libraries on the single node workstation using a single core
| Samples | NetLib Time (s) | OpenBLAS Time (s) | Speedup |
|---|---|---|---|
| 50 | 9.8 | 9.5 | 1.03 |
| 500 | 208.7 | 112.4 | 1.86 |
| 1000 | 568.5 | 241.5 | 2.35 |
| 1500 | 1150.6 | 398.5 | 2.89 |
| 2000 | 2000 | 585.6 | 3.42 |
| 2535 | 3178.2 | 819 | 3.88 |