| Literature DB >> 24475911 |
Jeffrey G Reid1, Andrew Carroll, Narayanan Veeraraghavan, Mahmoud Dahdouli, Andreas Sundquist, Adam English, Matthew Bainbridge, Simon White, William Salerno, Christian Buhay, Fuli Yu, Donna Muzny, Richard Daly, Geoff Duyk, Richard A Gibbs, Eric Boerwinkle.
Abstract
BACKGROUND: Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results.Entities:
Mesh:
Year: 2014 PMID: 24475911 PMCID: PMC3922167 DOI: 10.1186/1471-2105-15-30
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Data Flow. 1) Sequencing Instrument raw data is passed to vendor primary analysis software to generate sequence reads and base call confidence values (qualities). 2) Reads and qualities are passed to a mapping tool (BWA) for comparison to a reference genome to determine the placement of reads on the reference (producing a BAM file). 3) Individual sequence event BAMs are merged to make a single sample-level BAM file that then is processed in preparation for variant calling. 4) Atlas-SNP and Atlas-indel are used to identify variants and produce variant files (VCF). 5) Annotation adds biological and functional information to the variant lists and formats them for delivery.
Figure 2Workflow monitoring in DNAnexus. The GUI for applet monitoring displays the progress as a Gantt chart. The left panel lists the various steps including the parallelization steps with each row corresponding to a compute instance. A particular step can be clicked to determine the exact inputs and output or logs of execution for that step. Here we show a snapshot of the webpage displaying the progress of execution for the NA12878 exome analysis.
Mercury computational resource requirements
| | ||||||||||||
| Nodes | 1 | 1 | 0.333 | 0.5 | 0.125 | 1 | 0.333 | 1 | 0.125 | 0.167 | 0.167 | 0.167 |
| RAM | 48 | 48 | 15 | 28 | 3 | 48 | 14 | 32 | 4 | 7 | 7 | 8 |
| Hours | 3.62 | 1.84 | 1.38 | 3.39 | 1.30 | 0.28 | 2.25 | 3.04 | 0.75 | 9.00 | 7.51 | 1.71 |
| Node*hrs | 3.62 | 1.84 | 0.46 | 1.70 | 0.16 | 0.28 | 0.75 | 3.04 | 0.09 | 1.50 | 1.25 | 0.29 |
All estimates are approximate for whole exome and light-skim whole genome (~10-20 Gbp of data) sequenced on Illumina HiSeq and processed with the most recent versions of RTA and Casava. Nodes are 8-core, 48 GB RAM, with ~3 GHz Intel CPUs and ~1 TB of local scratch disk. Steps include all aspects of the pipeline from building reads and qualities (fastQ) from raw data (bcl files), through alignment and BAM generation using the BWA aligner, and BAM finishing with GATK post-processing and duplicate marking, capture and coverage metric calculation, and BAM file validation, finally producing variants from the Atlas2 variant calling suite with annotations from our annotator, Cassandra.
A feature summary of Mercury in DNAnexus and similar tools and services
| Mapper | BWA | No canonical (many) | BWA | BWA | BWA/Bowtie |
| Variant caller | Atlas2 | No canonical (many) | GATK | Samtools mpileup | Samtools/GATK, Varscan |
| Annotation | Cassandra | snpEff, AlleleFrequency | snpEff, ANNOVAR | Bioconductor | ANNOVAR |
| Visualization | Built-in browser | Built-in browser | Links to IGV | Built-in browser | No |
| Runs on local hardware | No* | Yes | No | Yes | Yes |
| Runs on cloud infrastructure | Yes | Yes | Yes | Jobs queued to public server | No |
| HIPAA compliant | Yes | No | No | No | Not Applicable (local only) |
| Requires setup configuration | No | Yes | No | Yes | Yes |
| Can add tools independently | Yes | Yes | By request | No | No |
| Data sharing & collaboration | Yes | Yes | No | No | No |
*Mercury can be run on local hardware via Valence, but does not include all of the security and data sharing features of Mercury in DNAnexus.
Technical replicate data for the pipeline for eight samples sequenced in duplicate (A and B)
| HS-1011 | 133 | 0.561% | 23,320 | 98.409% | 244 | 1.029% |
| HS-1015 | 155 | 0.542% | 28,312 | 98.910% | 157 | 0.548% |
| HS-1016 | 155 | 0.644% | 23,752 | 98.732% | 150 | 0.624% |
| HS-1017 | 105 | 0.456% | 22,767 | 98.768% | 179 | 0.777% |
| HS-1018 | 165 | 0.693% | 23,531 | 98.795% | 122 | 0.512% |
| HS-1019 | 162 | 0.682% | 23,441 | 98.686% | 150 | 0.631% |
| HS-1020 | 493 | 2.041% | 23,518 | 97.355% | 146 | 0.604% |
| HS-1021 | 161 | 0.681% | 23,188 | 98.125% | 282 | 1.193% |
| Average | 191 | 0.718% | 23,979 | 98.473% | 179 | 0.613% |
Concordance of SNP array and data
| 1 | 1897 | 1906 |
| 2 | 1900 | 1906 |
| 3 | 1898 | 1904 |
| 4 | 1895 | 1905 |
| 5 | 1899 | 1907 |
| 6 | 1904 | 1907 |
| 7 | 1902 | 1905 |
| Average | 1899.28 | 1905.71 |
| Average % | 98.56 | 98.89 |
For each exome technical replicate, we report the number of passing SNPs found by Mercury and the total number of SNPs (passing and not passing) in the VCF that overlap the SNP array design. The SNP array contains 1,927 sites that overlap the exome capture region.