| Literature DB >> 29717215 |
Jason L Causey1, Cody Ashby2, Karl Walker3, Zhiping Paul Wang4, Mary Yang5, Yuanfang Guan6, Jason H Moore4, Xiuzhen Huang7.
Abstract
Next-generation sequencing is empowering genetic disease research. However, it also brings significant challenges for efficient and effective sequencing data analysis. We built a pipeline, called DNAp, for analyzing whole exome sequencing (WES) and whole genome sequencing (WGS) data, to detect mutations from disease samples. The pipeline is containerized, convenient to use and can run under any system, since it is a fully automatic process in Docker container form. It is also open, and can be easily customized with user intervention points, such as for updating reference files and different software or versions. The pipeline has been tested with both human and mouse sequencing datasets, and it has generated mutations results, comparable to published results from these datasets, and reproducible across heterogeneous hardware platforms. The pipeline DNAp, funded by the US Food and Drug Administration (FDA), was developed for analyzing DNA sequencing data of FDA. Here we make DNAp an open source, with the software and documentation available to the public at http://bioinformatics.astate.edu/dna-pipeline/ .Entities:
Mesh:
Year: 2018 PMID: 29717215 PMCID: PMC5931599 DOI: 10.1038/s41598-018-25022-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
The full list of software tools available in the DNAp pipeline is shown in relative order of execution in the first column.
| Tool | WES Mode | WGS Mode |
|---|---|---|
| FastQC | * | * |
| BWA-MEM | * | * |
| Picard Tools SortSam | * | * |
| Picard Tools BuildBamIndex | * | * |
| Picard Tools MarkDuplicates | * | * |
| Picard Tools MergeSamFiles | * | * |
| GATK RealignerTargetCreator | optional | optional |
| GATK IndelRealigner | optional | optional |
| GATK BaseRecalibrator | optional | optional |
| GATK AnalyzeCovariates | optional | optional |
| GATK PrintReads | optional | optional |
| GATK DiagnoseTargets | * | |
| GATK DepthOfCoverage | * | |
| Qualimap | * | * |
| Strelka | * | * |
| MuTect2 | * | * |
| GATK CatVariants | * | * |
| Samtools view | * | |
| Samtools sort | * | |
| Lumpy | * | |
| bcftools filter | * | |
| Picard Tools SortVcf | * | |
| Breakdancer | * | |
| Pindel | * | |
| pindel2vcf | * | |
| Jaquard merge | * | * |
| GATK SelectVariants | * | * |
| GATK CombineVariants | * | |
| SnpEff | * | * |
| Oncotator | * | * |
The second and third columns indicate which tools are utilized by default in the Whole Exome and Whole Genome pipeline analysis modes, respectively. Some tools are used multiple times, but for brevity each is listed only at the relative point of first use. Tools listed as “optional” are executed only if realignment around indels is requested.
Figure 1The DNAp pipeline processes from fastq-format input files to annotated VCF outputs, either in Whole Exome or Whole Genome analysis modes, with options for runnings specific parts of the pipeline as needed. The diagram above shows end-to-end operation; major tools are named and intermediate processing is shown as composite processes (double lines). Data flow paths that are specific to the WES or WGS modes are color coded blue for WES-only and orange for WGS-only.
Figure 2Alignment comparison of A-State Pipeline (blue bars) versus Bowtie2 (orange) and a reference BWA aligner (green) shows similar performance for all three aligners, demonstrating that the pipeline is receiving and aligning the input reads as expected.
WES Pipeline test using cell line HCC1187C/BL.
| Caller | Matched in Ref | Caller Only | Ref Only |
|---|---|---|---|
| MuTect2 | 258 | 1773 | 47 |
| Strelka | 242 | 24 | 63 |
| Consensus:Intersection | 238 | 8 | 67 |
| Consensus:Union | 259 | 1783 | 46 |
WGS Pipeline test using cell line HCC1187C/BL.
| Caller | Matched in Ref | Caller Only | Ref Only |
|---|---|---|---|
| MuTect2 | 13558 | 116464 | 2299 |
| Strelka | 13180 | 1821 | 2677 |
| Consensus:Intersection | 12557 | 273 | 3300 |
| Consensus:Union | 13778 | 117544 | 2079 |
WGS Pipeline variant calling test for TCRBOA6 at two heterogeneous sites.
| Caller | Filter | Both Sites Matched | % | Site 1 Only | % | Site 2 Only | % |
|---|---|---|---|---|---|---|---|
| MuTect2 | all | 184346 | 91.0 | 9204 | 4.5 | 8973 | 4.4 |
| PASS | 3700 | 98.8 | 23 | 0.6 | 23 | 0.6 | |
| Strelka | all | 26731 | 99.4 | 87 | 0.3 | 85 | 0.3 |
| PASS | 1409 | 98.5 | 12 | 0.8 | 10 | 0.7 | |
| Lumpy | * | 7228 | 99.7 | 12 | 0.2 | 9 | 0.1 |
Site 1 was a Dell R820 server on-site, Site 2 was a Google Compute Engine virtual machine. *Lumpy outputs do not include filter information.