Literature DB >> 30351394

VCPA: genomic variant calling pipeline and data management tool for Alzheimer's Disease Sequencing Project.

Yuk Yee Leung¹, Otto Valladares¹, Yi-Fan Chou¹, Han-Jen Lin¹, Amanda B Kuzma¹, Laura Cantwell¹, Liming Qu¹, Prabhakaran Gangadharan¹, William J Salerno², Gerard D Schellenberg¹, Li-San Wang¹.

Abstract

SUMMARY: We report VCPA, our SNP/Indel Variant Calling Pipeline and data management tool used for the analysis of whole genome and exome sequencing (WGS/WES) for the Alzheimer's Disease Sequencing Project. VCPA consists of two independent but linkable components: pipeline and tracking database. The pipeline, implemented using the Workflow Description Language and fully optimized for the Amazon elastic compute cloud environment, includes steps from aligning raw sequence reads to variant calling using GATK. The tracking database allows users to view job running status in real time and visualize >100 quality metrics per genome. VCPA is functionally equivalent to the CCDG/TOPMed pipeline. Users can use the pipeline and the dockerized database to process large WGS/WES datasets on Amazon cloud with minimal configuration.
AVAILABILITY AND IMPLEMENTATION: VCPA is released under the MIT license and is available for academic and nonprofit use for free. The pipeline source code and step-by-step instructions are available from the National Institute on Aging Genetics of Alzheimer's Disease Data Storage Site (http://www.niagads.org/VCPA). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 30351394 PMCID： PMC6513159 DOI： 10.1093/bioinformatics/bty894

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The Alzheimer’s Disease Sequencing Project (ADSP) is an integral component of the National Alzheimer’s Project Act (NAPA) towards a cure of Alzheimer’s Disease (AD). ADSP will eventually analyze whole-genome sequencing (WGS) and whole-exome sequencing (WES) data from more than 20 000 late-onset AD patients and cognitively normal elderly to find new genetic variants associated with disease risk. To ensure all sequencing data are processed consistently and efficiently according to best practices, a common workflow called ‘Variant Calling Pipeline and data management tool’ (VCPA) was developed by the Genome Center for Alzheimer’s Disease (GCAD) in collaboration with ADSP. VCPA is capable to process any kind of germline DNA sequencing data and available for general use. VCPA (i) is optimized for large-scale production of WGS and WES data, (ii) includes a tracking database with web frontend for users to track production process and review quality metrics, (iii) is implemented using the Workflow Description Language (WDL) for better deployment and maintenance and (iv) is designed for the latest human reference genome build (GRCh38/hg38, version GRCh38DH) and follows best practices for WGS analysis with input from TOPMed (Trans-Omics for Precision Medicine) and CCDG (Centers for Common Disease Genomics). VCPA consists of two independent but interoperable components: a tracking database (Fig. 1A) with a web frontend (Fig. 1B) and a SNP/indel calling pipeline (Fig. 1C). The pipeline was optimized for automatic processing WGS/WES data in various file formats, from mapping sequence reads to the latest human reference genome (GRCh38/hg38) and variant calling. The tracking database (available as another AMI) was designed for monitoring the job status and recording quality metrics for each processed sample (Fig. 1B). With a dynamic web interface of the database, researchers can easily compare, share and visualize all these individual level quality metrics.

Fig. 1.

(A) VCPA tracking database; (B) dynamic view of job status; (C) VCPA Pipeline overview

2 SNP/indel calling pipeline

The variant calling pipeline for the WGS (stages 1 and 2a) was developed with input from CCDG/TOPMed for functional equivalence (Regier ) and follows best practices of Germline Single Nucleotide Polymorphisms (SNPs) & Insertion/deletion (indel) Discovery for Genomic Analysis Toolkit (GATK) v3.7 (DePristo ). A uniqueness of VCPA is that it accepts either WGS or WES pair-end reads in FASTQ, BAM (binary sequence alignment map format), or CRAM (compressed BAM) formats with flow cell information and genomic regions for exome sequencing enrichment/capture kits. The workflow is modularized and consists of four stages (Fig. 1C). Users can configure the workflow to skip individual stages to reduce the time and cost. Stage 0 includes preparation steps for read mapping. For samples already mapped previously, PICARD (http://broadinstitute.github.io/picard) is used to roll back BAM files to uBAM (unaligned BAM) files (https://gatkforums.broadinstitute.org/gatk/discussions/tagged/ubam). Stage 1 generates BAM files. First, reads are mapped to GRCh38/hg38 using BWA-MEM (Li H., 2013) and duplicate reads are marked by BamUtil (https://genome.sph.umich.edu/wiki/BamUtil). Next, BAM files are processed by Samblaster (adding MC and MQ tags to pair-end reads) (Faust and Hall, 2014) and sorted by genomic coordinates using SAMtools (Li ). Finally, coverage statistics are computed using Sambamba (Tarasov ). Stage 2A performs local realignment near known indel sites (1000 Genome indels) and recalibration of base call quality scores using GATK v3.7 (McKenna ). Stage 2B implements the GATK best practice steps for variant calling and annotation on SNPs and indels, and generates genotype call files in genomic Variant Call Format (gVCF) for each sample individually. Quality metrics of called variants are computed using GATK (DePristo ). Stage 3 combines gVCF files from multiple samples and performs joint genotype calling using GATK best practices. A project-level VCF file containing genotype information for all polymorphic sites across all samples is generated. Details for job submission are described in Supplementary Methods. The resulting directory architecture and important VCPA outputs are described in Supplementary Figure S1.

3 Tracking database

The tracking database enables the user to monitor production status (Fig. 1B) and review sequencing quality such as mapping percentage, depth coverage and quality of called variants. All 113 quality metrics are collected during the pipeline execution and imported into the database, and are organized by workflow stages and projects and viewable through an interactive web user interface. The tracking database is built on a LAMP (Linux, Apache Httpd, MySQL and PHP) application stack using the SLIM-PHP framework. The application has a small memory and storage footprint, provides a RESTful API interface to the MySQL back-end, and supports password protection to restrict access. The tracking database is dockerized and can be installed on-site (off the cloud) if preferred.

4 Using VCPA on Amazon EC2 or local Linux environment

We evaluated our pipeline using the NA12878 sample from the Genome in a Bottle project using the hg38 high confidence set (Supplementary Methods). Sensitivity/precision of VCPA calls were 0.999/0.994 for SNPs and 0.985/0.987 for indels respectively and comparable to TOPMed/CCDG workflows (Regier ). We also ran VCPA on two replicates of NA19238 (Yoruban) in CRAM and FASTQ format and the pairwise variant discordance rate is 1.000 for both SNVs and indels (Supplementary Methods), comparable to TOPMed/CCDG workflows (Regier ). Benchmark of cost and time on Amazon Elastic Compute Cloud (EC2) for running VCPA on these samples can be found in Supplementary Table S1. VCPA is available as Amazon Machine Images (AMI), ami-acc840d3. A dockerized version is also available for deployment on other linux-based environments. To conclude, VCPA is an efficient, high quality and scalable pipeline for processing WGS/WES data on the Amazon EC2 environment. VCPA is used for the ADSP production and can track information from >1000 genome analysis runs simultaneously. Future plans include incorporating other variant calling pipelines such as xAtlas (Farek ) and GATK4. Click here for additional data file.

6 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. Sambamba: fast processing of NGS alignment formats.

Authors: Artem Tarasov; Albert J Vilella; Edwin Cuppen; Isaac J Nijman; Pjotr Prins
Journal: Bioinformatics Date: 2015-02-19 Impact factor: 6.937

3. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

4. A framework for variation discovery and genotyping using next-generation DNA sequencing data.

Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly
Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330

5. SAMBLASTER: fast duplicate marking and structural variant read extraction.

Authors: Gregory G Faust; Ira M Hall
Journal: Bioinformatics Date: 2014-05-07 Impact factor: 6.937

6. Functional equivalence of genome sequencing analysis pipelines enables harmonized variant calling across human genetics projects.

Authors: Allison A Regier; Yossi Farjoun; David E Larson; Olga Krasheninina; Hyun Min Kang; Daniel P Howrigan; Bo-Juen Chen; Manisha Kher; Eric Banks; Darren C Ames; Adam C English; Heng Li; Jinchuan Xing; Yeting Zhang; Tara Matise; Goncalo R Abecasis; Will Salerno; Michael C Zody; Benjamin M Neale; Ira M Hall
Journal: Nat Commun Date: 2018-10-02 Impact factor: 14.919

6 in total

1. Advances and challenges in quantitative delineation of the genetic architecture of complex traits.

Authors: Hua Tang; Zihuai He
Journal: Quant Biol Date: 2021-06

2. Rare variants in the endocytic pathway are associated with Alzheimer's disease, its related phenotypes, and functional consequences.

Authors: Lingyu Zhan; Jiajin Li; Brandon Jew; Jae Hoon Sul
Journal: PLoS Genet Date: 2021-09-13 Impact factor: 5.917

3. An association test of the spatial distribution of rare missense variants within protein structures identifies Alzheimer's disease-related patterns.

Authors: Bowen Jin; John A Capra; Penelope Benchek; Nicholas Wheeler; Adam C Naj; Kara L Hamilton-Nelson; John J Farrell; Yuk Yee Leung; Brian Kunkle; Badri Vadarajan; Gerard D Schellenberg; Richard Mayeux; Li-San Wang; Lindsay A Farrer; Margaret A Pericak-Vance; Eden R Martin; Jonathan L Haines; Dana C Crawford; William S Bush
Journal: Genome Res Date: 2022-02-24 Impact factor: 9.043

4. Region-based analysis of rare genomic variants in whole-genome sequencing datasets reveal two novel Alzheimer's disease-associated genes: DTNB and DLG2.

Authors: Christoph Lange; Rudolph E Tanzi; Dmitry Prokopenko; Sanghun Lee; Julian Hecker; Kristina Mullin; Sarah Morgan; Yuriko Katsumata; Michael W Weiner; David W Fardo; Nan Laird; Lars Bertram; Winston Hide
Journal: Mol Psychiatry Date: 2022-03-04 Impact factor: 13.437

5. A Fast and Robust Strategy to Remove Variant-Level Artifacts in Alzheimer Disease Sequencing Project Data.

Authors: Michael E Belloy; Yann Le Guen; Sarah J Eger; Valerio Napolioni; Michael D Greicius; Zihuai He
Journal: Neurol Genet Date: 2022-08-11

6. PSCAN: Spatial scan tests guided by protein structures improve complex disease gene discovery and signal variant detection.

Authors: Zheng-Zheng Tang; Gregory R Sliwoski; Guanhua Chen; Bowen Jin; William S Bush; Bingshan Li; John A Capra
Journal: Genome Biol Date: 2020-08-26 Impact factor: 13.583

6 in total