| Literature DB >> 26927186 |
Ellen A Tsai1,2,3, Rimma Shakbatyan4, Jason Evans5, Peter Rossetti6, Chet Graham7, Himanshu Sharma8, Chiao-Feng Lin9,10,11, Matthew S Lebo12,13,14.
Abstract
Effective implementation of precision medicine will be enhanced by a thorough understanding of each patient's genetic composition to better treat his or her presenting symptoms or mitigate the onset of disease. This ideally includes the sequence information of a complete genome for each individual. At Partners HealthCare Personalized Medicine, we have developed a clinical process for whole genome sequencing (WGS) with application in both healthy individuals and those with disease. In this manuscript, we will describe our bioinformatics strategy to efficiently process and deliver genomic data to geneticists for clinical interpretation. We describe the handling of data from FASTQ to the final variant list for clinical review for the final report. We will also discuss our methodology for validating this workflow and the cost implications of running WGS.Entities:
Keywords: NGS; WGS; bioinformatics; clinical sequencing; next generation sequencing; precision medicine; validation
Year: 2016 PMID: 26927186 PMCID: PMC4810391 DOI: 10.3390/jpm6010012
Source DB: PubMed Journal: J Pers Med ISSN: 2075-4426
Top 20 Poorly Covered Genes with Clinical Relevance. Clinical relevance is defined as having at least five Pathogenic or Likely pathogenic variants in ClinVar reported in the gene by submitting laboratories or working groups.
| Gene | # Clinically Significant Variants | % Callable | Disease | Disease Prevalence |
|---|---|---|---|---|
| STRC | 8 | 20 | Sensorineural hearing loss | Common |
| ADAMTSL2 | 5 | 32 | Geleophysic dysplasia | Rare |
| CYP21A2 | 13 | 44 | Congenital adrenal hyperplasia | Common |
| ARX | 19 | 45 | X-linked infantile spasm syndrome | Rare |
| MECP2 | 250 | 53 | Rett syndrome | Common |
| GJB1 | 16 | 53 | Charcot-Marie-Tooth disease | Common |
| ABCD1 | 33 | 57 | X-linked adrenoleukodystrophy | Moderate |
| EMD | 11 | 57 | Emery-Dreifuss muscular dystrophy | Moderate |
| G6PD | 16 | 58 | Glucose-6-phosphate dehydrogenase deficiency | Common |
| GATA1 | 12 | 60 | Dyserythropoietic anemia and thrombocytopenia | Rare |
| AVPR2 | 15 | 62 | Nephrogenic diabetes insipidus | Rare |
| EDA | 37 | 63 | Hypohidrotic ectodermal dysplasia | Moderate |
| SLC16A2 | 11 | 63 | Allan-Herndon-Dudley syndrome | Rare |
| FLNA | 42 | 64 | Otopalatodigital syndrome | Rare |
| EBP | 24 | 64 | X-linked chondrodysplasia punctata | Rare |
| RPGR | 17 | 64 | Retinitis pigmentosa | Common |
| TAZ | 17 | 64 | Barth syndrome | Rare |
| IDS | 16 | 64 | Hunter syndrome | Moderate |
| FGD1 | 8 | 64 | Aarskog-Scott syndrome | Rare |
| GPR143 | 6 | 65 | Ocular albinism | Moderate |
Cost Analysis for Storage of WGS data. Primary storage assumes unreplicated, active storage with high input/output (I/O) capacity. Secondary storage assumes replicated, deep storage with low I/O capacity. The cost of processing a genome and data retention on the primary and secondary storage for one year is ~$245.
| Storage Type | Genome/Month ($) | Genome/Year ($) | Genome/5 Years ($) |
|---|---|---|---|
| Primary | 4.42 | 53.04 | 265.20 |
| Secondary | 3.48 | 41.76 | 208.80 |
| Total | 7.90 | 94.80 | 474.00 |
Figure 1WGS Alignment and Variant Calling Pipeline. There are multiple entry points to our pipeline where it can be re-triggered due to system failures or outside datasets. Standard processing of genome data from the Illumina Clinical Services Laboratory starts at the top entry point. From there, FASTQ sequences are aligned to the reference hg19 genome using the burrows-wheeler aligner (bwa). Since the alignment is computationally intensive, we divided the sequence files into smaller files. The alignments, known as “raw” BAM files in our pipeline, are processed through a series of steps prior to variant calling. The “final” BAM file is the resulting file after removing PCR artifacts, local indel realignment, and base quality recalibration. This is used as the input file to the variant caller portion of the pipeline. Variant calling happens in two phases, where the variants are identified and then their quality scores are recalibrated in the final VCF file.
Example Filtration Methods. These filters are applied using Boolean logic to produce the final list of filtered variants in each individual.
| Filter Name | Parameter | Description |
|---|---|---|
| Frequency | X (e.g., 0.01 or 0.05) | Keep variants that have frequencies in ESP or 1000 Genomes ≤ X |
| Loss-of-Function | Keep variants that may implicate loss of gene function, including those annotated with the following Sequence Ontology keywords: frameshift_variant, stop_gained, stop_lost, splice_acceptor_variant, initiator_codon_variant, splice_donor_variant. | |
| Gene List | Gene list (in HGNC nomenclature) | Gene filtration is based on selecting variants that are within particular genes. We check if a variant is annotated with a gene symbol of interest within a clinical region of interest |
| Reported Pathogenic | Select variants that are classified as Pathogenic or Likely pathogenic in variant databases, including ClinVar | |
| GeneInsight | Select variants that are classified as Pathogenic or Likely pathogenic in our internal GeneInsight database | |
| Compound Heterozygous | Select LOF and missense variants if there are at least two alterations in the gene that may impact function of both alleles |
Figure 2Bioinformatics Workflow. Our process is divided into four major phases. During this process, there are three trigger points that require manual hands-on time: (1) Alignment and Variant Calling; (2) Annotation and Upload to Oracle SQL; and (3) Variant Filtration. Segmenting these processes offer the ability to check the data integrity throughout this process and the flexibility of utilizing parts of these scripts for processing a non-standard clinical or research sample.
(a) Specificity
| Variant Type | FP (before Thresholds) | FP (after Thresholds) |
|---|---|---|
| SNVs | 20 | 1 |
| Indels | 1 | 0 |
(b) Sensitivity
| Variant Type | # | FN | Sensitivity | 95% Cl |
|---|---|---|---|---|
| SNVs | 410 | 0 | 100% | 99.1%–100% |
| Indels | 15 | 0 | 100% | 79.6%–100% |
(c) Concordance with 1000 Genomes data
| Variant Type | 1K Genomes Variants | Present in NGS Calls | % Present in NGS Calls | Present in NGS Calls with Matched Genotypes | % Present in NGS Calls with Matched Genotypes |
|---|---|---|---|---|---|
| SNVs | 2,762,933 | 2,735,592 | 99.01% | 2,730,826 | 98.84% |
| Indels | 3,27,474 | 299,300 | 91.39% | 285,401 | 87.15% |
| Total | 3,090,407 | 3,034,892 | 98.20% | 3,016,227 | 97.60% |