Literature DB >> 30759220

A scalable, aggregated genotypic-phenotypic database for human disease variation.

Ryan Barrett1, Cynthia L Neben1, Anjali D Zimmer1, Gilad Mishne1, Wendy McKennon1, Alicia Y Zhou1, Jeremy Ginsberg1.   

Abstract

Next generation sequencing multi-gene panels have greatly improved the diagnostic yield and cost effectiveness of genetic testing and are rapidly being integrated into the clinic for hereditary cancer risk. With this technology comes a dramatic increase in the volume, type and complexity of data. This invaluable data though is too often buried or inaccessible to researchers, especially to those without strong analytical or programming skills. To effectively share comprehensive, integrated genotypic-phenotypic data, we built Color Data, a publicly available, cloud-based database that supports broad access and data literacy. The database is composed of 50 000 individuals who were sequenced for 30 genes associated with hereditary cancer risk and provides useful information on allele frequency and variant classification, as well as associated phenotypic information such as demographics and personal and family history. Our user-friendly interface allows researchers to easily execute their own queries with filtering, and the results of queries can be shared and/or downloaded. The rapid and broad dissemination of these research results will help increase the value of, and reduce the waste in, scientific resources and data. Furthermore, the database is able to quickly scale and support integration of additional genes and human hereditary conditions. We hope that this database will help researchers and scientists explore genotype-phenotype correlations in hereditary cancer, identify novel variants for functional analysis and enable data-driven drug discovery and development.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 30759220      PMCID: PMC6372842          DOI: 10.1093/database/baz013

Source DB:  PubMed          Journal:  Database (Oxford)        ISSN: 1758-0463            Impact factor:   3.451


Introduction

Next generation sequencing (NGS) technologies continue to revolutionize the field of genomics as low-cost, high-throughput platforms with high sensitivity. Over the past few years, NGS multi-gene panels have been increasingly used in both the clinic and research laboratories for genetic screening, diagnosis and assessment of hereditary conditions, including cancer (1–3). About 10–15% of common cancers have been associated with inherited pathogenic or likely pathogenic variants that have well-established clinical presentations (4); an additional 5–15% are thought to be inherited (i.e. familial), but the underlying genetic etiologies have yet to be identified (5, 6). The study of genomic data in these cases can help reveal genotype–phenotype correlations in hereditary cancer, identify novel variants for functional analysis and enable data-driven drug discovery and development. However, the expanding volume, type and complexity of such data pose several bioinformatics challenges in storage, analysis and interpretation (7). High-level workflow of the database. The workflow is divided into four subwork processes including ‘Data Collection’, ‘Bioinformatics’, ‘Architecture’ and ‘User’, grouped by four different color-rounded rectangles. Several public population databases, as well as public and commercial cancer-specific databases, have been developed for genomic data and provide useful information on gene annotation, allele frequency and known or predicted functional consequences of variants (8–11). The sharing and pooling of this data is critical in interpreting the clinical significance of variants and delivery of genomic medicine (11, 12). However, associated specific clinical information, such as demographics and personal and family history, is not always available, and independently linking large sets of genotypic and phenotypic information often require knowledge of programming languages and database intelligence or expensive local software. To effectively share comprehensive, integrated genotypic–phenotypic data, we built Color Data, a cloud-based database that supports broad access and data literacy. Our user-friendly interface allows researchers to easily execute their own queries with filtering. The results of queries are visualized as text and graphic features and can be downloaded in tabular format directly through the database to conduct further data analysis. At the time of publication, the database contains gene variants and phenotypes from 50 000 affected and unaffected individuals who were sequenced for 30 genes associated with hereditary cancer risk. Importantly, we have designed the web interface and underlying implementation to quickly scale and support samples and information from millions of individuals, as well as whole-genome sequencing data.

Materials and methods

The high-level workflow and technical overview of the database are depicted in Figure 1 and described in detail below.
Figure 1

High-level workflow of the database. The workflow is divided into four subwork processes including ‘Data Collection’, ‘Bioinformatics’, ‘Architecture’ and ‘User’, grouped by four different color-rounded rectangles.

Data collection

Individuals were ordered a Color test by a healthcare provider. All phenotypic information was reported by the individual through an interactive, online health history tool in their Color account. Phenotypic questions asked are available upon request. Individuals who reported more than one ancestry were counted as ‘Multiple ethnicities’ with the following exceptions: (i) any individuals who reported ‘Ashkenazi Jewish’ in addition to any other ancestry were counted as ‘Ashkenazi Jewish’; (ii) any individuals who reported ‘Hawaiian’ were counted as ‘Pacific Islander’; and (iii) any individuals who reported any combination of ‘Chinese’, ‘Japanese’, ‘Indian’, ‘Filipino’, ‘Hawaiian’, ‘Other Pacific Islander’ or ‘Other Asian’ and no other ancestry were counted as ‘Asian, not specified’. All individuals consented to have their information appear in Color’s research database. Individuals were not recruited for this database and can opt out of participating in the database. This population was not specifically selected for any particular metric including gender, age, ethnicity or history of cancer, and individuals were included in consecutive order.

Bioinformatics pipeline

Laboratory procedures were performed at the Color laboratory (Burlingame, CA) under Clinical Laboratory Improvements Amendments (#05D2081492) and College of American Pathologists (#8975161) compliance as previously described (3). Briefly, genomic DNA was extracted from blood or saliva (Perkin Elmer Chemagic DNA Extraction Kit), enriched for select regions using SureSelect XT probes and then sequenced using NextSeq 500/550 or NovaSeq 6000 instruments (Illumina). Sequence reads were aligned against human genome reference GRCh37.p12 with the Burrows–Wheeler Aligner (13), and duplicate and low quality reads were removed. Single nucleotide variants and small insertions and deletions (indels, 2–50 bp) were called by the HaplotypeCaller module of GATK3.4 (14). Variants in homopolymer regions were called by an internally developed algorithm using SAMtools. Large structural variants (>50 bp) were detected using dedicated algorithms based on read depth (CNVkit) (15), paired reads and split reads [LUMPY (16), in-house developed algorithms]. Variants were classified according to the standards and guidelines for sequence variant interpretation of the American College of Medical Genetics and Genomics (17), and all variant classifications were signed out by board certified medical geneticist or pathologist. Variant classification categories are pathogenic (P), likely pathogenic (LP), variant of uncertain significance (VUS), likely benign (LB) and benign (B). The genes in Color Data were selected based on (i) published evidence of their association with hereditary cancer risk and (ii) the technical feasibility of sequencing them using the NGS methods described above. These genes are APC, ATM, BAP1, BARD1, BMPR1A, BRCA1, BRCA2, BRIP1, CDH1, CDK4, CDKN2A (p14ARF and p16INK4a), CHEK2, EPCAM, GREM1, MITF, MLH1, MSH2, MSH6, MUTYH, NBN, PALB2, PMS2, POLD1, POLE, PTEN, RAD51C, RAD51D, SMAD4, STK11 and TP53. Analysis, variant calling and reporting focused on the complete coding sequence and adjacent intronic sequence of the primary transcript(s) (Table S1), unless otherwise indicated. In PMS2, exons 12–15 were not analyzed. In several genes, only specific positions known to impact cancer risk were analyzed (genomic coordinates in GRCh37): CDK4—only chr12:g.58145429-58145431 (codon 24), MITF—only chr3:g.70014091 (including c.952G>A), POLD1—only chr19:g.50909713 (including c.1433G>A), POLE—only chr12:g.133250250 (including c.1270C>G), EPCAM—only large deletions and duplications including the 3′ end of the gene and GREM1—only duplications in the upstream regulatory region.

Architecture and implementation

Color Data is static HTML and CSS with Metabase embedded in