| Literature DB >> 33580225 |
Shilpa Nadimpalli Kobren1, Dustin Baldridge2, Matt Velinder3, Joel B Krier4, Kimberly LeBlanc1, Cecilia Esteves1, Barbara N Pusey5, Stephan Züchner6, Elizabeth Blue7, Hane Lee8,9, Alden Huang8, Lisa Bastarache10, Anna Bican10, Joy Cogan10, Shruti Marwaha11, Anna Alkelai12, David R Murdock13, Pengfei Liu13,14, Daniel J Wegner2, Alexander J Paul15, Shamil R Sunyaev1,4, Isaac S Kohane16.
Abstract
PURPOSE: Genomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful.Entities:
Mesh:
Year: 2021 PMID: 33580225 PMCID: PMC8187147 DOI: 10.1038/s41436-020-01084-8
Source DB: PubMed Journal: Genet Med ISSN: 1098-3600 Impact factor: 8.822
Fig. 1Representative clinical workflow to uncover disease-causing genetic variants in undiagnosed patients.
Upon acceptance to the Undiagnosed Diseases Network (UDN), (a) an affected patient has an in-person clinical evaluation where extensive phenotyping and additional tests are performed as needed. (b) Before or during the clinical evaluation, samples of relevant affected and unaffected individuals in a family are sent for genomic sequencing. (c,d) Sequencing data provided by the sequencing center are analyzed in conjunction with other information in a back-and-forth process between bioinformaticians, clinicians, and genetic counselors to highlight variants that are likely to explain the patient’s disease. (e) Matches to the strong candidate explanatory variants identified in (c) are searched for in databases containing human genetic variant and corresponding symptom information (e.g., Matchmaker Exchange) or in databases containing animal genetic variants and corresponding phenotype information (e.g., MARRVEL). Strong candidate variants are also introduced into model organisms or cell lines where possible to assess in vivo phenotypic impact. (f) Once a candidate variant has been confirmed as disease causal, a molecular diagnosis is provided that can subsequently be used to tailor clinical management and molecular therapeutics. (g–j) Recurring steps in computational workflows to process genomic sequencing data to call, filter, and prioritize genetic variants that explain the affected individual’s disease symptoms.
Structural variant (SV) callers in use at clinical sites.
| BaylorSeq | BCM | Duke/Columbia | Harvard | Miami | NIH | PacificNW | Stanford | UCLA | Utah | Vanderbilt | WUSTL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mantaa | ■ | ■ | □ | □ | □ | □ | □ | □ | ■ | ■ | □ | ■ |
| ExpansionHunter | ■ | ■ | ■ | ■ | ■ | |||||||
| GATKb | ■ | □ | □ | ■ | ||||||||
| LUMPY | □ | ■ | □ | □ | ||||||||
| CNVnator | □ | ■ | ■ | |||||||||
| RUFUS | ■ | ■ | ||||||||||
| CNVkit | ■ | ■ | ||||||||||
| BreakDancer | □ | ■ | ||||||||||
| Illumina DRAGEN depth-based CNV caller | ■ | |||||||||||
| SvABA: SV/indel Analysis by Assembly | ■ | |||||||||||
| CoNIFERc | ■ | |||||||||||
| ERDS: estimation by reads depth w/ SNVs | ■ | |||||||||||
| BreakSeq2 | □ | |||||||||||
| DELLY2 | □ | |||||||||||
| smoove | ■ | ■ | ■ | |||||||||
| SVTyper | □ | □ | □ | |||||||||
| AnnotSV | ■ | ■ | ■ | ■ | ■ | ■ | ||||||
| gnomAD-SV | ■ | ■ | ||||||||||
| duphold | □ | □ | ||||||||||
| XHMM | ■ | ■ | ■ | |||||||||
| SURVIVOR | □ | ■ | ||||||||||
| Parliament2 | ■ | |||||||||||
■ Tool called directly. □ Tool called indirectly (e.g., by a wrapper).
Each SV calling tool identifies subsets of SVs by type or other factors, and so in practice, the output of multiple methods must typically be combined and considered together. Wrapper tools that automatically call and combine results from multiple other SV detection methods improve the efficiency of this process. Duke/Columbia, NIH, Stanford, and Vanderbilt only use SV calling tools in specific cases or contexts rather than as part of their regular pipelines. Tool citations are listed in Extended Data Table 1.
CNV copy-number variant, SNV single-nucleotide variant.
aManta is used by BaylorSeq to generate putative SV calls, which are then shared with the clinical sites.
bThe two functions from GATK used are GermlineCNVCaller and DepthOfCoverage (DoC); the latter is used to detect exonic deletions or duplications.
cIn contrast to other tools, CoNIFER runs on exome sequencing (ES) data rather than genome sequencing (GS) data.
Quality control (QC) checks of variants for rare disease diagnosis.
QC checks of variant data fall into three main categories, listed in bold above. Although some tools can be used for many of these steps, we illustrate here which QC steps they are actually used for in practice. Note the clarifications for some of the QC tools and steps listed in footnotes a–e. Tool citations are listed in Extended Data Table 1.
ES exome sequencing, GS genome sequencing, SNV single-nucleotide variant.
aBCFtools refers to the Wellcome Trust Sanger Institute’s suite of tools: BCFtools, VCFtools, SAMtools, and HTSlib.
bThese tools either call de novo variants from sequencing reads to reduce false positive calls or provide de novo frequencies where a high frequency indicates a likely false positive.
cThe expected transition (Ts) to transversion (Tv) ratios assume variants are called with respect to the human reference sequence; if variants are called with respect to computed ancestral alleles, the expected Ts/Tv ratio for ES should be ~1.
dExpected relatedness between family members is estimated using a “kinship coefficient”; unexpectedly low kinship implies a family member is not as related as was originally assumed, unexpectedly high kinship suggests consanguinity, and maximal kinship implies an accidental sample duplication.
eMosaicism—where an individual contains a mix of genetically distinct cells—may be relevant for disease rather than only indicative of sequencing errors.
Human genetic variation data sets and derived tools.
| BaylorSeq | BCM | Duke/Columbia | Harvard | Miami | NIH | PacificNW | Stanford | UCLA | Utah | Vanderbilt | WUSTL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ClinVar | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● |
| OMIM | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● |
| HGMD: Human Gene Mutation Database | ● | ● | ● | ● | ● | ● | ● | ● | ||||
| dbSNP | ● | ● | ● | ● | ● | |||||||
| CGD: Clinical Genomic Database | ● | ● | ||||||||||
| Orphanet | ● | ● | ||||||||||
| gnomAD: Genome Aggregation Database | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● |
| ExAC: Exome Aggregation Consortium | ● | ● | ● | ● | ● | ● | ● | ● | ⚬ | ● | ||
| 1000 Genomes Project | ● | ● | ● | ● | ● | ● | ● | ● | ||||
| Institution—internal controlsa | ● | ● | ● | ● | ● | ● | ● | ● | ||||
| EVS: Exome Variant Server | ● | ● | ● | ● | ● | |||||||
| TOPMed: Trans-Omics for Precision Medicine | ⚬ | ● | ● | ⚬ | ⚬ | |||||||
| UK10K | ● | ● | ● | |||||||||
| Greater Middle East (GME) Variome Project | ⚬ | ⚬ | ||||||||||
| xKJPN: 1000+ Japanese | ⚬ | |||||||||||
| GenomeAsia 100 K Project | ⚬ | |||||||||||
| Iranome | ⚬ | |||||||||||
| gnomAD-SV: Genome Aggregation Database SVs | ● | ⚬ | ● | ● | ● | ● | ● | ● | ● | |||
| DGV: Database of Genomic Variants | ● | ⚬ | ● | ● | ● | ● | ● | ● | ||||
| dbVar: Database of Genomic Structural Variation | ● | ● | ● | ● | ||||||||
| ClinGen: Clinical Genome Resource | ● | ⚬ | ● | ● | ● | |||||||
| DECIPHER | ● | ⚬ | ● | ● | ||||||||
| Institution—internal controlsa | ● | ● | ● | |||||||||
| pLI: probability of loss-of-function (LoF) intolerance | ● | ● | ● | ● | ● | ⚬ | ● | ● | ● | ● | ||
| Missense (constraint) | ● | ● | ● | ● | ● | ● | ● | |||||
| pREC: probability of homozygote LoF intolerance | ● | ⚬ | ● | |||||||||
| (sub)RVIS: Residual Variation Intolerance Score | ● | ● | ||||||||||
| L-o/e-UF: LoF observed/expected upper-bound fraction | ● | ● | ||||||||||
| CCR: constrained coding regions | ● | ● | ||||||||||
| LIMBR: Localized Intolerance Model w/ Bayesian Regression | ● | |||||||||||
| MTR: missense tolerance ratio | ● | |||||||||||
| s_het: selective effect of heterozygous LoF | ● | |||||||||||
| M-o/e-UF: missense observed/expected upper-bound fraction | ● | |||||||||||
| LoFtool | ● | |||||||||||
| ● Tool used by default. ⚬ Tool used in specific cases or contexts only.b | ||||||||||||
Knowledge of variation within human populations with and without disease can be effectively used to assess the likelihood of a variant to cause the genetic condition under investigation. Tool and data set citations are listed in Extended Data Table 1.
aHuman sequence variation data sets that are internal to particular institutions and used by clinical sites surveyed here include variants present in patients from Baylor College of Medicine (BCM), the Institute for Genomic Medicine (Duke/Columbia), Brigham Genomic Medicine (Harvard), the NIH Undiagnosed Diseases Program (NIH), Centers for Mendelian Genomics (PacificNW), University of California–Los Angeles (UCLA), the Centre d’Etude du Polymorphisme Humain (Utah), and BioVu (Vanderbilt), and a curated set of copy-number variants (CNVs) detected via genome sequencing (GS) and confirmed via chromosomal microarray analysis (Washington University School of Medicine [WUSTL]).
bThe contexts in which specific human population variant data sets are used include historical reasons (ExAC), when a variant’s gnomAD-derived MAF is 0 or close to 0 (TOPMed), when patients’ inferred ancestry is non-European (TOPMed), Middle Eastern (GME), Japanese (xKJPN), Asian (GenomeAsia), and/or Iranian (Iranome), and when a predicted structural variant impacts a clinically relevant gene (gnomAD-SV, DGV, ClinGen, DECIPHER).
Tools for assigning the pathogenic likelihood or functional impact of variants.
| BaylorSeq | BCM | Duke/Columbia | Harvard | Miami | NIH | PacificNW | Stanford | UCLA | Utah | Vanderbilt | WUSTL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GERP++: Genomic Evolutionary Rate Profiling | ● | ● | ● | ● | ● | |||||||
| PhastCons | ● | ● | ● | |||||||||
| PolyPhen-2 | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ||
| SIFT | ● | ● | ● | ● | ● | ● | ● | ● | ● | |||
| MutationTaster | ● | ● | ● | ● | ||||||||
| MVP: missense variant pathogenicity | ● | |||||||||||
| ReMM: regulatory Mendelian mutation | ● | |||||||||||
| CADD: Combined Annotation Dependent Depletion | ● | ● | ● | ● | ● | ● | ● | ● | ||||
| REVEL: Rare Exome Variant Ensemble Learner | ● | ● | ● | ● | ● | ● | ● | ● | ||||
| DANN: Deep Neural Net version of CADD | ● | ● | ||||||||||
| M-CAP: Mendelian Clinically Applicable Pathogenicity | ● | ● | ||||||||||
| DOMINO: Dominant Disorder Associated Genesa | ● | |||||||||||
| Eigen | ● | |||||||||||
| SpliceAI | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ||
| GTEx: Genotype-Tissue Expression | ● | ● | ● | ● | ||||||||
| SpliceRegion annotations from VEP | ● | ● | ● | ● | ||||||||
| dbscSNV (splicing consensus SNVs) | ● | ● | ● | |||||||||
| Human Splicing Factor | ● | ● | ||||||||||
| MMSplice: Modular modeling of splicing | ● | ● | ||||||||||
| MaxEntScan | ● | ● | ||||||||||
| TraP: Transcript-inferred Pathogenicity | ● | |||||||||||
Variants of uncertain significance (i.e., that are not already known to be associated with disease) can be evaluated for functional or pathogenic impact using predictive models. Tool citations are listed in Extended Data Table 1.
aUnlike other tools, DOMINO provides scores per gene rather than per variant.