| Literature DB >> 27171182 |
David L Masica1, Rachel Karchin1,2.
Abstract
Entities:
Mesh:
Substances:
Year: 2016 PMID: 27171182 PMCID: PMC4865359 DOI: 10.1371/journal.pcbi.1004725
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Five years of independent testing of cSNV variant classifiers.
| Thusberg et al. 2011 | Olatubosun et al. 2012 | Shihab et al. 2013 | Rapakoulia et al. 2014 | Dong et al. 2015 | |
|---|---|---|---|---|---|
| + 902 PMD | + ~23,000 SwissVar | + 8,871 SwissProt | + 120 Curated by author | ||
| + 3,594 LSDBs | Subset reliable at 0.95 | + HumVar 22,196 | + 6,279 VariBench II | ||
| 68%, 62% (65%) | 83%, 41% (62%) | 67%, 82% (74%) | 38%, 81% (69%) | 68%, 75% (71%) | |
| 73%, 70% (71%) | 84%, 40% (62%) | 86%, 61% (73%) | 85%, 69% (73%) | 92%, 59% (78%) | |
| 77%, 76% (76%) | NA | 62%, 75% (68%) | 31%, 90% (74%) | 38%, 94% (66%) | |
| 63%, 79% (71%) | 67%, 61% (64%) | 66%, 74% (70%) | NA | 70%, 83%, (77%) | |
| 88%, 56% (72%) | 73%, 54% (63%) | NA | NA | 53%, 70% (61%) | |
| 71%, 92% (82%) | NA | 76%, 89% (82%) | NA | 55%, 94% (75%) | |
| 85%, 78% (81%) | NA | 90%, 90% (90%) | NA | 74%, 81% (77%) | |
| NA | NA | NA | 73%, 88% (84%) | 70%, 80% (74%) | |
| 61%, 58% (60%) | NA | NA | NA | NA | |
| NA | NA | NA | NA | 94%, 74% (86%) | |
| NA | NA | NA | NA | 55%, 91% (75%) | |
| NA | NA | NA | NA | 71%, 73% (72%) | |
| NA | NA | NA | NA | 79%, 74% (77%) |
Selected studies, one from each of the past five years, comparing multiple bioinformatic variant classifiers on large datasets of putatively pathogenic and benign variants. To avoid any potential bias, tests of an author’s own method, if present, were excluded. For each entry, results are given as sensitivity, specificity (accuracy). Variants datasets used as the pathogenic or disease-causing class are indicated by a plus sign (+) and the neutral or benign class with a negative sign (-). Entries with multiple lines indicate that two unique datasets were used for benchmarking. Both datasets used in Olatubosun et al. 2012 were the same, except that variants in the second set were filtered to meet a 95% prediction confidence using their own PON-P method. PMD is the Protein Mutant Database; NA indicates the method was not tested in the corresponding benchmark.
Fig 1How context can influence the impact and inferred impact of LDLR variation on different experimental and clinical parameters.
LDLR variation is likely to have the largest observable and reproducible impact on parameters most directly influenced by the protein (A). For instance, if a variant produces effects downstream from the protein, then protein structure and function are likely perturbed. Further downstream, cellular LDL uptake could be modulated, which can increase risk for familial hypercholesterolemia, which might appreciably increase risk for heart disease or heart attack. Liability for complex cardiovascular diseases is influenced by many endogenous and exogenous factors other than LDLR mutation (B). Furthermore, diagnoses can result from a varied combination of clinical and laboratory diagnostics, which can result in differential or conflicting diagnoses (B). In C, cellular studies, pedigree studies, a disease mutation database, and popular bioinformatics methods are used to classify LDLR variants as disease causing or benign. On the heatmap, black and white indicate a classification of disease causing and benign, respectively, for different classification methods (gray indicates an intermediate or unclear classification). Mean patient LDL-c concentration from pedigree studies (purple) and cellular LDL uptake (red) shown with darker colors indicating more severe impact (numbers indicate published values).
Fig 2Some advantages of considering endophenotypes, relative to phenotypes, illustrated using three CFTR variants.
Mean sweat chloride from individuals harboring the three variants (S1235R, D614G, and G551D), and results from two distinct in vivo experiments performed in cells expressing the variants. Increasing sweat chloride is associated with increasing disease severity, whereas in the two in vivo assays decreasing values correspond to decreasing protein function or abundance. Endophenotypes were scaled for purposes of presenting on a single chart, such that three sweat chlorides could be compared with one another, the three chloride conductance measurements could be compared with one another, etc.
Fig 3Hypothetical visualization of a multidimensional endophenotypic landscape for cystic fibrosis.
Each cSNV can be represented as a point in a three-dimensional space of three endophenotypic scores relevant to cystic fibrosis disease: post-translational processing (glycosylation) and trafficking of the CFTR protein to the epithelial cell plasma membrane, an in vivo cellular assay of chloride conductance that measures channel gating, and chloride concentration in a diagnostic sweat test. Each point on the landscape can be interpreted with respect to disease severity, shown in the color bar to the right of the landscape.
Fig 4Correlation of ePOSE score with three individual endophenotypes.
Measured endophenotype versus predicted impact (ePOSE Score) for 20 CFTR variants using classifiers trained with (A) sweat chloride, (B) chloride conductance, or (C) fraction of correctly processed CFTR protein. Each plot is the result of 20 leave-one-out cross-validation calculations (i.e., one data point for each of the 20 variants). Blue circles, green squares, and red diamonds denote benign, indeterminate, and disease-causing annotated phenotype, respectively, for each of the 20 variants. Note: increasing sweat chloride is associated with increasing disease severity, whereas for the two in vivo assays, decreasing values correspond to decreasing protein function or processing.
Fig 5Interpolation plot of predicted endophenotypes resulting from the separate leave-one-out cross-validation calculations shown in Fig 4.
ePOSE score for the 20 CFTR variants from Fig 4 plotted and interpolated (color shows ePOSE scores resulting from training with sweat chloride data). Using the resulting classifiers, each endophenotype was predicted for three additional variants (G551S, A561E, and G1349D) and subsequently validated. A561E was accurately predicted to affect disease via drastically reduced CFTR processing and channel gating. G551S was accurately predicted to affect cystic fibrosis primarily via channel gating.
Six disease-associated genes with sources of variant-specific endophenotypic data.
| Protein or gene (number of amino acids) | Endophenotype (number of unique mutations) | Disease phenotype(s) | References and database URLs |
|---|---|---|---|
| Sweat [Cl-] (>250) | Cystic fibrosis | [ | |
| FEV%1 (>250) | |||
| P.a. Infection (>250) | |||
| Pancreas function (>250) | |||
| C/ (C + B) (59) | |||
| [Cl-] Conductance (59) | |||
| Li-Fraumeni and nonhereditary cancers | [ | ||
| Catalytic activity % wild type (80) | Phenylketonuria | [ | |
| PAH protein % wild type (80) | |||
| Serum Phe and BH4 response (>35) | |||
| Cellular LDL uptake (79) | Cardiovascular disease | [ | |
| Expression (79) | Hypercholesterolemia | ucl.ac.uk/ldlr | |
| Plasma LDL-c (~79) | |||
| Homologous recombination (140) | Hereditary breast cancer | [ | |
| Cell growth (204) | Hyperhomocysteinemia | [ |
Both locus-specific databases (LSDBs) and published manuscripts contain data to enable development of new in silico bioinformatics methods to predict variant impact on endophenotypes. For each gene, available endophenotypes, number of unique mutations with endophenotypic values, disease phenotype(s) and links/references to data sources is provided.