| Literature DB >> 35906703 |
Ruchir Rastogi1, Peter D Stenson2, David N Cooper2, Gill Bejerano3,4,5,6.
Abstract
Stopgain substitutions are the third-largest class of monogenic human disease mutations and often examined first in patient exomes. Existing computational stopgain pathogenicity predictors, however, exhibit poor performance at the high sensitivity required for clinical use. Here, we introduce a new classifier, termed X-CAP, which uses a novel training methodology and unique feature set to improve the AUROC by 18% and decrease the false-positive rate 4-fold on large variant databases. In patient exomes, X-CAP prioritizes causal stopgains better than existing methods do, further illustrating its clinical utility. X-CAP is available at https://github.com/bejerano-lab/X-CAP .Entities:
Keywords: Machine learning; Nonsense; Pathogenicity prediction; Stopgain
Mesh:
Year: 2022 PMID: 35906703 PMCID: PMC9338606 DOI: 10.1186/s13073-022-01078-y
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 15.266
Fig. 1Stopgains are a sizable variant class. a The number of variants of each mutation type as a proportion of all DM (disease-causing) variants in HGMD 2020.1. Single base-pair stopgains are the third-largest class, trailing only missense variants and frameshift indels. b The prevalence of stopgains from Phase 3 of the 1000 Genomes Project (N=2504) as a function of their allele frequencies within the same dataset. The average individual in the dataset harbors 12.5 stopgains with an allele frequency of less than 1%
X-CAP features. violetItalicized features are novel and have not been used in previous stopgain pathogenicity predictors. Specifically, no features related to zygosity, stop codon read-through, or alternative translation reinitiation are present in earlier classifiers
| Feature type | Feature name | Description |
|---|---|---|
| Zygosity | Binary variable distinguishing homozygous (and hemizygous) variants from heterozygous variants, inputed when known or predicted as a function of benign stopgain alleles at the same position in training set when unknown | |
| Gene/exon essentiality | Number of benign stopgains in training set along gene divided by gnomAD’s expected number of loss-of-function variants | |
| RVIS | Measure of gene intolerance to functional variation | |
| OMIM gene map | Two non-exclusive, binary features indicating whether a recessive or dominant disease listed in the OMIM Gene Map is caused by mutations in this gene | |
| Transcript or exon contains no benign variants and at least one pathogenic variant within training set | ||
| Variant is skipped in at least one isoform of the gene | ||
| Variant location | distance from CDS start/end | Number of coding nucleotides from CDS start and end |
| relative CDS location | Distance from CDS start divided by CDS length | |
| Number of coding nucleotides from exon start and end | ||
| Distance from exon start divided by exon length | ||
| Number of nucleotides in overlapped exon | ||
| Index of the exon that the variant overlaps | ||
| Number of exons in overlapped transcript | ||
| chromosome | Ternary variable indicating if the variable is located on an autosomal, X, or Y chromosome | |
| NMD | distance from last exon-exon junction | Number of coding nucleotides upstream from last exon-exon junction (negative if downstream of junction) |
| Percentage of overlapped transcripts in which the variant is >50 bp upstream of the last exon-exon junction | ||
| Stop codon read-through | One-hot encoding of the new stop codon introduced by the stopgain | |
| Alternative translation reinitiation | Number of base pairs between the variant and the next potential downstream start codon within the mRNA | |
| Cross-species conservation | phyloP | Base-pair conservation across vertebrates of upstream, downstream, and overlapped exon regions |
| phastCons | Regional conservation across vertebrates of upstream, downstream, and overlapped exon regions |
Fig. 2X-CAP features show predictive power. Comparison of feature values for benign and pathogenic stopgains in the training set of . a The Residual Variation Intoleration Score (RVIS) decile of genes, weighted by the number of variants they contain. Genes without RVIS values were excluded. Pathogenic variants are more prevalent in low RVIS genes, namely those generally intolerant to variation. b Kernel Density Estimation (KDE) plot of the relative variant location, defined as the distance in the coding domain sequence (CDS) from the translation start site divided by the total CDS length. On average, benign stopgains are located later in transcripts than pathogenic stopgains. c KDE plot of the number of exons in the mutated gene. The maximum number of exons is clipped to 100 for clarity. Genes containing benign stopgains tend to have fewer exons than genes containing pathogenic stopgains. d Odds ratios (pathogenic/benign) comparing variants that introduce a given stop codon to those that do not. The TGA stop codon, molecularly shown to be the most amenable to read-through of the three [36], is depleted in pathogenic variants. e Odds ratios comparing 5’ proximal stopgains (those within the first 100 bp of the sequence) that have a potential alternative downstream start codon a given distance away against those that do not. Pathogenic variants tend to be located further from the next downstream start codon than benign variants. f KDE plot of the mean phyloP of the downstream region, the portion of the CDS truncated by the stopgain. Regions downstream of pathogenic variants are more conserved than regions downstream of benign variants. In b, c, and f, Scott’s Rule [52] was used to calculate the bandwidth of the Gaussian kernel. In d and e, error bars denote 95% confidence intervals for the odds ratio
Fig. 3X-CAP outperforms competitors. a For each model, we plot the ROC curve and associated AUROC metric on the test set of . X-CAP has the highest AUROC, improving upon the previous state-of-the-art by 0.14 absolute points. The orange and green dotted lines display X-CAP’s performance when trained only on variants present in the databases used by MutPred-LoF and ALoFT, respectively. To ensure a fair comparison, we randomly subsampled these datasets to the size used in the original papers (n indicates the size of the training set). b We enlarge the portion of the plot above the dashed line in panel a to show performance in the clinically relevant, high-sensitivity region (TPR ≥0.95). We also display the hsr-AUROC, which is the normalized area under the curve in the high-sensitivity region. We optimized X-CAP to excel in this region, rather than over the full ROC. At 95% sensitivity, X-CAP correctly classifies 80.0% of benign stopgain variants, over four times more than any other classifier
Fig. 4X-CAP eliminates the most benign stopgain VUS in control exomes. We plot the fraction of rare benign stopgain variants that were assigned scores below the 95%-sensitivity threshold for each classifier. These variants were taken from exomes from a control population (N=480) in an Inflammatory Bowel Disease (IBD) study. The performance of all classifiers on exomes nicely matches their performance on aggregated variant sets in Fig. 3b and Additional file 1: Fig. S3b. X-CAP increases the percentage of benign VUS eliminated by 4.4-fold
X-CAP prioritizes causal stopgains in patient exomes. Each row in the table describes a single patient, the causative gene and variant, the genotype of the variant, and the percentile-normalized score provided by each classifier. For each method, raw scores were percentile-normalized in comparison to the scores output by the classifier on the test set of . All ten patients contain one rare stopgain and no other rare mutations in the causal gene. violetBolded entries have the highest percentile for a given variant. redItalicized entries would have been misclassified on the basis of the original authors’ recommendations (CADD, DANN, and Eigen do not provide a decision rule). X-CAP assigns the highest percentile six out of the ten times and mischaracterizes only one variant. No other tool assigns the highest percentile-normalized score more than once, and MutPred-LoF and ALoFT mischaracterize variants five and three times, respectively
| Patient ID | Gene | HGVS | GT | X-CAP | MutPred-LoF | ALoFT | CADD | DANN | Eigen |
|---|---|---|---|---|---|---|---|---|---|
| DDDP108441 | c.C1366T:p.Q456X | 0/1 | 89.4 | 87.7 | 89.4 | 89.2 | 81.3 | ||
| DDDP108556 | c.C5916A:p.Y1972X | 0/1 | 90.8 | red | 23.7 | 52.0 | 8.9 | ||
| DDDP108105 | c.C1375T:p.R459X | 0/1 | 77.3 | 96.4 | 57.2 | 64.1 | 70.0 | ||
| DDDP109873 | c.C5581T:p.Q1861X | 0/1 | 90.8 | 98.8 | 57.2 | 44.1 | 46.5 | ||
| DDDP111266 | c.C613T:p.R205X | 1/1 | 97.5 | 46.0 | 81.3 | 6.8 | |||
| DDDP107416 | c.C976T:p.Q326X | 0/1 | 78.6 | 17.9 | 45.3 | ||||
| DDDP108492 | c.C691T:p.R231X | 0/1 | 90.7 | 57.2 | 81.3 | 70.7 | |||
| DDDP100091 | c.C3047A:p.S1016X | 0/1 | 93.1 | 66.0 | 64.1 | 11.2 | |||
| DDDP110976 | c.T2579A:p.L860X | 0/1 | 12.3 | 14.4 | 25.2 | ||||
| DDDP110748 | c.C1801T:p.R601X | 0/1 | 83.2 | 10.0 | 21.0 | 15.2 |