| Literature DB >> 30744685 |
Barthélémy Caron1, Yufei Luo1, Antonio Rausell2,3.
Abstract
State-of-the-art methods assessing pathogenic non-coding variants have mostly been characterized on common disease-associated polymorphisms, yet with modest accuracy and strong positional biases. In this study, we curated 737 high-confidence pathogenic non-coding variants associated with monogenic Mendelian diseases. In addition to interspecies conservation, a comprehensive set of recent and ongoing purifying selection signals in humans is explored, accounting for lineage-specific regulatory elements. Supervised learning using gradient tree boosting on such features achieves a high predictive performance and overcomes positional bias. NCBoost performs consistently across diverse learning and independent testing data sets and outperforms other existing reference methods.Entities:
Keywords: Mendelian diseases; Non-coding genetic variants; Pathogenicity score; Rare variant analysis; Whole genome sequencing
Mesh:
Substances:
Year: 2019 PMID: 30744685 PMCID: PMC6371618 DOI: 10.1186/s13059-019-1634-2
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Curation of high-confidence pathogenic non-coding SNVs associated with monogenic Mendelian disease genes. a Number of high-confidence pathogenic non-coding SNVs obtained from the Human Gene Mutation Database [37] (HGMD-DM), ClinVar [38], and Smedley’2016 [20], after filtering out SNVs overlapping exonic and splice sites of protein-coding genes and SNVs associated with non-coding RNAs (“Methods” section). Only high-confidence pathogenic non-coding variants associated with the same protein-coding gene by both the original resource and the annotation process done in this work (depicted in orange) were retained for downstream analysis. b Retained variants in a were further classified according the OMIM category of the associated gene, i.e., non-Mendelian disease gene, Mendelian disease gene associated with a disease phenotype differing from the one reported in the original resource (i.e., presenting a conflicting disease description), complex Mendelian disease genes, and monogenic Mendelian disease genes. Only high-confidence pathogenic non-coding SNVs associated with monogenic Mendelian diseases with no homozygous individuals in GnomAD database [35] (depicted in green) were finally retained for downstream analyses. c Distribution of the high-confidence pathogenic non-coding SNVs associated with monogenic Mendelian disease genes according to the type of gene region they overlap: intronic, 5′UTR, 3′UTR, upstream, downstream, and intergenic regions. d Distribution of the high-confidence pathogenic non-coding SNVs associated with monogenic Mendelian disease genes according to the original annotation source, i.e., HGMD-DM, ClinVar, and Smedley’2016. e Corresponding number of monogenic Mendelian disease genes collectively involved by SNVs in d. The number of SNVs in each category is indicated inside the barplots and Venn diagrams together with the number of genes collectively involved (in parenthesis in a–c; totals are reported above each barplot)
Natural selection features associated with non-coding single-nucleotide variants mined in this work. Features are classified under different categories depending on the sequence context (i.e., position level, window level, and gene level) and evolutionary scale: interspecies (vertebrates, mammals, and primates, excluding humans), or recent and ongoing natural selection in humans. We note here that the query variant was excluded from the calculations involving mean allele frequencies and mean heterozygosity of a given region and that the variant allele frequency itself was not used as a feature in any training or pathogenicity prediction throughout the study
| Category | Sequence context | Evolutionary scale | Model | Feature abbreviation used in this work | Definition | References |
|---|---|---|---|---|---|---|
| Position-level and window-level features | Position-level | Interspecies (mammals) | A | GerpS | Base-wise Rejected Substitution (RS) score defined by Genomic Evolutionary Rate Profiling (GERP++ scores) from mammalian alignments, excluding humans | [ |
| Interspecies (mammals) | A | GerpN | Neutral evolution score defined by GERP++, excluding humans | [ | ||
| Recent and ongoing in humans | B | bStatistic | b Statistic: background selection score indicating the expected fraction of neutral diversity that is present at a site, with values close to 0 representing near complete removal of diversity as a result of selection and values near 1 indicating little effect. B-statistic was based on human single-nucleotide polymorphism data from Perlegen Sciences, HapMap phase II, the SeattleSNPs NHLBI Program for Genomic Applications, and the NIEHS Environmental Genome Project | [ | ||
| Interspecies (primates) | A | priPhCons | Primate PhastCons conservation score, excluding humans | [ | ||
| Interspecies (mammals) | A | mamPhCons | Mammalian PhastCons conservation scores, excluding humans | [ | ||
| Interspecies (vertebrates) | A | verPhCons | Vertebrate PhastCons conservation score, excluding humans | [ | ||
| Interspecies (primates) | A | priPhyloP | Primate PhyloP conservation score, excluding humans | [ | ||
| Interspecies (mammals) | A | mamPhyloP | Mammalian PhyloP conservation score, excluding humans | [ | ||
| Interspecies (vertebrates) | A | verPhyloP | Vertebrate PhyloP conservation score, excluding humans | [ | ||
| 1000-bp window | Recent and ongoing in humans | B | meanDaf1000G | Mean derived allele frequency of variants in 1-kb window region calculated from the 1000 Genomes Project (excluding the query variant) | [ | |
| Recent and ongoing in humans | B | meanHet1000G | Mean heterozygosity of 1-kb window region calculated from the 1000 Genomes Project (excluding the query variant) | [ | ||
| Recent and ongoing in humans | B | meanMAF1000G | Mean minor allele frequency of variants in 1-kb flanking region calculated from 1000 Genomes Project (excluding the query variant) | [ | ||
| Recent and ongoing in humans | B | meanMAFGnomAD | Mean minor allele frequency of variants in 1-kb window region calculated from GnomAD genome data (excluding the query variant from the calculation). Mean MAF was assessed for the global population and for population-specific frequencies: Africans and African Americans (AFR), Admixed Americans (AMR), East Asians (EAS), Finnish (FIN), non-Finnish Europeans (NFE), Ashkenazi Jewish (ASJ), and other populations (OTH) | [ | ||
| 30-kb window | Recent and ongoing in humans | B | TajimasD_CHB_pvalue | Tajima’s D | [ | |
| 30-kb window | Recent and ongoing in humans | B | FuLisD_CEU_pvalue | Fu and Li’s F* | [ | |
| 10-bp window | Recent and ongoing in humans | B | CDTS | The Context-Dependent Tolerance Score (CDTS) represents the difference between observed and expected variations in Humans. The expected variation is computed for each nucleotide genome-wide as the probability of variation of each nucleotide depending on its heptanucleotide context. CDTS was computed on 11,257 unrelated individuals. | [ | |
| 75-bp flanking region |
| D | GC | Percent GC in a window of ± 75 bp | [ | |
|
| D | CpG | Percent CpG in a window of ± 75 bp | [ | ||
| Gene-level features | Non-coding region of the closest gene (a) | Recent and ongoing in humans | C | ncRVIS | Non-coding RVIS is a measure of the departure from the genome-wide average number of common variants found in the non-coding sequence of genes with a similar amount of non-coding mutational burden in humans. ncRVIS was computed on an in-house collection of whole genome sequencing of 690 individuals. | [ |
| Interspecies (mammals) | C | ncGERP | Average GERP++ score across a gene’s non-coding sequence | [ | ||
| Coding region of the closest gene | Interspecies (primates) | C | dN/dS | Primate dN/dS ratio, providing a measure of the coding-sequence conservation across primates | [ | |
| Recent and ongoing in humans | C | pLI | Probability of being loss-of-function intolerant (intolerant of heterozygous and homozygous loss-of-function variants), assessed from the ExAC database. | [ | ||
| Recent and ongoing in humans | C | RVIS percentile | Residual Variation Intolerance Score (RVIS) percentile, a measure of the departure from the average number of common functional mutations in genes with a similar amount of mutational burden in humans. RVIS was assessed on sequence data from 6503 whole exome from the NHLBI Exome Sequencing Project (ESP) | [ | ||
| Recent and ongoing in humans | C | GDI | Gene Damage Index, a gene-level metric of the mutational damage that has accumulated in the general population, based on CADD scores and on the 1000 Genomes Project data | [ | ||
| Phylo-genetic gene features |
| C | familyMemberCount | Number of human paralogs of the gene: Family member count (FMC) in OGEE database | [ | |
|
| C | gene_age | The gene age is estimating the origination time of genes from the presence or absence of orthologs in the vertebrate phylogeny. | [ |
bp base pairs, GERP Genomic Evolutionary Rate Profiling, RS Rejected Substitution, N/A not applicable
(a) Non-coding region of the closes gene defined in the original publication of ncRVIS and ncGERP as the collection of 5′UTR, 3′UTR, and an additional non-exonic 250 bp upstream of transcription start site (TSS)
Fig. 2Performance of individual features mined in this work to classify high-confident set of n = 737 pathogenic non-coding SNVs associated with monogenic Mendelian disease genes from a negative set of n = 7370 randomly sampled common SNVs without clinical assertions matched by region. The area under the receiver operating characteristic curve (AUROC; left panel) and the area under the precision-recall curve (AUPRC; right panel) obtained for each feature is represented. Features are gathered according to five categories (A–E; “Methods” section) and ranked within category by decreasing AUROC and AUPRC. AUROC values < 0.5 (anti-classifiers) were transformed in 1-AUROC values for the purpose of this figure and are indicated with an asterisk (*). Accordingly, AUPRC values for anti-classifiers were assessed on the basis of the − 1 product transformation. Of note, population-specific GnomAD MAFs (“Methods” section) are not shown for simplicity. One hot-encoded SNV region features (i.e., “intronic,” “UTR5,” “UTR3,” “upstream,” “downstream,” and “intergenic”) are gathered as a single feature labeled as “region”
Fig. 3Comparative performance of NCBoost models trained upon different sets of features. The figure represents the area under the receiver operating characteristic curve (AUROC; a) and the area under the precision-recall curve (AUPRC; b) obtained for each of the six feature configurations evaluated (feature categories A, B, A+B, A+B+C, A+B+C+D, and A+B+C+D+E) when tested mimicking a tenfold cross-validation on n = 283 high-confidence pathogenic non-coding SNVs and n = 2830 common variants without clinical assertions
Fig. 4Comparative performance of NCBoost against state-of-the-art methods. The figure shows the area under AUROC (a) and the AUPRC (b) obtained for NCBoost (configuration of features ABCD) together with the six state-of-the-art methods (CADD, DeepSEA, Eigen, Eigen-PC, FunSeq2, and ReMM; “Methods” section) when tested on the same set of positive and negative variants described for Fig. 3
AUROC and AUPRC values obtained by NCBoost upon different configurations of the training and independent testing sets. The figures obtained by the six state-of-the-art methods evaluated on the same testing sets are shown together with NCBoost
| Positive set used for NCBoost training | Positive set used for testing | AUROC | AUPRC | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Source | # SNVs | Source | # SNVs | CADD | DeepSEA | Eigen | Eigen PC | FunSeq2 | ReMM | NCBoost | CADD | DeepSEA | Eigen | Eigen PC | FunSeq2 | ReMM | NCBoost |
| HGMD-DM | 186 | !HGMD-DM | 97 | 0.67 | 0.76 | 0.78 | 0.70 | 0.67 | 0.74 | 0.82 | 0.16 | 0.24 | 0.23 | 0.13 | 0.15 | 0.26 | 0.36 |
| CV | 107 | !CV | 176 | 0.72 | 0.74 | 0.73 | 0.63 | 0.68 | 0.77 | 0.78 | 0.27 | 0.31 | 0.24 | 0.11 | 0.15 | 0.37 | 0.38 |
| Smedley | 78 | !Smedley | 205 | 0.69 | 0.72 | 0.73 | 0.65 | 0.66 | 0.75 | 0.78 | 0.24 | 0.25 | 0.21 | 0.11 | 0.14 | 0.31 | 0.36 |
| ≥ 2 sources | 73 | 1 source | 210 | 0.68 | 0.72 | 0.73 | 0.64 | 0.66 | 0.75 | 0.78 | 0.23 | 0.24 | 0.21 | 0.11 | 0.14 | 0.31 | 0.31 |
| 1 source | 210 | ≥ 2 sources | 73 | 0.8 | 0.81 | 0.82 | 0.69 | 0.77 | 0.82 | 0.85 | 0.31 | 0.43 | 0.29 | 0.16 | 0.24 | 0.48 | 0.52 |
Fig. 5Testing of NCBoost against a fully independent set of recently reported pathogenic variants. The figure shows the area under AUROC (a) and the AUPRC (b) obtained for NCBoost (configuration of features ABCD) together with the eight reference methods (CADD, DeepSEA, Eigen, Eigen-PC, FunSeq2, ReMM, GWAVA, and ncER; “Methods” section) tested on a fully independent set of 70 positive and 700 negative variants matched per genomic region (“Methods” section). Only the GWAVA Region version is depicted for the sake of visualization. GWAVA Unmatched and TSS versions led to AUROC of 0.59 and 0.58 and to AUPRC of 0.02 and 0.02, respectively. Notice that GWAVA and ncEM methods were considered here assuming no overlap of their training sets with the recently reported pathogenic non-coding variants evaluated in these figures. There was no re-training of NCBoost, but the same NCBoost ABCD bundle used in Figs. 3 and 4 was applied
Fig. 6Prioritization of non-coding pathogenic variants within individual genomes. The median across the 100 simulated disease genomes of the within-individual rank percentile of a variant (y-axis) is shown for 37 recently reported pathogenic variants evaluated (red dots) and their corresponding 37 randomly sampled negative common SNVs mapping within the same non-coding region of the associated genes (blue dots). Boxplots represent the distributions for the different evaluated scores (x-axis). NCBoost scores offered the highest within-individual rank percentiles of pathogenic variants (median 97.04%), with a statistical significant difference (one-sided paired Wilcoxon test p value) as compared to all evaluated reference methods: ReMM (median 96.34%; p value = 3.75e−2), Eigen (median 96.00%, p value = 9.55e−2), DeepSEA (95.3%; p value = 1.96e−3), CADD (median 93.50; p value = 4.03e−3), FunSeq2 (median 89.92; p value = 5.88e−4). All scores showed a statistically significant difference between the median rank percentile distribution of pathogenic and their internal control negative variants (one-sided paired Wilcoxon test): NCBoost p value = 8.04E-07, ReMM p value = 9.71E−05, Eigen p value = 2.28E−05, DeepSEA p value = 7.51E−06, CADD p value = 3.88E−04, and FunSeq2 p value = 6.13E−03. Complete details are provided in Additional file 6: Table S5. An alternative graphical representation pairing each pathogenic variant with is provided in Additional file 2: Figure S12
Fig. 7The figure represents the number of monogenic Mendelian disease genes (MMDGs, in blue, y-axis) bearing at least one (solid curves) or ten (dashed curves) potentially pathogenic non-coding variants as a function of the top NCBoost scoring positions considered (x-axis), ranked from left (more pathogenic) to right (less pathogenic). A total of 857,825,085 positions overlapping intronic, 5′UTR or 3′UTR, and upstream and downstream regions collectively associated with 18,404 protein-coding genes was used as a reference background. For the sake of visualization, the x-axis was cut at 10 Million top-scoring genomic positions. Vertical bars represent the thresholds at which left positions display NCBoost scores higher than the corresponding top percentage of the high-confidence pathogenic variants curated in this work. Top 5%, 10%, 15%, and 20% thresholds are represented. The horizontal dotted lines represent the total of 3223 MMDGs (in blue) and 18,404 protein-coding genes (in black) for which NCBoost scores were obtained. Genes bearing potentially pathogenic non-coding variant per gene within the top 5% and the top 10% are highly enriched in MMDGs as compared to the background of protein-coding genes used (see text)