| Literature DB >> 31096927 |
E Mossotto1,2, J J Ashton3,4, L O'Gorman3, R J Pengelly3,5, R M Beattie4, B D MacArthur5, S Ennis3.
Abstract
BACKGROUND: Next-generation sequencing is revolutionising diagnosis and treatment of rare diseases, however its application to understanding common disease aetiology is limited. Rare disease applications binarily attribute genetic change(s) at a single locus to a specific phenotype. In common diseases, where multiple genetic variants within and across genes contribute to disease, binary modelling cannot capture the burden of pathogenicity harboured by an individual across a given gene/pathway. We present GenePy, a novel gene-level scoring system for integration and analysis of next-generation sequencing data on a per-individual basis that transforms NGS data interpretation from variant-level to gene-level. This simple and flexible scoring system is intuitive and amenable to integration for machine learning, network and topological approaches, facilitating the investigation of complex phenotypes.Entities:
Keywords: Gene score; Genome analysis; Mathematical modelling; Next-generation sequencing; Pathogenicity score
Mesh:
Year: 2019 PMID: 31096927 PMCID: PMC6524327 DOI: 10.1186/s12859-019-2877-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Pathogenicity scores for SNVs and their reported ranges in the dbsnfp database
| Metric | Type | Implementation | Actual range | Imposed range for transformation |
|---|---|---|---|---|
| CADD | Composite | Score | −7.53 to 35.79 | |
| DANN | Composite | Score | 0 to 1 | – |
| FATHMMa | Functionality | 1-Score | −16.13 to 10.64 | |
| fathmm-MKL | Composite | Score | 0 to 1 | – |
| GERP++_RS | Conservation | Score | −12.3 to 6.17 | |
| M-CAP | Composite | Score | 0 to 1 | – |
| MetaLR | Composite | Score | 0 to 1 | – |
| MetaSVM | Composite | Score | −2 to 3 | |
| MutationTastera | Functionality | 1-Score if N/P; Score if A/D | 0 to 1 | – |
| phastCons | Conservation | Score | 0 to 1 | – |
| phyloP | Conservation | Score | −13.28 to 1.2 | |
| Polyphen2_HDIV | Functionality | Score | 0 to 1 | – |
| Polyphen2_HVAR | Functionality | Score | 0 to 1 | – |
| PROVEANa | Functionality | 1-Score | −14 to 14 | – |
| SIFTa | Functionality | 1-Score | 0 to 1 | – |
| VEST3 | Functionality | Score | 0 to 1 | – |
In order to maintain uniform directionality, the complement (1 – score) of a value was taken so that across scores, a value of 0 consistently indicated benign variation and a value of 1 inferred maximal pathogenicity
Fig. 1Single variant GenePy score distribution under fixed deleteriousness values. Impact of varying zygosity and minor allele frequency (MAF)
Statistical attributes of whole gene GenePy scores computed for sixteen deleteriousness metrics. Number of genes for which GenePy scores were calculated, median number of non-variant genes (GenePy = 0), mean GenePy scores, mean and standard deviation across our cohort (n = 508), coefficient of variation (CV, defined as σ/μ) and the median number of genes with a GenePy score < 0,01 as percentage of the total number of genes. The same information is reported for GenePycgl
| Metric | Gene scores calculated | aMedian no. of genes with GenePy = 0 within individuals (%) | Max GenePy | MeanGenePy | CV uncorrected | Median no. of genes with GenePy < 0.01 within individuals (%) | Max GenePycgl | Mean GenePycgl | CVcgl corrected | Median no. of genes with Genepycgl < 0.01(%) |
|---|---|---|---|---|---|---|---|---|---|---|
| CADD | 14,184 | 9917 (69.92%) | 32.15 | 0.10 | 3.81 | 10,231 (72.13%) | 74.19 | 0.08 | 8.09 | 10,304 (72.64%) |
| DANN | 14,184 | 9917 (69.92%) | 110.48 | 0.33 | 3.37 | 10,153 (71.58%) | 304.15 | 0.25 | 6.96 | 10,196 (71.88%) |
| FATHMM | 13,143 | 9981 (75.94%) | 72.73 | 0.16 | 4.15 | 10,923 (83.11%) | 269.62 | 0.11 | 6.42 | 11,092 (84.40%) |
| fathmm-MKL | 14,178 | 9039 (63.75%) | 50.10 | 0.16 | 3.29 | 9282 (65.48%) | 131.34 | 0.12 | 7.55 | 9332 (65.84%) |
| GERP++_RS | 14,197 | 9910 (69.80%) | 100.44 | 0.32 | 3.35 | 10,116 (71.25%) | 283.69 | 0.24 | 6.47 | 10,143 (71.44%) |
| M-CAP | 12,921 | 12,577 (97.34%) | 24.52 | 0.02 | 12.65 | 12,596 (97.48%) | 59.88 | 0.02 | 19.05 | 12,630 (97.74%) |
| MetaLR | 14,063 | 12,752 (90.68%) | 38.14 | 0.04 | 8.77 | 13,146 (93.48%) | 87.80 | 0.04 | 16.14 | 13,253 (94.24%) |
| MetaSVM | 14,076 | 9845 (69.94%) | 36.76 | 0.10 | 3.95 | 10,141 (72.04%) | 99.44 | 0.08 | 8.94 | 10,207 (72.51%) |
| MutationTaster | 14,039 | 12,161 (86.62%) | 90.86 | 0.13 | 5.24 | 12,521 (89.19%) | 332.05 | 0.09 | 9.02 | 12,579 (89.60%) |
| phastCons | 14,197 | 10,217 (71.97%) | 100.64 | 0.21 | 3.79 | 11,018 (77.60%) | 324.41 | 0.14 | 5.76 | 11,116 (78.29%) |
| phyloP | 14,202 | 9910 (69.78%) | 118.81 | 0.40 | 3.31 | 10,107 (71.17%) | 332.05 | 0.31 | 7.15 | 10,131 (71.34%) |
| Polyphen2_HDIV | 14,745 | 11,824 (80.19%) | 65.48 | 0.14 | 4.89 | 12,558 (85.16%) | 257.00 | 0.12 | 12.08 | 12,658 (85.84%) |
| Polyphen2_HVAR | 14,741 | 11,470 (77.81%) | 59.67 | 0.11 | 5.47 | 12,621 (85.62%) | 239.71 | 0.09 | 14.03 | 12,778 (86.69%) |
| PROVEAN | 13,888 | 9733 (70.08%) | 74.16 | 0.23 | 3.37 | 9958 (71.70%) | 219.39 | 0.17 | 7.93 | 10,003 (72.02%) |
| SIFT | 14,561 | 11,088 (76.15%) | 99.69 | 0.25 | 3.69 | 11,224 (77.08%) | 265.64 | 0.20 | 7.04 | 11,257 (77.31%) |
| VEST3 | 14,170 | 9919 (70.00%) | 53.36 | 0.09 | 5.69 | 10,528 (74.29%) | 136.56 | 0.08 | 12.56 | 10,821 (76.36%) |
aAcross the cohort of 508 individuals assessed, individual samples have a very high median number of invariant genes resulting on GenePy scores of zero
Fig. 2GenePy profiles observed for all genes across the whole cohort for all sixteen deleteriousness metrics. Uncorrected GenePy scores (upper panel) exhibit characteristic spikes reflecting gene scores strongly influenced by the effect of: single highly deleterious (D = 1) common homozygous variants (red) or; single highly deleterious very rare/novel variants (MAF = 0.00001) (blue). GenePycgl score profiles (lower panel) do not display these spikes. Invariant genes conferring a GenePy score < 0.01 are overrepresented and not shown here by commencing the x-axis with the 0.01–0.02 bin. All sixteen versions of the GenePy score exhibit long tails in the GenePy score distribution truncated here at a score of six
Fig. 3GenePy score profiles for seven independent patients diagnosed with IBD across selected genes from the NOD2 and TLR pathways. GenePy scores shown were implemented using the M-CAP deleteriousness (D) metric. To facilitate plotting, raw GenePy scores were transformed to Z-scores for each gene. Different colours depict individual patient profiles. Despite being diagnosed with the same disease, all individuals exhibit distinctive profiles across key genes implicated in key immune pathways. Some individuals have evidence of gene pathogenicity within the same pathway (e.g. IBD5 and IBD6) this is conferred through accumulated mutation in different genes – IBD6 has elevated gene-level scores for TAB1, CARD6 and MAPK3 while IBD5 may have impaired function in this pathway due to combined mutation in MAPK13, BP1 and NFKB1. Similarly, IBD1, IBD3 and IBD4 exhibit pathogenic profiles in TLR pathway genes only. These individual level data can be combined with disease phenotype, severity and treatment outcome data in machine learning models to better stratify patient cohorts and realise the promise of personalised medicine
NOD2 GenePy score statistics (maxima and means) and Mann-Whitney U tests across groups for all sixteen deleteriousness metrics. p-values smaller than 1 × 10−2 or smaller than 5 × 10−2 are highlighted by two (**) or one (*) asterisks respectively. SKAT-O gene association results comparing patient groups against controls provided below thick line
| Metric | Controls ( | IBD ( | UC ( | CD ( | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| |
| CADD | 2.71 | 0.28 | 3.52 | 0.40 | 1.04 × 10−1 | 2.66 | 0.20 | 1.38 × 10−1 | 3.52 | 0.54 | 4.62 × 10−4
|
| DANN | 5.92 | 0.84 | 7.62 | 1.06 | 1.36 × 10−1 | 5.62 | 0.57 | 1.22 × 10−1 | 7.62 | 1.38 | 8.16 × 10− 4
|
| FATHMM | 3.33 | 0.49 | 4.34 | 0.66 | 1.04 × 10−1 | 3.14 | 0.38 | 1.47 × 10− 1 | 4.34 | 0.84 | 4.84 × 10− 4
|
| fathmm-MKL | 4.53 | 0.37 | 6.24 | 0.55 | 4.54 × 10−2
| 3.78 | 0.25 | 3.15 × 10− 1 | 6.24 | 0.76 | 1.79 × 10− 4
|
| GERP++_RS | 5.30 | 0.64 | 7.00 | 0.87 | 1.26 × 10− 1 | 4.95 | 0.42 | 1.27 × 10− 1 | 7.00 | 1.17 | 6.95 × 10− 4
|
| M-CAP | 1.87 | 0.12 | 3.39 | 0.22 | 1.58 × 10− 2
| 1.73 | 0.08 | 4.62 × 10− 1 | 3.39 | 0.32 | 1.37 × 10− 4
|
| MetaLR | 2.42 | 0.16 | 3.39 | 0.29 | 2.71 × 10− 1 | 1.81 | 0.10 | 2.34 × 10− 2
| 3.39 | 0.42 | 1.63 × 10−3
|
| MetaSVM | 2.67 | 0.30 | 3.61 | 0.43 | 9.88 × 10− 2 | 2.50 | 0.22 | 1.50 × 10− 1 | 3.61 | 0.57 | 4.39 × 10− 4
|
| MutationTaster | 4.38 | 0.26 | 5.10 | 0.39 | 4.48 × 10− 2
| 2.65 | 0.13 | 4.37 × 10− 1 | 5.10 | 0.57 | 7.47 × 10− 4
|
| phastCons | 4.66 | 0.35 | 5.24 | 0.56 | 2.86 × 10− 1 | 3.54 | 0.24 | 2.70 × 10− 2
| 5.24 | 0.77 | 2.16 × 10− 3
|
| phyloP | 6.32 | 1.02 | 7.93 | 1.27 | 1.23 × 10− 1 | 5.92 | 0.75 | 1.38 × 10− 1 | 7.93 | 1.62 | 7.09 × 10− 4
|
| Polyphen2_HDIV | 5.32 | 0.68 | 7.03 | 0.82 | 2.02 × 10− 1 | 2.30 | 0.33 | 6.22 × 10− 2 | 7.03 | 1.13 | 1.20 × 10− 3
|
| Polyphen2_HVAR | 4.86 | 0.46 | 5.31 | 0.64 | 1.65 × 10− 1 | 2.07 | 0.21 | 7.22 × 10− 2 | 5.31 | 0.92 | 7.90 × 10− 4
|
| PROVEAN | 4.33 | 0.66 | 5.23 | 0.86 | 1.04 × 10− 1 | 4.08 | 0.49 | 1.45 × 10− 1 | 5.23 | 1.10 | 4.84 × 10− 4
|
| SIFT | 5.91 | 0.95 | 7.61 | 1.14 | 1.47 × 10− 1 | 5.43 | 0.64 | 1.16 × 10− 1 | 7.61 | 1.47 | 9.64 × 10− 4
|
| VEST3 | 3.28 | 0.30 | 4.21 | 0.44 | 1.36 × 10− 1 | 2.24 | 0.17 | 1.13 × 10− 1 | 4.21 | 0.62 | 7.48 × 10− 4
|
| SKAT-O (all variants) | – | – | 5.41 × 10− 1 | 9.76 × 10− 2 | 3.46 × 10− 2
| ||||||
| SKAT-O (MAF < 0.05) | – | – | 4.63 × 10− 1 | 1.37 × 10− 1 | 5.02 × 10− 2 | ||||||
Comparison of PD versus non-PD individuals. Significant results are shown in bold type. For each gene the most significant result only of all SNV association tests is shown and for each these the rs id is provided. Additionally, the number of SNV association test conducted within each gene is indicated in brackets. No correction is made for testing of six genes nor for testing multiple SNVs within any given gene
| Test PD vs non-affected samples |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| GenePy | 0.178 | 0.445 |
| 0.983 | 0.828 | 0.206 |
| SKAT-O | 1 | 0.557 | 0.157 | 0.427 | 0.712 | 0.741 |
| Top 5% comparison |
| 0.107 |
|
| 0.347 |
|
| Most significant SNV |
| 0.081 |
| 0.051 | 0.433 | 0.433 |