| Literature DB >> 27984588 |
Simone Marini1,2, Ivan Limongelli3,4,5, Ettore Rizzo4,5, Alberto Malovini6, Edoardo Errichiello7, Annalisa Vetro7,3, Tan Da2, Orsetta Zuffardi7,3, Riccardo Bellazzi2,5,6.
Abstract
Among the scientific challenges posed by complex diseases with a strong genetic component, two stand out. One is unveiling the role of rare and common genetic variants; the other is the design of classification models to improve clinical diagnosis and predictive models for prognosis and personalized therapies. In this paper, we present a data fusion framework merging gene, domain, pathway and protein-protein interaction data related to a next generation sequencing epilepsy gene panel. Our method allows integrating association information from multiple genomic sources and aims at highlighting the set of common and rare variants that are capable to trigger the occurrence of a complex disease. When compared to other approaches, our method shows better performances in classifying patients affected by epilepsy.Entities:
Mesh:
Year: 2016 PMID: 27984588 PMCID: PMC5161322 DOI: 10.1371/journal.pone.0164940
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Data fusion framework.
Sequencing data are collapsed to calculate their mutational loads using four ROIs, namely genes, pathways, domains and PPIs. This allows studying ROI-phenotype associations along the four correspondent axes. Each element tested for association then becomes a feature for a prediction model. Single ROI types are combined to create data sets. Each data set is split into a training and test set. The training set is used to tune the learning parameters of a RF model and then select the best set of features, while the test set is used to measure the prediction performances.
Fig 2Examples of variants influencing more than one gene.
Protein encoded by gene A interacts with proteins encoded by gene B and gene C. (a) Variant on gene A, V contributes to the score both for A and B due to their interaction. In the same way, variants on gene B, V, and on gene C, (V) both contribute to the scores of genes A B and C. (b) Resulting variant contributions on final PPIs scores.
Fig 3Interconnected, statistically significant ROIs.
Graphical representation of statistically significant ROIs and their overlapping. The direction of the arrow means that an element is included into another. Gene ROIs (light blue) can be part of pathway (green) or PPI (grey) ROIs, while domain ROIs (purple) can be part of gene ROIs.
Genes associated with the phenotype.
| ROI | Element | p-value |
|---|---|---|
| SCN1A | 5.83 × 10−6 | |
| genes | CACNA1G | 6.56 × 10−5 |
| CHRNB2 | 3.04 × 10−5 |
p-values and associated genes of gene ROIs.
Domains, pathways and PPIs associated with the phenotype.
| ROI | Element | p-value | related genes |
|---|---|---|---|
| IPR001098 | 1.63 × 10−4 | POLG | |
| Domains | IPR001696 | 3.56 × 10−5 | |
| IPR010526 | 2.99 × 10−5 | ||
| IPR005821 | 9.59 × 10−6 | CACNA1A, | |
| hsa04930 | 1.2 × 10−6 | CACNA1A, | |
| Pathways | hsa04725 | 7.99 × 10−5 | CACNA1A, CHRNA4, |
| hsa04728 | 1.18 × 10−6 | CACNA1A, GRIN2A, GRIN2B, PLCB1, PPP2R2C, | |
| hsa05033 | 1.42 × 10−5 | CACNA1A, CHRNA4, | |
| hsa04020 | 4.22 × 10−6 | CACNA1A, | |
| hsa04911 | 1.55 × 10−5 | ATP1A2, KCNMA1, KCNN3, PLCB1, SLC2A1, SLC2A2 | |
| hsa04919 | 2.71 × 10−4 | ATP1A2, NOTCH3, PLCB1, SLC2A1 | |
| hsa04976 | 1.75 × 10−4 | ATP1A2, SLC2A1 | |
| PPIs | UBC | 1.44 × 10−5 | FLNA, TUBA1A, UBE3A, SLC9A6, SLC1A3, PAFAH1B1, STXBP1, GPR56, NEDD4L, SLC2A1, ASPM, FANCI, TIMM17B, KCTD7, PLCB1, PDCD10, OPA1, NOTCH3, SCARB2, SLC25A22, HTT, DYRK1A, ATP6AP2, ALDH7A1, DMD, PQBP1, ARAF, TUBB2B, CSTB, TAP1, PTK2B, MECP2, TBC1D24, CNTNAP2, GRIN2A, CLCN2, PPP2R2C, POLG, STRADA, GABRD, CACNA1A, ATP1A2, MEF2C, KCNN3, SLC4A10, GABRB3, RBFOX1, JRK, PRICKLE1, TCF4, ARHGEF9, GRIN2B, SLC6A3, CASR, NHLRC1, EPM2A, OPRM1, GABBR1, CLN8 |
| SNTA1 | 4.24 × 10−6 | DMD, | |
| PSEN1 | 5.83 × 10−6 |
p-values and related genes of domain, pathway and PPI ROIs. Related genes associated to the phenotype are in bold font.
Classification performance for each data set.
| data set | AUC (feat. selection) | AUC (unselected feat.) | MCC (feat. selection) | MCC (unselected feat.) |
|---|---|---|---|---|
| genes | 0.86 | 0.83 | 0.55 | 0.48 |
| domains | 0.76 | 0.82 | 0.55 | 0.54 |
| pathways | 0.77 | 0.76 | 0.37 | 0.43 |
| PPIs | 0.81 | 0.79 | 0.44 | 0.38 |
| 0.57 | ||||
| genes + pathways | 0.84 | 0.83 | 0.5 | 0.52 |
| genes + PPIs | 0.84 | 0.79 | 0.49 | 0.42 |
| 0.88 | 0.86 | 0.54 | ||
| domains + PPIs | 0.87 | 0.81 | 0.59 | 0.45 |
| pathways + PPIs | 0.82 | 0.8 | 0.47 | 0.45 |
| 0.86 | 0.6 | 0.56 | ||
| genes + pathways + PPIs | 0.85 | 0.8 | 0.53 | 0.43 |
| genes + domains + PPIs | 0.87 | 0.83 | 0.56 | 0.52 |
| domains + pathways + PPIs | 0.88 | 0.82 | 0.6 | 0.48 |
| genes + domains + pathways + PPIs | 0.86 | 0.83 | 0.53 | 0.52 |
Performance of prediction models on the test sets averaged on ten repetitions. The highest value of each column is bold. Feature selection increases AUC in 14 out of 15 data sets up to +6%, and MCC in 12 out of 15, up to +12%.