| Literature DB >> 24232732 |
Marinka Žitnik1, Vuk Janjić, Chris Larminie, Blaž Zupan, Nataša Pržulj.
Abstract
The advent of genome-scale genetic and genomic studies allows new insight into disease classification. Recently, a shift was made from linking diseases simply based on their shared genes towards systems-level integration of molecular data. Here, we aim to find relationships between diseases based on evidence from fusing all available molecular interaction and ontology data. We propose a multi-level hierarchy of disease classes that significantly overlaps with existing disease classification. In it, we find 14 disease-disease associations currently not present in Disease Ontology and provide evidence for their relationships through comorbidity data and literature curation. Interestingly, even though the number of known human genetic interactions is currently very small, we find they are the most important predictor of a link between diseases. Finally, we show that omission of any one of the included data sources reduces prediction quality, further highlighting the importance in the paradigm shift towards systems-level data fusion.Entities:
Mesh:
Year: 2013 PMID: 24232732 PMCID: PMC3828568 DOI: 10.1038/srep03202
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Data fusion.
Panel A is a graphical representation of our data fusion by matrix factorisation approach to discovering disease-disease associations. The shown block-based matrix representation exactly corresponds to the data fusion schema in Figure 3-A. We combine 11 data sources on four different types of objects (see Methods): drugs, genes, Disease Ontology (DO) terms and Gene Ontology (GO) terms. These data are encoded in two types of matrices: constraint matrices, which relate objects of the same type (such as drugs if they have common adverse effects) and are placed on the main diagonal (illustrated by matrices with blue entries); and relation matrices, which relate objects of different types and are placed off the main diagonal (illustrated by matrices with grey entries). Our data fusion approach involves three main steps. First, we construct a block-based matrix representation of all data sources used in our study (panel A, left). The molecular data encoded in these matrices are sparse, incomplete and noisy (depicted by different shades of blue and grey) and some matrices are completely missing because associated data sources are not available (e.g. no link between GO terms and drugs). In the second step, we simultaneously decompose all relation matrices as products of low-rank matrix factors and use constraint matrices to regularise low-rank approximations of relation matrices. The key idea of our data fusion approach is sharing low-rank matrix factors between relation matrices that describe objects of common type. The resulting factorised system (panel A, middle) contains matrix factors that are specific to every type of objects (four matrices in left part; e.g. GDrug), and matrix factors that are specific to every data source (six matrix factors in right part; e.g. SGene, DO Term). Thus, low-rank matrix factors capture source- and object type-specific patterns. Finally, we use matrix factors to reconstruct relation matrices and complete their unobserved entries (panel A, right). Panel B shows the algorithm for assigning diseases to classes and obtaining disease-disease association predictions.
Figure 2Multi-layered hierarchical decomposition of disease classes.
Our analysis yields 108 disease classes using the most stringent threshold for predicting disease-disease associations. Identified classes are rather small and each class contains at most 17 diseases with the exception of the largest disease class that consists of 146 diseases (at root layer). We further decompose the largest class by re-running the data fusion process on set of diseases that are in the largest class in order to identify its fine-grained structure (level one). We repeat data fusion analysis using this top-down strategy two more times (levels two and three), which results in a hierarchical decomposition of most reliable disease classes (see Methods).
Figure 3System-level data fusion approach to disease re-classification.
Panel A shows the relationships between data sources: nodes represent four types of objects, i.e. genes, GO terms, DO terms and drugs; arcs denote data sources that relate objects of different types (relation matrices, R, i ≠ j), or objects of the same type (constraints, Θ). Panel B shows a disease class predicted by data fusion overlaid with a DO graph. Members of the disease class are outlined. This illustrates the ability of data fusion to successfully capture real disease classes: diseases associated with crescentic glomerulonephritis are presented.
Data sources. All data sources used in this disease association study, their size, and edge density. Relation matrices Rij relate objects of two different types and their numbers are reported separately (delimited by a forward slash)
| Matrix | Data description | # Nodes | # Edges | Density | Reference |
|---|---|---|---|---|---|
| Θ1(1) | Protein-protein interactions | 10,360 | 55,787 | 0.00104 | BioGRID v3.1.94 |
| Θ1(2) | Gene co-expression | 539 | 869 | 0.006 | Prieto et al. |
| Θ1(3) | Cell signalling data | 1,217 | 7,517 | 0.01016 | KEGG |
| Θ1(4) | Genetic interactions | 542 | 511 | 0.00349 | BioGRID v3.1.94 |
| Θ1(5) | Metabolic network | 5,908 | 1,505,831 | 0.0863 | KEGG |
| Θ4 | Drug interaction data | 4,477 | 21,821 | 0.00218 | DrugBank v3.0 |
| Θ3 | GO semantic structure | 11,853 | 43,924 | 0.00063 | Gene Ontology |
| Θ2 | DO semantic structure | 1,536 | 1,098 | 0.00093 | Disease Ontology |
| R13 | Gene annotations | 17,428/11,853 | 100,685 | 0.00049 | Gene Ontology |
| R14 | Drug-target relationships | 1,978/4,477 | 7,977 | 0.00009 | DrugBank v3.0 |
| R12 | Gene-disease relationships | 5,267/1,536 | 22,084 | 0.00273 | Mapped GeneRIF |
14 predicted disease-disease associations currently not captured by the semantic structure of Disease Ontology. Literature support for them is listed under the column denoted by “References”. Reported p-values measure how likely it would be for a disease association to emerge if gene-disease relation matrix was permuted, as described in Methods
| Disease pair | Literature evidence (quoted verbatim from the referenced source) | References | P-value |
|---|---|---|---|
| vitamin B deficiency (DOID:8449) endogenous depression (DOID:1595) | “Vitamin B complex deficiency causes the psychiatric symptoms of atypical endogenous depression. Dementia and depression have been association with this deficiency possibly from under production of methionine.” | <0.001 | |
| gastric lymphoma (DOID:10540) crescentic glomerulonephritis (DOID:13139) | “Mixed cryoglobulinemia-associated membranoproliferative glomerulonephritis disclosed gastric MALT lymphoma. Glomerulonephritis and lymphoma tend to co-exist in the same patients (relative risk 34.0; | <0.001 | |
| thyroid medullary carcinoma (DOID:3973) cholestasis (DOID:13580) | “Paraneoplastic cholestasis and hypercoagulability associated with medullary thyroid carcinoma. Cholestasis is likely a paraneoplastic effect of thyroid medullary carcinoma.” | 0.001 | |
| crescentic glomerulonephritis (DOID:13139) miliary tuberculosis (DOID:9861) | “Complex-mediated diffuse proliferative glomerulonephritis with crescentic formation is associated with miliary tuberculosis. Antituberculous agents successfully treat miliary tuberculosis and recovered renal function.” | 0.001 | |
| thyroid adenoma (DOID:2891) thymoma (DOID:3275) | “Coexistence of bilateral paraganglioma of the A. carotis, thymoma and thyroid adenoma. A common neuroectodermal origin is proposed as an explanation for the coexistence of the carotid body tumor and multiple endocrine tumors.” | 0.001 | |
| early myoclonic encephalopathy (DOID:308) Angelman syndrome (DOID:1932) | “Angelman syndromes share a range of clinical characteristics, including intellectual disability with or without regression and infantile encephalopathy. It presents in infancy with nonspecific features, such as psychomotor delay and seizures. This can lead to the descriptive labels of cerebral palsy or static encephalopathy.” | <0.001 | |
| autoimmune polyendocrine syndrome (DOID:14040) myositis (DOID:633) | “Autoimmune polyendocrine syndrome type 2 (known as Schmidt's syndrome) can be associated with interstitial myositis, an inflammatory myopathy which can be pathologically distinguished from idiopathic polymyositis and inclusion body myositis.” | <0.001 | |
| primary hyperparathyroidism (DOID:11202) sarcoidosis (DOID:11335) | “Primary hyperparathyroidism simulates sarcoidosis. Coexisting primary hyperparathyroidism and sarcoidosis cause increased Angiotensin-converting enzyme and decreased parathyroid hormone and phosphate levels.” | <0.001 | |
| cerebrotendinous xanthomatosis (DOID:4810) viral hepatitis (DOID:1884) | “Mutations in the sterol 27-hydroxylase gene (CYP27A) cause hepatitis of infancy as well as cerebrotendinous xanthomatosis. Accumulation of cholesterol and cholestanol can lead to the xanthomata, neurodegeneration, cataracts and atherosclerosis that are typical of cerebrotendinous xanthomatosis.” | <0.001 | |
| lepromatous leprosy (DOID:10887) mental depression (DOID:1596) | “The precipitating causes of relapse in leprosy include mental depression which downgrades immunity. The prevalence of dementia and depression in older leprosy patients is high.” | 0.001 | |
| male infertility (DOID:12336) DiGeorge syndrome (DOID:11198) | “Complex chromosome rearrangements (CCR) are rare structural chromosome aberrations that can be found in patients with phenotypic abnormalities or in phenotypically normal patients presenting infertility. The malsegregation of CCR can lead to partial 10p12.3 to 10p14 deletion, associated with the DiGeorge like phenotype.” | 0.001 | |
| Cushing's syndrome (DOID:12252) Hodgkin's lymphoma (DOID:8543) | “Hodgkin's lymphoma is highly responsive to steroids and Cushing's syndrome results from over exposure to corticosteroids, so it could be considered a treatment side effect. However, the co-existence in one patient of Cushing's disease (caused by a tumour in the pituitary) that suppressed the Hodgkin's lymphoma has been reported.” | <0.001 | |
| crescentic glomerulonephritis (DOID:13139) prostate cancer (DOID:10283) | “There can be two potential causes for the association: 1) that the drugs and treatment regimen that cancer patients are on causes the glomerulonephritis, or 2) that features of the cancer may cause the glomerulonephritis with ANCA being associated in both cases.” | <0.001 | |
| allergic bronchopulmonary aspergillosis (DOID:13166) myopathy (DOID:423) | “Allergic Bronchopulmonary aspergillosis is caused by a fungal disease. Fungal diseases are often treated with triazoles. Drug-induced myopathies are well recognised with triazole class of drugs. The association between these two may therefore be based on the treatment and risk it carries, rather than a common mechanism.” | <0.001 |
Relative contribution of each data source to the fused model. Starting from the configuration given in Figure 3-A, we remove individual data sources, re-run the data fusion algorithm, and compute residual sum of squares (RSS) and explained variance (Evar) changes for the resulting model. For example, if we remove protein-protein interaction data (column labelled “”), the quality of the resulting fused model drops by 2.0% (i.e. RSS increases by 2.0% and Evar decreases by 2.0%). The column labelled “Θ4 + R14” corresponds to the configuration in which we remove all drug-related information from the system, while the one labelled “Θ4” indicates that only drug side-effects information was removed
| Data source | Θ4 | Θ4 + R14 | Θ3 | Θ3 + R13 | |||||
|---|---|---|---|---|---|---|---|---|---|
| 13.3% | 6.3% | 2.0% | 2.0% | 2.0% | 2.2% | 3.8% | 1.0% | 1.9% | |
| 9.5% | 4.5% | 2.5% | 2.0% | 2.0% | 1.3% | 4.6% | 1.8% | 3.2% |