| Literature DB >> 34702888 |
Lucía Prieto Santamaría1,2, Eduardo P García Del Valle3, Massimiliano Zanin4, Gandhi Samuel Hernández Chan5, Yuliana Pérez Gallardo6, Alejandro Rodríguez-González3.
Abstract
Established nosological models have provided physicians an adequate enough classification of diseases so far. Such systems are important to correctly identify diseases and treat them successfully. However, these taxonomies tend to be based on phenotypical observations, lacking a molecular or biological foundation. Therefore, there is an urgent need to modernize them in order to include the heterogeneous information that is produced in the present, as could be genomic, proteomic, transcriptomic and metabolic data, leading this way to more comprehensive and robust structures. For that purpose, we have developed an extensive methodology to analyse the possibilities when it comes to generate new nosological models from biological features. Different datasets of diseases have been considered, and distinct features related to diseases, namely genes, proteins, metabolic pathways and genetical variants, have been represented as binary and numerical vectors. From those vectors, diseases distances have been computed on the basis of several metrics. Clustering algorithms have been implemented to group diseases, generating different models, each of them corresponding to the distinct combinations of the previous parameters. They have been evaluated by means of intrinsic metrics, proving that some of them are highly suitable to cover new nosologies. One of the clustering configurations has been deeply analysed, demonstrating its quality and validity in the research context, and further biological interpretations have been made. Such model was particularly generated by OPTICS clustering algorithm, by studying the distance between diseases based on gene sharedness and following cosine index metric. 729 clusters were formed in this model, which obtained a Silhouette coefficient of 0.43.Entities:
Mesh:
Year: 2021 PMID: 34702888 PMCID: PMC8548311 DOI: 10.1038/s41598-021-00554-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Best results obtained when performing DBSCAN in the different datasets, with the different types of vectors and measuring diseases distance according to different metrics.
| Algorithm | Dataset | Vector | Distance | Feature | Clusters | Noise | Silhouette | Calinski–Harabasz | Davies–Bouldin |
|---|---|---|---|---|---|---|---|---|---|
| DBSCAN | Complete | Binary | Dice | Pathway | 833 | 2236 | 0.4 | 14.49 | 1 |
| Hamming | Pathway | 22 | 420 | 0.56 | 538.49 | 2.42 | |||
| Jaccard | Pathway | 832 | 2108 | 0.44 | 17.34 | 1.03 | |||
| Sokalsneath | Pathway | 786 | 1847 | 0.48 | 28.82 | 1.04 | |||
| Numeric | Correlation | Gene | 1760 | 4152 | 0.4 | 10.82 | 1.03 | ||
| Cosine | Gene | 1760 | 4148 | 0.4 | 10.78 | 1.03 | |||
| Euclidean | Gene | 92 | 552 | 0.34 | 140.27 | 5.69 | |||
| Inner | Binary | Dice | Pathway | 462 | 1052 | 0.31 | 22.24 | 1.18 | |
| Hamming | Pathway | 11 | 413 | 0.59 | 843.7 | 2.75 | |||
| Jaccard | Pathway | 513 | 1314 | 0.37 | 16.86 | 1.12 | |||
| Sokalsneath | Pathway | 515 | 1486 | 0.4 | 19.81 | 1.03 | |||
| Numeric | Correlation | Gene | 683 | 1343 | 0.38 | 8.04 | 1.32 | ||
| Cosine | Gene | 684 | 1337 | 0.38 | 8.03 | 1.32 | |||
| Euclidean | Gene | 33 | 268 | 0.33 | 146.57 | 3.47 | |||
| Minkowski | Gene | 33 | 268 | 0.33 | 146.57 | 3.47 |
Best results obtained when performing OPTICS in the different datasets, with the different types of vectors and measuring diseases distance according to different metrics.
| Algorithm | Dataset | Vector | Distance | Feature | Clusters | Noise | Silhouette | Calinski–Harabasz | Davies–Bouldin |
|---|---|---|---|---|---|---|---|---|---|
| OPTICS | Complete | Binary | Dice | Pathway | 1111 | 1645 | 0.47 | 14.22 | 1.21 |
| Hamming | Pathway | 1001 | 2048 | 0.39 | 6.47 | 1.56 | |||
| Jaccard | Pathway | 1101 | 1677 | 0.49 | 14.47 | 1.12 | |||
| Sokalsneath | Pathway | 1087 | 1722 | 0.51 | 17.16 | 1.11 | |||
| Numeric | Correlation | Gene | 2213 | 3095 | 0.45 | 10.4 | 1.14 | ||
| Cosine | Gene | 2199 | 3026 | 0.46 | 10.9 | 1.14 | |||
| Inner | Binary | Dice | Pathway | 749 | 1157 | 0.39 | 10.64 | 1.25 | |
| Jaccard | Pathway | 741 | 1187 | 0.41 | 10.98 | 1.14 | |||
| Sokalsneath | Pathway | 729 | 1228 | 0.43 | 13.19 | 1.12 | |||
| Numeric | Correlation | Gene | 892 | 1195 | 0.45 | 8.94 | 1.23 | ||
| Cosine | Gene | 887 | 1175 | 0.46 | 9.2 | 1.22 |
Best results obtained when performing KMeans in the different datasets, with the different types of vectors and measuring diseases distance according to different metrics.
| Algorithm | Dataset | Vector | Distance | Feature | Clusters | Silhouette | Sum of Square Errors | Calinski–Harabasz | Davies–Bouldin |
|---|---|---|---|---|---|---|---|---|---|
| KMeans | Complete | Binary | Jaccard | Pathway | 280 | 0.3 | 26,716.4 | 237.83 | 1.4 |
| Sokalsneath | Pathway | 280 | 0.31 | 15,093.87 | 215.77 | 1.35 | |||
| Inner | Binary | Dice | Protein | 800 | 0.31 | 3702.47 | 25.23 | 1.2 | |
| Jaccard | Protein | 800 | 0.34 | 2633.78 | 21.37 | 1.23 | |||
| Sokalsneath | Protein | 800 | 0.35 | 2296.09 | 16.82 | 1.37 | |||
| Numeric | Correlation | Gene | 800 | 0.39 | 4146.08 | 22.42 | 1.08 | ||
| Cosine | Gene | 800 | 0.38 | 4143.6 | 22.59 | 1.09 |
Figure 1Distribution of the number of diseases in each cluster for the first analysed model obtained performing OPTICS. The model was generated from the complete dataset regarding genes as the studied features, with numeric vectors and computing distances with cosine metric. The histogram bars were filtered so that clusters with less than 5 diseases are not displayed. The global results for this model were of 2199 clusters, 3032 diseases as noise (from a total of 10,300), a Silhouette score value of 0.46, a CH score value of 10.9 and DB score value of 1.14.
Figure 2Distribution of the number of diseases in each cluster for the second analysed model obtained performing OPTICS. The model was generated from the inner dataset regarding pathways as the studied features, with binary vectors and computing distances with sokalsneath metric. The histogram bars were filtered so that clusters with less than 5 diseases are not displayed. The global results for this model were of 729 clusters, 1228 diseases as noise (from a total of 4130), a Silhouette score value of 0.43, a CH score value of 13.19 and DB score value of 1.12.
Figure 3Visualization of the first analysed model obtained performing OPTICS. Each point represents a disease, plotted in the two-dimensional space obtained once applied PCA and t-SNE to the genes feature matrix related to the complete set of diseases. Different colours symbolize different clusters. The size of the points ranges accordingly to the clusters’ size. Only diseases in clusters containing more than 10 diseases have been represented for the sake of clarity (a total of 468 diseases).
Figure 4Visualization of the second analysed model obtained performing OPTICS. Each point represents a disease, plotted in the two-dimensional space obtained once applied PCA and t-SNE to the pathways feature matrix related to the complete set of diseases. Different colours symbolize different clusters. The size of the points ranges accordingly to the clusters’ size. Only clusters containing more than 15 diseases have been represented for the sake of clarity (a total of 409 diseases).
Figure 5Distribution of Silhouette coefficient in the clusters formed in the first analysed model obtained performing OPTICS. Only the 9 first largest clusters are shown, depicted sorted by the number of diseases (cluster 165 has 29 diseases while cluster 1043 has 16 diseases). The specific diseases grouped inside each cluster can be found at the public repository.
Relevant information of the largest clusters formed in the first analysed model obtained performing OPTICS.
| Cluster number | Number of diseases in the cluster | Diseases in the cluster | Most important gene(s) associated to all diseases in the cluster | Other gene(s) associated to some diseases in the cluster |
|---|---|---|---|---|
| 155 | 29 | ACTH Syndrome, Ectopic Adrenal Cortex Diseases Adrenal Gland Hyperfunction Arthritis, Gouty Facial paralysis Hypernatremia Diplegic Infantile Cerebral Palsy Cerebral Palsy, Quadriplegic, Infantile Monoplegic Infantile Cerebral Palsy Calcium Pyrophosphate Dihydrate Deposition Athetoid cerebral palsy Monoplegic Cerebral Palsy Hypocortisolism secondary to another disorder Spastic cerebral palsy Subaortic stenosis ACTH-dependent Cushing's syndrome Adrenocortical hyperplasia Opsoclonus-Myoclonus Syndrome Cerebral Palsy, Dystonic-Rigid Cerebral Palsy, Atonic Congenital Cerebral Palsy Sacroiliitis Cerebral Palsy, Mixed Cerebral Palsy, Rolandic Type Kinsbourne Syndrome Paraneoplastic Opsoclonus-Myoclonus Ataxia Proopiomelanocortin Deficiency Pyogenic Sacroiliitis Septic Sacroiliitis | POMC (propiomelanocortin) | PRKAR1A, NR3C1, FGFR1 |
| 701 | 24 | Carpal Tunnel Syndrome Familial Amyloid Polyneuropathy, Type V Trigger Finger Disorder Amyloid Neuropathies, Familial Amyloid Neuropathies Autonomic neuropathy Systemic amyloidosis Familial amyloid polyneuropathy, type VI Familial Amyloid Neuropathy, Portuguese Type Familial Amyloid Polyneuropathy, Jewish Type Amyloid Polyneuropathy, Swiss Type Amyloid of vitreous Amyloid Polyneuropathy, British Type (disorder) Danish type familial amyloid cardiomyopathy Senile systemic amyloidosis Familial Amyloid Polyneuropathy, Appalachian Type Hereditary cardiac amyloidosis Protein Misfolding Disorders Dystransthyretinemic Euthyroidal Hyperthyroxinemia AMYLOIDOSIS, HEREDITARY, TRANSTHYRETIN-RELATED AMYLOIDOSIS, LEPTOMENINGEAL, TRANSTHYRETIN-RELATED AMYLOID CARDIOMYOPATHY, TRANSTHYRETIN-RELATED CARPAL TUNNEL SYNDROME, FAMILIAL Transthyretin related familial amyloid cardiomyopathy | TTR (transthyretin) | APOA1, GSN, LYZ |
| 761 | 20 | Torsades de Pointes Left posterior fascicular block Paroxysmal familial ventricular fibrillation Ventricular tachycardia, monomorphic Lenegre's disease Congenital long QT syndrome CARDIOMYOPATHY, DILATED, 1E SICK SINUS SYNDROME 1, AUTOSOMAL RECESSIVE LONG QT SYNDROME 3 Heart Block, Nonprogressive Cardiac Conduction Defect, Nonprogressive Hereditary bundle branch system defect CARDIAC CONDUCTION DEFECT, NONSPECIFIC (disorder) Ventricular Fibrillation, Paroxysmal Familial, 1 Long QT syndrome type 3 ATRIAL FIBRILLATION, FAMILIAL, 10 LONG QT SYNDROME 2/3, DIGENIC LONG QT SYNDROME 3/6, DIGENIC Disorder Cardiac channelopathy Complete heart block with broad QRS complexes | SCN5A (sodium voltage-gated channel alpha subunit 5) | KCNH2, KCNQ1, KCNE2, DPP6, CALM2, KCNE1, SCN1B, CALM3, CAML1, KCNA3, CACNA1C |
| 571 | 20 | Dissociated Nystagmus Rotary Nystagmus Periodic Alternating Nystagmus Symptomatic Nystagmus Spontaneous Ocular Nystagmus Vertical Nystagmus Rebound Nystagmus Jerk Nystagmus See-Saw Nystagmus Retraction Nystagmus Temporary Nystagmus Permanent Nystagmus Unidirectional Nystagmus Multidirectional Nystagmus Conjugate Nystagmus Convergence Nystagmus Fatigable Positional Nystagmus Non-Fatigable Positional Nystagmus LEBER CONGENITAL AMAUROSIS 6 (disorder) Cone-Rod Dystrophy 13 | RPGRIP1 (RPGR interacting protein 1) | - |
| 62 | 19 | Herpes Labialis Hyperlipoproteinemia Type III Sea-Blue Histiocyte Syndrome Internal Carotid Artery Stenosis Dementia in Parkinson's disease Multiple Sclerosis, Acute Relapsing cortex bone disorders Common Carotid Artery Stenosis External Carotid Artery Stenosis Multiple Sclerosis, Relapsing–Remitting Apolipoprotein E, Deficiency or Defect of Dysbetalipoproteinemia due to Defect in Apolipoprotein E-d Familial Hyperbeta- and Prebetalipoproteinemia Hyperlipemia with Familial Hypercholesterolemic Xanthomatosis Broad-Betalipoproteinemia Floating-Betalipoproteinemia ALZHEIMER DISEASE 2 LIPOPROTEIN GLOMERULOPATHY Obstructive sleep apnea hypopnea | APOE (apolipoprotein E) | – |
| 217 | 18 | MENTAL RETARDATION, X-LINKED 2 (disorder) MENTAL RETARDATION, X-LINKED 14 MENTAL RETARDATION, X-LINKED 20 MENTAL RETARDATION, X-LINKED 23 Mental Retardation, X-Linked 92 MENTAL RETARDATION, X-LINKED 82 MENTAL RETARDATION, X-LINKED 84 MENTAL RETARDATION, X-LINKED 77 MENTAL RETARDATION, X-LINKED 81 MENTAL RETARDATION, X-LINKED 42 MENTAL RETARDATION, X-LINKED 73 MENTAL RETARDATION, X-LINKED 53 MENTAL RETARDATION, X-LINKED 72 MENTAL RETARDATION, X-LINKED 50 MENTAL RETARDATION, X-LINKED 95 MENTAL RETARDATION, X-LINKED 90 (disorder) MENTAL RETARDATION, X-LINKED 88 (disorder) MENTAL RETARDATION, X-LINKED 41 | DLG3 (discs large MAGUK scaffold protein 3) GDI1 (GDP dissociation inhibitor 1) | – |
| 130 | 18 | Akinetic Mutism Gerstmann-Straussler-Scheinker Disease Kuru Prion Diseases Fatal Familial Insomnia Human Transmissible Spongiform Encephalopathies, Inherited Wasting Disease, Chronic SPONGIFORM ENCEPHALOPATHY WITH NEUROPSYCHIATRIC FEATURES Creutzfeldt-Jakob Disease, Sporadic HUNTINGTON DISEASE-LIKE 1 Creutzfeldt-Jakob Disease, Heidenhain Variant Iatrogenic Jakob-Creutzfeldt disease Other Creutzfeldt-Jakob disease Amyloidosis, Cerebral, with Spongiform Encephalopathy Acquired CJD CEREBRAL AMYLOID ANGIOPATHY, PRNP-RELATED Familial Creutzfeldt-Jakob Familial Alzheimer-like prion disease | PRNP (prion protein) | CSF2, LAMC2, CTSD, PRDX2, GH1, C4BPA, CARD14, MAPT, ABCB6, APOE |
| 562 | 17 | Myxedema Subacute thyroiditis Thyrotoxicosis Subclinical hypothyroidism Severe hypothyroidism Silent thyroiditis Toxic thyroid adenoma Diffuse goiter Toxic diffuse goiter Acquired hypothyroidism Neonatal hyperthyroidism Autoimmune thyroiditis Congenital hyperthyroidism Hyperthyroidism, Nonautoimmune Hyperthyroidism, Familial Gestational HYPOTHYROIDISM, CONGENITAL, NONGOITROUS, 3 HYPOTHYROIDISM, CONGENITAL, NONGOITROUS, 1 | TSHR (thyroid stimulating hormone receptor) | TG |
| 1039 | 16 | Epilepsies, Partial Epilepsy, Simple Partial Simple Partial Seizures Gelastic Epilepsy Benign Focal Epilepsy, Childhood Childhood Benign Occipital Epilepsy Amygdalo-Hippocampal Epilepsy Rhinencephalic Epilepsy Occipital Lobe Epilepsy Subclinical Seizure Uncinate Seizures Digestive Epilepsy Benign Occipital Epilepsy Migrating partial seizures in infancy EPILEPTIC ENCEPHALOPATHY, EARLY INFANTILE, 14 EPILEPSY, NOCTURNAL FRONTAL LOBE, 5 | KCNT1 (potassium sodium-activated channel subfamily T member 1) | LGI1, CDKL5 |
Only the 9 first largest clusters are shown. The number of diseases, the names of such diseases inside each cluster and the genes related to those diseases are included in the table. The most important gene(s) column depicts the gene(s) that is/are associated to all the diseases in the cluster. The last column presents other genes that are related to multiple diseases in the cluster.
Figure 6Schematic representation of the considered factors involved in the current analysis methodology. Each phase of the performed analysis contemplated different variables, leading to different combinations of the possible inputs that would in turn lead to different outcomes. The figure illustrates the possibilities for the different used datasets, features, vector types, distance metrics and clustering algorithms.
Figure 7Workflow followed to perform the clustering analysis. The main steps in the study were the dataset structuring in feature matrices, the distance matrices computation and the clustering implementation and evaluation.
Number of diseases in each of the considered datasets.
| Dataset | Feature | Number of diseases |
|---|---|---|
| Complete | Genes | 10,300 |
| Proteins | 10,246 | |
| Pathways | 6708 | |
| Variants | 6942 | |
| Inner | All | 4130 |