Christian Lopez1, Scott Tucker2, Tarik Salameh3, Conrad Tucker4. 1. Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, USA. 2. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA; Engineering Science and Mechanics, The Pennsylvania State University, University Park, PA 16802, USA. 3. Hershey College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA. 4. Industrial and Manufacturing Engineering, The Pennsylvania State University, University Park, PA 16802, USA; Engineering Design Technology and Professional Programs, The Pennsylvania State University, University Park, PA 16802, USA; Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA. Electronic address: ctucker4@psu.edu.
Abstract
INTRODUCTION: Many chronic disorders have genomic etiology, disease progression, clinical presentation, and response to treatment that vary on a patient-to-patient basis. Such variability creates a need to identify characteristics within patient populations that have clinically relevant predictive value in order to advance personalized medicine. Unsupervised machine learning methods are suitable to address this type of problem, in which no a priori class label information is available to guide this search. However, it is challenging for existing methods to identify cluster memberships that are not just a result of natural sampling variation. Moreover, most of the current methods require researchers to provide specific input parameters a priori. METHOD: This work presents an unsupervised machine learning method to cluster patients based on their genomic makeup without providing input parameters a priori. The method implements internal validity metrics to algorithmically identify the number of clusters, as well as statistical analyses to test for the significance of the results. Furthermore, the method takes advantage of the high degree of linkage disequilibrium between single nucleotide polymorphisms. Finally, a gene pathway analysis is performed to identify potential relationships between the clusters in the context of known biological knowledge. DATASETS AND RESULTS: The method is tested with a cluster validation and a genomic dataset previously used in the literature. Benchmark results indicate that the proposed method provides the greatest performance out of the methods tested. Furthermore, the method is implemented on a sample genome-wide study dataset of 191 multiple sclerosis patients. The results indicate that the method was able to identify genetically distinct patient clusters without the need to select parameters a priori. Additionally, variants identified as significantly different between clusters are shown to be enriched for protein-protein interactions, especially in immune processes and cell adhesion pathways, via Gene Ontology term analysis. CONCLUSION: Once links are drawn between clusters and clinically relevant outcomes, Immunochip data can be used to classify high-risk and newly diagnosed chronic disease patients into known clusters for predictive value. Further investigation can extend beyond pathway analysis to evaluate these clusters for clinical significance of genetically related characteristics such as age of onset, disease course, heritability, and response to treatment.
INTRODUCTION: Many chronic disorders have genomic etiology, disease progression, clinical presentation, and response to treatment that vary on a patient-to-patient basis. Such variability creates a need to identify characteristics within patient populations that have clinically relevant predictive value in order to advance personalized medicine. Unsupervised machine learning methods are suitable to address this type of problem, in which no a priori class label information is available to guide this search. However, it is challenging for existing methods to identify cluster memberships that are not just a result of natural sampling variation. Moreover, most of the current methods require researchers to provide specific input parameters a priori. METHOD: This work presents an unsupervised machine learning method to cluster patients based on their genomic makeup without providing input parameters a priori. The method implements internal validity metrics to algorithmically identify the number of clusters, as well as statistical analyses to test for the significance of the results. Furthermore, the method takes advantage of the high degree of linkage disequilibrium between single nucleotide polymorphisms. Finally, a gene pathway analysis is performed to identify potential relationships between the clusters in the context of known biological knowledge. DATASETS AND RESULTS: The method is tested with a cluster validation and a genomic dataset previously used in the literature. Benchmark results indicate that the proposed method provides the greatest performance out of the methods tested. Furthermore, the method is implemented on a sample genome-wide study dataset of 191 multiple sclerosispatients. The results indicate that the method was able to identify genetically distinct patient clusters without the need to select parameters a priori. Additionally, variants identified as significantly different between clusters are shown to be enriched for protein-protein interactions, especially in immune processes and cell adhesion pathways, via Gene Ontology term analysis. CONCLUSION: Once links are drawn between clusters and clinically relevant outcomes, Immunochip data can be used to classify high-risk and newly diagnosed chronic diseasepatients into known clusters for predictive value. Further investigation can extend beyond pathway analysis to evaluate these clusters for clinical significance of genetically related characteristics such as age of onset, disease course, heritability, and response to treatment.
Authors: Noriko Isobe; Lohith Madireddy; Pouya Khankhanian; Takuya Matsushita; Stacy J Caillier; Jayaji M Moré; Pierre-Antoine Gourraud; Jacob L McCauley; Ashley H Beecham; Laura Piccio; Joseph Herbert; Omar Khan; Jeffrey Cohen; Lael Stone; Adam Santaniello; Bruce A C Cree; Suna Onengut-Gumuscu; Stephen S Rich; Stephen L Hauser; Stephen Sawcer; Jorge R Oksenberg Journal: Brain Date: 2015-03-28 Impact factor: 13.501
Authors: Olga G Kulakova; Ekaterina Yu Tsareva; Dmitrijs Lvovs; Alexander V Favorov; Alexey N Boyko; Olga O Favorova Journal: Pharmacogenomics Date: 2014-04 Impact factor: 2.533
Authors: Yijun Zhao; Brian C Healy; Dalia Rotstein; Charles R G Guttmann; Rohit Bakshi; Howard L Weiner; Carla E Brodley; Tanuja Chitnis Journal: PLoS One Date: 2017-04-05 Impact factor: 3.240
Authors: Sayoko E Moroi; David M Reed; David S Sanders; Ahmed Almazroa; Lawrence Kagemann; Neil Shah; Nakul Shekhawat; Julia E Richards Journal: Curr Opin Ophthalmol Date: 2019-05 Impact factor: 3.761
Authors: Manan Shah; Derek Shu; V B Surya Prasath; Yizhao Ni; Andrew H Schapiro; Kevin R Dufendach Journal: Appl Clin Inform Date: 2021-09-08 Impact factor: 2.762
Authors: Monika A Myszczynska; Poojitha N Ojamies; Alix M B Lacoste; Daniel Neil; Amir Saffari; Richard Mead; Guillaume M Hautbergue; Joanna D Holbrook; Laura Ferraiuolo Journal: Nat Rev Neurol Date: 2020-07-15 Impact factor: 42.937
Authors: Giovanni Bellomo; Antonio Indaco; Davide Chiasserini; Emanuela Maderna; Federico Paolini Paoletti; Lorenzo Gaetani; Silvia Paciotti; Maya Petricciuolo; Fabrizio Tagliavini; Giorgio Giaccone; Lucilla Parnetti; Giuseppe Di Fede Journal: Front Neurosci Date: 2021-03-31 Impact factor: 4.677