| Literature DB >> 35580864 |
Zhenlong Zheng1,2, Xianglan Zhang3,4, Bong-Kyeong Oh5, Ki-Yeol Kim6.
Abstract
Osteoporosis is a severe chronic skeletal disorder that affects older individuals, especially postmenopausal women. However, molecular biomarkers for predicting the risk of osteoporosis are not well characterized. The aim of this study was to identify combined biomarkers for predicting the risk of osteoporosis using machine learning methods. We merged three publicly available gene expression datasets (GSE56815, GSE13850, and GSE2208) to obtain expression data for 6354 unique genes in postmenopausal women (45 with high bone mineral density and 45 with low bone mineral density). All machine learning methods were implemented in R, with the GEOquery and limma packages, for dataset download and differentially expressed gene identification, and a nomogram for predicting the risk of osteoporosis was constructed. We detected 378 significant differentially expressed genes using the limma package, representing 15 major biological pathways. The performance of the predictive models based on combined biomarkers (two or three genes) was superior to that of models based on a single gene. The best predictive gene set among two-gene sets included PLA2G2A and WRAP73. The best predictive gene set among three-gene sets included LPN1, PFDN6, and DOHH. Overall, we demonstrated the advantages of using combined versus single biomarkers for predicting the risk of osteoporosis. Further, the predictive nomogram constructed using combined biomarkers could be used by clinicians to identify high-risk individuals and in the design of efficient clinical trials to reduce the incidence of osteoporosis.Entities:
Keywords: combined biomarker; gene expression; machine learning; osteoporosis; risk prediction
Mesh:
Substances:
Year: 2022 PMID: 35580864 PMCID: PMC9186773 DOI: 10.18632/aging.204084
Source DB: PubMed Journal: Aging (Albany NY) ISSN: 1945-4589 Impact factor: 5.955
Figure 1Study design. Data for duplicated genes in each gene expression dataset were averaged. The datasets were then merged based on gene name. Finally, osteoporosis-predictive genes were identified, as indicated. BMD: bone mineral density; GO: Gene Ontology; KEGG: Kyoto Encyclopedia Genes Genomes; ML: machine learning; HTML: Hypertext Markup Language format.
Figure 2Gene expression patterns in the three datasets analyzed. (A) Gene expression pattern in the merged microarray dataset, which includes 6354 genes and data from 90 experiments. (B) Gene expression pattern of significant differentially expressed genes (n = 378) in high-BMD and low-BMD groups. The genes were identified using the limma package in R; among them, 191 genes were down-regulated and 187 genes were up-regulated.
Summary of GO terms identified using the DAVID annotation database.
|
|
|
|
|
|
| UP_KEYWORDS | Phosphoprotein | 239 | 7.3E-22 | 2.5E-19 |
| GOTERM_MF_DIRECT | Protein binding | 247 | 2.2E-13 | 1.2E-10 |
| UP_KEYWORDS | Acetylation | 113 | 3.4E-11 | 5.8E-9 |
| UP_KEYWORDS | Nucleus | 151 | 9.1E-11 | 1.0E-8 |
| GOTERM_CC_DIRECT | Nucleoplasm | 101 | 2.0E-10 | 7.5E-8 |
| GOTERM_CC_DIRECT | Cytoplasm | 155 | 1.0E-9 | 2.0E-7 |
| UP_KEYWORDS | Alternative splicing | 240 | 1.2E-7 | 1.1E-5 |
| UP_KEYWORDS | Ubl conjugation | 60 | 7.4E-7 | 5.0E-5 |
| GOTERM_CC_DIRECT | Nucleus | 147 | 1.7E-6 | 2.2E-4 |
| UP_KEYWORDS | DNA damage | 21 | 6.4E-6 | 3.6E-4 |
| UP_KEYWORDS | Methylation | 38 | 2.7E-5 | 1.3E-3 |
| UP_KEYWORDS | Coiled coil | 87 | 4.2E-5 | 1.6E-3 |
| UP_KEYWORDS | ATP binding | 47 | 4.3E-5 | 1.6E-3 |
| UP_KEYWORDS | Isopeptide bond | 40 | 7.4E-5 | 2.5E-3 |
| UP_KEYWORDS | DNA repair | 17 | 8.5E-5 | 2.6E-3 |
ap-value: modified Fisher’s exact test p-value.
bBenjamini: Benjamini–Hochberg false discovery rate (FDR)-adjusted p-value.
Figure 3Comparison of prediction accuracies of combinations of different numbers of genes. The specific-number gene sets were selected from 378 significant differentially expressed genes identified by the merged microarray dataset using the limma package. Vertical and horizontal axes represent the prediction accuracy and the number of genes considered in combination, respectively.
Overview of the 10 sets of combined genes (two or three genes) tested.
|
|
|
|
| 1 |
| Phospholipase A2, membrane associated |
|
| Human WD repeat containing, antisense to TP73 | |
| 2 |
| Deoxyhypusine hydroxylase |
|
| Solute carrier family 22, member 14 | |
| 3 |
| Oxytocin receptor |
|
| Furin, paired basic amino acid cleaving enzyme | |
| 4 |
| Solute carrier family 41, member 3 |
|
| BBSome interacting protein 1 | |
| 5 |
| TATA-binding protein |
|
| Toll-like receptor adaptor molecule 1 | |
| 6 |
| Mahogunin ring finger 1 |
|
| Platelet-derived growth factor subunit B | |
|
| Zinc finger protein 764 | |
| 7 |
| Paraspeckle component 1 |
|
| Mannose phosphate isomerase | |
|
| Eukaryotic translation initiation factor 5 | |
| 8 |
| WD repeat-containing protein 6 |
|
| Prefoldin subunit 6 | |
|
| Paraspeckle component 1 | |
| 9 |
| Adrenomedullin 2 |
|
| Major facilitator superfamily domain containing 10 | |
| Platelet-activating factor acetylhydrolase 1b regulatory subunit 1 | ||
| 10 |
| Lipin-1 |
|
| Prefoldin subunit 6 | |
|
| Deoxyhypusine hydroxylase |
Comparison of predictive accuracies of models with training and testing datasets.
|
|
| ||||||||
|
| |||||||||
| LDA | KNN | SVM | RF | LDA | KNN | SVM | RF | ||
| 0.667 | 0.731 | 0.740 | 0.999 | 0.662 | 0.650 | 0.641 | 0.603 | ||
| 0.057 | 0.049 | 0.055 | 0.003 | 0.093 | 0.095 | 0.092 | 0.102 | ||
|
| |||||||||
| Gene set | Genes | LDA | KNN | SVM | RF | LDA | KNN | SVM | RF |
|
|
| 0.893 | 0.888 | 0.899 | 1.000 | 0.873 | 0.859 | 0.864 | 0.841 |
|
|
| 0.802 | 0.879 | 0.882 | 1.000 | 0.800 | 0.852 | 0.826 | 0.829 |
|
|
| 0.860 | 0.840 | 0.879 | 1.000 | 0.851 | 0.783 | 0.808 | 0.789 |
|
|
| 0.887 | 0.889 | 0.935 | 1.000 | 0.881 | 0.838 | 0.852 | 0.799 |
|
|
| 0.854 | 0.867 | 0.881 | 1.000 | 0.827 | 0.815 | 0.803 | 0.782 |
|
|
| 0.863 | 0.880 | 0.894 | 1.000 | 0.834 | 0.827 | 0.819 | 0.842 |
|
|
| 0.832 | 0.866 | 0.889 | 1.000 | 0.810 | 0.854 | 0.853 | 0.829 |
|
|
| 0.853 | 0.856 | 0.869 | 1.000 | 0.843 | 0.786 | 0.805 | 0.798 |
|
|
| 0.834 | 0.817 | 0.858 | 1.000 | 0.799 | 0.734 | 0.771 | 0.789 |
|
|
| 0.869 | 0.885 | 0.927 | 1.000 | 0.858 | 0.860 | 0.873 | 0.920 |
LDA, Linear discriminant analysis; KNN, k-Nearest neighbors; SVM, Support vector machine; RF, Random forest.
ML methods are indicated, and values are the mean (top) and standard deviation (bottom) calculated from 100 reiterations.
Figure 4Nomogram for predicting the probability of osteoporosis risk. (A) Identification of the probability of osteoporosis risk for an individual patient. (B) Practical use of the nomogram, available in Hypertext Markup Language (HTML) format.