Literature DB >> 35741859

Integrated Analysis of Tissue-Specific Gene Expression in Diabetes by Tensor Decomposition Can Identify Possible Associated Diseases.

Y-H Taguchi1, Turki Turki2.   

Abstract

In the field of gene expression analysis, methods of integrating multiple gene expression profiles are still being developed and the existing methods have scope for improvement. The previously proposed tensor decomposition-based unsupervised feature extraction method was improved by introducing standard deviation optimization. The improved method was applied to perform an integrated analysis of three tissue-specific gene expression profiles (namely, adipose, muscle, and liver) for diabetes mellitus, and the results showed that it can detect diseases that are associated with diabetes (e.g., neurodegenerative diseases) but that cannot be predicted by individual tissue expression analyses using state-of-the-art methods. Although the selected genes differed from those identified by the individual tissue analyses, the selected genes are known to be expressed in all three tissues. Thus, compared with individual tissue analyses, an integrated analysis can provide more in-depth data and identify additional factors, namely, the association with other diseases.

Entities:  

Keywords:  diabetes mellitus; gene expression; neurodegenerative diseases; tensor decomposition

Mesh:

Year:  2022        PMID: 35741859      PMCID: PMC9222230          DOI: 10.3390/genes13061097

Source DB:  PubMed          Journal:  Genes (Basel)        ISSN: 2073-4425            Impact factor:   4.141


1. Introduction

Gene expression analysis is an important step for investigating diseases and identifying genes that can be used as therapeutic targets or biomarkers or genes that are causes of disease. Although the development of high throughput sequencing technology (HST) has led to continuous increases in the amount of gene expression profile data, methods of integrating multiple gene expression profiles are still being developed. Tensor decomposition (TD) is a promising candidate method for integrating multiple gene expression profiles. Using this method, gene expression profiles from multiple tissues of individuals can be stored as a tensor , which represents the gene expression of the ith gene in the jth individual of the kth tissue. TD provides a method of decomposing a tensor into a series expansion of the product of singular value vectors, each of which represents a gene assigned to a specific individual or tissue. For example, by applying the higher-order singular value decomposition (HOSVD) method to , we can obtain the following: where is a core tensor, are singular value matrices and orthogonal matrices. We previously proposed a TD-based unsupervised feature extraction (FE) method [1] and applied it to a wide range of genomic sciences. Recently, this method was improved by the introduction of standard deviation (SD) optimization and applied to gene expression [2], DNA methylation [3], and histone modification analyses [4]. Nevertheless, because the updated method was only previously applied to gene expression measured by HST, whether it is also applicable to gene expression profiles retrieved by microarray technology remains to be clarified. In this paper, an integrated analysis was performed by applying the recently proposed TD-based unsupervised FE method with SD optimization to microarray-measured gene expression data for diabetes mellitus from multiple tissues. We found that applying the TD-based unsupervised FE with SD optimization to gene expression profiles from individual tissues can identify diseases associated with diabetes that cannot be identified by the other state-of-the-art methods. There are multiple benefits to using TD to identify DEGs. First, since it is not a supervised method, it can select DEGs that are biologically more plausible than those selected using supervised methods. This can be explained using the following example wherein the aim is to identify DEGs that are distinct between two classes, e.g., patients and healthy controls. Supervised methods attempt to identify DEGs associated with a smaller divergence within individual classes, whereas TD allows one to select DEGs with within-class divergence to some extent (since TD tries to identify the representative state of distinction between two classes). If the representative state is associated with within-class divergence that has biological origins, e.g., age and sex, this divergence should not be penalized. However, supervised methods often do so, whereas the unsupervised method allows biological within-class divergence. Second, TD can select more stable DEGs, i.e., those independent of specific sets of samples considered in the analysis. This is because TD attempts to identify DEGs coincident with those of the representative state, which should be robust. Since sub-sampling does not change the representative state drastically, the gene set selected by TD is not altered drastically either. Third, TD can deal with multiple conditions. For example, if gene expression is measured in various tissues of several people, it is natural to format them as gene × person × tissue, which results in a tensor form. We have listed only a few important advantages here. Readers interested in acquiring information on other advantages of TD can refer to our recent book [1].

2. Materials and Methods

2.1. Gene Expression

Gene expression profiles (GSE13268, GSE13269, and GSE13270 [5]) were retrieved from the Gene Expression Omnibus (GEO), and they were obtained from a study of the progression of diabetes biomarker diseases in the rat liver, gastrocnemius muscle, and adipose tissue. Each of these profiles is composed of gene expression profiles from five individuals seen in two strains, Goto-Kakizaki and WistarKyoto, and they include data for three tissues (adipose, muscle, and liver) obtained at five time points after treatment. Three files named GSE13268_series_matrix.txt.gz, GSE13269_series_matrix.txt.gz, and GSE13270_series_matrix.txt.gz were downloaded from the Supplementary Files in GEO. Gene expression profiles were formatted as a tensor, with , representing the expression of the ith probe in the tth tissue (: adipose, : muscle, : liver) at the jth time point for the kth replicate and mth treatment at the sth strain. These values are normalized as follows:

2.2. Methods

Figure 1 shows the analysis pipeline. Methodological details can be found in the Supplementary Materials.
Figure 1

Overall flowchart of the analysis pipeline.

3. Results

To validate the selected genes, 2281 gene symbols are uploaded to Enrichr [6] (For the full list of selected probes, genes, and enrichment analyses, check the Supplementary Materials). Table 1 shows the results of the “KEGG 2021 Human” category in Enrichr. Since none of the terms are related to diabetes except for the top term, i.e., “diabetic cardiomyopathy”, the process initially appears to be a failure. Nevertheless, a number of the identified diseases are deeply related to diabetes mellitus. For example, many neurodegenerative diseases are listed, and diabetes mellitus is widely known to be a risk factor for neurodegenerative diseases [7,8,9,10,11]. Moreover, diabetes mellitus is known to be associated with thermogenesis [12], oxidative phosphorylation [13], and the PPAR signaling pathway [14]. Thus, the proposed method is successful in contrast to the first impression and can identify many diseases associated with diabetes mellitus.
Table 1

Top 10 “KEGG 2021 Human” category terms in Enrichr.

TermOverlapp-ValueAdjusted p-Value
Diabetic cardiomyopathy83/203 1.89×1031 5.80×1029
Prion disease93/273 7.40×1028 1.13×1025
Parkinson disease86/249 2.50×1026 2.55×1024
Oxidative phosphorylation60/133 7.92×1026 6.06×1024
Nonalcoholic fatty liver disease65/155 1.19×1025 7.30×1024
Thermogenesis76/232 8.28×1022 4.22×1020
Complement and coagulation cascades42/85 2.26×1020 9.85×1019
PPAR signaling pathway39/74 2.58×1020 9.85×1019
Alzheimer disease94/369 3.99×1018 1.36×1016
Huntington disease83/306 6.48×1018 1.98×1016
Table 2 shows the top 10 terms in the category “ARCHS4 tissues” in Enrichr. Remarkably, gene expression is measured for three of the top four tissues. Similar results are found for the “Mouse Gene Atlas” category in Enrichr (Table 3). In conclusion, the proposed method is successful.
Table 2

Top 10 terms in the “ARCHS4 Tissues” category in Enrichr.

TermOverlapp-ValueAdjusted p-Value
LIVER (BULK TISSUE)481/2316 3.49×1063 3.77×1061
VENTRICLE449/2316 1.67×1049 9.04×1048
SKELETAL MUSCLE (BULK TISSUE)428/2316 2.34×1041 8.42×1040
ADIPOSE (BULK TISSUE)410/2316 6.46×1035 1.75×1033
MYOBLAST409/2316 1.42×1034 3.08×1033
SUBCUTANEOUS ADIPOSE TISSUE401/2316 6.92×1032 1.25×1030
ATRIUM366/2316 2.38×1021 3.67×1020
HEART (BULK TISSUE)363/2316 1.53×1020 2.07×1019
HEPATOCYTE362/2316 2.82×1020 3.39×1019
OMENTUM350/2316 3.25×1017 3.51×1016
Table 3

Top 10 terms in the “Mouse Gene Atlas” category in Enrichr.

TermOverlapp-ValueAdjusted p-Value
mammary gland non-lactating116/201 7.92×1064 7.61×1062
skeletal muscle229/710 5.23×1063 2.51×1061
liver243/928 3.58×1048 1.14×1046
adipose brown148/456 5.78×1041 1.39×1039
heart154/568 2.53×1032 4.86×1031
kidney80/554 3.98×104 5.90×103
osteoblast day 2144/264 4.30×104 5.90×103
bladder33/195 1.63×103 1.96×102
adipose white33/199 2.29×103 2.44×102
MEF45/300 3.33×103 3.20×102

4. Discussion

Although the proposed method successfully integrated gene expression data measured in three tissues and identified diseases associated with diabetes mellitus, the identified genes also included genes expressed in all three tissues. If other methods that do not require an integrated analysis can perform similarly, then complicated methods, such as the proposed method, will not be required. To determine whether methods without integration can achieve similar performance, we tested three methods: t test, SAM [15], and limma [16]. Since the t test and SAM methods cannot simultaneously consider the distinction between the control and treatment as well as the dependence on time, we attempted to identify genes that presented expression differences between the control and treatment (no consideration of time dependence). For more details on how to perform these three methods, check the sample R source code in the Supplementary Materials. Table 4 shows the number of probes selected by the other methods. These methods select fewer probes than the proposed method (2542 probes), and the number selected in muscle is relatively low. According to the limma method, only two probes could be selected for muscle; thus, the method was not successful. The integrated analysis likely helped identify more probes, which resulted in more significant enrichment.
Table 4

Number of probes selected by other methods.

Tissuet TestSamLimma
Adipose556773116
Muscle1001192
liver9471090211
ComBat40091800
To further validate the genes selected by other methods, we converted probe IDs to gene symbols and uploaded them to Enrichr. Table 5 presents the results for the other methods on the “Mouse Gene Atlas” category in Enrichr. For muscle, neither SAM nor t test could select muscle as top ranked tissues whereas limma could identify only two probes as muscle-specific genes (see Table 4). Thus, the other methods are not better than the proposed method which could identify muscle specificity correctly (Table 3). Figure 2 shows the Venn diagrams between selected genes. Since the proposed method selects different genes from those specifically selected in individual tissues, an integrated analysis is a valuable method.
Table 5

Top three terms by other methods in the “Mouse Gene Atlas” category in Enrichr.

TermOverlapp-ValueAdjusted p-Value
t test
Adipose
adipose brown38/456 4.17×1012 3.92×1010
mammary gland lact12/104 3.87×106 1.82×104
macrophage peri LPS thio 0 h18/353 1.14×103 3.59×102
Muscle
adipose brown29/456 4.34×1026 2.26×1024
heart21/568 3.88×1014 1.01×1012
mammary gland lact4/104 1.15×103 1.99×102
Liver
liver90/928 2.53×1016 2.38×1014
adipose brown40/456 9.53×107 4.48×105
kidney40/554 9.11×105 2.86×103
ComBat
bone marrow107/413 1.04×1010 9.98×109
osteoblast day 2175/264 8.31×1010 3.99×108
embryonic stem line V26 2 p16149/728 9.44×107 3.02×105
sam
Adipose
adipose brown51/456 2.61×1016 2.48×1014
mammary gland lact12/104 5.67×105 2.69×103
macrophage peri LPS thio 0 h23/353 3.54×104 1.12×102
Muscle
adipose brown33/456 2.16×1029 1.21×1027
heart23/568 7.01×1015 1.96×1013
mammary gland lact4/104 1.91×103 3.47×102
Liver
liver93/928 3.10×1014 2.91×1012
adipose brown43/456 1.63×106 7.66×105
kidney43/554 1.76×104 5.51×103
Cell cycle11/124 2.95×109 4.71×107
Oocyte meiosis9/129 6.65×107 5.32×105
Progesterone-mediated oocyte maturation8/100 1.00×106 5.34×105
limma
Adipose
adipose brown14/456 4.19×108 2.85×106
adipose white4/199 1.61×102 5.46×101
intestine small6/466 2.59×102 5.87×101
Liver
liver33/928 2.39×1011 1.60×109
adipose brown7/456 1.31×101 1.00×100
heart8/568 1.57×101 1.00×100
Figure 2

Venn diagrams between genes selected by various methods. Upper: t test, lower: SAM.

Finally, based on the genes associated with probes shown in Table 4, we found that the “KEGG 2021 Human” category in Enrichr does not include neurodegenerative diseases (see the Supplementary Materials). Thus, the association between neurodegenerative diseases and diabetes mellitus can be found only when an integrated analysis, such as the proposed method, is employed. In this sense, an integrated analysis is more than a simple union of individual analyses and can identify factors that cannot be identified by individual analyses, such as potentially associated diseases. Thus, an integrated analysis of gene expression profiles in individual tissues provides more in-depth information than individual analyses, at least for certain cases. Thus, integrated analyses of gene expression profiles in individual tissues should be encouraged. It may be plausible for other integrated methods to perform similarly. If this is true, the advanced methods that we have proposed here are not required. To rule out this possibility, we apply ComBat [17] to remove the batch effect between the three tissue types since we selected genes whose expressions are independent of tissues as can be seen in Figure S1; Table 4 shows the results. It is seldom reported to be successful. Limma failed to select any DEGs, and the numbers of genes selected by the t test and SAM are markedly different from each other in contrast to the identification of tissue-specific DEGs, whose numbers are more coincident across the three methods (Table 4). Biological validation is also worse; Table 5 shows the result of the “Mouse Gene Atlas”. None of the tissues used in the experiments are listed, whereas the proposed method is (Table 3). In addition to this, based on the genes associated with probes shown in Table 4, we found that the “KEGG 2021 Human” category in Enrichr does not include neurodegenerative diseases (see the Supplementary Materials) that were detected using the proposed method (Table 1). In conclusion, integrated analysis using ComBat is inferior to the proposed method. One might wonder why an integrated analysis of three tissues from patients with diabetes mellitus can identify associations with neurodegenerative diseases. The PCA and TD-based unsupervised FE methods are frequently able to detect disease associations. We previously identified an association between cancer and amyotrophic lateral sclerosis [18] without investigating cancer gene expression and an association between heart diseases and posttraumatic stress disorder [19] without investigating brain gene expression. Therefore, we were not surprised that the integrated analysis using the proposed method was able to identify disease associations. To our knowledge, few studies have attempted to predict the association between diseases using gene expression, although many studies have focused on the associations between genes and disease [20,21,22] and between drugs and disease association [23,24,25]. Our proposed strategy would be useful for such studies.

5. Conclusions

In this study, we applied the proposed TD-based unsupervised FE with SD optimization method to perform an integrated analysis of gene expression measured in three distinct tissues using microarray architecture; moreover, the proposed method had not been applied to such data in previous studies. The results show that the proposed method can identify more genes than individual analyses. The selected genes are known to be expressed in all three tissues, and they are also enriched in many neurodegenerative diseases that have a known association with diabetes mellitus but cannot be identified by individual analysis. In this sense, integrated analyses might have the ability to identify additional factors relative to individual analyses.
  19 in total

1.  Significance analysis of microarrays applied to the ionizing radiation response.

Authors:  V G Tusher; R Tibshirani; G Chu
Journal:  Proc Natl Acad Sci U S A       Date:  2001-04-17       Impact factor: 11.205

2.  Predicting drug-disease associations through layer attention graph convolutional network.

Authors:  Zhouxin Yu; Feng Huang; Xiaohan Zhao; Wenjie Xiao; Wen Zhang
Journal:  Brief Bioinform       Date:  2021-07-20       Impact factor: 11.622

3.  A new branch connecting thermogenesis and diabetes.

Authors:  Haipeng Sun; Yibin Wang
Journal:  Nat Metab       Date:  2019-09

4.  Gene Set Knowledge Discovery with Enrichr.

Authors:  Zhuorui Xie; Allison Bailey; Maxim V Kuleshov; Daniel J B Clarke; John E Evangelista; Sherry L Jenkins; Alexander Lachmann; Megan L Wojciechowicz; Eryk Kropiwnicki; Kathleen M Jagodnik; Minji Jeon; Avi Ma'ayan
Journal:  Curr Protoc       Date:  2021-03

5.  Principal component analysis-based unsupervised feature extraction applied to in silico drug discovery for posttraumatic stress disorder-mediated heart disease.

Authors:  Y-h Taguchi; Mitsuo Iwadate; Hideaki Umeyama
Journal:  BMC Bioinformatics       Date:  2015-04-30       Impact factor: 3.169

6.  eDGAR: a database of Disease-Gene Associations with annotated Relationships among genes.

Authors:  Giulia Babbi; Pier Luigi Martelli; Giuseppe Profiti; Samuele Bovo; Castrense Savojardo; Rita Casadio
Journal:  BMC Genomics       Date:  2017-08-11       Impact factor: 3.969

7.  Identifying Disease-Gene Associations With Graph-Regularized Manifold Learning.

Authors:  Ping Luo; Qianghua Xiao; Pi-Jing Wei; Bo Liao; Fang-Xiang Wu
Journal:  Front Genet       Date:  2019-04-02       Impact factor: 4.599

8.  An effective drug-disease associations prediction model based on graphic representation learning over multi-biomolecular network.

Authors:  Hanjing Jiang; Yabing Huang
Journal:  BMC Bioinformatics       Date:  2022-01-04       Impact factor: 3.169

Review 9.  Quantification of Mitochondrial Oxidative Phosphorylation in Metabolic Disease: Application to Type 2 Diabetes.

Authors:  Matthew T Lewis; Jonathan D Kasper; Jason N Bazil; Jefferson C Frisbee; Robert W Wiseman
Journal:  Int J Mol Sci       Date:  2019-10-24       Impact factor: 5.923

Review 10.  PPARs and the Development of Type 1 Diabetes.

Authors:  Laurits J Holm; Mia Øgaard Mønsted; Martin Haupt-Jorgensen; Karsten Buschard
Journal:  PPAR Res       Date:  2020-01-09       Impact factor: 4.964

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.