Literature DB >> 24023417

Insights from the clustering of microarray data associated with the heart disease.

Venkatesan Perumal¹, Vasantha Mahalingam.

Abstract

Heart failure (HF) is the major of cause of mortality and morbidity in the developed world. Gene expression profiles of animal model of heart failure have been used in number of studies to understand human cardiac disease. In this study, statistical methods of analysing microarray data on cardiac tissues from dogs with pacing induced HF were used to identify differentially expressed genes between normal and two abnormal tissues. The unsupervised techniques principal component analysis (PCA) and cluster analysis were explored to distinguish between three different groups of 12 arrays and to separate the genes which are up regulated in different conditions among 23912 genes in heart failure canines' microarray data. It was found that out of 23912 genes, 1802 genes were differentially expressed in the three groups at 5% level of significance and 496 genes were differentially expressed at 1% level of significance using one way analysis of variance (ANOVA). The genes clustered using PCA and clustering analysis were explored in the paper to understand HF and a small number of differentially expressed genes related to HF were identified.

Entities: Chemical Disease Gene Species

Keywords: Cluster analysis; Heart failure; Microarray data; Principal component analysis

Year: 2013 PMID： 24023417 PMCID： PMC3766307 DOI： 10.6026/97320630009759

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Heart failure still remains as a major public health problem in the industrialised world, despite of significant improvement in the filed of diagnosis and medical therapeutics. Globally, the current prevalence of heart failure is over 23 million [1]. In cardiovascular research, microarray technologies are in use to test the hypothesis about the molecular mechanisms underlying different pathological conditions and phenotypes and to identify new therapeutic targets. Human samples are subjected to many biological variations due to concomitant etiologies, medications, age, sex and clinical stage. So, the reproducibility is highly affected in case of human samples. There are number of studies on chronically instrumented dogs with high frequency cardiac pacing to study pathophysiological and molecular mechanism related to dilated cadiomyopathy [2]. In this paper, the microarray data set on pacing-induced heart failure model for dogs was considered [2]. An analysis of microarray is a search for genes that have a similar or correlated pattern of expressions. The statistical aspects such as Analysis of Variance (ANOVA), Principal Component Analysis (PCA) and cluster analysis were used in the paper for analysing microarray data in canines (dogs). The main objective of the paper is to identify differentially expressed genes in three groups or classes using one way ANOVA test, to separate the genes which are up regulated in different conditions using principal component analysis and to identify the cluster of samples, cluster of genes, relationships between the samples and genes using cluster analysis.

Methodology

Principal Component Analysis and Cluster Analysis:

Principal Component Analysis (PCA) is a variable reduction procedure and was derived by Karl Pearson in 1901. It is a classical tool to reduce the dimension of expression data, to visualise the similarities between the biological samples, and to filter noise. PCA is often used as a pre-processing step to clustering [3]. The basic idea in PCA is to find the components that explain the maximum amount of variance in original variables by few linearly transformed uncorrelated components. Figure 1 explains schematic diagram of three dimensional data represented by two dimensional principal components [4].

Figure 1

Schematic graph showing three dimensional data represented by two dimensional principal components, where matrix contain n rows and p columns.

Clustering is widely used method in the first step of gene expression data analysis. The aim of cluster analysis is forming groups (clusters) of the objects on the basis of similarity (or distance) between the objects [5]. It is used for finding correlated and functionally related groups. The most frequently used clustering techniques are Hierarchical clustering and Kmeans clustering. There are various methods to define the distance between clusters and the most widely used clustering is the average linkage method which works well with standardised microarray data. In Average linkage Clustering, average linkage defines the distance between the two clusters as the average distance between all pairs of items where one member of a pair belong to cluster 1 and other member of pair belongs to cluster 2 [6]. The k-means clustering algorithm starts with a predefined number of cluster centers (k) specified by the user [7] (Figure 2).

Figure 2

Flow chart explaining steps involved in different types of clustering.

Application to heart failure data:

The data for the current study were obtained from Gene Expression Omnibus database at the National Centre for Biotechnology Information (GEO: http://www.ncbi.nlm.nih.gov / geo/ GSE5247). The data consists of sixteen male mongrel dogs divided into three groups: the first group consists of 6 dogs subjected to left ventricular pacing at 210 beats/min for 3 weeks; the second group, 6 others paced for 210 beats/min for 3 weeks and at 240 beats/min thereafter; and the remaining four used as normal controls. Total cardiac RNA was extracted from control (n = 4), 3 wk-paced (n = 4), and decompensated heart failure dogs (n= 4) [2]. Affymetrix-based canine oligonucleotide microarrays were used in the study to determine the changes in gene expression profile from compensated dysfunction to decompensated heart failure in pacing induced dilated cardiomyopathy. The Data set consists of 23,912 genes and 12 samples (arrays) (Figure 3). The open source software R version 2.10.0 is used for the microarray data analysis.

Figure 3

Flow chart describing the microarray experiment conducted on the heart failure model.

Results & Discussion

Normalisation of Heart failure data:

Before applying statistical analysis, the normality of the gene expression data should be checked. The Figure 4 shows step by step statistical procedure for doing the heart failure microarray data. The heart failure microarray raw data, were extremely positively skewed (Figure 5a). The raw data were processed using log transformation with base 2 and quantile normalized for 12 arrays in the three groups namely control, T3 and T4 to reduce variations across arrays (Figure 5b).

Figure 4

Flow chart describing statistical procedures for microarray data analysis.

Figure 5

a) Box plots of all groups for raw data; b) Box plot of all groups for normalised data; c) Scree plot of arrays as variables; d) Scree plot of gene as variables; e) Biplot of arrays as variables; f) Biplot of genes as variables; g) Average linkage hierarchical clustering; h) K Means clustering.

Analysis of Variance (ANOVA):

The exploratory microarray data analyses were carried out to shot list the differentially expressed genes in two or more known groups or classes. The one way ANOVA was used and the test was carried out in parallel for all the genes. It was found that out of 23912 genes, 1802 genes were differentially expressed in the three groups at 5% level of significance and 496 genes were differentially expressed at 1% level of significance. The top 40 most differentially expressed genes in the three groups were selected and corresponding p values were given the Table 1 (see supplementary material).

Principal Component Analysis (PCA):

For the selected 40 genes, PCA based on correlation matrix with samples as variables was performed. The first two PC from the table 2 explained more than 96% of the total variations. From the scree plot, it is observed that first two components are sufficient (Figure 5c). In the Biplot all control samples (C) labelled as Gr1 were grouped together, all T2 samples labelled as Gr2 were grouped together and all T3 groups labelled as Gr3 were grouped together. The angle between arrows of groups two and three was small, group 2 was between group 1 and group 3 as expected from definitions of the groups (Figure 5e). Gene with rank 14 and 1 is likely to be up regulated in groups (3 and 2) and down regulated in group1(control) whereas gene 29 is likely to be up regulated in group 1 and down regulated in groups (2 and 3). Gene with rank 14 codes for HSP40 (DNAJB6), which acts as a chaperone and plays a pivotal role during stress conditions and heart failure [8]. Gene with rank 1 is an uncharacterised protein. Gene with rank 29 codes for ubiquitination factor E4A (UBE4A), which is involved in the degradation process of excess, unwanted and mis-fold proteins which is an important event to save the cells during heart failure [9]. Similarly, for the 40 selected genes, PCA based on covariance matrix with genes as variables was performed. The number of variables was large and first two PC from the Table 2 (see supplementary material) explained 76% of the total variations. From the scree plot, it is observed that first two components are sufficient (Figure 5d). From the biplot (Figure 5f), only two groups were identified among the genes and the genes which are up regulated and down regulated in groups 2 & 3 and the outlying genes in the groups are given in the (Table 2). Genes such as PGRMC1, CYBB, AGPAT9 are playing major role in heart diseases. PGRMC1 and its homologues regulate cholesterol synthesis by activating the P450 protein Cyp51 which is an important target in cardio vascular disease. CYBB regulates NADPH oxidase activity and thereby protects from severe ischemia/reperfusion injury during HF. AGPAT9 is involved in phospholipid biosynthesis and its up regulation increases insulin resistance which is highly linked with cardio vascular disease. On molecular level it was not possible to distinguish between 3 weeks pacing induced heart failure and 4 weeks pacing induced heart failure however, these two groups were different from the control group. The genes which are up regulated in groups 2 and 3 can be used as biomarkers in case of heart failure models.

Cluster Analysis:

For the selected 40 genes, three clusters were identified and correctly classified (Figure 5g & Table 2 (see supplementary material)) for the three groups (control, T3 and T4) using cluster analysis by average linkage method with genes as variables. In cluster one, genes with rank 35, 15, and 18 were outlying genes, in cluster two, 1, 38 and 14 were outlying genes and in cluster three, 25, 37, and 33 were outlying genes. Gene with rank 35 codes for kininogen which is involved is blood coagulation system. Kinins, peptide products of kininogens may be involved in hypertensive and diabetic diseases [10]. As already mentioned in PCA results, Genes with rank 1 and 14 play a major role in heart failure models. The other outlying genes need to be studied further. The (Figure 5h) represents the scatter plot of k means clustering where the two elbows indicate the two possibilities viz. 2 - cluster solution and 3 - cluster solutions.

Conclusion

Oligonucleotide or cDNA micorarray has also been used in cardiovascular disease related gene changes [11]. Microarray experiments using two colour comparisons has potential pitfall for data analysis. We don't measure gene expression directly, but rather fluorescence intensity by a scanner. Many factors influence the intensity and produce multicentric effect, creating a need for bias correction or normalisation between two colour systems. We have demonstrated in this work, methods for selecting differentially expressed genes. In conclusion, an approach to select differentially expressed genes in cDNA microarray was proposed and studied. This offers basis for advanced datamining approaches. There are limited statistical methods to deal with multidimensional data. The same microarray data set can lead to very different conclusions by using different data analysis techniques and different clustering algorithms [12]. The group of samples, gene clusters, outlying genes and the relationship between the samples and genes were analysed using PCA and cluster analysis. Ojaimi et al compared found that a number of processes including normalization of gene regulation during decompensation, appearance of new up regulated genes and maintenance of gene expression all contribute to the transition to overt heart failure with an unexpectedly small number of genes differentially regulated [2]. In this paper, three group comparisons were done simultaneously. The genes such as PGRMC1, CYBB, AGPAT9, DNAJB6, UBE4A and KNG1 are differentially expressed in the canine heart failure model. These genes may also be critically involved in human heart failures. Hence, further studies should be done to understand the role of these genes in the etiology of heart failure cases.

11 in total

1. Principal component analysis for clustering gene expression data.

Authors: K Y Yeung; W L Ruzzo
Journal: Bioinformatics Date: 2001-09 Impact factor: 6.937

2. The human kininogen gene (KNG) mapped to chromosome 3q26-qter by analysis of somatic cell hybrids using the polymerase chain reaction.

Authors: D Fong; D I Smith; W T Hsieh
Journal: Hum Genet Date: 1991-06 Impact factor: 4.132

10. Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets.

Authors: Fangzhou Yao; Jeff Coquery; Kim-Anh Lê Cao
Journal: BMC Bioinformatics Date: 2012-02-03 Impact factor: 3.169

Insights from the clustering of microarray data associated with the heart disease.

Background

Methodology

Principal Component Analysis and Cluster Analysis:

Application to heart failure data:

Results & Discussion

Normalisation of Heart failure data:

Analysis of Variance (ANOVA):

Principal Component Analysis (PCA):

Cluster Analysis:

Conclusion

1. Principal component analysis for clustering gene expression data.

2. The human kininogen gene (KNG) mapped to chromosome 3q26-qter by analysis of somatic cell hybrids using the polymerase chain reaction.

3. A crucial role of mitochondrial Hsp40 in preventing dilated cardiomyopathy.

4. Paternal age related schizophrenia (PARS): Latent subgroups detected by k-means clustering analysis.

5. Mammalian E4 is required for cardiac development and maintenance of the nervous system.

Review 6. Epidemiology and risk profile of heart failure.

7. A preliminary application of principal components and cluster analysis to internal tongue deformation patterns.

8. Microarray gene expression profiles in dilated and hypertrophic cardiomyopathic end-stage heart failure.

9. Microarray data analysis and mining tools.

10. Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets.