Literature DB >> 16144520

Clustering gene expression data based on predicted differential effects of GV interaction.

Abstract

Microarray has become a popular biotechnology in biological and medical research. However, systematic and stochastic variabilities in microarray data are expected and unavoidable, resulting in the problem that the raw measurements have inherent "noise" within microarray experiments. Currently, logarithmic ratios are usually analyzed by various clustering methods directly, which may introduce bias interpretation in identifying groups of genes or samples. In this paper, a statistical method based on mixed model approaches was proposed for microarray data cluster analysis. The underlying rationale of this method is to partition the observed total gene expression level into various variations caused by different factors using an ANOVA model, and to predict the differential effects of GV (gene by variety) interaction using the adjusted unbiased prediction (AUP) method. The predicted GV interaction effects can then be used as the inputs of cluster analysis. We illustrated the application of our method with a gene expression dataset and elucidated the utility of our approach using an external validation.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2005 PMID： 16144520 PMCID： PMC5172465 DOI： 10.1016/s1672-0229(05)03005-6

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Monitoring tens of thousands of genes in parallel under different experimental environments or across different tissue types provides a systematic genome-wide approach to help in understanding a wide range of problems, such as gene functions in various cellular processes, gene regulations in different cellular signaling pathways, the diagnose of disease conditions, and the effects of medical treatments. A key step in the analysis of gene expression data is the identification of biologically relevant groups of genes or tissue samples that have similar expression patterns. When clustering genes across many samples, unknown gene function may be inferred from clusters of genes similarly expressed 1., 2., 3.. By clustering samples over the expression levels of multiple genes, novel disease subgrouping may be identified 4., 5., 6.. Gene expression data can be analyzed by various clustering methods, including hierarchical clustering 3., 7., self-organizing maps (, K-means 9., 10., and graph theoretic approaches of CAST (, HCS (, and CLICK (. However, systematic and stochastic fluctuations are usually involved in microarray experiments (. Therefore, the raw measurements have inherent “noise” within microarray experiments, which may introduce bias in identifying groups of genes or tissue samples and result in the false interpretation of expression patterns. It needs an approach to minimize or eliminate inherent “noise” in microarray experiments and to make the inputs of cluster analysis more biologically meaningful. Mixed model approaches are widely used to partition the sources of variation of observed phenotypes. They have the flexibility to handle a wide variety of experimental designs and data shapes (including balanced and unbalanced data), and to be easily extended to more complicated biological models. Mixed model approaches have been applied to detect significantly differential expression genes 15., 16..

Hierarchical clustering methods

Hierarchical clustering methods are popularly used by biologists to produce a hierarchical tree of clusters 3., 7.. The dendrogram provides potentially useful information about the relationships among clusters and can be broken into the desired number of clusters by cutting across the tree at a desired height. According to the methods that produce clusters, hierarchical clustering algorithms can be further divided into agglomerative algorithms and divisive algorithms. Agglomerative clustering starts with the points as individual clusters and then iteratively merges the two closest clusters together. This iterative merging procedure continues until only one cluster is remaining. Different criteria of measuring the similarity between a pair of clusters yield different cluster algorithms. All algorithms are based on distance metrics for measuring the similarity of a pair of points. Euclidian distance and one minus the Pearson correlation coefficient are two commonly used distance metrics to measure the proximity of a pair of points in clustering expression profiles. The Pearson correlation coefficient R(X,Y) and Euclidian distance D(X,Y) between X and Y are defined as follows, Using these two distance metrics, the distance between a pair of clusters can be computed in two ways. In complete-linkage criterion, the distance between two clusters is simply the maximum metric between a point in one cluster and a point in another cluster. In UPGMA-linkage criterion, the distance between two clusters is calculated as the average distance of the pairwise distances between the points in one cluster and the points in another cluster. Divisive clustering starts with one, all-inclusive cluster and, at each step, the biggest group is broken down into two smaller groups until each cluster contains only a single sample. In this case, we need to decide which cluster to split and how to split the bigger one into two smaller ones at each step. Different decisions and split criteria can generate different divisive clustering algorithms such as DIANA (DIvisive ANAlysis). The detailed implementation of DIANA can be seen in Kaufman and Rousseeuw (.

Assessment of clustering results

A key question in the design and analysis of clustering techniques is how to evaluate the clustering results. Different measures are applicable in different situations, depending on the information available such as whether a partial true solution is known or not. Jain and Dubes ( divided cluster evaluation indices into two main categories: internal and external criterion. Internal criterion measures the quality of clusters based only upon the data, whereas external criterion measures the agreement between the derived clusters and some external gold standards. The external criterion analysis has the strong capacity of providing an independent, hopefully unbiased assessment of cluster quality. Because the inputs of the predicted GV (gene by variety) effects for a cluster method are not the same as compared with the raw log2(Ratios), in this situation, internal criteria such as figure of merit (FOM; ref. ) or silhouette width ( are not suitable to assess the quality of cluster results. Since the putative cluster labels have been available for the gene expression dataset used, an external index of Jaccard coefficient has been adopted to evaluate the quality of clustering results. The Jaccard coefficient is defined as the proportion of correctly identified mates in the derived solution to the sum of correctly identified mates plus the total number of disagreements between the derived solution and the putative solution (a disagreement is a pair that are mates in one solution and non-mates in the other). The higher the score, the better the solution, and a score of 1.0 suggests a perfect solution. Sharan et al. ( applied this index to evaluate the clustering results. We presented a statistical method based on mixed model approaches for cluster analysis of microarray data. The objective of this method was to partition the observed total gene expression level into various variations caused by different factors using an ANOVA model, and to predict the differential effects of GV interaction using the adjusted unbiased prediction (AUP) method 21., 22.. Then we applied three hierarchical clustering methods: complete-linkage (, UPGMA-linkage (, and DIANA ( to clustering for the phenotypic values of log2(Ratios) and the predicted differential effects of GV interaction, respectively. The utility of our method on the task of clustering genes was judged by Jaccard coefficient (. We developed a windows-interface software (ClusterProject) for analysis and visualization of gene expression data. This software with a graphical user interface contains various clustering methods, similarity metrics, and the evaluation metrics, as well as multi-variant analysis including PCA (principal component analysis) and the mixed model approach. It can visualize the raw expression data and the cluster results in several ways. The software is available at http://ibi.zju.edu.cn/software/clusterproject/.

Model

Prediction of GV interaction effects

A microarray experiment is a multi-step process and each step may introduce a potential source of variation. The variation of the measured gene expression data can be generally classified into three generic categories: biological variation, technical variation, and residual variation 24., 25.. Biological variation in measured gene expression accounts for the variation from different mRNA sources, such as different animals, cell lines, or tissues. Technical variation refers to the variation coming from the use of the microarray system, such as the sample preparation procedures, the hybridization and washing procedures, the detection method of gene expressions, and laboratory environmental conditions. Residual variation accounts for sampling or experimental error or other unexplainable factors. The variation in a measured gene expression is the sum of these three variations. Our approach centered around the ANOVA model of Kerr et al. ( for the analysis of microarray data. The ANOVA model is a popular statistical approach to account for different sources of variation. It can consider all possible sources of variation in a microarray experiment and use one equation to summarize them. The exact form of the ANOVA model depends on the particular experiment. That is, one should determine which sources of variation are present in each experiment individually and construct the model accordingly. Let y be the observed gene expression measurement from gene i, dye j, array k, and variety l, then an overall ANOVA model iswhere μ is the average of overall expression levels, a fixed effect; G is the fixed effect of the i-th gene. The effects of dye D, array A, variety V, gene by dye interaction GD, gene by array interaction GA, gene by variety interaction GV, and residual ε are all random variables with zero means and variance components, respectively. The generic term “variety” refers to the effect of treatments, tissue types, or time points in a biological process (. The interaction effects of genes by varieties interaction (GV) are biologically interesting among these effects. These terms reflect differences in expression of genes to particular varieties that are not explained by the marginal effects of genes and varieties. Variance components of the aforementioned models can be estimated using restricted maximum likelihood estimation (REML) and minimum norm quadratic unbiased estimation (MINQUE) (. The random effects of GV can be predicted by the best linear unbiased prediction (BLUP; ref. ) and the AUP method 21., 22.. We used MINQUE (1) to estimate the variance components and the AUP method to predict the random effects. MINQUE (1) is a MINQUE method with all the prior values setting as 1.0. The predicted differential effects of GV interaction were used as the inputs for further cluster analysis.

Application to yeast sporulation data

We applied our method to the analysis of gene expression data on the transcriptional program of sporulation in budding yeast collected and analyzed by Chu et al. (. The data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation. In this experiment, cDNA microarrays containing 97% of the totally 6,118 known and predicted genes of yeast were used to study gene expression during meiosis and spore formation. The mRNA samples were taken at seven time points: 0, 0.5, 2, 5, 7, 9, and 11.5 h. For each time point, the researchers prepared a “red”-labeled cDNA pool. Meanwhile, time-0 sample was served as a reference pool for all of the samples taken from seven time points and was labeled with “green” fluorescent dye. Seven microarrays were used in the study, and each array was probed with the green-labeled sample mixed with one of the seven red-labeled samples. Each spot contained four measurements: red signal, red background, green signal, and green background. The background-normalized ratio (red signal — red background)/(green signal — green background) was used as respective expression level of a gene at each time point. In addition, Chu et al. ( described a small set of hand-picked representative genes from each of the seven temporal classes that were expressed during sporulation. Two genes (MRD1 and NAB4) for profile 3, and two genes (KNR4 and EXO1) for profile 4 could not be found at the publicly available data file. The remaining 36 genes were used for next analysis. We modified the preceding full mixed linear model to support this specific data. The modification of model (1) iswhere y is the background-corrected base-2 logarithm of individual intensity measurement, and i = 1, …, 36 genes; j = 1, 2 dyes; k= 1, …, 7 arrays; and l = 1, …, 7 varieties (time points). It is not possible to fit the full model (1) that includes GD interaction effects to this experimental design because 0 residual degree of freedom remains. So we excluded the GD effect from the full model. Model (2) is similar to the model of Kerr and Churchill ( of this gene expression data, which used AD effect instead of variety effect.

Results

The proportions of variance components to the total variance in model (2) were summarized in Table 1. Variety effects and gene by variety effects contributed largely to the variation of gene expression (45.2% and 36.4%, respectively). It elucidated that the variation of gene expression was mainly determined by variety (time point) and gene by variety interaction. There is strong evidence that the expression levels of genes vary from different time points. The proportion of residual variation to the total variance was small (3.9%).

Table 1

Variance Component Estimates and Their Proportions to Total Variance for Yeast Sporulation Data

Parameter	Estimate	Proportion
σD2/σT2	0.002	0.001
σA2/σT2	0.119	0.045
σV2/σT2	1.194	0.452
σGA2/σT2	0.262	0.099
σGV2/σT2	0.962	0.364

σε2/σT2	0.102	0.039

Furthermore, the GV effects in model (2) were predicted by the AUP method. Each of the three clustering algorithms with one minus Pearson correlation metric and Euclidian metric was applied to clustering for the phenotypic values of log2(Ratios) and the predicted GV effects, respectively. Jaccard coefficient was computed for each run to assess the quality of each obtained cluster result. The comparisons of cluster results for the yeast sporulation data were shown in Table 2. It was obvious that the cluster results of three clustering methods have been improved when using the predicted GV effects as the inputs of cluster analysis instead of the phenotypic values of log2(Ratios) except UPGMA-linkage with Euclidian. Three clustering methods with Euclidian all correctly discovered the genes of Metabolic class for log2(Ratios) and GV effects. These clustering methods with Pearson correlation also correctly discovered the metabolic genes for GV effects except DIANA. However, when clustering for log2(Ratios), the Metabolic class was partitioned into two subclasses (one group includes SIP4, CAT2, YOR100C, CAR1, AGA2, and YPR192W, another includes ACS1 and PYC1). DIANA with Euclidian produced the best performance using GV effects, and it accurately discovered three classes (including Metabolic, Early I, and Late).

Table 2

Comparisons of Three Clustering Methods with log2(Ratios) and GV Effects for Yeast Sporulation Data Model (2)

Method	Pearson		Euclidian
Method	log₂(Ratios)	GV effects	log₂(Ratios)	GV effects
Complete-linkage	0.369	0.420	0.390	0.487
UPGMA-linkage	0.291	0.395	0.338	0.315
DIANA	0.301	0.311	0.391	0.500

Discussion

Microarray technologies provide an overall, simultaneous view on the expression levels of tens of thousands of genes under different conditions or processes. Large numbers of valuable datasets have been produced to serve biological and biomedical researches (. Finding structure in a large dataset is a venerable, well-studied problem that is routinely implemented as a first step of data mining. Finding groups of similarly expressed genes or tumors in a microarray data is very valuable to help in understanding gene functions and gene regulations, and to assist in clinical treatments. However, since there are many different variations induced in different stages of microarray experiments, the identification and estimation of different sources of variation are fundamental to the design of cost-efficient microarray experiments. Genes, dyes, arrays, varieties (treatments, time points or disease types), and their interactions are well known as the source of effects contributing to variation in the micaroarray experiments 24., 26.. In the present study, a statistical method based on mixed model approaches was proposed to assist the cluster analysis of gene expression data. The underlying basic principle of this method is to use the constructed model to partition the total gene expression variation into various components caused by different factors and then predict the differential effects of GV interaction in the model by the AUP method. The mixed model method provides an automatic correction for the nuisance effects in estimating the relative expression of genes across experimental samples. GV interaction effects capture the departure from the overall averages that reflect the biologically relative expressions for the specific combination of the gene and the variety. These effects exclude the contributions of the genes, dyes, arrays, their interactions effects and random error effects on the gene expression, so it is more biologically meaningful than the raw expression measurements. Using predicted GV interaction effects as the inputs of cluster analysis to construct clusters can decrease the noise blight on the cluster result. The result of the yeast sporulation data elucidated the utility of using GV effects as inputs. Replications allow for assessment of the variability of expression data (for example, in RNA isolation, labeling efficiency, or in chip quality), so that formal statistical analysis methods can be applied. Replication is an important aspect in microarray design. Two basic types of replications can be incorporated within or between arrays: 1) biological replication in which mRNA samples taken from multiple populations can be used on multiple arrays; and 2) technical replication in which the same mRNA samples can be repeated on multiple arrays, or multiple clones or probes of the same gene can be spotted multiple times on the array. So replication can minimize technical artifacts and assess biological variability and is the key to the accuracy and reliability of the data. Whether biological or technical replication or both of the two are used in microarray experiments depends on the relative magnitude of biological and technical variability in the sample. Replication of the same genes on an array can reduce array effects due to the quality of robot-fabricated immobilized cDNA probes within the same array. However, replicated spots should be well spaced so that the true variability within an array can be estimated. The yeast sporulation experiment used a replication of making a self-comparison of the time-0 sample. Although this was adequate for providing error degrees of freedom, it was not an ideal situation. All of the nonzero residuals from the ANOVA model come from the self comparison array and all other data points are exactly fit because they are not replicated (. Good experimental design will likely provide the greatest amount of satisfaction and the least amount of frustration in executing a microarray project. The reference design and loop design are two common experiment designs in microarray experiments. Kerr and Churchill ( suggested a more effective loop design for yeast sporulation experiment. Fitting model (1) with this design, residuals are obtained from every array and GV effects can be estimated more precisely. In addition, dye swap is also a common design. Furthermore, the technique with more than two dyes has been proposed to decrease the experimental expenses (. Our method can be easily applied to these designs and their modifications with replications. A common problem in microarray experiments is missing data. In microarray experiments, each array may contain a number of genes with fluorescence intensity measurements that are flagged by the experimenter and recorded as missing data. Due to noise and missing values in data sets, many statistic methods may result in estimates quite different from the real values. Mixed model approach has advantages of handling both unbalanced data and of predicting the random effects.

22 in total

1. Validating clustering for gene expression data.

Authors: K Y Yeung; D R Haynor; W L Ruzzo
Journal: Bioinformatics Date: 2001-04 Impact factor: 6.937

2. Normalization strategies for cDNA microarrays.

Authors: J Schuchhardt; D Beule; A Malik; E Wolski; H Eickhoff; H Lehrach; H Herzel
Journal: Nucleic Acids Res Date: 2000-05-15 Impact factor: 16.971

3. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments.

Authors: M K Kerr; G A Churchill
Journal: Proc Natl Acad Sci U S A Date: 2001-07-24 Impact factor: 11.205

4. Assessing gene significance from cDNA microarray expression data via mixed models.

Authors: R D Wolfinger; G Gibson; E D Wolfinger; L Bennett; H Hamadeh; P Bushel; C Afshari; R S Paules
Journal: J Comput Biol Date: 2001 Impact factor: 1.479

5. Analysis of variance for gene expression microarray data.

Authors: M K Kerr; M Martin; G A Churchill
Journal: J Comput Biol Date: 2000 Impact factor: 1.479

Review 6. Gene expression data analysis.

Authors: A Brazma; J Vilo
Journal: FEBS Lett Date: 2000-08-25 Impact factor: 4.124

Review 7. Fundamentals of experimental design for cDNA microarrays.

Authors: Gary A Churchill
Journal: Nat Genet Date: 2002-12 Impact factor: 38.330

8. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

Authors: U Alon; N Barkai; D A Notterman; K Gish; S Ybarra; D Mack; A J Levine
Journal: Proc Natl Acad Sci U S A Date: 1999-06-08 Impact factor: 11.205

9. Cluster analysis and display of genome-wide expression patterns.

Authors: M B Eisen; P T Spellman; P O Brown; D Botstein
Journal: Proc Natl Acad Sci U S A Date: 1998-12-08 Impact factor: 11.205

10. Triple-target microarray experiments: a novel experimental strategy.

Authors: Thorsten Forster; Yael Costa; Douglas Roy; Howard J Cooke; Klio Maratou
Journal: BMC Genomics Date: 2004-02-10 Impact factor: 3.969