Literature DB >> 17597933

Exploratory methods for checking quality of microarray data.

Abstract

In microarray experiments many undesirable systematic variations are commonly observed. Often investigators analyzing microarray data need to make subjective decisions about the quality of the experiment, by examining its chip image and a simple scatter plot. Thus, a more rigorous but simple method is desirable to determine the quality of microarray data. We propose two exploratory methods to investigate the quality of microarray experiments with replicated chips. The first method is based on correlations among chips and the second on the actual intensity values for each gene. The proposed methods are illustrated using a real microarray data set. The methods provide an initial estimation for determining the quality of microarray experiments.

Entities: CellLine Chemical Disease Species

Year: 2007 PMID： 17597933 PMCID： PMC1896057 DOI： 10.6026/97320630001423

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

In microarray experiments different sources of systematic and random errors can arise, which may significantly affect the inference on the measured gene expression patterns. A normalization procedure is regularly employed to remove (or minimize) the artifacts due to such errors. While these normalization approaches are useful for adjusting bias of each individual chip, they do not provide a rigorous statistical criterion to detect chips in poor quality. At an earlier stage of analysis, each microarray slide is often examined graphically using the scatter plot between chips to examine large variability (or low reproducibility) and any unusual patterns. However, such examinations are based on subjective human pattern recognition, and chips in poor quality can frequently enter the subsequent analysis, resulting in unreliable inference on the whole microarray study. Therefore, in this study we are concerned about checking the quality of overall microarray experiments and to identify the outlying chips that have much lower reproducibility than other chips. There have been several approaches for checking reproducibility in microarray experiments. For example, Parmigiani et al., [1] defined integrative correlation between two experiments that are conducted separately to answer the same biological question. This integrative correlation is calculated for each gene and called a gene's reproducibility score. King et al., [2] used correlations, the rate of two fold changes, and principal component analysis to check the reproducibility of gene expression measurements. Park et al., [3] proposed a diagnostic plots for identifying outlying slides. In this paper, we propose an exploratory method to check the quality of microarray data using two different approaches.

Methodology

We first describe the approach based on the correlations between chips and then describe the other approach based on the actual intensity values.

Correlation Based Approach

Given in the supplementary material linked below

Example

In this section, the proposed methods are applied to murine B-cell data. To study gene expression profiles in murine B-cell development, total cellular RNA was extracted from five consecutive B-lymphocyte lineage sub-populations (pre-BI cells, large pre-BII cells, small pre-BII cells, immature B-cells, and mature B-cells), and then, gene expression profiles from the five consecutive stages of mouse B cell development were generated with more than five replicates. [8] Murine B-cell data show lower sensitivity (0.66) and specificity (0.02). For the further exploratory analysis, we apply the proposed methods. In the chip-wise correlation plot (Figure 1), most treatments except small Pre-BII cells (chip 23 - chip 27) show high chip-wise correlations. Chipwise correlations of the small Pre-BII cell treatment have a highly skewed distribution and the third replicate has very small correlations compared to the other chips in the same group. Therefore, we can conclude that this third replicate is problematic and has to be checked or treated before a further analysis. In the summary correlation plot (Figure 2), Murine B-cell data shows outliers, chip 25. All the chips except chips in Small Pre-BII group are located in the upper triangular and chip 25 is far from the other chips. It supports the result from chip-wise correlation plot (Figure 1).

Figure 1

Chip-wise correlation plot: Murine B-cell data. The plots are for the five treatments: Immature B (1, 2, 3, 4, 5), Large Pre-BII (6, 7, 8, 9, 10), Mature B (11, 12, 13, 14, 15, 16), Pre-BI (17, 18, 19, 20, 21, 22), and Small Pre-BII (23, 24, 25, 26, 27)

Figure 2

The summary correlation plot. The solid line across the plot is the reference line for specificity. The chips lower than this line represent low specificity and the chips upper than this line represent high specificity

In Table 1, the last column of PKS and PW show lower p-values than the others. Therefore, we can conclude that the distribution of within correlation in Small Pre-BII group is greater than the distribution of the other groups. Also the mean of within correlation in small Pre BII group is less than the mean of the other groups.

Table 1

PKS and PW matrices of Murine B-cell data

P_KS	Imm. B	Large BII	Mat. B	Pre BI	Small BII
Immature B	1.00	0.41	0.34	0.52	0.20
Large Pre BII	0.90	1.00	0.62	0.62	0.20
Mature B	0.89	0.81	1.00	0.77	0.15
Pre BI	0.89	0.81	0.94	1.00	0.15
Small Pre BII	1.00	0.90	0.89	0.72	1.00

P_W	Imm. B	Large BII	Mat. B	Pre BI	Small BII

Immature B	1.00	0.37	0.30	0.34	0.11
Large Pre BII	0.66	1.00	0.45	0.42	0.24
Mature B	0.72	0.58	1.00	0.47	0.17
Pre BI	0.68	0.60	0.55	1.00	0.18
Small Pre BII	0.90	0.78	0.84	0.83	1.00

Next, we apply the test based on intensities within treatment. We assume the FDR as 5%. Table 2 shows the result of the intensity based tests. Murine B-cell data show quite different patterns. Especially, the gamma of small Pre-BII treatment is lowest among five treatments. Therefore we can conclude that Murine B-cell data set is less reproducible.

Table 2

Summary table for the within test based on intensities

Treatment	Conc/disc	Γ
Murine B-cell (27)
Immature B(5)	1086/5509	0.6707
Large Pre BII (5)	1079/5516	0.6728
Mature B(6)	1145/5450	0.6528
Pre BI (6)	1095/5500	0.6679
Small Pre BII (5)	1320/5275	0.5997

We can conclude that murine B-cell data show lower reproducibility, sensitivity and specificity. Therefore, it is not clear whether or not a further statistical test procedure can detect true differences successfully among the five consecutive stages, especially with small pre-BII cells. It is mainly due to one outlying chip (chip 25), as shown in Figure 3. Therefore, the analyst should check the experimental procedure and tissues used for this chip before a further statistical analysis.

Discussion

At the initial stage of the microarray data analysis, the exploratory data analysis (EDA) provides the first contact with data. The techniques of EDA consist of a number of informal steps such as checking the quality of the data, calculating simple summary statistics, and constructing appropriate graphs. The proposed method is a more formal way of checking quality than simple EDA plots. Thus, at an initial stage of the microarray data analysis, the proposed method provides useful information regarding the quality of microarray experiments. The correlation based approaches check the treatment-wise quality, while the test based on the actual intensity values checks the gene-wise quality for each gene. The proposed method is quite effective in detecting some outlying chips. It is much easier to apply than a traditional method of checking outlying chips either by the principal component analysis or the quality control plot. [3] There are some statistical issues to be taken into consideration, however. First, the log intensities may not have an approximate normal distribution. For simplicity, we have assumed the normal distribution for testing all hypotheses. However extensions to other distributional assumptions are certainly possible. For example, the other distributions such as log-normal and gamma distributions can be easily handled. Second, we did not use a stringent criterion for identifying the concordant/discordant genes. All these genes should be checked by using a analysis such as SAM [9] or t-test [10] during a later stage of analysis. Third, the correlation coefficients derived from all possible pairs of chips may not be independent. We did not consider these correlations in the current analysis. A more sophisticated approach based on the bootstrapping method is under development which considers possible correlations among the correlation coefficients. We would like to emphasize that the proposed method is an exploratory analysis. We believe the proposed method to be practically useful, simple and easy to implement that will provide a more rigorous approach in a preliminary overview regarding the quality of microarray experiments. Most proposed methods are implemented in the software arrayQCplot [11] and can be downloaded from Bioconductor(www.bioconductor.org).

11 in total

1. Significance analysis of microarrays applied to the ionizing radiation response.

Authors: V G Tusher; R Tibshirani; G Chu
Journal: Proc Natl Acad Sci U S A Date: 2001-04-17 Impact factor: 11.205

2. Statistical significance for genomewide studies.

Authors: John D Storey; Robert Tibshirani
Journal: Proc Natl Acad Sci U S A Date: 2003-07-25 Impact factor: 11.205

3. Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays.

Authors: Nitin Jain; Jayant Thatte; Thomas Braciale; Klaus Ley; Michael O'Connell; Jae K Lee
Journal: Bioinformatics Date: 2003-10-12 Impact factor: 6.937

4. Diagnostic plots for detecting outlying slides in a cDNA microarray experiment.

Authors: Taesung Park; Sung-Gon Yi; SeungYeoun Lee; Jae K Lee
Journal: Biotechniques Date: 2005-03 Impact factor: 1.993

5. Reliability and reproducibility of gene expression measurements using amplified RNA from laser-microdissected primary breast tissue with oligonucleotide arrays.

Authors: Chialin King; Ning Guo; Garrett M Frampton; Norman P Gerry; Marc E Lenburg; Carol L Rosenberg
Journal: J Mol Diagn Date: 2005-02 Impact factor: 5.568

Review 6. Estimation and control of multiple testing error rates for microarray studies.

Authors: Stanley B Pounds
Journal: Brief Bioinform Date: 2006-03 Impact factor: 11.622

7. arrayQCplot: software for checking the quality of microarray data.

Authors: Eun-Kyung Lee; Sung-Gon Yi; Taesung Park
Journal: Bioinformatics Date: 2006-07-24 Impact factor: 6.937

8. Changes in gene expression profiles in developing B cells of murine bone marrow.

Authors: Reinhard Hoffmann; Thomas Seidl; Martin Neeb; Antonius Rolink; Fritz Melchers
Journal: Genome Res Date: 2002-01 Impact factor: 9.043

9. Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset.

Authors: Sung E Choe; Michael Boutros; Alan M Michelson; George M Church; Marc S Halfon
Journal: Genome Biol Date: 2005-01-28 Impact factor: 13.583

10. Analysis of strain and regional variation in gene expression in mouse brain.

Authors: P Pavlidis; W S Noble
Journal: Genome Biol Date: 2001-09-27 Impact factor: 13.583

3 in total

1. Microarray Data Preprocessing: From Experimental Design to Differential Analysis.

Authors: Antonio Federico; Laura Aliisa Saarimäki; Angela Serra; Giusy Del Giudice; Pia Anneli Sofia Kinaret; Giovanni Scala; Dario Greco
Journal: Methods Mol Biol Date: 2022

2. A role for the retinoblastoma protein as a regulator of mouse osteoblast cell adhesion: implications for osteogenesis and osteosarcoma formation.

Authors: Bernadette Sosa-García; Volkan Gunduz; Viviana Vázquez-Rivera; W Douglas Cress; Gabriela Wright; Haikuo Bian; Philip W Hinds; Pedro G Santiago-Cardona
Journal: PLoS One Date: 2010-11-11 Impact factor: 3.240

Review 3. Transcriptomics in Toxicogenomics, Part II: Preprocessing and Differential Expression Analysis for High Quality Data.

Authors: Antonio Federico; Angela Serra; My Kieu Ha; Pekka Kohonen; Jang-Sik Choi; Irene Liampa; Penny Nymark; Natasha Sanabria; Luca Cattelani; Michele Fratello; Pia Anneli Sofia Kinaret; Karolina Jagiello; Tomasz Puzyn; Georgia Melagraki; Mary Gulumian; Antreas Afantitis; Haralambos Sarimveis; Tae-Hyun Yoon; Roland Grafström; Dario Greco
Journal: Nanomaterials (Basel) Date: 2020-05-08 Impact factor: 5.076

3 in total