Literature DB >> 21047405

EMA - A R package for Easy Microarray data analysis.

Nicolas Servant¹, Eleonore Gravier, Pierre Gestraud, Cecile Laurent, Caroline Paccard, Anne Biton, Isabel Brito, Jonas Mandel, Bernard Asselain, Emmanuel Barillot, Philippe Hupé.

Abstract

BACKGROUND: The increasing number of methodologies and tools currently available to analyse gene expression microarray data can be confusing for non specialist users.
FINDINGS: Based on the experience of biostatisticians of Institut Curie, we propose both a clear analysis strategy and a selection of tools to investigate microarray gene expression data. The most usual and relevant existing R functions were discussed, validated and gathered in an easy-to-use R package (EMA) devoted to gene expression microarray analysis. These functions were improved for ease of use, enhanced visualisation and better interpretation of results.
CONCLUSIONS: Strategy and tools proposed in the EMA R package could provide a useful starting point for many microarrays users. EMA is part of Comprehensive R Archive Network and is freely available at http://bioinfo.curie.fr/projects/ema/.

Entities: Chemical Disease Gene Species

Year: 2010 PMID： 21047405 PMCID： PMC2987873 DOI： 10.1186/1756-0500-3-277

Source DB: PubMed Journal: BMC Res Notes ISSN： 1756-0500

Findings

Numerous analysis methods and tools have been developed to study microarray, many of them being implemented as free R [1] and/or Bioconductor [2] packages. This abundance of methods makes choosing the best approach difficult for newcomers and non-specialist users. Based on the experience of the biostatisticians of Institut Curie, we propose a clear analysis strategy combining a large variety of standard methodologies. The most usual and relevant R functions needed to perform these analyses were selected and gathered in the R package EMA (Easy Microarray data Analysis). EMA covers an entire analysis process including quality control, normalisation, exploratory analysis, unsupervised and supervised classification, functional analysis and censored data exploration. The package can be used for both one or two-colours gene expression micrarrays and for exon expression experiments.

Analysis strategy

Firstly, the quality of the data must be assessed in order to detect problematic raw probe-level data, such as spatial artifacts on the chip or poor quality hybridisation. Indeed, gene expression experiments suffer from many sources of technical and experimental variation. Removing noise and systematic biases is performed in order to both improve the biological signal and make all the arrays comparable. This is the so-called normalisation step. Secondly, we propose to discard the probesets with very low signal across the samples (i.e. genes unexpressed or below detection threshold). This filtering step leads to both a noise reduction in the data and an increase in the statistical power of the subsequent analysis. Then, exploratory approaches are classically used to find clusters of genes (or samples) with similar profiles. Note that here, biological interpretation depends on the choice of the similarity metrics. These approaches potentially highlight outliers and/or non relevant effects (batch effect for example), which can be subsequently estimated and/or removed from the data thanks to appropriate methods. Finally, supervised approaches aim at the identification of differentially expressed genes (DEG), or deregulated pathways by taking into account the multiple testing issues. The biological interpretation of the differential analysis results can be performed thanks to functional and gene set enrichment analyses. Sample class prediction (eg good vs poor clinical outcome) based on supervised classification methods can also be performed to highlight genes signatures.

Selected tools

For the data quality assessment, we recommend to use the arrayQualityMetrics package [3], which performs a powerful, easy-to-use and comprehensive data quality estimation as well as an automatic html report. The EMA package proposes the most famous techniques for Affymetrix GeneChip normalisation: MAS5.0 [4], RMA [5] and GCRMA [6]. We recommend to use GCRMA because it outperforms the other approaches (by ignoring the mismatch intensities and taking into account the probe sequence information) and allows an efficient filtering of irrelevant probesets thanks to its bimodal distribution of probesets expression values (Figure 1a). Other packages such as limma [7], vsn [8] or lumi [9] can be used to normalise non Affymetrix data. After this first step, the main EMA functions can be used for any type of expression data, using a simple data expression matrix as input.

Figure 1

Graphical outputs provided by the EMA package for the class comparison study of [18]. (a) Histogram of probesets expression values across the 23 samples after GCRMA normalisation and log2 transformation. Probesets with an expression value below 3.5 (red vertical line) are discarded. (b) Individuals factor map produced by the PCA performed on the 23 filtered gene expression profiles. (c) Heatmap of the 23 gene expression profiles based on the 100 genes with the highest interquartile range (IQR) values. Sample clustering was performed using Pearson's correlation coefficient and Ward criterion. Gene clustering was performed using absolute Pearson's correlation coefficient and Ward criterion. (d) Qqplot produced by the SAM analysis on the two groups of tumours. Probesets in green are considered to be differentially expressed between the two conditions. The EMA package provides functions to perform exploratory analyses such as Principal Component Analysis (PCA, Figure 1b), hierarchical clustering (Figure 1c) or Multiple Factor Analysis. They are based on R packages such as FactoMineR [10], cluster [11], or mostclust [12]. The use of linear model is proposed to estimate and to remove the non relevant effects potentially detected. Various methods are proposed to perform differential analysis and their choice depends on the sample size. The multtest package provides standard approaches like Student or Mann-Whitney test associated with multiple testing correction methods. The Significance Analysis of Microarrays (SAM) approach [13] (siggenes package) is also very interesting because it both estimates the null distribution and takes into account the correlation between probesets (Figure 1d). The rank product method [14] (RankProd package) dedicated to small sample size dataset is also offered, as well as some linear model (ANOVA) functions. Alternatively, the user can apply the limma package which is a very powerful tool to assess differential expression by linear models. The functional enrichment of the DEG list is assessed based on the GeneOntology [15], and KEGG [16] pathways annotation terms. The hyper-geometric test of the GOstats package is used to test the over-representation of the functional terms in the gene list. For sample class prediction, we suggest to use the CMA package [17] including the most popular machine learning and gene selection algorithms. In the context of censored data, the EMA package supports Kaplan Meier and log-rank analyses using the survival package.

Example

The proposed analysis strategy was applied to the breast cancer gene expression dataset [18] comparing 12 Basal-like carcinomas (BLCs) and 11 HER2 positive carcinomas (HER2+). Some graphical outputs for data preprocessing, exploratory analysis and differential analysis steps are displayed in Figure 1. The RNA profiles were analysed using U133 plus 2.0 Affymetrix GeneChip. Three genes (P-cadherin, v-kit, FOXC1) were reported by the authors to be associated to a genes cluster over-expressed in the basal-like carcinomas and three genes (PTEN, Her2 and GRB7) to a genes cluster over-expressed in the Her2+ carcinomas. All these genes but one (v-kit) were found to be differentially expressed using the EMA package. This discrepancy is easily explained because in spite of v-kit belongs to a basal-like expression cluster, no change in v-kit expression can be observed between the two groups in this clustering analysis. This is because the hierarchical clustering was performed on genes (such as v-kit) not necessary differentially expressed between the two populations. The R scripts used to analyse this gene expression dataset can be found in [Additional file 1]. Transcriptomic data used in this application are publicly available at Gene Expression Omnibus (Accession number: [GSE13787]) and are part of the package.

Conclusions

EMA is a freely available R package which implements a complete strategy for expression microarray analysis. The package includes a vignette [Additional file 2] which describes the detailed biological/clinical analysis strategy used at Institut Curie. Most of the functions were improved for ease of use (fewer command lines, default parameters tested and chosen to be optimal). Relevant, enhanced and easy-to-interpret text and graphic outputs are offered. The package is available on The Comprehensive R Archive Network repository [19].

Availability and requirements

• Project Name: EMA • Project home page: http://bioinfo.curie.fr/projects/ema/ http://cran.r-project.org/ • Operating systems: Linux, Windows • Programming language: R • Other requirements: R version ≥ 2.10. R packages: cluster, Hmisc, heatmap.plus, FactoMineR, GOstats, survival, multtest, affy, gcrma, rgl, GSA, RankProd, siggenes, MASS, hgu133plus2.db, xtable, biomaRt. • License: GNU GPL • Any restrictions to use by non-academics: none

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

NS and EG discussed the choice of the strategy and tools, participated to the development of the EMA package and wrote the paper. PG, CL, CP, AB, IB, JM discussed the choice of the strategy and tools and participated to the development of the EMA package. BA, EB and PH discussed the choice of the strategy and tools and supervised the work group. All authors read and approved the final manuscript.

Additional file 1

R scripts applied to the breast cancer gene expression dataset [18]. R script used to analyse the breast cancer gene expression data set [18]. Click here for file

Additional file 2

EMA vignette. The vignette discuss the detailed biological/clinical analysis strategy used at Institut Curie and presents an application to a gene expression dataset. Click here for file

13 in total

1. Significance analysis of microarrays applied to the ionizing radiation response.

Authors: V G Tusher; R Tibshirani; G Chu
Journal: Proc Natl Acad Sci U S A Date: 2001-04-17 Impact factor: 11.205

2. The Gene Ontology (GO) database and informatics resource.

Authors: M A Harris; J Clark; A Ireland; J Lomax; M Ashburner; R Foulger; K Eilbeck; S Lewis; B Marshall; C Mungall; J Richter; G M Rubin; J A Blake; C Bult; M Dolan; H Drabkin; J T Eppig; D P Hill; L Ni; M Ringwald; R Balakrishnan; J M Cherry; K R Christie; M C Costanzo; S S Dwight; S Engel; D G Fisk; J E Hirschman; E L Hong; R S Nash; A Sethuraman; C L Theesfeld; D Botstein; K Dolinski; B Feierbach; T Berardini; S Mundodi; S Y Rhee; R Apweiler; D Barrell; E Camon; E Dimmer; V Lee; R Chisholm; P Gaudet; W Kibbe; R Kishore; E M Schwarz; P Sternberg; M Gwinn; L Hannick; J Wortman; M Berriman; V Wood; N de la Cruz; P Tonellato; P Jaiswal; T Seigfried; R White
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. Normalization of cDNA microarray data.

Authors: Gordon K Smyth; Terry Speed
Journal: Methods Date: 2003-12 Impact factor: 3.608

4. RankProd: a bioconductor package for detecting differentially expressed genes in meta-analysis.

Authors: Fangxin Hong; Rainer Breitling; Connor W McEntee; Ben S Wittner; Jennifer L Nemhauser; Joanne Chory
Journal: Bioinformatics Date: 2006-09-18 Impact factor: 6.937

5. lumi: a pipeline for processing Illumina microarray.

Authors: Pan Du; Warren A Kibbe; Simon M Lin
Journal: Bioinformatics Date: 2008-05-08 Impact factor: 6.937

6. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

7. arrayQualityMetrics--a bioconductor package for quality assessment of microarray data.

Authors: Audrey Kauffmann; Robert Gentleman; Wolfgang Huber
Journal: Bioinformatics Date: 2008-12-23 Impact factor: 6.937

8. Frequent PTEN genomic alterations and activated phosphatidylinositol 3-kinase pathway in basal-like breast cancer cells.

Authors: Bérengère Marty; Virginie Maire; Eléonore Gravier; Guillem Rigaill; Anne Vincent-Salomon; Marion Kappler; Ingrid Lebigot; Fathia Djelti; Audrey Tourdès; Pierre Gestraud; Philippe Hupé; Emmanuel Barillot; Francisco Cruzalegui; Gordon C Tucker; Marc-Henri Stern; Jean-Paul Thiery; John A Hickman; Thierry Dubois
Journal: Breast Cancer Res Date: 2008-12-03 Impact factor: 6.466

9. CMA: a comprehensive Bioconductor package for supervised classification with high dimensional data.

Authors: M Slawski; M Daumer; A-L Boulesteix
Journal: BMC Bioinformatics Date: 2008-10-16 Impact factor: 3.169

10. Model order selection for bio-molecular data clustering.

Authors: Alberto Bertoni; Giorgio Valentini
Journal: BMC Bioinformatics Date: 2007-05-03 Impact factor: 3.169

22 in total

1. Erythropoietin-induced changes in brain gene expression reveal induction of synaptic plasticity genes in experimental stroke.

Authors: Manuela Mengozzi; Ilaria Cervellini; Pia Villa; Zübeyde Erbayraktar; Necati Gökmen; Osman Yilmaz; Serhat Erbayraktar; Mathini Manohasandra; Paul Van Hummelen; Peter Vandenabeele; Yuti Chernajovsky; Alexander Annenkov; Pietro Ghezzi
Journal: Proc Natl Acad Sci U S A Date: 2012-05-29 Impact factor: 11.205

2. Endogenous dendritic cells from the tumor microenvironment support T-ALL growth via IGF1R activation.

Authors: Todd A Triplett; Kim T Cardenas; Jessica N Lancaster; Zicheng Hu; Hilary J Selden; Guadalupe J Jasso; Sadhana Balasubramanyam; Kathy Chan; LiQi Li; Xi Chen; Andrea N Marcogliese; Utpal P Davé; Paul E Love; Lauren I R Ehrlich
Journal: Proc Natl Acad Sci U S A Date: 2016-02-09 Impact factor: 11.205

3. Prediction of multiple infections after severe burn trauma: a prospective cohort study.

Authors: Shuangchun Yan; Amy Tsurumi; Yok-Ai Que; Colleen M Ryan; Arunava Bandyopadhaya; Alexander A Morgan; Patrick J Flaherty; Ronald G Tompkins; Laurence G Rahme
Journal: Ann Surg Date: 2015-04 Impact factor: 12.969

4. Genome-wide Kdm4 histone demethylase transcriptional regulation in Drosophila.

Authors: Amy Tsurumi; Shuang Xue; Lin Zhang; Jinghong Li; Willis X Li
Journal: Mol Genet Genomics Date: 2019-04-24 Impact factor: 3.291

5. The histone chaperone HJURP is a new independent prognostic marker for luminal A breast carcinoma.

Authors: Rocío Montes de Oca; Zachary A Gurard-Levin; Frédérique Berger; Haniya Rehman; Elise Martel; Armelle Corpet; Leanne de Koning; Isabelle Vassias; Laurence O W Wilson; Didier Meseure; Fabien Reyal; Alexia Savignoni; Bernard Asselain; Xavier Sastre-Garau; Geneviève Almouzni
Journal: Mol Oncol Date: 2014-11-20 Impact factor: 6.603

6. Patient-derived xenografts recapitulate molecular features of human uveal melanomas.

Authors: Cécile Laurent; David Gentien; Sophie Piperno-Neumann; Fariba Némati; André Nicolas; Bruno Tesson; Laurence Desjardins; Pascale Mariani; Audrey Rapinat; Xavier Sastre-Garau; Jérôme Couturier; Philippe Hupé; Leanne de Koning; Thierry Dubois; Sergio Roman-Roman; Marc-Henri Stern; Emmanuel Barillot; J William Harbour; Simon Saule; Didier Decaudin
Journal: Mol Oncol Date: 2013-02-26 Impact factor: 6.603

7. Biological network-driven gene selection identifies a stromal immune module as a key determinant of triple-negative breast carcinoma prognosis.

Authors: H Bonsang-Kitzis; B Sadacca; A S Hamy-Petit; M Moarii; A Pinheiro; C Laurent; F Reyal
Journal: Oncoimmunology Date: 2015-06-24 Impact factor: 8.110

8. Differential adipose tissue gene expression profiles in abacavir treated patients that may contribute to the understanding of cardiovascular risk: a microarray study.

Authors: Mohsen Shahmanesh; Kenneth Phillips; Meg Boothby; Jeremy W Tomlinson
Journal: PLoS One Date: 2015-01-24 Impact factor: 3.240

9. Spi-1/PU.1 activates transcription through clustered DNA occupancy in erythroleukemia.

Authors: Maya Ridinger-Saison; Valentina Boeva; Pauline Rimmelé; Ivan Kulakovskiy; Isabelle Gallais; Benjamin Levavasseur; Caroline Paccard; Patricia Legoix-Né; François Morlé; Alain Nicolas; Philippe Hupé; Emmanuel Barillot; Françoise Moreau-Gachelin; Christel Guillouf
Journal: Nucleic Acids Res Date: 2012-07-11 Impact factor: 16.971

10. Increased gene expression of FOXP1 in patients with autism spectrum disorders.

Authors: Wei-Hsien Chien; Susan Shur-Fen Gau; Chun-Houh Chen; Wen-Che Tsai; Yu-Yu Wu; Po-Hsu Chen; Chi-Yung Shang; Chia-Hsiang Chen
Journal: Mol Autism Date: 2013-07-01 Impact factor: 7.509