| Literature DB >> 23620278 |
Lars M T Eijssen1, Magali Jaillard, Michiel E Adriaens, Stan Gaj, Philip J de Groot, Michael Müller, Chris T Evelo.
Abstract
Quality control (QC) is crucial for any scientific method producing data. Applying adequate QC introduces new challenges in the genomics field where large amounts of data are produced with complex technologies. For DNA microarrays, specific algorithms for QC and pre-processing including normalization have been developed by the scientific community, especially for expression chips of the Affymetrix platform. Many of these have been implemented in the statistical scripting language R and are available from the Bioconductor repository. However, application is hampered by lack of integrative tools that can be used by users of any experience level. To fill this gap, we developed a freely available tool for QC and pre-processing of Affymetrix gene expression results, extending, integrating and harmonizing functionality of Bioconductor packages. The tool can be easily accessed through a wizard-like web portal at http://www.arrayanalysis.org or downloaded for local use in R. The portal provides extensive documentation, including user guides, interpretation help with real output illustrations and detailed technical documentation. It assists newcomers to the field in performing state-of-the-art QC and pre-processing while offering data analysts an integral open-source package. Providing the scientific community with this easily accessible tool will allow improving data quality and reuse and adoption of standards.Entities:
Mesh:
Year: 2013 PMID: 23620278 PMCID: PMC3692049 DOI: 10.1093/nar/gkt293
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Overview of the four categories of QC results produced by ArrayAnalysis.org
| Category | Graphs and tables computed on raw data | R/Bioconductor packages |
|---|---|---|
| Sample quality | Sample prep controlsa | simpleaffy, yaqcaffy |
| 3′/5′ for b-actin and GAPDHa | simpleaffy | |
| RNA degradation plot | affy | |
| Hybridization and overall signal quality | Spike-in hybrid. controlsa | simpleaffy |
| Background intensitya | simpleaffy | |
| Percentage presenta | simpleaffy | |
| Present/Marg./Absent calls | simpleaffy | |
| Pos/Neg control distribution | affyQCReport | |
| All Affymetrix controls | affy, ArrayTools | |
| Signal comparability and bias diagnostic | Scale factorsa | simpleaffy |
| Boxplot of log-intensityb | affy | |
| Density histogramb | affy | |
| MA plotb | affy | |
| Array reference layout | affy | |
| Pos/Neg controls COI plot | affyQCReport | |
| 2D images | affy, affyPLM | |
| NUSE plot | affyPLM | |
| RLE plot | affyPLM | |
| Array correlation | Correlation plotb | affy, gplots |
| Hierarchical clustering dendrogramb | affy, bioDist | |
| PCA plotb | affy | |
| Summary | Summary table | simpleaffy, yaqcaffy |
Twenty plots and one table, classified into four main categories, are generated to assess the quality of the microarray data set. A summary table is composed to give an overview of the quality indicators marked by ‘a’. Six plots, marked by ‘b’, are recomputed after pre-processing the data to evaluate the correction of present artifacts by the normalization. Functionalities from the following Bioconductor libraries are adapted, extended and integrated within the tool: affy, affycomp, affypdnn, affyPLM, affyQCReport, ArrayTools, bioDist, biomaRt, gcrma, gdata, gplots, plier, RColorBrewer, simpleaffy, yaqcaffy. Note that the calculations using the gcrma, plier, simpleaffy and yaqcaffy packages support only the chip types supported by these packages; in case a requested image cannot be constructed, e.g. because of the chip type, the plot is omitted, and a warning is produced.
Figure 1.Schematic representation of the three input forms on the web portal composing the input wizard of ArrayAnalysis.org. (a) Upload of a ZIP file with CEL (or zipped CEL) files. (b) Definition of custom sample names and experimental grouping by either uploading a description file or completing the form, in which the CEL file names are prefilled based on the data set uploaded. (c) Selection of plots and computations to be returned: (c1) shows the detected array type, species and number of arrays in the data set and asks for an optional email address, (c2) selects elements in the four categories of plots and indicators applied to the raw data and (c3) defines the pre-processing steps and the plots evaluating these steps. Default settings will depend on the chip type of the uploaded CEL files and may be changed by the user.
Figure 2.Sample of output images provided by the ArrayAnalysis.org QC tool. (a) Summary table of quality indicators; the indicator value is colored blue when within and red when out of the recommended cut-offs. (b) 2D image of the probe level model (PLM) residuals; this plot helps in the visualization of deviating regions on the chips. (c) Background intensity plot; a gray rectangle represents the maximal allowed spread. (d) Boxplot of raw data; this plot is also computed after normalization. (e) Array correlation plot after normalization; a color code of experimental groups eases the interpretation of the plot. (f) PCA analysis; 2D projections of the samples on the three principal components and histogram of explained variances by all components, ordered by decreasing percentage of total variance explained. The online documentation at ArrayAnalysis.org discusses all images produced and their interpretation in detail.
Figure 3.Structure of the ArrayAnalysis.org QC tool. (a) The ArrayAnalysis.org web server manages the input files and parameters and sends them to a distant calculation server using ssh2 and scp protocols. The R script runs on the calculation server and generates output images and tables. These output files are copied back to the web server, and a QC report and an archive are created from these files and displayed on the screen, together with a link to the normalized data and a log file of the run. (b) The package has a modular setup, which allows both calls via the web portal and local calls, while still using the same core functional code (represented as purple files), facilitating the implementation of updates. Once the settings and input files are correctly registered—this procedure differs for the web and local calls—the core script starts with loading the raw data and assigning sample names and experimental groups when provided. Then, the main script calls custom functions to compute QC plots and pre-process the data, which are stored in separate R scripts.