Literature DB >> 26283699

aRrayLasso: a network-based approach to microarray interconversion.

Abstract

UNLABELLED: Robust conversion between microarray platforms is needed to leverage the wide variety of microarray expression studies that have been conducted to date. Currently available conversion methods rely on manufacturer annotations, which are often incomplete, or on direct alignment of probes from different platforms, which often fail to yield acceptable genewise correlation. Here, we describe aRrayLasso, which uses the Lasso-penalized generalized linear model to model the relationships between individual probes in different probe sets. We have implemented aRrayLasso in a set of five open-source R functions that allow the user to acquire data from public sources such as Gene Expression Omnibus, train a set of Lasso models on that data and directly map one microarray platform to another. aRrayLasso significantly predicts expression levels with similar fidelity to technical replicates of the same RNA pool, demonstrating its utility in the integration of datasets from different platforms.
AVAILABILITY AND IMPLEMENTATION: All functions are available, along with descriptions, at https://github.com/adam-sam-brown/aRrayLasso. CONTACT: chirag_patel@hms.harvard.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Oligonucleotide Probes

Year: 2015 PMID： 26283699 PMCID： PMC4653393 DOI： 10.1093/bioinformatics/btv469

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

A pressing issue in translational biology is the ability to reference and utilize historical microarray datasets for large-scale discovery programs (Tsiliki ). The appeal of using historical datasets includes capturing previous investment to construct larger cohorts. Despite interest in both industry and academia (Tsiliki ; Yengi, 2005), few groups have attempted to tackle the problem of platform integration. Current approaches primarily rely upon passing different microarray platforms through a common identifier system, such as EntrezGene IDs, using specially designed packages (Alibes ; Mohammad ) or online tools (Huang ). While these systems work well in cases where manufacturers have maintained annotations of their microarray databases, ID-based conversion methods fail for deprecated and undermaintained microarray platforms. Another approach to convert between platforms is sequenced-based, wherein each sequence tag is aligned to the genome or transcriptome and annotated (Fumagalli ; Liu ). Unfortunately, it is often the case that de novo annotations do not capture the complexity of the transcriptome (e.g. for genes with alternative splice variants Gambino ). To address the shortcomings of both annotation- and sequence-based conversion methods, we have developed aRrayLasso, a Lasso-regression based network model. Our method directly predicts the probe expression levels of the target platform. To demonstrate the accuracy of our method, we show that predictions made using aRrayLasso are of similar accuracy to technical replicates from the 6 same mRNA pool. Our methodology allows users to utilize currently available methodologies for integrating cross-experiment microarray datasets (Tsiliki ) and allow for the construction of large-cohort retrospective studies.

2 Methods

To convert from a source to a target microarray platform, we chose to model each individual sequence tag in the target platform as a linear combination of all sequence tags from the source platform (see Fig. 1 and Supplementary Methods). Because microarrays have greater than 10 000 individual probes, we chose to use the Lasso algorithm for generalized linear regression (Friedman ). The Lasso algorithm allows the resulting linear model to be ‘sparse’ in that only the most relevant and robust (by cross-validation) predictors are assigned non-zero values. This optimization allows the model to outperform similar models that require all predictors to be assigned non-zero coefficients (Tibshirani ). Lasso is implemented in the R package ‘glmnet,’ allowing for ease of use (Friedman ).

Fig. 1.

Schematic of the aRrayLasso algorithm. aRrayLasso takes in an MxN target matrix containing M samples and N probes. A Lasso model, f is then constructed for each target probe using all probes in the MxP source matrix (M samples, P probes) We first generate a list of lasso models for each sequence tag in the target microarray platform. Our implementation can take as input a variety of data formats, including expression matrices, R expressionSet objects and Gene Expression Omnibus accession numbers (Edgar ). Once the full list of models has been computed, we provide functions that allow either the straightforward prediction of sequence tag values or the validation of the model list by calculation of Pearson product-moment correlation coefficients. To demonstrate the utility of our methodology, we utilized three datasets: (i) GSE6313, containing C57/B6 adult mouse retina cDNA profiles (Liu ), (ii) GSE7785, containing PANC-1 derived cDNA profiles (Tan ) and (iii) GSE4854, containing mouse cortex expression profiles (Kuo ). Each dataset is composed of multiple technical replicates for several distinct microarray platforms (see Supplementary Table S1). For both datasets, we used aRrayLasso to first train models to intraconvert between each individual platform and then predicted intraconversions between each pair of platforms for all technical replicates. To assess the accuracy of our conversions, we calculated the average Pearson’s r between the predicted values and actual experimental values for each platform and replicate. We also calculated the average inter-replicate Pearson’s r for each platform (see Supplementary Table S2).

3 Results

To explore the performance of aRrayLasso, we began by comparing our method’s ability to predict expression to the biological variation between replicates on the same platform. We assessed the degree to which aRrayLasso could accurately predict platform interconversions in three datasets, representative of different experimental systems, organisms and platforms. For the five platforms tested, aRrayLasso predictions are within the technical variation of each microarray platform when compared with technical replicates from the same cDNA pool, even when subjected to multiple sequential conversions (Supplementary Table S2). In addition, once built, aRrayLasso models can be used between experimental conditions: using the models built on GSE6313, we predicted expression levels in GSE4854 with no significant loss of signal (Pearsons product-moment correlation, P < 0.38). While the results presented here do not guarantee similar results for all training and testing datasets, these analyses serve as a promising proof of concept. Furthermore, our success with a relatively small dataset suggests that aRrayLasso may reach even higher levels of performance as the size of the datasets involved increases.

4 Discussion

Implementation: In this investigation, we propose a data-driven method for integrating across high-throughput genomic measurement modalities that avoids the use of annotation- or sequence alignment-based tools. We have implemented a Lasso regression-based modeling approach to model the expression level of each sequence tag in a target microarray as a linear combination of all sequence tags in a source microarray. Our implementation represents a straightforward, easy-to-use and open-source methodology for conversion between microarray platforms. Limitations: One drawback of our method is the need for extant or newly generated matched samples in the source and target platforms. In our experience, however, there are a large number of datasets available that have matched samples with replicates for a number of popular microarray platforms. A second limitation to our method is in conversion which lack overlap in gene coverage. In these cases, as with currently available methodologies, our method will fail to provide meaningful conversions. Lastly, while we have shown in one case that interexperiment conversions are feasible, we caution that systematic technical error in a single experiment may lead to the creation of a biased model. In general, however, when coupled with one of several cross-experiment dataset integration tools, aRrayLasso will enable mining of the remarkable and untapped historical pool of microarray datasets for large-scale metastudies for well-powered discovery.

13 in total

1. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository.

Authors: Ron Edgar; Michael Domrachev; Alex E Lash
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

2. A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies.

Authors: Winston Patrick Kuo; Fang Liu; Jeff Trimarchi; Claudio Punzo; Michael Lombardi; Jasjit Sarang; Mark E Whipple; Malini Maysuria; Kyle Serikawa; Sun Young Lee; Donald McCrann; Jason Kang; Jeffrey R Shearstone; Jocelyn Burke; Daniel J Park; Xiaowei Wang; Trent L Rector; Paola Ricciardi-Castagnoli; Steven Perrin; Sangdun Choi; Roger Bumgarner; Ju Han Kim; Glenn F Short; Mason W Freeman; Brian Seed; Roderick Jensen; George M Church; Eivind Hovig; Connie L Cepko; Peter Park; Lucila Ohno-Machado; Tor-Kristian Jenssen
Journal: Nat Biotechnol Date: 2006-07-02 Impact factor: 54.908

3. On integrating multi-experiment microarray data.

Authors: Georgia Tsiliki; Dimitrios Vlachakis; Sophia Kossida
Journal: Philos Trans A Math Phys Eng Sci Date: 2014-04-21 Impact factor: 4.226

4. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

5. Strong rules for discarding predictors in lasso-type problems.

Authors: Robert Tibshirani; Jacob Bien; Jerome Friedman; Trevor Hastie; Noah Simon; Jonathan Taylor; Ryan J Tibshirani
Journal: J R Stat Soc Series B Stat Methodol Date: 2012-03 Impact factor: 4.488

6. Comparison of hybridization-based and sequencing-based gene expression technologies on biological replicates.

Authors: Fang Liu; Tor-Kristian Jenssen; Jeff Trimarchi; Claudio Punzo; Connie L Cepko; Lucila Ohno-Machado; Eivind Hovig; Winston Patrick Kuo
Journal: BMC Genomics Date: 2007-06-07 Impact factor: 3.969

7. IDconverter and IDClight: conversion and annotation of gene and protein IDs.

Authors: Andreu Alibés; Patricio Yankilevich; Andrés Cañada; Ramón Díaz-Uriarte
Journal: BMC Bioinformatics Date: 2007-01-10 Impact factor: 3.169

8. Characterization of three alternative transcripts of the BRCA1 gene in patients with breast cancer and a family history of breast and/or ovarian cancer who tested negative for pathogenic mutations.

Authors: Gaetana Gambino; Mariella Tancredi; Elisabetta Falaschi; Paolo Aretini; Maria Adelaide Caligo
Journal: Int J Mol Med Date: 2015-02-16 Impact factor: 4.101

9. Transfer of clinically relevant gene expression signatures in breast cancer: from Affymetrix microarray to Illumina RNA-Sequencing technology.

Authors: Debora Fumagalli; Alexis Blanchet-Cohen; David Brown; Christine Desmedt; David Gacquer; Stefan Michiels; Françoise Rothé; Samira Majjaj; Roberto Salgado; Denis Larsimont; Michail Ignatiadis; Marion Maetens; Martine Piccart; Vincent Detours; Christos Sotiriou; Benjamin Haibe-Kains
Journal: BMC Genomics Date: 2014-11-21 Impact factor: 3.969

10. AbsIDconvert: an absolute approach for converting genetic identifiers at different granularities.

Authors: Fahim Mohammad; Robert M Flight; Benjamin J Harrison; Jeffrey C Petruska; Eric C Rouchka
Journal: BMC Bioinformatics Date: 2012-09-12 Impact factor: 3.169