Literature DB >> 27153652

ARACNe-AP: gene network reverse engineering through adaptive partitioning inference of mutual information.

Alexander Lachmann1, Federico M Giorgi1, Gonzalo Lopez1, Andrea Califano1.   

Abstract

UNLABELLED: The accurate reconstruction of gene regulatory networks from large scale molecular profile datasets represents one of the grand challenges of Systems Biology. The Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNe) represents one of the most effective tools to accomplish this goal. However, the initial Fixed Bandwidth (FB) implementation is both inefficient and unable to deal with sample sets providing largely uneven coverage of the probability density space. Here, we present a completely new implementation of the algorithm, based on an Adaptive Partitioning strategy (AP) for estimating the Mutual Information. The new AP implementation (ARACNe-AP) achieves a dramatic improvement in computational performance (200× on average) over the previous methodology, while preserving the Mutual Information estimator and the Network inference accuracy of the original algorithm. Given that the previous version of ARACNe is extremely demanding, the new version of the algorithm will allow even researchers with modest computational resources to build complex regulatory networks from hundreds of gene expression profiles.
AVAILABILITY AND IMPLEMENTATION: A JAVA cross-platform command line executable of ARACNe, together with all source code and a detailed usage guide are freely available on Sourceforge (http://sourceforge.net/projects/aracne-ap). JAVA version 8 or higher is required. CONTACT: califano@c2b2.columbia.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2016        PMID: 27153652      PMCID: PMC4937200          DOI: 10.1093/bioinformatics/btw216

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

We and others have shown that the accurate and systematic dissection of tissue gene regulatory networks (reverse engineering) represents a crucial step in the elucidation of drivers and mechanisms presiding over both physiologic and pathologic phenotypes. Many computational approaches have been proposed for the reverse engineering of gene regulatory networks from large-scale gene expression profile data. Most of these require estimating pairwise gene functions, such as Pearson/Spearman correlation (Mutwil ), Mutual Information (MI, Steuer ) and linear/LASSO regression (Licausi ) amongst others. ARACNe (Margolin ) represents one of the most widely used reverse engineering algorithms by the scientific community and has been extensively experimentally validated. ARACNe uses an information theoretic framework, based on the data processing inequality theorem, to infer direct regulatory relationships between transcriptional regulator proteins and target genes. ARACNe has been shown to be useful in the reconstruction of context-specific transcriptional networks in multiple tissue types (Lefebvre ). Several additional algorithms have emerged, which rely on the interrogation of ARACNe networks to successfully predict novel driver genes and mechanisms (Aytes ; Della Gatta ), as well as drug mechanism of action (Woo ). Thanks to the Next-Generation Sequencing revolution, however, ever expanding RNASeq datasets create the need for algorithm improvements to support more computationally efficient inference of genome-wide gene-regulatory networks. We introduce a complete redesign of ARACNe to leverage efficient Adaptive Partitioning (AP) Mutual Information estimators (Liang and Wang, 2008). We benchmark the performance improvements of the new algorithm implementation on the analysis of a large breast carcinoma dataset (TCGA, 2012), compared to the previous version of ARACNe, based on the Fixed Bandwidth (FB) algorithm (Margolin ).

2 Methods

2.1 The ARACNe pipeline

We replaced the core FB MI-estimator with a new AP-based version and wrote an optimized implementation through a series of cached binning operations and the use of 16-bit short integers to store rank-transformed gene expression data. All performance sensitive parts of the algorithm, including the Data Processing Inequality (DPI) implementation, now support multi-threading, thus taking advantage of available computer architectures. Additionally, while the original ARACNe implementation relied on a multiple Matlab scripts for pre and post processing while the core algorithm was implemented in C ++. The new version is streamlined and entirely implemented in a single JAVA executable removing the need for proprietary software and allowing for platform-independent use. ARACNe requires Gene Expression Profile (GEP) data and a predefined list of gene regulators (e.g. Transcription Factors – TFs) as input. Running ARACNe involves three key steps. MI threshold estimation. This preprocessing step identifies a significance threshold of MI values from the GEPs provided. The threshold depends on the number of samples provided in the input. Bootstrapping/MI network reconstruction. In this phase MI networks are reconstructed for randomly sampled GEP. For N such bootstraps of the data N MI networks are generated. The calculation of the networks involves three steps: (a) Compute MI for every TF/Target pair after rank-transformation of the GEPs. (b) Removal of non-statistically significant connections using the MI threshold. (c) Removal of indirect interactions by applying a Data Processing Inequality tolerance filter (DPI, Margolin ). Building consensus network. A consensus network is computed by estimating the statistical significance of the number of times a specific edge is detected across all bootstrap runs, based on a Poisson distribution. Only significant pairs are kept (P < 0.05, Bonferroni corrected).

2.2 Adaptive partitioning

The Mutual Information between two variables is probabilistic measure of their statistical dependence (Steuer ). Computing the MI from gene expression profiles usually requires estimating joint and marginal gene expression probability densities. In the original ARACNe implementation (FB), this was achieved by dividing the gene expression space into discrete bins of fixed size. The original ARACNe algorithm was based on the FB MI-estimator, which generated equisized bins (Margolin ). The number of bins selected for the analysis depended on the number of samples and had to be chosen in a preprocessing step. To address these limitations, we introduced an alternative AP-estimator. The two dimensional space are still divided into discrete bins but, in contrast to the FB algorithm, there is not a preset data-driven partition size. Rather, the space is divided in an adaptive fashion following the local data distribution. The space is split recursively into quadrants at the means (Fig. 1A). The stopping condition for the recursive procedure is met when a uniform distribution (assessed by χ2 test) between the newly created quadrants is reached or fewer than three data points fall into the quadrant to be split.
Fig. 1.

(A) Expression values of E2F1 and CCND1 in the TCGA breast carcinoma dataset. Shown are the binning steps of the Adaptive Partitioning to infer pairwise Mutual Information. (B) Comparison between FB-inferred (x-axis) and AP-inferred (y-axis) MI values for all TF/gene pairs in the breast cancer dataset

(A) Expression values of E2F1 and CCND1 in the TCGA breast carcinoma dataset. Shown are the binning steps of the Adaptive Partitioning to infer pairwise Mutual Information. (B) Comparison between FB-inferred (x-axis) and AP-inferred (y-axis) MI values for all TF/gene pairs in the breast cancer dataset

2.3 Datasets/hardware

In order to test the performance of ARACNEe-AP in terms of speed and qualitative MI assessment, we ran multiple benchmarks. We compared computational speed and tested the impact of the AP estimation of the joint density distribution compared to FB using 533 TCGA Breast invasive carcinoma samples (TCGA, 2012). The transcript raw counts were RPKM transformed and filtered for genes with zero counts leaving 13 812 genes. As regulators we used 1331 genes annotated as ‘regulators of transcription’ and ‘DNA-binding’ in Gene Ontology (GO, 2013). We calculate all pairwise MIs between 20 318 genes and express the speed as MIs per second. All the tests shown were performed on a multi-core Intel Xeon E5-2630 CPU.

3 Results and discussion

We ran ARACNe-AP on the TCGA Breast invasive carcinoma dataset, obtaining a network with 1331 regulators, 13 546 targets and 100 580 interactions. ARACNe-AP maintains the capability to identify regulator-target relationships that would be otherwise missed by simple correlation techniques or other linear similarity measures (Supplementary Fig. S1), e.g. that between E2F1 and CCND1 (Cyclin D1) (Fig. 1A), a ChIP-Seq validated interaction (Lachmann ) that controls cell cycle progression (Sherr, 1994). The example of E2F1 regulating CCND1 highlights the advantage of non-linear measures such as MI to identify complex gene interactions. Indeed, Pearson correlation between E2F1 and CCND1 is close to 0 and not statistically significant (P = 0.4) while the MI is highly significant (P = 10 − 8). The data shows two sets of statistically independent relationships between these two genes. A subset of the samples supports a positive correlation recapitulating that E2F1 can promote its transcription indirectly, through Ras pathway activation (Berkovich ), which in turn up-regulates Cyclin D1 mRNA synthesis (Croft and Olson, 2006). The negatively correlated samples support the fact that E2F1 can directly inhibit the transcription of Cyclin D1 (Watanabe ). Both the FB and the AP estimators achieve similar accuracy in the estimation of gene pairs MI. While the absolute MI values of the two methods are not directly comparable, the AP algorithm ranks gene pairs (based on their MI) similarly to the FB algorithm (Fig. 1B). Networks obtained with ARACNe-AP can be used to calculate regulator activity on a sample-by-sample basis using the ssMARINA algorithm (Aytes ). The networks obtained via the FB and AP methods produce nearly identical inferences of regulatory protein activity, based on the differential expression of their regulons (Supplementary Fig. S4). However, the AP version of ARACNe provides a 200× gain in computational efficiency, thus greatly reducing execution time. Specifically, the optimized AP implementation (ARACNe-AP) can process an average of 31 610 MI/s, compared to only 160 with the original ARACNe implementation (ARACNe-FB) (Supplementary Fig. S2). Furthermore, ARACNe-AP is fully multi-threaded, yielding an additional speed increase on typical CPUs proportional to the number of available cores. A mainstream multithreaded CPU can process almost 200 000 gene-pair MI/s (Supplementary Fig. S5). ARACNe-AP is also more efficient (2× on average) in terms of memory usage, compared to ARACNe-FB, due to optimization and use of 16-bit short integers to store rank-transformed gene expression values, allowing the processing of datasets up to 65 536 samples (Supplementary Fig. S3). In conclusion, ARACNe-AP is more than two orders of magnitude faster than the previous ARACNe-FB implementation, while requiring only 50% of the memory. Among others, the ARACNe-AP implementation has been successfully applied to reverse engineering a T-ALL context specific transcriptional network which has resulted in elucidating RUNX1 as a tumor suppressor gene in this cancer (Della Gatta ), and to reverse engineering a prostate cancer specific network leading to identification of FOXM1 and CENPF as synergistic Master Regulators of aggressive disease (Aytes ). Networks inferred by the improved algorithm are virtually identical to those inferred by the original one, both in terms of pairwise MI inference and overall network topology. Yet, the improvements provided by the new implementation have critical repercussions in the field of gene network analyses, as they allow the reconstruction of gene networks from datasets with up to 500 samples in less than one hour. Thus, a 100-bootstrap ARACNe analysis can be run on a standard desktop computer without the need for specialized supercomputers. In contrast, a 100-bootstrap run of ARACNe-FB would require a minimum of 100 supercomputing cores to be completed in a comparable amount of time, thus requiring expensive computational infrastructure not available to the majority of researchers. Finally, removal of proprietary Matlab code and consolidation of the algorithm into a single, platform-independent Java executable significantly increases ease of use and deployment
  16 in total

1.  The mutual information: detecting and evaluating dependencies between variables.

Authors:  R Steuer; J Kurths; C O Daub; J Weise; J Selbig
Journal:  Bioinformatics       Date:  2002       Impact factor: 6.937

2.  ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments.

Authors:  Alexander Lachmann; Huilei Xu; Jayanth Krishnan; Seth I Berger; Amin R Mazloom; Avi Ma'ayan
Journal:  Bioinformatics       Date:  2010-08-13       Impact factor: 6.937

3.  HRE-type genes are regulated by growth-related changes in internal oxygen concentrations during the normal development of potato (Solanum tuberosum) tubers.

Authors:  Francesco Licausi; Federico Manuel Giorgi; Elmar Schmälzlin; Björn Usadel; Pierdomenico Perata; Joost Thomas van Dongen; Peter Geigenberger
Journal:  Plant Cell Physiol       Date:  2011-09-27       Impact factor: 4.927

4.  Gene regulatory network reconstruction using conditional mutual information.

Authors:  Kuo-Ching Liang; Xiaodong Wang
Journal:  EURASIP J Bioinform Syst Biol       Date:  2008

5.  The Rho GTPase effector ROCK regulates cyclin A, cyclin D1, and p27Kip1 levels by distinct mechanisms.

Authors:  Daniel R Croft; Michael F Olson
Journal:  Mol Cell Biol       Date:  2006-06       Impact factor: 4.272

6.  Cross-species regulatory network analysis identifies a synergistic interaction between FOXM1 and CENPF that drives prostate cancer malignancy.

Authors:  Alvaro Aytes; Antonina Mitrofanova; Celine Lefebvre; Mariano J Alvarez; Mireia Castillo-Martin; Tian Zheng; James A Eastham; Anuradha Gopalan; Kenneth J Pienta; Michael M Shen; Andrea Califano; Cory Abate-Shen
Journal:  Cancer Cell       Date:  2014-05-12       Impact factor: 31.743

7.  PlaNet: combined sequence and expression comparisons across plant networks derived from seven species.

Authors:  Marek Mutwil; Sebastian Klie; Takayuki Tohge; Federico M Giorgi; Olivia Wilkins; Malcolm M Campbell; Alisdair R Fernie; Björn Usadel; Zoran Nikoloski; Staffan Persson
Journal:  Plant Cell       Date:  2011-03-25       Impact factor: 11.277

Review 8.  G1 phase progression: cycling on cue.

Authors:  C J Sherr
Journal:  Cell       Date:  1994-11-18       Impact factor: 41.582

9.  E2F and Ras synergize in transcriptionally activating p14ARF expression.

Authors:  Eli Berkovich; Yocheved Lamed; Doron Ginsberg
Journal:  Cell Cycle       Date:  2003 Mar-Apr       Impact factor: 4.534

10.  Reverse engineering of TLX oncogenic transcriptional networks identifies RUNX1 as tumor suppressor in T-ALL.

Authors:  Giusy Della Gatta; Teresa Palomero; Arianne Perez-Garcia; Alberto Ambesi-Impiombato; Mukesh Bansal; Zachary W Carpenter; Kim De Keersmaecker; Xavier Sole; Luyao Xu; Elisabeth Paietta; Janis Racevskis; Peter H Wiernik; Jacob M Rowe; Jules P Meijerink; Andrea Califano; Adolfo A Ferrando
Journal:  Nat Med       Date:  2012-02-26       Impact factor: 53.440

View more
  80 in total

1.  Cross-Species Single-Cell Analysis of Pancreatic Ductal Adenocarcinoma Reveals Antigen-Presenting Cancer-Associated Fibroblasts.

Authors:  Mohan Bolisetty; Pasquale Laise; William F Flynn; Ela Elyada; Elise T Courtois; Richard A Burkhart; Jonathan A Teinor; Pascal Belleau; Giulia Biffi; Matthew S Lucito; Santhosh Sivajothi; Todd D Armstrong; Dannielle D Engle; Kenneth H Yu; Yuan Hao; Christopher L Wolfgang; Youngkyu Park; Jonathan Preall; Elizabeth M Jaffee; Andrea Califano; Paul Robson; David A Tuveson
Journal:  Cancer Discov       Date:  2019-06-13       Impact factor: 39.397

2.  Cross-Cohort Analysis Identifies a TEAD4-MYCN Positive Feedback Loop as the Core Regulatory Element of High-Risk Neuroblastoma.

Authors:  Presha Rajbhandari; Gonzalo Lopez; Claudia Capdevila; Beatrice Salvatori; Jiyang Yu; Ruth Rodriguez-Barrueco; Daniel Martinez; Mark Yarmarkovich; Nina Weichert-Leahey; Brian J Abraham; Mariano J Alvarez; Archana Iyer; Jo Lynne Harenza; Derek Oldridge; Katleen De Preter; Jan Koster; Shahab Asgharzadeh; Robert C Seeger; Jun S Wei; Javed Khan; Jo Vandesompele; Pieter Mestdagh; Rogier Versteeg; A Thomas Look; Richard A Young; Antonio Iavarone; Anna Lasorella; Jose M Silva; John M Maris; Andrea Califano
Journal:  Cancer Discov       Date:  2018-03-06       Impact factor: 39.397

3.  Pan-cancer Convergence to a Small-Cell Neuroendocrine Phenotype that Shares Susceptibilities with Hematological Malignancies.

Authors:  Nikolas G Balanis; Katherine M Sheu; Favour N Esedebe; Saahil J Patel; Bryan A Smith; Jung Wook Park; Salwan Alhani; Brigitte N Gomperts; Jiaoti Huang; Owen N Witte; Thomas G Graeber
Journal:  Cancer Cell       Date:  2019-07-08       Impact factor: 31.743

4.  The Master Regulator Protein BAZ2B Can Reprogram Human Hematopoietic Lineage-Committed Progenitors into a Multipotent State.

Authors:  Karthik Arumugam; William Shin; Valentina Schiavone; Lukas Vlahos; Xiaochuan Tu; Davide Carnevali; Jordan Kesner; Evan O Paull; Neus Romo; Prem Subramaniam; Jeremy Worley; Xiangtian Tan; Andrea Califano; Maria Pia Cosma
Journal:  Cell Rep       Date:  2020-12-08       Impact factor: 9.423

5.  Comprehensive characterisation of compartment-specific long non-coding RNAs associated with pancreatic ductal adenocarcinoma.

Authors:  Luis Arnes; Zhaoqi Liu; Jiguang Wang; Carlo Maurer; Irina Sagalovskiy; Marta Sanchez-Martin; Nikhil Bommakanti; Diana C Garofalo; Dina A Balderes; Lori Sussel; Kenneth P Olive; Raul Rabadan
Journal:  Gut       Date:  2018-02-10       Impact factor: 23.059

6.  SECAT: Quantifying Protein Complex Dynamics across Cell States by Network-Centric Analysis of SEC-SWATH-MS Profiles.

Authors:  George Rosenberger; Moritz Heusel; Isabell Bludau; Ben C Collins; Claudia Martelli; Evan G Williams; Peng Xue; Yansheng Liu; Ruedi Aebersold; Andrea Califano
Journal:  Cell Syst       Date:  2020-12-16       Impact factor: 10.304

7.  Variability of Betweenness Centrality and Its Effect on Identifying Essential Genes.

Authors:  Christina Durón; Yuan Pan; David H Gutmann; Johanna Hardin; Ami Radunskaya
Journal:  Bull Math Biol       Date:  2018-10-22       Impact factor: 1.758

Review 8.  Evaluating genetic causes of azoospermia: What can we learn from a complex cellular structure and single-cell transcriptomics of the human testis?

Authors:  Samuele Soraggi; Meritxell Riera; Ewa Rajpert-De Meyts; Mikkel H Schierup; Kristian Almstrup
Journal:  Hum Genet       Date:  2020-01-16       Impact factor: 4.132

Review 9.  A systems biology approach to discovering pathway signaling dysregulation in metastasis.

Authors:  Robert Clarke; Pavel Kraikivski; Brandon C Jones; Catherine M Sevigny; Surojeet Sengupta; Yue Wang
Journal:  Cancer Metastasis Rev       Date:  2020-08-10       Impact factor: 9.264

Review 10.  A census of pathway maps in cancer systems biology.

Authors:  Brent M Kuenzi; Trey Ideker
Journal:  Nat Rev Cancer       Date:  2020-02-17       Impact factor: 60.716

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.