Literature DB >> 24052743

Integrative 'omic' approach towards understanding the nature of human diseases.

Abstract

The combination of improving technologies for molecular interrogation of global molecular alterations in human diseases along with increases in computational capacity, have enabled unprecedented insight into disease etiology, pathogenesis and have enabled new possibilities for biomarker development. A large body of data has accumulated over recent years, with a most prominent increase in information originating from genomic, transcriptomic and proteomic profiling levels. However, the complexity of the data made discovery of high-order disease mechanisms involving various biological layers, difficult, and therefore required new approaches toward integration of such data into a complete representation of molecular events occurring on cellular level. For this reason, we developed a new mode of integration of results coming from heterogeneous origins, using rank statistics of results from each profiling level. Due to the increased use of next-generation sequencing technology, experimental information is becoming increasingly more associated to sequence information, for which reason we have decided to synthesize the heterogeneous results using the information of their genomic position. We therefore propose a novel positional integratomic approach toward studying 'omic' information in human disease.

Entities: Disease Gene Species

Keywords: Data integration; Genomics; High-throughput technologies; Transcriptomics

Year: 2012 PMID： 24052743 PMCID： PMC3776674 DOI： 10.2478/v10034-012-0018-7

Source DB: PubMed Journal: Balkan J Med Genet ISSN： 1311-0160 Impact factor: 0.519

INTRODUCTION

The development of microarray technology in the last decade and the upsurge of next-generation sequencing in the last few years has provided an abundance of data originating from various biological levels, most prominently from genomic and transcriptomic levels [1,2]. Such novel approaches have contributed greatly towards our understanding of physiological cellular processes, as well as molecular changes that occur in human disease. The high-dimensional nature of data originating from these studies has also opened an array of new theoretical and statistical challenges that had to be faced in order to attain acceptable reproducibility and consistency of scientific results [3]. In particular, a large number of hypotheses tested in a single experiment produced a substantial amount of statistical noise, causing large numbers of false-positive detections and undue omission of many true-positive results. Although statistical methods have been developed to address these issues, difficulties in increasing specificity and sensitivity of highly parallel approaches still exist, with the greatest notoriety in the field of human diseases belonging to a group of common, complex disorders. In an attempt to alleviate these drawbacks, we developed a method that harnesses the biological relations between data originating from studies investigating human disease on various biological levels. An example of such an approach may be illustrated by the fact that genomic alterations associated with human disease, i.e., multiple sclerosis (MS), are usually investigated and interpreted separately from transcriptomic alterations occurring in MS. The biological relation between these two layers may thus be utilized to favor prioritization of genes that were detected on both layers, therefore reducing noisy results and facilitating detection of true biological data. We expect that with the inclusion of increasing the number of biological layers and increasing the number of studies in the database used for integration, the comprehensiveness and biological validity of prioritized genes would increase progressively.

MATERIALS AND METHODS

The pathway towards constructing the initial database used for subsequent integration is highly dependent on the disease of interest. While some common disorders have been investigated in several ‘omic’ studies that investigated several biological cellular levels, the sourcing data for other diseases may be more scarce. The search for data sources should be initiated by an overview of literature published to date. When the investigator is familiar with the studies performed, the published reports and their tables in supplemental materials may be used to extract the lists of genes or other genomic features with detected significant alterations. A crucial step in obtaining data sources of high quality is inspection of available databases that are stored in public data repositories. These tend to be highly specialized for the biological layer of investigation. For genomic data from genome-wide association studies, data may be extracted from dbGAP (http://www.ncbi.nlm.nih.gov/gap), for epigenomic, transcriptomic and methylomic data, Array Express (http://www.ebi.ac.uk/arrayexpress/) or Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/), and for next generation sequencing databases European Nucleotide Archive (ENA, http://www.ebi.ac.uk/ena/) and Sequence Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra) [4-7]. After all the sources have been investigated, a collected database of features [genes, mRNAs, microRNAs (miRNAs), CpG islands, proteins and others] with significant alterations in chosen disease should be prepared for each included study. We also advise collection of information, such as significance values and fold change values, on which prioritization of features for each biological layer will be performed in the later steps. If the latter information is not available, all the significant alterations in a given study will have the same importance in integration. In the following section, significant results from various study types will be collectively referred to as “signals” for reasons of clarity. Data Integration. Before data can be integrated, they have to be reduced to a universal common denominator. Due to increasing heterogeneity of genetic information, tying biological information to gene-level annotation is becoming increasingly more difficult. Genomic variation and methylation patterns are two examples of information that is prohibitively difficult to associate with genes in any straightforward manner, as such alterations occur in genes, between genes or spread across several genes. For this reason, we opted for an integration based on the genomic position of features originating from various data sources. This required the signals from all databases to be converted to their genomic positions and projected on the genome assembly backbone. This step then allows for complete omission of difficult annotation conversion steps, required before final integration can be performed, greatly simplifying the synthesis of heterogeneous data. After signals are positioned on the genomic backbone, the complete assembly is divided into bins of equal size. For each study, a score is given to each of the bins, depending on the score of alterations residing in that segment of the genome. After this step, the scores of all bins are prioritized and their rank scores calculated. The integration step is attained when the non parametric rank product for each of the bins is calculated, and on the basis of rank scores of bins originating from each data source, as we have previously described [8]. The lower final rank product signifies that higher ranks were attained by bins on several separate biological layers [9]. Therefore, these bins represent genomic regions where accumulation of signals is detected on various biological levels, and thus represent regions of interest for further investigation. Ultimately, a permutational test may be employed to determine the significance of signal accumulation in each bin [8]. The detailed overview of the process may be observed in Figure 1.

Figure 1.

Process of integration of numerous heterogeneous data sources. First, data on significant alterations on a certain biological layer is obtained from selected studies (data from various layers is coded by letters a–n and differing colors). These alterations or signals are then positioned into genomic bins of fixed size and bin-scores for each of the bins is estimated. For each of the layers in a–n, bins are then prioritized on the basis of this score and the rank of each bin is separated. The final integration step is then performed by calculating rank products for each of the genomic bins, based on their rank in each of data sources.

RESULTS AND DISCUSSION

Results originating from the positional integratomic approach represent a prioritized list of genomic regions, where regions containing the greatest accumulation of heterogeneous biological alterations in an investigated disease rank highest and are characterized by lowest permutation test p values. As the integrative approach is performed for regions (bins) across the whole genome, the resulting genome-wide distribution of results from integration of data in human disease may be inspected. Genome-wide distribution of integration results for MS as an example of a complex autoimmune human disorder is represented in Figure 2. Here, the greatest accumulation of signals is observed on chromosome 6, specifically in the well-known human leukocyte antigen (HLA) region, suggesting that data from heterogeneous biological sources of ‘omic’ data indicate the role of this region in MS. Moreover, other regions have also attained high integration scores, suggesting importance of non-HLA regions in MS. Specifically, a region containing an interleukin-7 receptor gene (IL7R) attained very high integrative scores, not only on the basis of detections from genome-wide association studies, but also on the basis of evidence from expression profiling studies in blood and brain tissues. Additionally, the same region has been ranked high due to information obtained from various bioinformatic sources of data, such as KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways and co-expression information [10,11]. Such a heterogeneous body of evidence offers information of great relevance to true biological disease alterations and thus provides plausible candidate selection for further studies.

Figure 2.

The genome-wide distribution of significance values, based on the permutation test of integration scores. Each region or genomic bin is represented by a dot whose height represent significances in the −log10P form, with regions characterized by high accumulation of heterogeneous data attaining higher −log10P values. The HLA region on chromosome 6 attained the highest score in these analyses with p values below 1•10−9. Notably, non-HLA regions score high as well, offering a landscape of new genomic regions for further down-stream investigations

The positional approach offers great flexibility and control over parameters on which the final prioritization of genomic regions is based. Based on scientific questions, a researcher may be more interested in a contribution of only selected biological layers to the final integration score. For this reason, we have implemented means to allow custom weighting of different sources of data. For example, if one is interested in the relation between genomic variation and differential methylation, one may attribute those two sources greater weights and regions where signals from GWAS (genome-wide association studies), and global methylation studies aggregate will be obtained. Additional levels of control may also be obtained by customizing the size of genomic bins, allowing for detection of interactions that spread across larger genomic regions. There has been great interest in deciphering the genetic factors with medium-to-low effect sizes as the explanation for the phenomenon of missing heritability in MS and other complex disorders [12,13]. Here, an integrative approach may assist in promoting detection of the genomic variant with its actual role in such complex disorder, and distinguishing them from spurious noise originating from statistical noise generated in genome-wide association studies. As large-scale studies, which attempt to detect low-effect susceptibility factors in human disease, have to be performed on large sample sizes, requiring great resources and effort [14], this approach may be a mode of comprehensive evidence-based selection of molecular determinants to investigate in such downstream validation studies. With continuing development of high-throughput technologies, it is expected that the amount of the resulting data in large databases will continue to rise. For this reason, novel approaches for interpretation and understanding will also have to be prepared to face these challenges. As it is difficult for a single researcher or research group to have a comprehensive overview over such a vast information landscape, new means of presentation and access to these results will have to be envisaged. A position-based, integrative approach not only represents the means to quick insight into heterogeneous evidence from several large-scale studies, but is also a basis toward the preparation of an interactive genome browser-like solutions for fast and easy access to this body of information.

14 in total

1. The KEGG databases and tools facilitating omics analysis: latest developments involving human diseases and pharmaceuticals.

Authors: Masaaki Kotera; Mika Hirakawa; Toshiaki Tokimatsu; Susumu Goto; Minoru Kanehisa
Journal: Methods Mol Biol Date: 2012

2. Genome-wide strategies for detecting multiple loci that influence complex diseases.

Authors: Jonathan Marchini; Peter Donnelly; Lon R Cardon
Journal: Nat Genet Date: 2005-03-27 Impact factor: 38.330

Review 3. Microarray data analysis: from disarray to consolidation and consensus.

Authors: David B Allison; Xiangqin Cui; Grier P Page; Mahyar Sabripour
Journal: Nat Rev Genet Date: 2006-01 Impact factor: 53.242

Review 4. Gene expression omnibus: microarray data storage, submission, retrieval, and analysis.

Authors: Tanya Barrett; Ron Edgar
Journal: Methods Enzymol Date: 2006 Impact factor: 1.600

5. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements.

Authors: Leming Shi; Laura H Reid; Wendell D Jones; Richard Shippy; Janet A Warrington; Shawn C Baker; Patrick J Collins; Francoise de Longueville; Ernest S Kawasaki; Kathleen Y Lee; Yuling Luo; Yongming Andrew Sun; James C Willey; Robert A Setterquist; Gavin M Fischer; Weida Tong; Yvonne P Dragan; David J Dix; Felix W Frueh; Frederico M Goodsaid; Damir Herman; Roderick V Jensen; Charles D Johnson; Edward K Lobenhofer; Raj K Puri; Uwe Schrf; Jean Thierry-Mieg; Charles Wang; Mike Wilson; Paul K Wolber; Lu Zhang; Shashi Amur; Wenjun Bao; Catalin C Barbacioru; Anne Bergstrom Lucas; Vincent Bertholet; Cecilie Boysen; Bud Bromley; Donna Brown; Alan Brunner; Roger Canales; Xiaoxi Megan Cao; Thomas A Cebula; James J Chen; Jing Cheng; Tzu-Ming Chu; Eugene Chudin; John Corson; J Christopher Corton; Lisa J Croner; Christopher Davies; Timothy S Davison; Glenda Delenstarr; Xutao Deng; David Dorris; Aron C Eklund; Xiao-hui Fan; Hong Fang; Stephanie Fulmer-Smentek; James C Fuscoe; Kathryn Gallagher; Weigong Ge; Lei Guo; Xu Guo; Janet Hager; Paul K Haje; Jing Han; Tao Han; Heather C Harbottle; Stephen C Harris; Eli Hatchwell; Craig A Hauser; Susan Hester; Huixiao Hong; Patrick Hurban; Scott A Jackson; Hanlee Ji; Charles R Knight; Winston P Kuo; J Eugene LeClerc; Shawn Levy; Quan-Zhen Li; Chunmei Liu; Ying Liu; Michael J Lombardi; Yunqing Ma; Scott R Magnuson; Botoul Maqsodi; Tim McDaniel; Nan Mei; Ola Myklebost; Baitang Ning; Natalia Novoradovskaya; Michael S Orr; Terry W Osborn; Adam Papallo; Tucker A Patterson; Roger G Perkins; Elizabeth H Peters; Ron Peterson; Kenneth L Philips; P Scott Pine; Lajos Pusztai; Feng Qian; Hongzu Ren; Mitch Rosen; Barry A Rosenzweig; Raymond R Samaha; Mark Schena; Gary P Schroth; Svetlana Shchegrova; Dave D Smith; Frank Staedtler; Zhenqiang Su; Hongmei Sun; Zoltan Szallasi; Zivana Tezak; Danielle Thierry-Mieg; Karol L Thompson; Irina Tikhonova; Yaron Turpaz; Beena Vallanat; Christophe Van; Stephen J Walker; Sue Jane Wang; Yonghong Wang; Russ Wolfinger; Alex Wong; Jie Wu; Chunlin Xiao; Qian Xie; Jun Xu; Wen Yang; Liang Zhang; Sheng Zhong; Yaping Zong; William Slikker
Journal: Nat Biotechnol Date: 2006-09 Impact factor: 54.908

Review 6. Next-generation sequencing transforms today's biology.

Authors: Stephan C Schuster
Journal: Nat Methods Date: 2007-12-19 Impact factor: 28.547

7. Positional integratomic approach in identification of genomic candidate regions for Parkinson's disease.

Authors: Ales Maver; Borut Peterlin
Journal: Bioinformatics Date: 2011-05-19 Impact factor: 6.937

8. COXPRESdb: a database to compare gene coexpression in seven model animals.

Authors: Takeshi Obayashi; Kengo Kinoshita
Journal: Nucleic Acids Res Date: 2010-11-16 Impact factor: 16.971

9. The sequence read archive.

Authors: Rasko Leinonen; Hideaki Sugawara; Martin Shumway
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

10. Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis.

Authors: Stephen Sawcer; Garrett Hellenthal; Matti Pirinen; Chris C A Spencer; Nikolaos A Patsopoulos; Loukas Moutsianas; Alexander Dilthey; Zhan Su; Colin Freeman; Sarah E Hunt; Sarah Edkins; Emma Gray; David R Booth; Simon C Potter; An Goris; Gavin Band; Annette Bang Oturai; Amy Strange; Janna Saarela; Céline Bellenguez; Bertrand Fontaine; Matthew Gillman; Bernhard Hemmer; Rhian Gwilliam; Frauke Zipp; Alagurevathi Jayakumar; Roland Martin; Stephen Leslie; Stanley Hawkins; Eleni Giannoulatou; Sandra D'alfonso; Hannah Blackburn; Filippo Martinelli Boneschi; Jennifer Liddle; Hanne F Harbo; Marc L Perez; Anne Spurkland; Matthew J Waller; Marcin P Mycko; Michelle Ricketts; Manuel Comabella; Naomi Hammond; Ingrid Kockum; Owen T McCann; Maria Ban; Pamela Whittaker; Anu Kemppinen; Paul Weston; Clive Hawkins; Sara Widaa; John Zajicek; Serge Dronov; Neil Robertson; Suzannah J Bumpstead; Lisa F Barcellos; Rathi Ravindrarajah; Roby Abraham; Lars Alfredsson; Kristin Ardlie; Cristin Aubin; Amie Baker; Katharine Baker; Sergio E Baranzini; Laura Bergamaschi; Roberto Bergamaschi; Allan Bernstein; Achim Berthele; Mike Boggild; Jonathan P Bradfield; David Brassat; Simon A Broadley; Dorothea Buck; Helmut Butzkueven; Ruggero Capra; William M Carroll; Paola Cavalla; Elisabeth G Celius; Sabine Cepok; Rosetta Chiavacci; Françoise Clerget-Darpoux; Katleen Clysters; Giancarlo Comi; Mark Cossburn; Isabelle Cournu-Rebeix; Mathew B Cox; Wendy Cozen; Bruce A C Cree; Anne H Cross; Daniele Cusi; Mark J Daly; Emma Davis; Paul I W de Bakker; Marc Debouverie; Marie Beatrice D'hooghe; Katherine Dixon; Rita Dobosi; Bénédicte Dubois; David Ellinghaus; Irina Elovaara; Federica Esposito; Claire Fontenille; Simon Foote; Andre Franke; Daniela Galimberti; Angelo Ghezzi; Joseph Glessner; Refujia Gomez; Olivier Gout; Colin Graham; Struan F A Grant; Franca Rosa Guerini; Hakon Hakonarson; Per Hall; Anders Hamsten; Hans-Peter Hartung; Rob N Heard; Simon Heath; Jeremy Hobart; Muna Hoshi; Carmen Infante-Duarte; Gillian Ingram; Wendy Ingram; Talat Islam; Maja Jagodic; Michael Kabesch; Allan G Kermode; Trevor J Kilpatrick; Cecilia Kim; Norman Klopp; Keijo Koivisto; Malin Larsson; Mark Lathrop; Jeannette S Lechner-Scott; Maurizio A Leone; Virpi Leppä; Ulrika Liljedahl; Izaura Lima Bomfim; Robin R Lincoln; Jenny Link; Jianjun Liu; Aslaug R Lorentzen; Sara Lupoli; Fabio Macciardi; Thomas Mack; Mark Marriott; Vittorio Martinelli; Deborah Mason; Jacob L McCauley; Frank Mentch; Inger-Lise Mero; Tania Mihalova; Xavier Montalban; John Mottershead; Kjell-Morten Myhr; Paola Naldi; William Ollier; Alison Page; Aarno Palotie; Jean Pelletier; Laura Piccio; Trevor Pickersgill; Fredrik Piehl; Susan Pobywajlo; Hong L Quach; Patricia P Ramsay; Mauri Reunanen; Richard Reynolds; John D Rioux; Mariaemma Rodegher; Sabine Roesner; Justin P Rubio; Ina-Maria Rückert; Marco Salvetti; Erika Salvi; Adam Santaniello; Catherine A Schaefer; Stefan Schreiber; Christian Schulze; Rodney J Scott; Finn Sellebjerg; Krzysztof W Selmaj; David Sexton; Ling Shen; Brigid Simms-Acuna; Sheila Skidmore; Patrick M A Sleiman; Cathrine Smestad; Per Soelberg Sørensen; Helle Bach Søndergaard; Jim Stankovich; Richard C Strange; Anna-Maija Sulonen; Emilie Sundqvist; Ann-Christine Syvänen; Francesca Taddeo; Bruce Taylor; Jenefer M Blackwell; Pentti Tienari; Elvira Bramon; Ayman Tourbah; Matthew A Brown; Ewa Tronczynska; Juan P Casas; Niall Tubridy; Aiden Corvin; Jane Vickery; Janusz Jankowski; Pablo Villoslada; Hugh S Markus; Kai Wang; Christopher G Mathew; James Wason; Colin N A Palmer; H-Erich Wichmann; Robert Plomin; Ernest Willoughby; Anna Rautanen; Juliane Winkelmann; Michael Wittig; Richard C Trembath; Jacqueline Yaouanq; Ananth C Viswanathan; Haitao Zhang; Nicholas W Wood; Rebecca Zuvich; Panos Deloukas; Cordelia Langford; Audrey Duncanson; Jorge R Oksenberg; Margaret A Pericak-Vance; Jonathan L Haines; Tomas Olsson; Jan Hillert; Adrian J Ivinson; Philip L De Jager; Leena Peltonen; Graeme J Stewart; David A Hafler; Stephen L Hauser; Gil McVean; Peter Donnelly; Alastair Compston
Journal: Nature Date: 2011-08-10 Impact factor: 49.962

2 in total

Review 1. DNA methylation as a mediator of genetic and environmental influences on Parkinson's disease susceptibility: Impacts of alpha-Synuclein, physical activity, and pesticide exposure on the epigenome.

Authors: Samantha L Schaffner; Michael S Kobor
Journal: Front Genet Date: 2022-08-19 Impact factor: 4.772

2. ONION: Functional Approach for Integration of Lipidomics and Transcriptomics Data.

Authors: Monika Piwowar; Wiktor Jurkowski
Journal: PLoS One Date: 2015-06-08 Impact factor: 3.240

2 in total