Literature DB >> 22058129

DistiLD Database: diseases and traits in linkage disequilibrium blocks.

Albert Pallejà¹, Heiko Horn, Sabrina Eliasson, Lars Juhl Jensen.

Abstract

Genome-wide association studies (GWAS) have identified thousands of single nucleotide polymorphisms (SNPs) associated with the risk of hundreds of diseases. However, there is currently no database that enables non-specialists to answer the following simple questions: which SNPs associated with diseases are in linkage disequilibrium (LD) with a gene of interest? Which chromosomal regions have been associated with a given disease, and which are the potentially causal genes in each region? To answer these questions, we use data from the HapMap Project to partition each chromosome into so-called LD blocks, so that SNPs in LD with each other are preferentially in the same block, whereas SNPs not in LD are in different blocks. By projecting SNPs and genes onto LD blocks, the DistiLD database aims to increase usage of existing GWAS results by making it easy to query and visualize disease-associated SNPs and genes in their chromosomal context. The database is available at http://distild.jensenlab.org/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 22058129 PMCID： PMC3245128 DOI： 10.1093/nar/gkr899

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Genome-wide association studies (GWAS) have been extensively used to associate single nucleotide polymorphisms (SNPs) to diverse phenotypes and have substantially increased our knowledge of the genetics and molecular pathways underlying human traits and diseases (1,2). These studies rely on genotyping large cohorts of cases and controls, which is expensive despite the cost efficiency of the microarray technology used (3). Over the last few years, GWAS have resulted in hundreds of publications in high-profile journals, and several databases gather the results of the many studies. The main repositories for GWAS data are as follows: the database of Genotypes and Phenotypes [dbGaP, (4)], European Genotype Archive (EGA), the GWAS Database of Japan and GWAS Central [formerly known as HGVbaseG2P (5)]. Unfortunately, none of these resources allow systematic download and redistribution of the data. The National Human genome Research Institute maintains a public, daily updated Catalog of Published Genome Wide Association Studies (GWAS Catalog) from where the most statistically significant SNPs associated with each phenotype can be retrieved (6). These data can be queried and visualized in the UCSC genome browser through GWAS Integrator (7). Despite the existence of these resources, GWAS data are far from easy to work with for non-experts. This is because GWAS identify marker SNPs, which are not necessarily the causal SNPs but are assumed to be in linkage disequilibrium (LD) with them (1,2,8,9). LD is defined as the non-random association of variants at two or more loci, and it has long been the basis for genetically mapping genes associated with traits or diseases (9,10). To identify candidate disease genes based on the SNPs found by GWAS, it is thus necessary to take into account LD among SNPs. We simplify this task by cutting the chromosomes into so-called LD blocks, within which SNPs are mostly in strong LD with each other, whereas those from different blocks are not. The aim of the DistiLD database is to increase the usage of existing GWAS results. To this end, the DistiLD database performs three important tasks: (i) published GWAS are collected from several sources and linked to standardized, international disease codes; (ii) data from the International HapMap program (11) are analyzed to define LD blocks onto which SNPs and genes are mapped; (iii) a web interface makes it easy to query and visualize disease-associated SNPs and genes within LD blocks.

COLLECTION AND ANNOTATION OF GWAS RESULTS

The GWAS results were collected from three different sources. The first is the manually inspected collection of SNP–phenotype associates of Johnson and O'Donnell (12), which covers GWAS data from studies published before 1 March 2008. The second source is our own collection of GWAS data manually retrieved from all the studies published between 1 March 2008 and 1 July 2010. The PubMed searches were ‘genome wide association studies’, ‘genome wide association study’ and ‘GWAS’. The third source is the data collected by the GWAS Catalog up to 8th July 2011. These data sets were merged in an inclusive manner: we stored all SNPs listed by any of the sources and assigned it the lowest P-value in case the same study was imported from multiple sources. The DistiLD database currently contains 820 GWAS and 86 627 SNPs–studies associations, being the one with most associations among the publicly accessible databases. We plan to update the database with additional GWAS data on a weekly basis. A physician manually assigned 717 of the studies to one or more diseases, represented in the database by International Classification of Diseases version 10 (ICD10) codes. Consequently, users can query the database for diseases using ICD10 codes, which are commonly used by physicians.

IDENTIFICATION OF LD BLOCKS

The International HapMap Project represents a major effort to map the LD among SNPs in the human genome (11). They provide two commonly used measures for LD, namely D′ and r2, both of which can vary from 0 to 1 with higher values implying stronger LD. We used D′ as the basis for partitioning the chromosomes into LD blocks, because this measure is normalized for allele frequencies, making it better suited than r2 for estimating the overall LD across pairs of multiallelic loci (13). Our algorithm for identifying LD blocks is based on sliding windows along the chromosomes (Figure 1A). We use two different window sizes to calculate the LD across each chromosomal position: a ±60 kb window to capture coarse-grained LD and a ±5 kb window to capture the fine-grained LD. We chose the window size of ±60 kb, because D′ on average drops <0.5 beyond that distance in Caucasians of central European ancestry (14). We picked the size of ±5 kb to have a window that is at the same time small yet large enough to typically contain several SNPs given the HapMap SNP density (11). For both window sizes, we calculate the average D′ between the left and right halves of the window; pairs of SNPs for which HapMap that does not specify LD values are considered to have D′ = 0. SNPs within the ±5 kb window are not considered to be also part of the ±60 kb window.

Figure 1.

Dividing chromosomes into LD blocks. The figure shows the results for a region of chromosome 19. (A) We first segment the chromosome into three classes based on the average D′ within a ±60 kb window: high-LD (black diamonds, D′ ≥ 0.6), moderate-LD (green squares, 0.4 ≤ D′ < 0.6) and low-LD segments (blue triangles, D′ < 0.4). The heatmap below the graph shows the D′ values between pairs of SNPs. (B) We subsequently determine the boundaries of LD blocks within moderate- and low-LD segments based on where the average D′ within a ±5 kb window drops to 0.5 or lower. We next divide each chromosome into segments of high (D′ ≥ 0.6), moderate (0.4 ≤ D′ < 0.6) and low (D′ < 0.4) LD based on the average D′ for the ±60 kb window (Figure 1B). Starting from these segments, we cut the chromosome into LD blocks based on the following rules: (i) we never cut within high-LD segments. (ii) Within moderate- and low-LD segments, we cut wherever the average D′ for the ±5 kb window is <0.5. (iii) If the ±5 kb D′ average does not fall <0.5 within a low-LD segment, we cut where the lowest ±5 kb D′ average is found. These rules ensure that a high-LD segment will always belong to only a single LD block, whereas segments separated by a low-LD segment will never be part of the same LD block. To assess the robustness of the results, we varied the parameters to see if any of them dramatically affect the average size of the LD blocks. This is not the case; the average size of the LD blocks changed <8% when varying the large window size, the small window size and the average D′ thresholds to define the LD segments (Table 1).

Table 1.

Robustness analysis of the algorithm

Large window (kb)	Small window (kb)	Average D′ thresholds	Number of LD blocks	Average size of LD blocks
±50	±5	±0.10	37 856	80
±60	±4	±0.10	35 097	86
±60	±5	±0.05	38 532	79
±60	±5	±0.10	37 991	80
±60	±5	±0.20	35 752	85
±60	±6	±0.10	41 332	73
±70	±5	±0.10	38 296	79

The table shows the number of LD blocks and the average size of the blocks after running our algorithm using different window sizes and average D′ thresholds symmetric around D′ = 0.5. We set the thresholds by adding or subtracting to 0.5 the quantity in column Average D′ thresholds. The average size of the LD blocks changed <8% when varying the window sizes and the average D′ thresholds. The windows and thresholds finally selected for running the algorithm and the results obtained are in bold.

Robustness analysis of the algorithm The table shows the number of LD blocks and the average size of the blocks after running our algorithm using different window sizes and average D′ thresholds symmetric around D′ = 0.5. We set the thresholds by adding or subtracting to 0.5 the quantity in column Average D′ thresholds. The average size of the LD blocks changed <8% when varying the window sizes and the average D′ thresholds. The windows and thresholds finally selected for running the algorithm and the results obtained are in bold.

THE DistiLD WEB INTERFACE

Users can query DistiLD in three different ways, starting from either a disease, a list of SNPs or a list of genes:

Disease-focused query

Users can query the database for a disease by typing its entire name or ICD10 code. The autocompletion system helps the user to easily select a disease from the ICD10 classification. Diseases and traits can also be retrieved by free-text search within the paper abstracts. All the LD blocks associated through GWAS to a given disease are shown including the SNPs associated with the disease and the genes that fall within those blocks.

Mutation-focused query

It is also possible for users to query the database for SNPs by inputting a list of rs numbers, irrespective of whether the SNPs were identified through GWAS or through other methodologies. To this end, we map all SNPs in dbSNP (15) to the LD blocks. This enables users to find out what other diseases can be related to their disease of interest, because the LD blocks show both the SNPs entered by the user (highlighted in red) and other SNPs in LD, which are associated to other diseases.

Gene-focused query

Users can also query the database by entering a gene name or list of gene names of interest. The LD blocks that contain those genes are shown with the query genes highlighted in red; the blocks will also show any disease-associated SNPs contained within them. This way, users can use the DistiLD database to identify diseases linked to their genes of interest even if the GWAS in question did not explicitly report those genes. No matter which of the three query modes was used, an intermediate page will be shown listing all the studies that matched the search with a link to the corresponding publication (Figure 2A). The user can select either all studies related to a certain disease or one specific study for which to view the related LD blocks (Figure 2B). We rank the blocks by the P-value of the most statistically significant SNP within each block. LD blocks are represented in boxes where the chromosome is a thin bar in the middle showing the position and orientation of the genes. The genes and intergenic regions are not shown to scale. This schematic view enables the users to visualize large chromosomal regions in a much more compact way than the traditional genome browsers. The SNPs are pointing to their chromosome position and their P-value and PMID are shown (Figure 2B and C). It is also possible to retrieve gene information (Figure 2D).

Figure 2.

The DistiLD web interface. The figure shows different steps when querying the database with the three genes IKZF1, ARID5B and CEBPE. (A) An intermediate page is shown where the user selects a disease or GWAS of interest. (B) The result page shows LD blocks containing SNPs associated with the selected disease or GWAS. If the query is a list of SNPs or genes, they will be highlighted in red. (C) A popup with further details on SNPs can be obtained by clicking on them. (D) Similarly, selecting a gene yields an information popup provided by the Reflect web resource (16).

LARGE-SCALE DATA ACCESS

The DistiLD database integrates information on: (i) association of SNPs and diseases from GWAS and (ii) links between SNPs and genes based on LD data from the HapMap project. All these data can be accessed freely through the website and downloaded as tab-delimited files to allow for large-scale analyses. Users can download the following two files: the GWAS SNPs mapped to LD blocks and diseases (ICD10 codes and descriptions), and all the SNPs that are in the Database of Single Nucleotide Polymorphisms (dbSNP) build 132 (15) and all the genes from Ensembl database version 57 (17) mapped to the LD blocks. The LD blocks cover the entire human genome and have self-explanatory identifiers that consist of the chromosome, the start and the stop coordinates. These files are available under the Creative Commons Attribution 3.0 License. Despite the great number of disease-related chromosomal loci reported by GWAS, the causal genes remain extremely difficult to identify, particularly in complex diseases. To deal with this issue, several approaches based on network, pathway, protein–protein interaction, gene ontology or gene expression analyses (18–23) try to make a more meaningful use of the associations reported by GWAS, by incorporating prior functional knowledge to the genetic variants associated to a disease. We believe that DistiLD could be the starting point for such studies by providing them the LD blocks associated with a given disease containing the set of genes in LD with the SNPs associated with the disease.

FUNDING

Novo Nordisk Foundation Center for Protein Research. Funding for open access charge: Reinholdt W. Jorck og Hustrus Fond. Conflict of interest statement. None declared.

23 in total

1. The NCBI dbGaP database of genotypes and phenotypes.

Authors: Matthew D Mailman; Michael Feolo; Yumi Jin; Masato Kimura; Kimberly Tryka; Rinat Bagoutdinov; Luning Hao; Anne Kiang; Justin Paschall; Lon Phan; Natalia Popova; Stephanie Pretel; Lora Ziyabari; Moira Lee; Yu Shao; Zhen Y Wang; Karl Sirotkin; Minghong Ward; Michael Kholodov; Kerry Zbicz; Jeffrey Beck; Michael Kimelman; Sergey Shevelev; Don Preuss; Eugene Yaschenko; Alan Graeff; James Ostell; Stephen T Sherry
Journal: Nat Genet Date: 2007-10 Impact factor: 38.330

2. Diverse genome-wide association studies associate the IL12/IL23 pathway with Crohn Disease.

Authors: Kai Wang; Haitao Zhang; Subra Kugathasan; Vito Annese; Jonathan P Bradfield; Richard K Russell; Patrick M A Sleiman; Marcin Imielinski; Joseph Glessner; Cuiping Hou; David C Wilson; Thomas Walters; Cecilia Kim; Edward C Frackelton; Paolo Lionetti; Arrigo Barabino; Johan Van Limbergen; Stephen Guthery; Lee Denson; David Piccoli; Mingyao Li; Marla Dubinsky; Mark Silverberg; Anne Griffiths; Struan F A Grant; Jack Satsangi; Robert Baldassano; Hakon Hakonarson
Journal: Am J Hum Genet Date: 2009-02-26 Impact factor: 11.025

Review 3. Human genetic variation and its contribution to complex traits.

Authors: Kelly A Frazer; Sarah S Murray; Nicholas J Schork; Eric J Topol
Journal: Nat Rev Genet Date: 2009-04 Impact factor: 53.242

4. GSEA-SNP: applying gene set enrichment analysis to SNP data from genome-wide association studies.

Authors: Marit Holden; Shiwei Deng; Leszek Wojnowski; Bettina Kulle
Journal: Bioinformatics Date: 2008-10-14 Impact factor: 6.937

Review 5. Genetic mapping in human disease.

Authors: David Altshuler; Mark J Daly; Eric S Lander
Journal: Science Date: 2008-11-07 Impact factor: 47.728

Review 6. Linkage disequilibrium--understanding the evolutionary past and mapping the medical future.

Authors: Montgomery Slatkin
Journal: Nat Rev Genet Date: 2008-06 Impact factor: 53.242

7. A second generation human haplotype map of over 3.1 million SNPs.

Authors: Kelly A Frazer; Dennis G Ballinger; David R Cox; David A Hinds; Laura L Stuve; Richard A Gibbs; John W Belmont; Andrew Boudreau; Paul Hardenbol; Suzanne M Leal; Shiran Pasternak; David A Wheeler; Thomas D Willis; Fuli Yu; Huanming Yang; Changqing Zeng; Yang Gao; Haoran Hu; Weitao Hu; Chaohua Li; Wei Lin; Siqi Liu; Hao Pan; Xiaoli Tang; Jian Wang; Wei Wang; Jun Yu; Bo Zhang; Qingrun Zhang; Hongbin Zhao; Hui Zhao; Jun Zhou; Stacey B Gabriel; Rachel Barry; Brendan Blumenstiel; Amy Camargo; Matthew Defelice; Maura Faggart; Mary Goyette; Supriya Gupta; Jamie Moore; Huy Nguyen; Robert C Onofrio; Melissa Parkin; Jessica Roy; Erich Stahl; Ellen Winchester; Liuda Ziaugra; David Altshuler; Yan Shen; Zhijian Yao; Wei Huang; Xun Chu; Yungang He; Li Jin; Yangfan Liu; Yayun Shen; Weiwei Sun; Haifeng Wang; Yi Wang; Ying Wang; Xiaoyan Xiong; Liang Xu; Mary M Y Waye; Stephen K W Tsui; Hong Xue; J Tze-Fei Wong; Luana M Galver; Jian-Bing Fan; Kevin Gunderson; Sarah S Murray; Arnold R Oliphant; Mark S Chee; Alexandre Montpetit; Fanny Chagnon; Vincent Ferretti; Martin Leboeuf; Jean-François Olivier; Michael S Phillips; Stéphanie Roumy; Clémentine Sallée; Andrei Verner; Thomas J Hudson; Pui-Yan Kwok; Dongmei Cai; Daniel C Koboldt; Raymond D Miller; Ludmila Pawlikowska; Patricia Taillon-Miller; Ming Xiao; Lap-Chee Tsui; William Mak; You Qiang Song; Paul K H Tam; Yusuke Nakamura; Takahisa Kawaguchi; Takuya Kitamoto; Takashi Morizono; Atsushi Nagashima; Yozo Ohnishi; Akihiro Sekine; Toshihiro Tanaka; Tatsuhiko Tsunoda; Panos Deloukas; Christine P Bird; Marcos Delgado; Emmanouil T Dermitzakis; Rhian Gwilliam; Sarah Hunt; Jonathan Morrison; Don Powell; Barbara E Stranger; Pamela Whittaker; David R Bentley; Mark J Daly; Paul I W de Bakker; Jeff Barrett; Yves R Chretien; Julian Maller; Steve McCarroll; Nick Patterson; Itsik Pe'er; Alkes Price; Shaun Purcell; Daniel J Richter; Pardis Sabeti; Richa Saxena; Stephen F Schaffner; Pak C Sham; Patrick Varilly; David Altshuler; Lincoln D Stein; Lalitha Krishnan; Albert Vernon Smith; Marcela K Tello-Ruiz; Gudmundur A Thorisson; Aravinda Chakravarti; Peter E Chen; David J Cutler; Carl S Kashuk; Shin Lin; Gonçalo R Abecasis; Weihua Guan; Yun Li; Heather M Munro; Zhaohui Steve Qin; Daryl J Thomas; Gilean McVean; Adam Auton; Leonardo Bottolo; Niall Cardin; Susana Eyheramendy; Colin Freeman; Jonathan Marchini; Simon Myers; Chris Spencer; Matthew Stephens; Peter Donnelly; Lon R Cardon; Geraldine Clarke; David M Evans; Andrew P Morris; Bruce S Weir; Tatsuhiko Tsunoda; James C Mullikin; Stephen T Sherry; Michael Feolo; Andrew Skol; Houcan Zhang; Changqing Zeng; Hui Zhao; Ichiro Matsuda; Yoshimitsu Fukushima; Darryl R Macer; Eiko Suda; Charles N Rotimi; Clement A Adebamowo; Ike Ajayi; Toyin Aniagwu; Patricia A Marshall; Chibuzor Nkwodimmah; Charmaine D M Royal; Mark F Leppert; Missy Dixon; Andy Peiffer; Renzong Qiu; Alastair Kent; Kazuto Kato; Norio Niikawa; Isaac F Adewole; Bartha M Knoppers; Morris W Foster; Ellen Wright Clayton; Jessica Watkin; Richard A Gibbs; John W Belmont; Donna Muzny; Lynne Nazareth; Erica Sodergren; George M Weinstock; David A Wheeler; Imtaz Yakub; Stacey B Gabriel; Robert C Onofrio; Daniel J Richter; Liuda Ziaugra; Bruce W Birren; Mark J Daly; David Altshuler; Richard K Wilson; Lucinda L Fulton; Jane Rogers; John Burton; Nigel P Carter; Christopher M Clee; Mark Griffiths; Matthew C Jones; Kirsten McLay; Robert W Plumb; Mark T Ross; Sarah K Sims; David L Willey; Zhu Chen; Hua Han; Le Kang; Martin Godbout; John C Wallenburg; Paul L'Archevêque; Guy Bellemare; Koji Saeki; Hongguang Wang; Daochang An; Hongbo Fu; Qing Li; Zhen Wang; Renwu Wang; Arthur L Holden; Lisa D Brooks; Jean E McEwen; Mark S Guyer; Vivian Ota Wang; Jane L Peterson; Michael Shi; Jack Spiegel; Lawrence M Sung; Lynn F Zacharia; Francis S Collins; Karen Kennedy; Ruth Jamieson; John Stewart
Journal: Nature Date: 2007-10-18 Impact factor: 49.962

8. HGVbaseG2P: a central genetic association database.

Authors: Gudmundur A Thorisson; Owen Lancaster; Robert C Free; Robert K Hastings; Pallavi Sarmah; Debasis Dash; Samir K Brahmachari; Anthony J Brookes
Journal: Nucleic Acids Res Date: 2008-10-23 Impact factor: 16.971

9. An open access database of genome-wide association results.

Authors: Andrew D Johnson; Christopher J O'Donnell
Journal: BMC Med Genet Date: 2009-01-22 Impact factor: 2.103

10. Pathway and network-based analysis of genome-wide association studies in multiple sclerosis.

Authors: Sergio E Baranzini; Nicholas W Galwey; Joanne Wang; Pouya Khankhanian; Raija Lindberg; Daniel Pelletier; Wen Wu; Bernard M J Uitdehaag; Ludwig Kappos; Chris H Polman; Paul M Matthews; Stephen L Hauser; Rachel A Gibson; Jorge R Oksenberg; Michael R Barnes
Journal: Hum Mol Genet Date: 2009-03-13 Impact factor: 6.150

16 in total

1. LDtrait: An Online Tool for Identifying Published Phenotype Associations in Linkage Disequilibrium.

Authors: Shu-Hong Lin; Derek W Brown; Mitchell J Machiela
Journal: Cancer Res Date: 2020-06-30 Impact factor: 12.701

2. Diseases 2.0: a weekly updated database of disease-gene associations from text mining and data integration.

Authors: Dhouha Grissa; Alexander Junge; Tudor I Oprea; Lars Juhl Jensen
Journal: Database (Oxford) Date: 2022-03-28 Impact factor: 4.462

3. GRASP: analysis of genotype-phenotype results from 1390 genome-wide association studies and corresponding open access database.

Authors: Richard Leslie; Christopher J O'Donnell; Andrew D Johnson
Journal: Bioinformatics Date: 2014-06-15 Impact factor: 6.937

4. Pervasive pleiotropy between psychiatric disorders and immune disorders revealed by integrative analysis of multiple GWAS.

Authors: Qian Wang; Can Yang; Joel Gelernter; Hongyu Zhao
Journal: Hum Genet Date: 2015-09-04 Impact factor: 4.132

5. Weighted mining of massive collections of [Formula: see text]-values by convex optimization.

Authors: Edgar Dobriban
Journal: Inf inference Date: 2017-12-08

6. Optimal multiple testing under a Gaussian prior on the effect sizes.

Authors: Edgar Dobriban; Kristen Fortney; Stuart K Kim; Art B Owen
Journal: Biometrika Date: 2015-11-04 Impact factor: 2.445

7. An integrated brain-specific network identifies genes associated with neuropathologic and clinical traits of Alzheimer's disease.

Authors: Cui-Xiang Lin; Hong-Dong Li; Chao Deng; Weisheng Liu; Shannon Erhardt; Fang-Xiang Wu; Xing-Ming Zhao; Yuanfang Guan; Jun Wang; Daifeng Wang; Bin Hu; Jianxin Wang
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622