| Literature DB >> 25268625 |
Catarina Correia1, Yoan Diekmann2, Astrid M Vicente3, José B Pereira-Leal4.
Abstract
Hundreds of genetic variants have been associated to common diseases through genome-wide association studies (GWAS), yet there are limits to current approaches in detecting true small effect risk variants against a background of false positive findings. Here we addressed the missing heritability problem, aiming to test whether there are indeed risk variants within GWAS statistical noise and to develop a systematic strategy to retrieve these hidden variants. Employing an integrative approach, which combines protein-protein interactions with association data from GWAS for 6 common diseases, we found that associated-genes at less stringent significance levels (p < 0.1) with any of these diseases are functionally connected beyond noise expectation. This functional coherence was used to identify disease-relevant subnetworks, which were shown to be enriched in known genes, outperforming the selection of top GWAS genes. As a proof of principle, we applied this approach to breast cancer, supporting well-known breast cancer genes, while pinpointing novel susceptibility genes for experimental validation. This study reinforces the idea that GWAS are under-analyzed and that missing heritability is rather hidden. It extends the use of protein networks to reveal this missing heritability, thus leveraging the large investment in GWAS that produced so far little tangible gain.Entities:
Mesh:
Year: 2014 PMID: 25268625 PMCID: PMC4227180 DOI: 10.3390/ijms151017601
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Proteins encoded by genes selected at –Log10 gene-wise p-values <1 are functionally related in a protein-protein interaction (PPI). Red circles represent the real value obtained for each genome-wide association studies (GWAS) dataset analyzed. Box plots represent the percentage of direct interactions (A) and isolated nodes (B) and the largest connected component (LCC) size (C) in the 1000 random samples of proteins, by disease. Empirical p-values are shown.
Figure 2Largest connected components contain true biological insight into diseases. (A) Precision, by disease, of five sets of genes against a list of known diseases candidates (n = 50, 64, 44 and 21 for breast cancer, multiple sclerosis, Parkinson’s and type 1 diabetes, respectively). The sets of genes evaluated for precision against the lists of known candidates were: the set of genes selected at a gene wise p-value cutoff of 0.1 (white bar) (n = 1934, 1894, 1035, 1907 in breast cancer, MS, Parkinson’s and type 1 diabetes, respectively), the set of genes included in the LCC obtained from the previous selection (red bar) (n = 395, 410, 146 and 337 in breast cancer, MS, Parkinson’s and type 1 diabetes, respectively), the same number of GWAS top genes than the ones included in the LCC (grey bar), the set of genes surviving Bonferroni correction over SNPs or genes (grey and black dots, respectively). Numbers above the bars are the number of known candidates included in each gene selection set; (B) Recall, by disease, of the same sets of genes against the lists of known candidates. Numbers above the bars are the number of known candidates included in each gene selection set; and (C) Venn diagrams showing, for each disease, the overlap between known candidate genes retrieved by LCC genes (dark grey circle) and by the same number of GWAS top genes (light grey circle).
Figure 3LCC performs better than GWAS top genes in retrieving known breast cancer genes. Venn diagrams showing the overlap between genes reported to be differentially expressed (retrieved from the TCGA data portal, using the default parameters (−0.5 < Log2 < 0.5; frequency = 40%) or to harbor copy number abnormalities (retrieved from the TCGA data portal, using the default parameters −0.5 < Log2 < 0.5; frequency = 20%) or somatic mutation (genes with at least five cases reported in COSMIC database) in breast cancer, with each of the sets of genes selected from the breast cancer GWAS dataset by the previous gene selection approaches (all genes selected at a gene-wise p-value <0.1, LCC genes and top GWAS genes, represented in light yellow, red and orange circles, respectively).
Figure 4Breast cancer network. This network illustrates the 19 known breast cancer genes included in the breast cancer LCC and their first neighbors. Nodes are colored based on a score reflecting their presence in an additional LCC cancer dataset (neuroblastoma) and in the LCCs for the four unrelated diseases. A darker color represents a higher score, which means a higher specificity for cancer. The shape of the node reflects the presence of each gene in breast cancer gene lists (genes associated with breast cancer in NextBio, genes with somatic mutations, copy number abnormalities or differential expression obtained from COSMIC database and TCGA data portal). Circular nodes are proteins absent from the four lists, triangular nodes are proteins present in one and diamond nodes in two. A thicker border indicates that the gene was reported to be differentially expressed in breast cancer.