| Literature DB >> 28785300 |
Stefanie Friedrichs1, Juliane Manitz2,3, Patricia Burger1, Christopher I Amos4, Angela Risch5,6,7, Jenny Chang-Claude8, Heinz-Erich Wichmann9,10,11, Thomas Kneib2, Heike Bickeböller1, Benjamin Hofner12,13.
Abstract
The analysis of genome-wide association studies (GWAS) benefits from the investigation of biologically meaningful gene sets, such as gene-interaction networks (pathways). We propose an extension to a successful kernel-based pathway analysis approach by integrating kernel functions into a powerful algorithmic framework for variable selection, to enable investigation of multiple pathways simultaneously. We employ genetic similarity kernels from the logistic kernel machine test (LKMT) as base-learners in a boosting algorithm. A model to explain case-control status is created iteratively by selecting pathways that improve its prediction ability. We evaluated our method in simulation studies adopting 50 pathways for different sample sizes and genetic effect strengths. Additionally, we included an exemplary application of kernel boosting to a rheumatoid arthritis and a lung cancer dataset. Simulations indicate that kernel boosting outperforms the LKMT in certain genetic scenarios. Applications to GWAS data on rheumatoid arthritis and lung cancer resulted in sparse models which were based on pathways interpretable in a clinical sense. Kernel boosting is highly flexible in terms of considered variables and overcomes the problem of multiple testing. Additionally, it enables the prediction of clinical outcomes. Thus, kernel boosting constitutes a new, powerful tool in the analysis of GWAS data and towards the understanding of biological processes involved in disease susceptibility.Entities:
Mesh:
Year: 2017 PMID: 28785300 PMCID: PMC5530424 DOI: 10.1155/2017/6742763
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1Graphical representation of rewiring step in data preparation. Nodes are representing genes in the pathway, while edges indicate interactions between the corresponding genes. Assume the gene depicted in grey is not represented by any genetic markers in the considered study and thus cannot be analyzed. To retain information about the (indirect) interaction of the two genes previously linked to the omitted gene, a new direct link is established between them. Its interaction type is determined by multiplication of the weights inherent to the two dropped links.
Figure 2Graphical representation of the main features of the kernel boosting algorithm.
Figure 3Relative frequency of datasets in which a pathway was selected for 50 pathways in the noninformative simulation scenario.
Description of network properties for pathway topology of pathways used in simulations, compared to the properties of the two effect pathways hsa04020 and hsa04022. Nodes equal the number of included genes, links give the number of interactions, inhibition links the count of interactions of inhibiting type, the average degree of a node is the mean number of adjacent edges, density is the ratio between numbers of existing links and possible links, diameter denotes the distance to the farthest node in the graph, transitivity (also called cluster coefficient) calculates the probability of adjacent vertices of a vertex being connected, and signed transitivity considers the type of interaction in this calculation.
| Min | Mean | Median | Max | hsa04020 | hsa04022 | |
|---|---|---|---|---|---|---|
| Nodes | 29.00 | 103.60 | 86.5 | 398.00 | 180.00 | 167.00 |
| Links | 1.00 | 197.81 | 87.5 | 1493.00 | 297.00 | 372.00 |
| Inhibition links | 0.00 | 27.08 | 10.50 | 148.00 | 7.00 | 67.00 |
| Average degree | 0.07 | 3.18 | 2.36 | 15.62 | 3.30 | 4.46 |
| Density | 0.00 | 0.03 | 0.03 | 0.16 | 0.02 | 0.03 |
| Inhibition degree | 0.00 | 0.52 | 0.24 | 2.62 | 0.08 | 0.80 |
| Diameter | 1.00 | 7.36 | 7.00 | 18.00 | 6.00 | 7.00 |
| Transitivity | 0.00 | 0.02 | 0.00 | 0.14 | 0.00 | 0.03 |
| Signed transitivity | −0.02 | 0.01 | 0.00 | 0.10 | 0.00 | 0.03 |
Counts of included influential genes within pathways used for simulation purposes. Pathways without simulated causal genes are not displayed.
| KEGG id | Name of pathway | Effect genes included |
|---|---|---|
| hsa04020 | Calcium signaling pathway | 4 |
| hsa04022 | cGMP-PKG signaling pathway | 5 |
| hsa04024 | cAMP signaling pathway | 1 |
| hsa04080 | Neuroactive ligand-receptor interaction | 2 |
| hsa04270 | Vascular smooth muscle contraction | 2 |
| hsa04540 | Gap junction | 2 |
| hsa04610 | Complement and coagulation cascades | 1 |
| hsa05200 | Pathways in cancer | 2 |
Characteristics of analyzed GWAS datasets. Numbers of case and control individuals after quality control and SNP numbers for several analysis stages are displayed. Preprocessing of SNPs included quality control of genotype data, as well as updating genomic SNP positions according to the latest information (genomic build 38). The last column indicates the total number of all SNPs annotated to a pathway under investigation.
| Study | Cases/controls | SNPs genotyped | SNPs after preprocessing | SNPs in analysis |
|---|---|---|---|---|
| Lung cancer | 467/468 | 561,466 | 533,062 | 148,938 |
| Rheumatoid arthritis | 866/1189 | 545,080 | 491,695 | 137,839 |
KEGG pathways in the human diseases class as downloaded in April 2016. Pathways are sorted according to p value, derived from LKMT application on the rheumatoid arthritis dataset, in ascending order. p values for pathways significantly associated after Bonferroni correction are listed. Pathways selected by kernel boosting on the same dataset are marked in italics. Pathways containing one or several genes belonging to the HLA complex are marked with an asterisk behind the id number.
| KEGG id | Name of pathway |
|
|---|---|---|
| hsa05133 | Pertussis | 1.562 × 10−32 |
|
|
| 1.029 × 10−30 |
| hsa04933 | AGE-RAGE signaling pathway in diabetic complications | 3.877 × 10−17 |
|
|
| 2.651 × 10−16 |
|
|
| 3.087 × 10−15 |
|
|
| 3.969 × 10−15 |
|
|
| 4.131 × 10−12 |
|
|
| 7.695 × 10−11 |
|
|
| 1.344 × 10−11 |
| hsa05030 | Cocaine addiction | 1.353 × 10−11 |
|
|
| 1.466 × 10−11 |
| hsa05310 | Asthma | 2.268 × 10−11 |
| hsa05134 | Legionellosis | 1.699 × 10−05 |
|
|
| 3.591 × 10−10 |
| hsa05031 | Amphetamine addiction | 3.735 × 10−10 |
| hsa05145 | Toxoplasmosis | 4.555 × 10−10 |
|
|
| 1.814 × 10−09 |
| hsa05332 | Graft-versus-host disease | 5.940 × 10−09 |
|
|
| 1.530 × 10−07 |
| hsa05143 | African trypanosomiasis | 2.114 × 10−07 |
| hsa05222 | Small-cell lung cancer | 3.782 × 10−07 |
| hsa05205 | Proteoglycans in cancer | 1.236 × 10−06 |
|
|
| 1.702 × 10−06 |
|
|
| 1.757 × 10−06 |
|
|
| 1.980 × 10−06 |
| hsa05010 | Alzheimer's disease | 7.234 × 10−06 |
| hsa05142 | Chagas disease (American trypanosomiasis) | 1.048 × 10−05 |
|
|
| 1.109 × 10−05 |
|
|
| 1.368 × 10−05 |
| hsa04932 | Nonalcoholic fatty liver disease (NAFLD) | 1.823 × 10−05 |
| hsa05321 | Inflammatory bowel disease (IBD) | 2.124 × 10−05 |
|
|
| 3.625 × 10−05 |
|
|
| 4.133 × 10−05 |
|
|
| 4.220 × 10−05 |
| hsa05202 | Transcriptional misregulation in cancer | 7.697 × 10−05 |
| hsa05220 | Chronic myeloid leukemia | 8.464 × 10−05 |
| hsa05146 | Amoebiasis | 1.003 × 10−04 |
| hsa05414 | Dilated cardiomyopathy | 1.014 × 10−04 |
| hsa05231 | Choline metabolism in cancer | 1.504 × 10−04 |
|
|
| 1.672 × 10−04 |
|
|
| 2.390 × 10−04 |
| hsa05214 | Glioma | 2.506 × 10−04 |
| hsa05164 | Influenza A | 2.720 × 10−04 |
|
|
| 3.384 × 10−04 |
|
|
| 5.147 × 10−04 |
| hsa05014 | Amyotrophic lateral sclerosis (ALS) | 5.568 × 10−04 |
| hsa04930 | Type II diabetes mellitus | Not significant |
| hsa05218 | Melanoma | Not significant |
| hsa05140 | Leishmaniasis | Not significant |
|
|
| Not significant |
|
|
| Not significant |
|
|
| Not significant |
|
|
| Not significant |
| hsa05212 | Pancreatic cancer | Not significant |
| hsa05016 | Huntington's disease | Not significant |
| hsa05221 | Acute myeloid leukemia | Not significant |
|
|
| Not significant |
|
|
| Not significant |
| hsa05223 | Non-small-cell lung cancer | Not significant |
| hsa05034 | Alcoholism | Not significant |
| hsa05130 | Pathogenic Escherichia coli infection | Not significant |
| hsa05120 | Epithelial cell signaling in Helicobacter pylori infection | Not significant |
|
|
| Not significant |
|
|
| Not significant |
| hsa05100 | Bacterial invasion of epithelial cells | Not significant |
| hsa05216 | Thyroid cancer | Not significant |
| hsa05152 | Tuberculosis | Not significant |
| hsa05210 | Colorectal cancer | Not significant |
| hsa05230 | Central carbon metabolism in cancer | Not significant |
|
|
| Not significant |
| hsa05320 | Autoimmune thyroid disease | Not significant |
| hsa05033 | Nicotine addiction | Not significant |
| hsa05110 | Vibrio cholerae infection | Not significant |
Figure 4Relative frequency of datasets in which a pathway was selected using (a) kernel boosting (n = 2000, RR = 1.5) and (b) LKMT (n = 2000, RR = 1.5) for a sample size of 2000 individuals. Pathways including effect genes are labeled in bold; numbers in brackets denote the count of included influential genes within the pathway. All effects were simulated with a relative risk of 1.5 per allele.
Figure 5Relative frequency of datasets in which a pathway was selected using (a) kernel boosting (n = 500, RR = 1.5) and (b) LKMT (n = 500, RR = 1.5) for a sample size of 500 individuals. Pathways including effect genes are labeled in bold; numbers in brackets denote the count of included influential genes within the pathway. All effects were simulated with a relative risk of 1.5 per allele.
Figure 6Relative frequency of datasets in which a pathway was selected using (a) kernel boosting (n = 1000, RR = 1.5) and (b) LKMT (n = 1000, RR = 1.5) for sample sizes of 1000 individuals. Effect strength was set to relative risks of 1.5 per allele. Pathways including effect genes are labeled in bold; numbers in brackets denote the count of included influential genes within the pathway.
Figure 7Relative frequency of datasets in which a pathway was selected using (a) kernel boosting (n = 1000, RR = 1.1) and (b) LKMT (n = 1000, RR = 1.1) for sample sizes of 1000 individuals. Effect strength was set to relative risks of 1.1 per allele. Pathways including effect genes are labeled in bold; numbers in brackets denote the count of included influential genes within the pathway.