| Literature DB >> 25951190 |
Fotios Drenos1, Enzo Grossi2, Massimo Buscema3, Steve E Humphries4.
Abstract
We present the use of innovative machine learning techniques in the understanding of Coronary Heart Disease (CHD) through intermediate traits, as an example of the use of this class of methods as a first step towards a systems epidemiology approach of complex diseases genetics. Using a sample of 252 middle-aged men, of which 102 had a CHD event in 10 years follow-up, we applied machine learning algorithms for the selection of CHD intermediate phenotypes, established markers, risk factors, and their previously associated genetic polymorphisms, and constructed a map of relationships between the selected variables. Of the 52 variables considered, 42 were retained after selection of the most informative variables for CHD. The constructed map suggests that most selected variables were related to CHD in a context dependent manner while only a small number of variables were related to a specific outcome. We also observed that loss of complexity in the network was linked to a future CHD event. We propose that novel, non-linear, and integrative epidemiological approaches are required to combine all available information, in order to truly translate the new advances in medical sciences to gains in preventive measures and patients care.Entities:
Mesh:
Year: 2015 PMID: 25951190 PMCID: PMC4423836 DOI: 10.1371/journal.pone.0125876
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Names and abbreviations of phenotypes and polymorphisms included in the analysis.
| Variable | Abr. used | Indicator variable selected | |||
|---|---|---|---|---|---|
| 1 | 2 | 3 | |||
| Smoking (yes/no) | smoking | yes | |||
| Age | age | X | X | ||
| Body Mass Index | bmi | X | X | X | |
| Systolic Blood pressure | sysbp | X | X | ||
| Diastolic Blood pressure | diabp | X | X | X | |
| Triglycerides | tg | X | |||
| Total Cholesterol | chol | X | X | ||
| Low Density Lipoprotein | ldl | X | X | ||
| High Density Lipoprotein | hdl | X | X | ||
| Apolipoprotein B | apob | X | X | ||
| Apolipoprotein A | apoa | X | X | ||
| Lipoprotein-associated phospholipase | lppla2 | ||||
| C-reactive protein | crp | X | X | ||
| Factor VII coagulant activity | viic | X | X | ||
| Fibrinogen | fib | ||||
| ALX homeobox 4 gene | rs729287 | ALX4 | X | ||
| Angiopoietin-like 4 gene | E40K | ANGPTL4 | X | ||
| Apolipoprotein B gene | rs585967 | APOB | |||
| Apolipoprotein E gene | APOE | X | X | ||
| Apolipoprotein-A5-A4-C3-A1 gene cluster | rs6589566 | ApoA5-A4-C3-A1 | X | ||
| Arachidonate 5-lipoxygenase-activating protein gene | rs3885907 | ALOX5AP | X | ||
| Calpain 10 gene | rs4676411 | CAPN10 | |||
| Cathepsin S gene | rs11576175 | CTSS | X | ||
| Cholesteryl ester transfer protein gene | rs708272 | CETP | X | ||
| Coagulation factor VII gene | rs6046 | F7 | X | ||
| Complement component 2 gene | rs7746553 | C2 | |||
| Complement component 3 gene | rs344550 | C3 | |||
| C-reactive protein gene | rs3093077 | CRP | X | X | X |
| Cyclin-dependent kinase inhibitor 2A/B (Chr9p21) | rs10811661 | CDKN | X | X | |
| Exostosin 2 gene | rs3740878 | EXT2 | X | X | |
| Fibrinogen alpha chain gene | rs4508864 | FGA | X | X | |
| Glucokinase (hexokinase 4) regulator gene | rs780094 | GCKR | X | X | |
| Glutathione S-transferase mu 3 gene | rs3814309 | GSTM3 | X | X | |
| Glutathione S-transferase mu 4 gene | rs1537236 | GSTM4 | X | X | |
| Hepatic lipase gene | rs1800588 | LIPC | X | X | X |
| Insulin gene | rs689 | INS | |||
| Insulin-like growth factor 2 gene | 1252T/C AluI | IGF2 | X | X | |
| Interleukin 1 receptor antagonist gene | rs397211 | ILRN1 | X | X | |
| Interleukin 18 receptor accessory protein gene | rs11465699 | IL18RAP | X | ||
| Interleukin 6 receptor gene | rs4075015 | IL6R | |||
| Lipoprotein lipase gene | rs301 | LPL | X | X | |
| Low density lipoprotein receptor gene | rs6511720 | LDLR | |||
| Low density lipoprotein receptor-related protein 5 gene | rs11602256 | LRP5 | X | ||
| Nitric oxide synthase 3 gene | rs3918232S3 | NOS3 | X | ||
| Phospholipase A2, group VII gene | rs1051931 | PLA2G7 | X | X | |
| Platelet/endothelial cell adhesion molecule gene | rs1131012 | PECAM1 | X | X | X |
| Proprotein convertase subtilisin/kexin type 9 gene | rs11591147 | PCSK9 | X | ||
| Protein C receptor gene | rs867186 | PROCR | X | ||
| Toll-like receptor 4 gene | rs11536857 | TLR4 | X | X | |
| Transforming growth factor, beta 1 gene | rs4803455 | TGFB1 | X | X | |
| Uncoupling protein 2 gene | rs11602906 | UCP2 | |||
| Uncoupling protein 3 gene | rs1685354 | UCP3 | X | X | |
The phenotypes were selected as established risk factors or markers of CHD and their associated polymorphisms. [7]. Only the single top SNP was included for each gene considered. Before analysis each SNP was recoded as three indicator variables. To maintain the three variables per genotype coding, the continuous phenotypes were transformed to tertiles (S1 Table). The full list of phenotypic tertiles and indicator variables used can be found in S2 Table. The three last columns show the generated indicator variables selected from the TWIST procedure as predictive of CHD. Out of the original 150 variables, 75 were retained. Continuous traits in tertiles and genotypes as three genotyping classes. In the abbreviations, index numbers after the gene names refer to common homozygote (1), heterozygote (2) and rare homozygote (3). The APOE gene polymorphism was coded so that 1 were the E2 carriers, 2 were those within the E3E3 category and 3 were the E4 carriers. Smoking is a dichotomous variables of yes or no smoking.
Fig 1Minimum Spanning Tree (MST) for the TWIST selected variables.
Minimum Spanning Tree (MST) for variables selected as informative for CHD by TWIST. Presence or absence of a CHD event during follow-up is included as two separate nodes in the tree. Only positive associations between the nodes, optimized for all other available connections, are represented in the graph. The numbers on the edges are a measure of similarity between the variables. Most risk factors considered are situated between the two nodes with paths able to reach either. A smaller number of parameters are characteristic for the Event or No_Event categories with their paths unable to reach one of the Event nodes without passing the other. Genotypes are coded as 1 for the common homozygotes, 2 for heterozygotes and 3 for the rate homozygotes. Phenotypes are in tertiles with 1 the tertile of lowest values.
Fig 2Meta-MST graph showing the connections represented in at least 9 of the 10 MSTs constructed after randomly excluding 10% of the records.
In contrast to statistical testing, where small sample sizes lead to loss of statistical power for the identification of associations, the method used here is affected in terms of the stability of the proposed solution. Despite the smaller Meta-MST graph compared to the full MST, the main core of the graph with the two event nodes at opposite sides of the LIPC gene and fibrinogen locus nodes and most variables between them remains unchanged.
Fig 3Maximally Regular Graph (MRG) for those that developed CHD and those that remained CHD free during follow-up.
Separate Maximally Regular Graphs (MRG) for those that a) developed or b) remained free from CHD during the study 10 year follow-up. While the Minimum Spanning Tree (MST) represents the energy minimization state of all the correlations in the graph, mapping a single link between the variables and dropping all other links leading to cyclic structures. The MRG shows the maximum intrinsic complexity of the map by including the highest number of cyclic regular microstructures between the nodes. The graph for those that went on to have a CHD event has a smaller number of complex structures compared to the graph for those that remained healthy.