Liwan Fu1,2, Yuquan Wang2, Tingting Li2, Siqian Yang2, Yue-Qing Hu2,3. 1. Center for Non-communicable Disease Management, National Center for Children's Health, Beijing Children's Hospital, Capital Medical University, Beijing, China. 2. State Key Laboratory of Genetic Engineering, Human Phenome Institute, Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai, China. 3. Shanghai Center for Mathematical Sciences, Fudan University, Shanghai, China.
Abstract
Genome-wide association studies (GWASs) have successfully discovered numerous variants underlying various diseases. Generally, one-phenotype one-variant association study in GWASs is not efficient in identifying variants with weak effects, indicating that more signals have not been identified yet. Nowadays, jointly analyzing multiple phenotypes has been recognized as an important approach to elevate the statistical power for identifying weak genetic variants on complex diseases, shedding new light on potential biological mechanisms. Therefore, hierarchical clustering based on different methods for calculating correlation coefficients (HCDC) is developed to synchronously analyze multiple phenotypes in association studies. There are two steps involved in HCDC. First, a clustering approach based on the similarity matrix between two groups of phenotypes is applied to choose a representative phenotype in each cluster. Then, we use existing methods to estimate the genetic associations with the representative phenotypes rather than the individual phenotypes in every cluster. A variety of simulations are conducted to demonstrate the capacity of HCDC for boosting power. As a consequence, existing methods embedding HCDC are either more powerful or comparable with those of without embedding HCDC in most scenarios. Additionally, the application of obesity-related phenotypes from Atherosclerosis Risk in Communities via existing methods with HCDC uncovered several associated variants. Among these, UQCC1-rs1570004 is reported as a significant obesity signal for the first time, whose differential expression in subcutaneous fat, visceral fat, and muscle tissue is worthy of further functional studies.
Genome-wide association studies (GWASs) have successfully discovered numerous variants underlying various diseases. Generally, one-phenotype one-variant association study in GWASs is not efficient in identifying variants with weak effects, indicating that more signals have not been identified yet. Nowadays, jointly analyzing multiple phenotypes has been recognized as an important approach to elevate the statistical power for identifying weak genetic variants on complex diseases, shedding new light on potential biological mechanisms. Therefore, hierarchical clustering based on different methods for calculating correlation coefficients (HCDC) is developed to synchronously analyze multiple phenotypes in association studies. There are two steps involved in HCDC. First, a clustering approach based on the similarity matrix between two groups of phenotypes is applied to choose a representative phenotype in each cluster. Then, we use existing methods to estimate the genetic associations with the representative phenotypes rather than the individual phenotypes in every cluster. A variety of simulations are conducted to demonstrate the capacity of HCDC for boosting power. As a consequence, existing methods embedding HCDC are either more powerful or comparable with those of without embedding HCDC in most scenarios. Additionally, the application of obesity-related phenotypes from Atherosclerosis Risk in Communities via existing methods with HCDC uncovered several associated variants. Among these, UQCC1-rs1570004 is reported as a significant obesity signal for the first time, whose differential expression in subcutaneous fat, visceral fat, and muscle tissue is worthy of further functional studies.
The applications of genome-wide association studies (GWASs) have successfully established a large number of genetic variants associated with numerous complex diseases (Lutz et al., 2017), contributing to the understanding of the mechanisms of complex diseases such as obesity (Locke et al., 2015; Shungin et al., 2015). Notably, GWASs usually apply the univariate analysis to examine the association between genetic variants and a single phenotype, and in general, multiple phenotypes related to diseases are typically collected together for better understanding the physiological process of diseases (Yang et al., 2010). For example, information about individual status of obesity, insulin resistance, hypertension, and atherosclerotic dyslipidemia is required jointly to explore metabolic syndrome (Sattar et al., 2008). A research of hypertension inevitably takes account of the magnitude of systolic blood pressure (SBP) and diastolic blood pressure (DBP) (Yang and Wang, 2012), From the aspect of pleiotropy, namely, some genes could simultaneously affect multiple related phenotypes, the significance of biological process emphasizes the importance of multiple phenotypes analyses. Univariate analysis means conducting single phenotype separately and showing the outcomes for each phenotype (O'Reilly et al., 2012). However, analyzing one phenotype at each time will absolutely suffer multiple testing corrections, which results in a power loss in GWASs (Yang et al., 2010). Recently, jointly analyzing multiple phenotypes together has become popular due to its increased statistical power of identifying genetic variants compared to analyzing each phenotype separately, enhancing the magnitude of explanation for the biological progress of relevant diseases, and elevating the credibility of the results (Yang et al., 2010; Aschard et al., 2014; Fu et al., 2021).In the past decade, joint analysis of multiple phenotypes has developed rapidly, which may roughly be classified into three categories: regression approaches, integrating testing statistics from univariate analyses, and variable reduction approaches (Yang and Wang, 2012). Tests that fall into the first category, regression approaches, mainly encompass three different methods to analyze the association of multiple phenotypes with a genetic variant: mixed effect models (Bates and DebRoy, 2004), frailty models (Therneau et al., 2003), and generalized estimating equations (Zeger and Liang, 1986). In the second category, integrating testing statistics from univariate analyses, as the name suggests, integrates different test statistics or p-values from univariate association analyses via various strategies (Schaid et al., 2016; Yang et al., 2016). Nowadays, various approaches of integrating test statistics or p-values from univariate analyses have been established to investigate the association between genetic variants and multiple phenotypes concerning the correlation structure among phenotypes (van der Sluis et al., 2013; Kwak and Pan, 2016; Liang et al., 2016; Yang et al., 2016). In the last category, tests on the basis of variable reduction approaches roughly adopt three dimension reduction techniques. The first one is the principal component analysis (PCA) (Aschard et al., 2014). In PCA, the first few principal components (PCs) with regard to majority of the total phenotype variance are selected for evaluating their association with a genetic variant. The second one is the canonical correlation analysis (CCA) (Tang and Ferreira, 2012). CCA supplies an efficient and powerful method for both univariate and multivariate analyses ignoring the need for permutation test in association studies by searching for linear combinations that maximize the association between two classes of multidimensional variables. The last one is the principal component of heritability (PCH) (Ott and Rabinowitz, 1999; Klei et al., 2008; Wang et al., 2016). PCH adopts a linear combination of phenotypes that represents the highest heritability among all linear combinations of phenotypes for reducing multiple phenotypes.In this study, we develop a novel variable reduction approach called hierarchical clustering based on different methods for calculating correlation coefficients (HCDC) aiming at jointly analyzing multiple phenotypes. By means of a dimension reduction technique, HCDC constructs a typical phenotype from each cluster of phenotypes, then applies the existing approaches for jointly analyzing multiple phenotypes to estimate the genetic associations with the typical phenotypes instead of the individual phenotypes. The vital significance in dimension reduction technique of HCDC is that when one cluster is composed of positively highly correlated phenotypes, every linear combination of phenotypes is a representative of the cluster reasonably (Bien and Wegkamp, 2013; Bühlmann et al., 2013). One specific advantage of HCDC is that it does not need to know individual phenotypes, and it actually requires a similarity matrix about the phenotypes. In real data analysis, the similarity matrix of phenotypes can be evaluated from the summary statistical values with regard to the usage of independent single nucleotide polymorphisms (SNPs) in a GWAS (Zhu et al., 2015). Previously, hierarchical clustering method (HCM) is also a clustering approach (Liang et al., 2018). However, when calculating the correlation coefficients between distinct clusters, HCM adopts the uniform expression of correlation coefficients, not concerning the number of phenotypes in each cluster. As a result, HCM obtains lower statistical power in some scenarios. On the contrary, we propose HCDC by virtual extensive simulations to reveal the validity of the improved two-step approach and to explore its power. Notably, the performance of three existing approaches employing HCDC or HCM, namely, multivariate analysis of variance (MANOVA) (Cole and MaxwellScott, 1994), joint model of multiple phenotypes (MultiPhen) (O'Reilly et al., 2012), trait-based association test that uses extended Simes procedure (TATES) (van der Sluis et al., 2013), is compared with that of without employing HCDC or HCM. In this way, scientific issues about whether there exists an advantage of clustering (MANOVA, MultiPhen, and TATES using HCDC or HCM are compared with these approaches without using HCDC or HCM) and which clustering approach has more obviously outstanding performance (MANOVA, MultiPhen, and TATES using HCDC are compared with these approaches using HCM) can be solved. Our simulations reveal that MANOVA, MultiPhen, and TATES employing HCDC have correct type Ⅰ error rates and possess more power than MANOVA, MultiPhen, and TATES without employing HCDC in most simulation scenarios. Finally, we emphatically explore the performance of HCDC approach by utilizing the obesity-related phenotypes from a real dataset, Atherosclerosis Risk in Communities (ARIC) Study (Author Anonymous, 1989) from dbGaP. Consequently, a total of eight significant SNPs are detected, and subsequent bioinformatics analysis is carried out for better understanding the results. From another point of view, the interesting results indicate the effective performance of HCDC in real data application.
Methods
Proposed HCDC
Assume a sample with N individuals, and M phenotypes
. Meanwhile, let
denote the genotypic score of N individuals at a genetic variant of interest, where
represents the number of minor alleles that i th subject carries at that variant.Note that the key issue in the hierarchical clustering is to specify a measure of similarity between disjoint groups of phenotypes. Now let us take two disjoint clusters
and
of phenotypes as an example to demonstrate the calculation of similarity between these two groups. Denote
and
as the numbers of phenotypes in
and
, respectively.1. If
, Pearson correlation coefficient (Jin and Lin, 2019) between two phenotypes is calculated to represent the similarity between
and
.2. If
and
, or
and
multiple correlation coefficient (Cohen and Cohen, 1983; Jin and Lin, 2019) is employed based on the phenotypes involved in
and
, respectively, to reveal the similarity between a pair of clusters.3. If
and
, canonical correlation coefficient (Ferreira and Purcell, 2009) is applied according to the phenotypes involved in
and
respectively to show the similarity between two clusters.Once we have the similarity measure between two clusters of phenotypes, we apply a hierarchical clustering approach to cluster the phenotypes. Specifically, following the agglomerative (bottom–up) procedure, we start at the bottom (i.e., the lowest level) where each phenotype is a cluster and then recursively merge a selected pair of clusters with the biggest intergroup similarity at the next lower level into a single cluster. This produces a grouping at the next higher level with one less cluster until all phenotypes are grouped as one cluster at the highest level. Finally, there are M − 1 levels in the hierarchy.For any b,
, let
denote the height at the level b in the dendrogram, which is the biggest intergroup similarity at the level b − 1. Similar to a proposed principle (Bühlmann et al., 2013), a stopping criterion is adopted to determine the optimal number K of clusters,
Without loss of generality, the corresponding K clusters are denoted as
.The established HCDC encompasses the following two steps. First, M phenotypes are grouped into K clusters as aforementioned, and each of the K clusters singles out a representative phenotype. Second, existing approaches to the K representative phenotypes instead of the original M phenotypes are employed to evaluate the genetic association of multiple phenotypes with a genetic variant.Notice that each phenotype should be scaled first before constructing the representative phenotype for each other. We define the representative phenotype for the kth cluster as the mean phenotype values in the cluster, namely
where
is the number of phenotypes in the cluster
. Denote
as the
design matrix whose kth column is given by
. Then, existing approaches are employed to evaluate the association between
and X.The source code for HCDC approach can be found in https://github.com/YQHuFD/HCDC.
Comparison of Methods
For convenience, let
denote the ones vector of length n and
represent the all zeroes vector of length n, where n is a positive integer. First, we need to introduce one of the potential competitors, HCM (Liang et al., 2018). Same as the process of HCDC, HCM also adopts the bottom–up hierarchical clustering method on the basis of the similarity. But unlike HCDC, HCM defines the similarity matrix with
, where
is the i jth entry of the sample correlation matrix of M phenotypes
. The average linkage is employed as the similarity between two clusters in HCM. To be precise, the similarity between clusters
and
(which are two disjoint subsets of {1, 2, … , M}) is given by
where
and
are the numbers of phenotypes in the respective clusters
and
,
.Except the different definition of similarity between pairs of clusters, the remaining processes of HCM are exactly the same as the HCDC. Second, the performance of MANOVA (Cole and MaxwellScott, 1994), MultiPhen (O'Reilly et al., 2012), and TATES (van der Sluis et al., 2013) with using HCDC is compared with that of with using HCM and that of without using HCDC/HCM approaches. The ones with employing HCDC and HCM are referred as HCDCMANOVA, HCMANOVA, HCDCMultiPhen, HCMultiPhen, HCDCTATES, and HCTATES, respectively. In the following, we briefly review the existing approaches for easy reference.MANOVA (multivariate analysis of variance) (Cole and MaxwellScott, 1994): A total of M phenotypes are involved in the standard MANOVA and the background variance–covariance matrix
including
symmetrical elements is unconstrained. There are
freely evaluated elements in the covariances and variances. Standard MANOVA tests the null hypothesis that the M regression coefficients are all zeroes, which asymptotically follows F distribution.MutiPhen (joint model of multiple phenotypes) (O'Reilly et al., 2012): In the MultiPhen model, the genotypes and phenotypes are treated as ordinal response and predictors, respectively. Likelihood ratio test is performed to test the null hypothesis in the proportional odds logistic regression.TATES (trait-based association test that uses extended Simes procedure) (van der Sluis et al., 2013): The p-values from univariate analysis is integrated to get a comprehensive p-value, and simultaneously, correlation between phenotypes is considered for adjustment. Denote
as the p-value of TATES, where
represents the
ascending sorted p-value;
and
are the effective number of independent p-values among all involved M phenotypes and j specific phenotypes, respectively. The correlation matrix of p-values is derived to obtain the effective numbers.
Results
Simulation Studies
Suppose that a population is in Hardy–Weinberg equilibrium (HWE), and we generate the genotypes of the genetic variants following the binomial distribution with parameter two and the minor allele frequency (MAF). This simulation study sets MAF = 0.3 in most scenarios. We generate multiple phenotypes by means of the following factor model (van der Sluis et al., 2013):
where
denotes the M phenotypes; x is the genotype;
represents the vector of values suggesting the effects of genetic variant on the M phenotypes; f shows the vector of factors;
,
; I is the identity matrix; R represents the number of factors, and
is the correlation between factors;
is an
matrix; d is a diagonal matrix for correcting the variance of phenotypes; c denotes a constant;
represents a vector of random errors, and
are mutually independent and follow the standard normal distributions. Consider the following four models with different numbers of factors affected by genotypes.
and
where
denotes the vectors of components
.Model 1: There is only one factor, and the genotype has an effect on all phenotypes with the same effect size. That is,
,
,
, and
.Model 2: There are two factors and the genotype impacts on one factor with the same effect. Namely,
,
,
, and
, which is the block diagonal matrix of
and
.Model 3: There are four factors, and the genotype has an effect on the last two factors with varied effect directions. That is,
,
,Model 4: There are four factors, and the genotype has an influence on the last three factors with different sizes. Namely,
,andFor the all models, the within-factor correlation is
, and the between-factor correlation is
. For evaluating type Ⅰ error rates and powers, this study sets N = 2,000 unrelated individuals, and the number of phenotypes M = 16, 32. According to
, all phenotypes independent of genotypes are generated to estimate the type Ⅰ error rates of all investigated approaches, encompassing MANOVA, MultiPhen, TATES, HCMANOVA, HCMultiPhen, HCTATES, HCDCMANOVA, HCDCMultiPhen, and HCDCTATES. The corresponding Q–Q plots of type Ⅰ error rates in varied approaches are shown in Supplementary Figures S1–8. Notably, for assessing powers, we do not only alter the values of
(meanwhile, the within-factor correlation
and between-factor correlation
) but also vary the values of within-factor correlation
, and 0.9 (meanwhile, the between-factor correlation
).
Simulation Results
We establish varied nominal significance levels, distinct number of phenotypes, and different number of factors to assess the type Ⅰ error rates of all the nine methods. In each simulation model, the p-values of all these evaluated methods are estimated by their asymptotic distributions. The type Ⅰ error rates of MANOVA, MultiPhen, TATES, HCMANOVA, HCMultiPhen, HCTATES, HCDCMANOVA, HCDCMultiPhen, and HCDCTATES are evaluated by 10,000 replicated samples. For 10,000 replicated samples, we calculate that the 95% confidence intervals (CIs) for type Ⅰ error rates in the nominal levels of 0.01 and 0.05 are about (0.008, 0.012) and (0.0457, 0.0543), respectively. The estimated type Ⅰ error rates of all these tested methods are shown in Table 1 (M = 16) and Table 2 (M = 32). We observe that the majority of the type Ⅰ error rates of HCDCMANOVA, HCDCMultiPhen, and HCDCTATES are within 95%CIs, which reflects the validity of the established HCDC applied to existing methods. Additionally, the type Ⅰ error rates of MANOVA, MultiPhen, TATES, HCMANOVA, HCMultiPhen, and HCTATES are not obviously deviated from the nominal levels. For more information, please see the Q–Q plots in Supplementary Figures S1–8.
TABLE 1
Evaluations of type Ⅰ error rates of the nine methods in four simulation models.
Type Ⅰ error rates
Methods
Model 1
Model 2
Model 3
Model 4
α = 0.01
α = 0.05
α = 0.01
α = 0.05
α = 0.01
α = 0.05
α = 0.01
α = 0.05
HCDCMANOVA
0.0102
0.0523
0.0113
0.0522
0.0086
0.0532
0.0095
0.05
HCMANOVA
0.01
0.0517
0.0113
0.0524
0.0094
0.0478
0.0101
0.0509
MANOVA
0.0108
0.0505
0.0112
0.0547
0.0089
0.0514
0.0103
0.0519
HCDCMultiPhen
0.0101
0.0538
0.012
0.0527
0.0089
0.0532
0.0102
0.0483
HCMultiPhen
0.0091
0.0528
0.0121
0.0526
0.0102
0.0519
0.0101
0.0494
MultiPhen
0.0107
0.0523
0.0116
0.052
0.0094
0.0517
0.011
0.0537
HCDCTATES
0.0108
0.0502
0.0112
0.0511
0.0099
0.0466
0.0112
0.0506
HCTATES
0.0122
0.051
0.0114
0.0512
0.0109
0.0488
0.0103
0.05
TATES
0.0111
0.0473
0.0119
0.0512
0.0112
0.0514
0.0121
0.0535
Sample size N = 2,000, the number of phenotypes M = 16, c
2 = 0.5, ρc
2 = 0.1, and minor allele frequency (MAF) = 0.3. The type Ⅰ error rates of all nine methods are evaluated by 10,000 replicated samples at the significance of α. The values in bold indicate that the type Ⅰ error rates are out of 95% CI of the nominal significance level.
TABLE 2
Evaluations of type Ⅰ error rates of the nine methods in four simulation models.
Type Ⅰ error rates
Methods
Model 1
Model 2
Model 3
Model 4
α = 0.01
α = 0.05
α = 0.01
α = 0.05
α = 0.01
α = 0.05
α = 0.01
α = 0.05
HCDCMANOVA
0.01
0.0515
0.0118
0.0543
0.01
0.0498
0.0099
0.048
HCMANOVA
0.0111
0.0502
0.0118
0.0544
0.0111
0.0503
0.0102
0.0506
MANOVA
0.0101
0.051
0.0106
0.0582
0.0115
0.0545
0.0102
0.0515
HCDCMultiPhen
0.0099
0.0502
0.0117
0.0545
0.0098
0.05
0.0091
0.0497
HCMultiPhen
0.011
0.0516
0.0119
0.0543
0.0102
0.0503
0.0099
0.0512
MultiPhen
0.0102
0.0495
0.011
0.0589
0.0115
0.0573
0.0106
0.0511
HCDCTATES
0.0112
0.0514
0.0119
0.0539
0.0097
0.0483
0.0086
0.0463
HCTATES
0.0093
0.045
0.012
0.0538
0.0111
0.0546
0.0106
0.0516
TATES
0.0078
0.041
0.0105
0.0465
0.0128
0.0524
0.0101
0.0496
Sample size N = 2,000, the number of phenotypes M = 32, c
2 = 0.5, ρc
2 = 0.1, and minor allele frequency (MAF) = 0.3. The type Ⅰ error rates of all nine methods are evaluated by 10,000 replicated samples at the significance of α. The values in bold indicate that the type Ⅰ error rates are out of 95% CI of the nominal significance level.
Evaluations of type Ⅰ error rates of the nine methods in four simulation models.Sample size N = 2,000, the number of phenotypes M = 16, c
2 = 0.5, ρc
2 = 0.1, and minor allele frequency (MAF) = 0.3. The type Ⅰ error rates of all nine methods are evaluated by 10,000 replicated samples at the significance of α. The values in bold indicate that the type Ⅰ error rates are out of 95% CI of the nominal significance level.Evaluations of type Ⅰ error rates of the nine methods in four simulation models.Sample size N = 2,000, the number of phenotypes M = 32, c
2 = 0.5, ρc
2 = 0.1, and minor allele frequency (MAF) = 0.3. The type Ⅰ error rates of all nine methods are evaluated by 10,000 replicated samples at the significance of α. The values in bold indicate that the type Ⅰ error rates are out of 95% CI of the nominal significance level.For power comparison for these nine methods, we alter distinct numbers of phenotypes and different models. The powers of all tests are estimated on the basis of 1,000 replications and 2,000 subjects at a significance level of 0.05. From the plots of power against genetic effect β (Figure 1), the following are observed and can be shown:
FIGURE 1
Power comparisons of the nine methods as a function of β in the four models. Sample size N = 2,000, the number of phenotypes M = 16 (A–D) and M = 32 (E–H), c2 = 0.5, ρc2 = 0.1, and MAF = 0.3. The power of all the methods is evaluated by 1,000 replicated samples at a significance level of 0.05.
Power comparisons of the nine methods as a function of β in the four models. Sample size N = 2,000, the number of phenotypes M = 16 (A–D) and M = 32 (E–H), c2 = 0.5, ρc2 = 0.1, and MAF = 0.3. The power of all the methods is evaluated by 1,000 replicated samples at a significance level of 0.05.1. When the genetic variant has the same effect on all the phenotypes (Model 1), HCDCMANOVA, HCDCMultiPhen, and HCDCTATES are powerful than HCMANOVA, HCMultiPhen, and HCTATES, respectively. Meanwhile, HCMANOVA, HCMultiPhen, and HCTATES are powerful than MANOVA, MultiPhen, and TATES, respectively. In most replications, HCDC and HCM cluster various phenotypes into one or several categories to reduce the number of phenotypes to be analyzed for enhancing the power of test. Obviously, HCDC is slightly powerful than HCM in this scenario.2. When the genetic effects on phenotypes reveal some groups and possess the same direction (Model 2), the power of HCDCMANOVA, HCDCMultiPhen, and HCDCTATES is equal to that of HCMANOVA, HCMultiPhen, and HCTATES, respectively. However, MANOVA, MultiPhen, and TATES with HCDC or HCM are much more powerful than MANOVA, MultiPhen, and TATES, respectively. These results indicate that clustering can definitely increase the power of test.3. When the genetic effects on phenotypes appear in some groups and show different directions (Models 3 and 4), MANOVA, MultiPhen, and TATES are powerful than MANOVA, MultiPhen, and TATES with HCDC or HCM, respectively.4. No matter altering of genetic effects β or changes in correlation coefficients between varied phenotypes, HCDCMANOVA and HCDCMultiPhen, HCMANOVA and HCMultiPhen, MANOVA and MutiPhen have similar performance in all four models, respectively.5. When the genetic effects on phenotypes show obvious same direction within a group (Models 1 and 2), HCDCTATES, HCTATES, and TATES have better performance than other approaches.From the within-factor correlation c
2 (Supplementary Figures 9, 10), we can observe the following:6. When the genetic variant has the same effect on the phenotypes within a group, and there exists the same variance among phenotypes within this group, the powers of all estimated methods decrease as the within-factor c
2 increases (Models 1 and 2). However, our proposed MANOVA, MultiPhen, and TATES with using HCDC have obvious advantage over MANOVA, MultiPhen, and TATES without using HCDC, respectively.7. When the genetic variant has the distinct effects on the phenotypes within a group, and there are different variances among phenotypes within this group (Models 3 and 4), MANOVA, MultiPhen, and TATES with using HCDC have more power than MANOVA, MultiPhen, and TATES without using HCDC as the within-factor c
2 is <0.5, but MANOVA and MultiPhen get more advantage as c
2 is >0.5, which reveal that MANOVA and MultiPhen take heteroscedasticity between different phenotypes into account when calculating genetic associations.In summary, the existing approaches employing HCDC has controlled type Ⅰ error rates and have more advantage over or are comparable with those without employing HCDC. Therefore, we could draw that our established HCDC could give more power than HCM or original approaches without using HCDC, and in some scenarios, the advantage is more obvious. In other scenarios, the existing methods using HCDC is comparable with the most powerful test.
Real Data Analysis
We use our established approach, HCDC, together with other existing methods to the real data analysis in ARIC study (Author Anonymous, 1989). Briefly, ARIC is a prospective cohort study supported by the National Heart, Lung, and Blood Institute (NHLBI), aiming at assessing atherosclerosis risk in community. It keeps track of the altering of the occurrence of atherosclerosis-relevant diseases and cardiovascular risk factors in different regions, races, genders, and periods of time, in order to explore the natural process of atherosclerosis (Morrison et al., 2013). We acquire the clinical phenotypic and genotyped data of ARIC from dbGaP server of the United States National Center for Biotechnology Information (accession number: phs000090.v4.p1).To evaluate the performance of HCDC together with other existing methods in analyzing real data, we evaluate the nine approaches to explore obesity-related phenotypes in ARIC. We choose nine continuous phenotypes concerning obesity comprising body weight, body mass index (BMI), mean skinfold thickness of the triceps brachii, average subscapular skinfold thickness, hip girth, waist, waist-to-hip ratio (WHR), calf girth, and wrist breadth and three covariates of age, gender, and race. The description of these variables is shown in Table 3 in detail, and the correlation matrix of obesity-related phenotypes is displayed in Supplementary Figure S11. A total of 12,701 individuals across 272,027 SNPs are left to be analyzed subsequently after removing subjects with missing data under any of these 12 variables together with the genetic variants concerning missing rate more than 0.2 or HWE <10–4. Each phenotype is adjusted for those three covariates by conducting the linear regression model.
TABLE 3
The descriptions of involved obesity-related phenotypes and covariates in ARIC.
Index
All
Gender
Race
Male
Female
p Value
White
Black
p value
N
12771
5,704
7067
—
9,633
3,138
—
Male, %
44.66
—
—
—
47.02
37.44
9.11 × 10–21
Age, years
54.09 ± 5.73
54.450 ± 5.75
53.76 ± 5.69
6.76 × 10–13
54.34 ± 5.68
53.34 ± 5.80
5.51 × 10–17
Weight, lb
173.13 ± 36.85
188.27 ± 31.46
160.92 ± 36.36
<2.2 × 10–16
169.61 ± 35.69
183.99 ± 38.25
1.90 × 10–74
Weight missing, %
0.149
0.158
0.142
0.995
0.083
0.351
0.002
BMI, kg/m2
27.66 ± 5.30
27.54 ± 4.18
27.75 ± 6.05
0.020
27.01 ± 4.86
29.65 ± 6.05
9.98 × 10–104
BMI missing, %
0.149
0.158
0.142
0.995
0.083
0.351
0.002
Triceps, mm
25.26 ± 10.02
19.34 ± 7.87
30.04 ± 8.97
<2.2 × 10–16
24.54 ± 9.08
27.48 ± 12.23
1.73 × 10–34
Triceps missing, %
0.157
0.175
0.142
0.798
0.093
0.351
0.004
Scapular, mm
24.48 ± 11.59
22.22 ± 9.19
26.31 ± 12.92
1.13 × 10–94
21.85 ± 9.33
32.59 ± 13.89
1.60 × 10–299
Scapular missing, %
0.446
0.561
0.354
0.107
0.353
0.733
0.009
WC, cm
96.94 ± 13.83
99.23 ± 10.93
95.09 ± 15.54
1.25 × 10–68
96.19 ± 13.33
99.25 ± 15.02
5.34 × 10–24
WC missing, %
0.141
0.123
0.156
0.798
0.104
0.255
0.092
HC, cm
104.55 ± 10.31
102.85 ± 8.09
105.93 ± 11.63
2.81 × 10–68
103.50 ± 9.478
107.79 ± 11.98
7.52 × 10–72
HC missing, %
0.141
0.140
0.142
0.999
0.104
0.255
0.092
WHtR
0.926 ± 0.078
0.963 ± 0.054
0.895 ± 0.081
<2.2 × 10–16
0.928 ± 0.079
0.920 ± 0.076
4.66 × 10–8
WHtR missing, %
0.149
0.140
0.156
0.999
0.114
0.255
0.131
Calf, cm
37.44 ± 3.67
38.06 ± 3.17
36.95 ± 3.95
1.48 × 10–68
37.39 ± 3.58
37.60 ± 3.93
0.006
Calf missing, %
0.157
0.210
0.113
0.248
0.114
0.287
0.062
Wrist, mm
53.62 ± 5.18
57.78 ± 3.66
50.27 ± 3.53
<2.2 × 10–16
53.59 ± 5.26
53.74 ± 4.91
0.137
Wrist missing, %
0.117
0.123
0.113
0.999
0.073
0.255
0.022
N is the number of subjects; BMI, is body mass index; Triceps is average skinfold thickness of triceps brachii; Scapular is mean subscapular skinfold thickness; WC, is waist; HC, is hip girth; WHtR is waist-to-hip ratio; Calf is calf girth; and Wrist is wrist breadth. The distributions of normal index are described by mean ± standard deviation; the distributions of non-normal indicators are described by means (25% quantile, 75% quantile). For normal distribution indicators, the differences between groups are estimated using the t-test (the variances of two groups are homogeneous) or the approximate t-test (the variances of two groups are heterogeneous). For non-normally indicators, Wilcoxon signed-rank test is used to test the differences between indicators to get the p-values of differences. For discrete indicators, the chi-square test is used for hypothesis testing and then deriving p-values. Bold number indicates p < 0.05. ARIC, atherosclerosis risk in communities.
The descriptions of involved obesity-related phenotypes and covariates in ARIC.N is the number of subjects; BMI, is body mass index; Triceps is average skinfold thickness of triceps brachii; Scapular is mean subscapular skinfold thickness; WC, is waist; HC, is hip girth; WHtR is waist-to-hip ratio; Calf is calf girth; and Wrist is wrist breadth. The distributions of normal index are described by mean ± standard deviation; the distributions of non-normal indicators are described by means (25% quantile, 75% quantile). For normal distribution indicators, the differences between groups are estimated using the t-test (the variances of two groups are homogeneous) or the approximate t-test (the variances of two groups are heterogeneous). For non-normally indicators, Wilcoxon signed-rank test is used to test the differences between indicators to get the p-values of differences. For discrete indicators, the chi-square test is used for hypothesis testing and then deriving p-values. Bold number indicates p < 0.05. ARIC, atherosclerosis risk in communities.According to the scaled phenotypes with respect to obesity, we use these nine methods to identify associated genetic variants. Due to multiple testing correction, we apply the genome-wide significance threshold of 5 × 10–8. HCDC clusters nine phenotypes into two groups in this real data analysis, one only containing wrist breadth, while the other includes the rest. As comparisons, three groups are obtained after clustering by HCM, one only containing wrist breadth, and another encompasses WHR phenotype, while the other contains the remaining phenotypes. The dendrogram of clustering process for HCM and HCDC in ARIC data are presented in Figure 2. From these graphs, we can observe that there are significant differences between the HCM method and the HCDC method we proposed in the clustering process. Specifically, when the correlation coefficients between different clusters are calculated, the correlation coefficients increase with the increase in clustering times in HCM (h is gradually increasing), while in HCDC, the correlation coefficients may increase, or they may decrease compared to the last clustering result. These differences can be explained by the distinct ways to calculate the correlation coefficients. HCM uses a uniform formula to evaluate the similarity between pairs of clusters. However, pairs of clusters generally include varied situations, comprising single phenotype versus single phenotype, single phenotype versus multiple phenotypes, or multiple phenotypes versus multiple phenotypes. Nevertheless, HCDC takes those into account fully to deal with complex and changeable situations; as a result, such clustering result may be more convincing for most of circumstances.
FIGURE 2
The dendrogram of the nine phenotypes in the ARIC study via HCM (A) and HCDC (B). h represents the maximum value of correlation coefficient in each clustering process, which is taken as the “branch length” of the clustering tree. K reveals the final clustering times according to the stopping criteria. BMI is body mass index; Triceps is average skinfold thickness of triceps brachii; Scapular is mean subscapular skinfold thickness; WC is waist; HC is hip girth; WHR is waist-to-hip ratio; Calf is calf girth; and Wrist is wrist breadth.
The dendrogram of the nine phenotypes in the ARIC study via HCM (A) and HCDC (B). h represents the maximum value of correlation coefficient in each clustering process, which is taken as the “branch length” of the clustering tree. K reveals the final clustering times according to the stopping criteria. BMI is body mass index; Triceps is average skinfold thickness of triceps brachii; Scapular is mean subscapular skinfold thickness; WC is waist; HC is hip girth; WHR is waist-to-hip ratio; Calf is calf girth; and Wrist is wrist breadth.A total of eight SNPs are identified as significant signals for at least one method (Table 4). Previously, a large amount of studies (Frayling et al., 2007; Heard-Costa et al., 2009; Lindgren et al., 2009; Meyre et al., 2009; Thorleifsson et al., 2009; Willer et al., 2009; Heid et al., 2010; Speliotes et al., 2010; Bradfield et al., 2012; Wen et al., 2012; Berndt et al., 2013; Monda et al., 2013; Locke et al., 2015; Shungin et al., 2015) have covered that FTO contributes to the risk of obesity due to the population-based studies and the relevant experiments elaborating specific mechanisms. Among the eight associated SNPs, rs9939609 and rs8050136 are located in FTO gene. In addition, UQCC region is covered to be associated with height (Sanna et al., 2008). Few other SNPs have been explored to assess the association with obesity or obesity-related phenotypes. From Table 4, we can observe that HCDCMANOVA identified three SNPs; HCMANOVA identified two SNPs; MANOVA identified four SNPs; HCDCMultiPhen identified three SNPs, more than the number of SNPs identified by HCMultiPhen (twoSNPs) and MultiPhen (one SNP); HCDCTATES identified three SNPs; TATES identified four SNPs; while no SNP was identified by HCTATES. Overall, the results in real data analysis are highly consistent with the simulation performance. The number of SNPs identified by existing methods with HCDC is comparable with the largest number of SNPs identified by existing methods without HCDC. In order to make the overall performance clearer in real data results, we draw Q–Q plots and Manhattan plots after the application of these nine different methods in ARIC data (Supplementary Figures S12–18). From these plots, we can intuitively observe the SNPs identified by distinct methods, and their p-values in the same plot to compare their sizes.
TABLE 4
Display of significant SNPs and the corresponding p-values in the analysis of ARIC.
Chr
SNP
HCDCMANOVA
HCMANOVA
MANOVA
HCDCMultiphe
HCMultiphen
Multiphen
HCDCTATES
HCTATES
TATES
3
rs17017947
0.873
0.184
1.02 × 10–11
NA
NA
NA
0.803
0.690
0.314
10
rs41470552
0.102
0.004
6.25 × 10–9
NA
NA
NA
0.285
0.748
0.0358
11
rs7927943
1.72 × 10–7
1.88 × 10–7
5.57 × 10–6
1.88 × 10–7
1.21 × 10–7
3.33 × 10–6
9.18 × 10–8
0.513
1.16 × 10–8
11
rs1945647
5.83 × 10–7
2.49 × 10–7
1.19 × 10–5
4.26 × 10–7
1.12 × 10–7
6.27 × 10–6
2.31 × 10–7
0.554
1.77 × 10–8
16
rs9939609
1.67 × 10–11
9.53 × 10–11
1.85 × 10–8
2.98 × 10–11
1.67 × 10–10
3.39 × 10–8
1.68 × 10–11
0.331
2.97 × 10–10
16
rs8050136
3.83 × 10–11
2.10 × 10–10
4.29 × 10–8
8.07 × 10–11
4.33 × 10–10
8.66 × 10–8
1.11 × 10–10
0.277
2.86 × 10–9
20
rs201561
1.06 × 10–8
5.18 × 10–8
2.48 × 10–6
1.11 × 10–8
5.45 × 10–8
2.91 × 10–6
2.57 × 10–7
0.861
7.99 × 10–7
20
rs1570004
1.07 × 10–7
4.86 × 10–7
5.28 × 10–5
1.54 × 10–7
7.06 × 10–7
7.77 × 10–5
1.97 × 10–8
0.864
6.12 × 10–8
The p-values of nine methods are calculated based on asymptotic distribution. p-Value <5 × 10–8 are in bold. “NA” reveals MultiPhen is not available because the genotype at the specified SNP does not take all three values of 0, 1, and 2 in these data. SNP, single-nucleotide polymorphism; ARIC, atherosclerosis risk in communities.
Display of significant SNPs and the corresponding p-values in the analysis of ARIC.The p-values of nine methods are calculated based on asymptotic distribution. p-Value <5 × 10–8 are in bold. “NA” reveals MultiPhen is not available because the genotype at the specified SNP does not take all three values of 0, 1, and 2 in these data. SNP, single-nucleotide polymorphism; ARIC, atherosclerosis risk in communities.
Characteristics of the Significant Variants
We searched the annotations of the associated SNPs on the basis of the Ensemble website (https://asia.ensembl.org) and SCAN website (http://scandb.org), which are shown in Table 5. From Table 5, it can be observed that these significant SNPs are located in intergenic or intron region, and some of them have been covered to be associated with BMI, type 2 diabetes, or height. In general, the first or large-scale GWASs have reported some of these associated signals. The ID of PubMed could be inquired to retrieve the relevant progress of these SNPs. Additionally, there is no influence for us to explore the expressions of the genes that the significant SNPs are associated with, although most of them are located in the intron or intergenic region. Moreover, most of these significant SNPs reveal that their possible effects on the expressions of corresponding genes based on the cell lines of HapMap CEU and YRI (Table 5).
TABLE 5
The annotations of the significant identified SNPs.
Annotations are from Ensemble website (https://asia.ensembl.org) and SCAN website (http://scandb.org); intron denotes the SNP is located between exons; intergenic denotes the SNP is located between genes. Expression genes denote annotations added after analysis of transcriptional levels of eQTL in cell lines from HapMap CEU and YRI samples using Affymetrix human exon 1.0 ST array; GWAS references indicate the identifications of PubMed. SNP, single-nucleotide polymorphism; GWAS, genome-wide association study; eQTL, expression quantitative trait locus.
The annotations of the significant identified SNPs.Annotations are from Ensemble website (https://asia.ensembl.org) and SCAN website (http://scandb.org); intron denotes the SNP is located between exons; intergenic denotes the SNP is located between genes. Expression genes denote annotations added after analysis of transcriptional levels of eQTL in cell lines from HapMap CEU and YRI samples using Affymetrix human exon 1.0 ST array; GWAS references indicate the identifications of PubMed. SNP, single-nucleotide polymorphism; GWAS, genome-wide association study; eQTL, expression quantitative trait locus.For more extensive investigation of whether the significant SNPs identified in ARIC have LD with the other nearby loci, that is, to detect the correlations between these eight associated significantly SNPs in this study with the undetectable surrounding loci, we produced regional plots presented in Figure 3. From Figure 3, it is clear that rs7927943 is physically close to rs1945647, and their LD is quite robust, which reflects that their r
2 is >0.8. What is more, both of them are located near the LOC101928989 gene, regulating the expressions of certain genes (LIMK1, GNAI2, etc.). Since both rs7927943 and rs1945647 manipulate corresponding expressions of genes, subsequently, the relationship between these SNPs and obesity can be studied from the perspective of gene expression. Notice that both SNPs rs9939609 and rs8050136 are located in FTO gene attaching to chromosome 16, and their physical regions are close to each other with a high correlation r
2 >0.8 (Figure 3). It is well known that rs9939609 acts as an obese variant (Frayling et al., 2007). Because of the strong LD between rs9939609 and rs8050136, it is reasonable to speculate that rs8050136 is also associated with obesity-related phenotypes. Three SNPs, namely, rs17017947, rs1570004, and rs41470552, are located in the intron region of genes CHL1, UQCC, and NOLC1, respectively. None of them possesses relatively strong LD with the surrounding loci, so these SNPs probably have an effect on corresponding phenotypic characteristics independently. The rs201561 around LOC100270679 has a profound LD with the surrounding loci (Figure 3), combined with the fact that the association result of p-value for rs201561 is the smallest among all the nearby variants, revealing that the surrounding loci have an impact on the phenotypes because of the high LD with rs201561.
FIGURE 3
The regional association plots of the significant SNPs identified in ARIC. The p-values of rs17017947 and rs41470552 are evaluated by MANOVA method. The p-values of 7927943 and rs1945647 are estimated by TATES method. The p-values of rs9939609, rs8050136, and rs201561 are assessed by HCDCMANOVA method. The p-values of rs1570004 is evaluated by HCDCTATES. LD is constructed using the hg19 version of the 1000 Genome (American). The plots where rs7927943 and rs1945647 are located show the 1,000-kb range around these most significant SNPs. The plots where the rest SNPs (rs7927943, rs1945647, rs9939609, rs8050136, rs201561, and rs1570004) are located present the 400-kb range around these identified significantly SNPs. SNP, single-nucleotide polymorphism.
The regional association plots of the significant SNPs identified in ARIC. The p-values of rs17017947 and rs41470552 are evaluated by MANOVA method. The p-values of 7927943 and rs1945647 are estimated by TATES method. The p-values of rs9939609, rs8050136, and rs201561 are assessed by HCDCMANOVA method. The p-values of rs1570004 is evaluated by HCDCTATES. LD is constructed using the hg19 version of the 1000 Genome (American). The plots where rs7927943 and rs1945647 are located show the 1,000-kb range around these most significant SNPs. The plots where the rest SNPs (rs7927943, rs1945647, rs9939609, rs8050136, rs201561, and rs1570004) are located present the 400-kb range around these identified significantly SNPs. SNP, single-nucleotide polymorphism.With the purpose of exploring the SNPs associated with obesity-related phenotypes, and the expressions of those identified by all the methods employed in this study in different adipose tissues, we retrieved the relevant content of GTEx website (https://www.gtexportal.org/home/). Consequently, the significant SNPs (rs17017947, rs41470552, rs7927943, and rs1945647) not identified by existing methods with HCDC have not been detected to be expressed in relevant tissues via GTEx query, while these distinct genotypes of significant SNPs (rs9939609, rs8050136, rs201561, and rs1570004) identified by existing methods with HCDC present differential expressions in adipose tissue or muscle tissue (Figure 4). In other words, the proposed HCDC has certain significance for biological research from the perspective of gene expression. Furthermore, it is noteworthy that the different genotypes of UQCC1-rs1570004 are differentially expressed in subcutaneous adipose, visceral adipose, or muscle tissue (p < 1.59 × 10–19). Moreover, the phenotypes adopted in real data analysis denote various measurement phenotypes about obesity, so the differentially expressed tissues are highly consistent with the phenotypes adopted in this study. Thus, UQCC1-rs1570004, as a SNP that has not been reported to be associated with obesity-related phenotypes in other studies so far, is worthy of further functional experimental studies in the future to confirm its impressive value.
FIGURE 4
The relationship between the genotypes of the significant SNPs discovered by HCDC method and eQTL in subcutaneous adipose tissue, visceral adipose tissue, and muscle tissue (data are from GTEx website).
The relationship between the genotypes of the significant SNPs discovered by HCDC method and eQTL in subcutaneous adipose tissue, visceral adipose tissue, and muscle tissue (data are from GTEx website).
Discussion
In this article, HCDC is proposed to jointly analyze multiple phenotypes in association analyses. The established approach employs the similarity measure to cluster multiple phenotypes. Using HCDC, we apply the existing methods to detect the genetic associations with the combined phenotypes rather than the individual phenotypes. HCDC owns several obvious advantages compared to other dimension reduction approaches. First, a dendrogram involved in the multiple phenotypes can be produced by HCDC (see Figure 2), which could supply more information about the structure of phenotypes. Second, not limited to the correlation coefficients, any proper measurements of distance can be used for the hierarchical clustering procedure, although the specific effects are worth further consideration. Third, HCDC is computationally fast, so it is easy to implement. Notably, HCDC does not need to acquire the individual phenotypes, and on the contrary, it only acquires the similarity matrix of phenotypes. This similarity of matrix can be evaluated from the test statistics of summary data employing the independent SNPs in a GWAS (Zhu et al., 2015). This is a major advantage of HCDC clustering using correlation coefficients between phenotypes.We performed extensive simulations together with the real data analysis to assess the performance of MANOVA, MultiPhen, and TATES combined with applying HCDC and compared these with their original versions. The simulation results reveal that these three methods applying HCDC not only possess correct type Ⅰ error rates but also own more advantage over these without applying HCDC under a series of simulation scenarios. For more realistic simulation settings, GCTA software is the first choice. Thus, further tests should be evaluated in the future (Yang et al., 2011). More importantly, the real data analysis results elucidate that HCDC shows great potential in multiple phenotypes analysis of ARIC via GWAS about obesity, and the bioinformatics analysis for these results also supports them. In addition, we also use another clustering method, HCM, as a major competitor to compare its performance with that of HCDC. We suggest that the most important thing for HCM to be improved is that when calculating the correlation coefficient between two clusters, it should take the imbalanced numbers of phenotypes in two clusters into account, and it may not be appropriate to use a unified calculation formula of correlation coefficient. In real data analysis, the fact that the performance of HCDC is better than HCM confirms our point of view. Presently, HCDC is more suitable for continuous phenotypes. After the transformation of phenotypes, it can also be applied to dichotomous or mixed traits. However, its performance in dichotomous or mixed traits situation still needs to be further investigated.Then, we use HCDC to analyze ARIC data and discovered that UQCC1-rs1570004 has a significant correlation with multiple phenotypes about obesity traits. Bioinformatics exploration shows that varied genotypes of UQCC1-rs1570004 are differentially expressed in subcutaneous fat, visceral fat, and muscle tissue (p < 1.59 × 10–19). The differentially expressed tissues are consistent with the phenotypes studied in this work. Therefore, UQCC1-rs1570004, as an SNP that has not been reported to be associated to obesity-related phenotypes in the literature, is worthy of further functional experiments in the future to confirm its potential value. From the perspective of application in real data, HCDC owns certain value and significance for further association studies.In summary, HCDC is an effective approach for the association study between multiple phenotypes and genetic variants in varied research fields. In medical research, many research disciplines have strong intersection. Generally, different disciplines carry out the association study between phenotypes and genetic variants separately. Interdisciplinary research on multiple phenotypes, such as phenotypes across multiple tissues, including various indicators with behavior, morphology, and physiology, will be likely extended to phenome research (Houle et al., 2010), which would be very meaningful. Because there is no assumption for HCDC in the aspect of genetic effect model, clustering multiple phenotypes into different categories according to similarity measure between phenotypes in HCDC is very useful for phenome research. Moreover, in a large number of phenotypes, HCDC does not need to understand the specific model for generating data, while only understanding the correlation matrix between phenotypes is undoubtedly another decent feature. In reality, it is common that the genetic structure among different phenotypes is complex and usually unknown. HCDC provides an effective and novel research strategy for exploring high-dimensional phenotypic data in the coming era of phenome as shown in simulations.
Authors: Naveed Sattar; Alex McConnachie; A Gerald Shaper; Gerard J Blauw; Brendan M Buckley; Anton J de Craen; Ian Ford; Nita G Forouhi; Dilys J Freeman; J Wouter Jukema; Lucy Lennon; Peter W Macfarlane; Michael B Murphy; Chris J Packard; David J Stott; Rudi G Westendorp; Peter H Whincup; James Shepherd; S Goya Wannamethee Journal: Lancet Date: 2008-05-22 Impact factor: 79.321
Authors: Iris M Heid; Anne U Jackson; Joshua C Randall; Thomas W Winkler; Lu Qi; Valgerdur Steinthorsdottir; Gudmar Thorleifsson; M Carola Zillikens; Elizabeth K Speliotes; Reedik Mägi; Tsegaselassie Workalemahu; Charles C White; Nabila Bouatia-Naji; Tamara B Harris; Sonja I Berndt; Erik Ingelsson; Cristen J Willer; Michael N Weedon; Jian'an Luan; Sailaja Vedantam; Tõnu Esko; Tuomas O Kilpeläinen; Zoltán Kutalik; Shengxu Li; Keri L Monda; Anna L Dixon; Christopher C Holmes; Lee M Kaplan; Liming Liang; Josine L Min; Miriam F Moffatt; Cliona Molony; George Nicholson; Eric E Schadt; Krina T Zondervan; Mary F Feitosa; Teresa Ferreira; Hana Lango Allen; Robert J Weyant; Eleanor Wheeler; Andrew R Wood; Karol Estrada; Michael E Goddard; Guillaume Lettre; Massimo Mangino; Dale R Nyholt; Shaun Purcell; Albert Vernon Smith; Peter M Visscher; Jian Yang; Steven A McCarroll; James Nemesh; Benjamin F Voight; Devin Absher; Najaf Amin; Thor Aspelund; Lachlan Coin; Nicole L Glazer; Caroline Hayward; Nancy L Heard-Costa; Jouke-Jan Hottenga; Asa Johansson; Toby Johnson; Marika Kaakinen; Karen Kapur; Shamika Ketkar; Joshua W Knowles; Peter Kraft; Aldi T Kraja; Claudia Lamina; Michael F Leitzmann; Barbara McKnight; Andrew P Morris; Ken K Ong; John R B Perry; Marjolein J Peters; Ozren Polasek; Inga Prokopenko; Nigel W Rayner; Samuli Ripatti; Fernando Rivadeneira; Neil R Robertson; Serena Sanna; Ulla Sovio; Ida Surakka; Alexander Teumer; Sophie van Wingerden; Veronique Vitart; Jing Hua Zhao; Christine Cavalcanti-Proença; Peter S Chines; Eva Fisher; Jennifer R Kulzer; Cecile Lecoeur; Narisu Narisu; Camilla Sandholt; Laura J Scott; Kaisa Silander; Klaus Stark; Mari-Liis Tammesoo; Tanya M Teslovich; Nicholas John Timpson; Richard M Watanabe; Ryan Welch; Daniel I Chasman; Matthew N Cooper; John-Olov Jansson; Johannes Kettunen; Robert W Lawrence; Niina Pellikka; Markus Perola; Liesbeth Vandenput; Helene Alavere; Peter Almgren; Larry D Atwood; Amanda J Bennett; Reiner Biffar; Lori L Bonnycastle; Stefan R Bornstein; Thomas A Buchanan; Harry Campbell; Ian N M Day; Mariano Dei; Marcus Dörr; Paul Elliott; Michael R Erdos; Johan G Eriksson; Nelson B Freimer; Mao Fu; Stefan Gaget; Eco J C Geus; Anette P Gjesing; Harald Grallert; Jürgen Grässler; Christopher J Groves; Candace Guiducci; Anna-Liisa Hartikainen; Neelam Hassanali; Aki S Havulinna; Karl-Heinz Herzig; Andrew A Hicks; Jennie Hui; Wilmar Igl; Pekka Jousilahti; Antti Jula; Eero Kajantie; Leena Kinnunen; Ivana Kolcic; Seppo Koskinen; Peter Kovacs; Heyo K Kroemer; Vjekoslav Krzelj; Johanna Kuusisto; Kirsti Kvaloy; Jaana Laitinen; Olivier Lantieri; G Mark Lathrop; Marja-Liisa Lokki; Robert N Luben; Barbara Ludwig; Wendy L McArdle; Anne McCarthy; Mario A Morken; Mari Nelis; Matt J Neville; Guillaume Paré; Alex N Parker; John F Peden; Irene Pichler; Kirsi H Pietiläinen; Carl G P Platou; Anneli Pouta; Martin Ridderstråle; Nilesh J Samani; Jouko Saramies; Juha Sinisalo; Jan H Smit; Rona J Strawbridge; Heather M Stringham; Amy J Swift; Maris Teder-Laving; Brian Thomson; Gianluca Usala; Joyce B J van Meurs; Gert-Jan van Ommen; Vincent Vatin; Claudia B Volpato; Henri Wallaschofski; G Bragi Walters; Elisabeth Widen; Sarah H Wild; Gonneke Willemsen; Daniel R Witte; Lina Zgaga; Paavo Zitting; John P Beilby; Alan L James; Mika Kähönen; Terho Lehtimäki; Markku S Nieminen; Claes Ohlsson; Lyle J Palmer; Olli Raitakari; Paul M Ridker; Michael Stumvoll; Anke Tönjes; Jorma Viikari; Beverley Balkau; Yoav Ben-Shlomo; Richard N Bergman; Heiner Boeing; George Davey Smith; Shah Ebrahim; Philippe Froguel; Torben Hansen; Christian Hengstenberg; Kristian Hveem; Bo Isomaa; Torben Jørgensen; Fredrik Karpe; Kay-Tee Khaw; Markku Laakso; Debbie A Lawlor; Michel Marre; Thomas Meitinger; Andres Metspalu; Kristian Midthjell; Oluf Pedersen; Veikko Salomaa; Peter E H Schwarz; Tiinamaija Tuomi; Jaakko Tuomilehto; Timo T Valle; Nicholas J Wareham; Alice M Arnold; Jacques S Beckmann; Sven Bergmann; Eric Boerwinkle; Dorret I Boomsma; Mark J Caulfield; Francis S Collins; Gudny Eiriksdottir; Vilmundur Gudnason; Ulf Gyllensten; Anders Hamsten; Andrew T Hattersley; Albert Hofman; Frank B Hu; Thomas Illig; Carlos Iribarren; Marjo-Riitta Jarvelin; W H Linda Kao; Jaakko Kaprio; Lenore J Launer; Patricia B Munroe; Ben Oostra; Brenda W Penninx; Peter P Pramstaller; Bruce M Psaty; Thomas Quertermous; Aila Rissanen; Igor Rudan; Alan R Shuldiner; Nicole Soranzo; Timothy D Spector; Ann-Christine Syvanen; Manuela Uda; André Uitterlinden; Henry Völzke; Peter Vollenweider; James F Wilson; Jacqueline C Witteman; Alan F Wright; Gonçalo R Abecasis; Michael Boehnke; Ingrid B Borecki; Panos Deloukas; Timothy M Frayling; Leif C Groop; Talin Haritunians; David J Hunter; Robert C Kaplan; Kari E North; Jeffrey R O'Connell; Leena Peltonen; David Schlessinger; David P Strachan; Joel N Hirschhorn; Themistocles L Assimes; H-Erich Wichmann; Unnur Thorsteinsdottir; Cornelia M van Duijn; Kari Stefansson; L Adrienne Cupples; Ruth J F Loos; Inês Barroso; Mark I McCarthy; Caroline S Fox; Karen L Mohlke; Cecilia M Lindgren Journal: Nat Genet Date: 2010-10-10 Impact factor: 38.330
Authors: Dmitry Shungin; Thomas W Winkler; Damien C Croteau-Chonka; Teresa Ferreira; Adam E Locke; Reedik Mägi; Rona J Strawbridge; Tune H Pers; Krista Fischer; Anne E Justice; Tsegaselassie Workalemahu; Joseph M W Wu; Martin L Buchkovich; Nancy L Heard-Costa; Tamara S Roman; Alexander W Drong; Ci Song; Stefan Gustafsson; Felix R Day; Tonu Esko; Tove Fall; Zoltán Kutalik; Jian'an Luan; Joshua C Randall; André Scherag; Sailaja Vedantam; Andrew R Wood; Jin Chen; Rudolf Fehrmann; Juha Karjalainen; Bratati Kahali; Ching-Ti Liu; Ellen M Schmidt; Devin Absher; Najaf Amin; Denise Anderson; Marian Beekman; Jennifer L Bragg-Gresham; Steven Buyske; Ayse Demirkan; Georg B Ehret; Mary F Feitosa; Anuj Goel; Anne U Jackson; Toby Johnson; Marcus E Kleber; Kati Kristiansson; Massimo Mangino; Irene Mateo Leach; Carolina Medina-Gomez; Cameron D Palmer; Dorota Pasko; Sonali Pechlivanis; Marjolein J Peters; Inga Prokopenko; Alena Stančáková; Yun Ju Sung; Toshiko Tanaka; Alexander Teumer; Jana V Van Vliet-Ostaptchouk; Loïc Yengo; Weihua Zhang; Eva Albrecht; Johan Ärnlöv; Gillian M Arscott; Stefania Bandinelli; Amy Barrett; Claire Bellis; Amanda J Bennett; Christian Berne; Matthias Blüher; Stefan Böhringer; Fabrice Bonnet; Yvonne Böttcher; Marcel Bruinenberg; Delia B Carba; Ida H Caspersen; Robert Clarke; E Warwick Daw; Joris Deelen; Ewa Deelman; Graciela Delgado; Alex Sf Doney; Niina Eklund; Michael R Erdos; Karol Estrada; Elodie Eury; Nele Friedrich; Melissa E Garcia; Vilmantas Giedraitis; Bruna Gigante; Alan S Go; Alain Golay; Harald Grallert; Tanja B Grammer; Jürgen Gräßler; Jagvir Grewal; Christopher J Groves; Toomas Haller; Goran Hallmans; Catharina A Hartman; Maija Hassinen; Caroline Hayward; Kauko Heikkilä; Karl-Heinz Herzig; Quinta Helmer; Hans L Hillege; Oddgeir Holmen; Steven C Hunt; Aaron Isaacs; Till Ittermann; Alan L James; Ingegerd Johansson; Thorhildur Juliusdottir; Ioanna-Panagiota Kalafati; Leena Kinnunen; Wolfgang Koenig; Ishminder K Kooner; Wolfgang Kratzer; Claudia Lamina; Karin Leander; Nanette R Lee; Peter Lichtner; Lars Lind; Jaana Lindström; Stéphane Lobbens; Mattias Lorentzon; François Mach; Patrik Ke Magnusson; Anubha Mahajan; Wendy L McArdle; Cristina Menni; Sigrun Merger; Evelin Mihailov; Lili Milani; Rebecca Mills; Alireza Moayyeri; Keri L Monda; Simon P Mooijaart; Thomas W Mühleisen; Antonella Mulas; Gabriele Müller; Martina Müller-Nurasyid; Ramaiah Nagaraja; Michael A Nalls; Narisu Narisu; Nicola Glorioso; Ilja M Nolte; Matthias Olden; Nigel W Rayner; Frida Renstrom; Janina S Ried; Neil R Robertson; Lynda M Rose; Serena Sanna; Hubert Scharnagl; Salome Scholtens; Bengt Sennblad; Thomas Seufferlein; Colleen M Sitlani; Albert Vernon Smith; Kathleen Stirrups; Heather M Stringham; Johan Sundström; Morris A Swertz; Amy J Swift; Ann-Christine Syvänen; Bamidele O Tayo; Barbara Thorand; Gudmar Thorleifsson; Andreas Tomaschitz; Chiara Troffa; Floor Va van Oort; Niek Verweij; Judith M Vonk; Lindsay L Waite; Roman Wennauer; Tom Wilsgaard; Mary K Wojczynski; Andrew Wong; Qunyuan Zhang; Jing Hua Zhao; Eoin P Brennan; Murim Choi; Per Eriksson; Lasse Folkersen; Anders Franco-Cereceda; Ali G Gharavi; Åsa K Hedman; Marie-France Hivert; Jinyan Huang; Stavroula Kanoni; Fredrik Karpe; Sarah Keildson; Krzysztof Kiryluk; Liming Liang; Richard P Lifton; Baoshan Ma; Amy J McKnight; Ruth McPherson; Andres Metspalu; Josine L Min; Miriam F Moffatt; Grant W Montgomery; Joanne M Murabito; George Nicholson; Dale R Nyholt; Christian Olsson; John Rb Perry; Eva Reinmaa; Rany M Salem; Niina Sandholm; Eric E Schadt; Robert A Scott; Lisette Stolk; Edgar E Vallejo; Harm-Jan Westra; Krina T Zondervan; Philippe Amouyel; Dominique Arveiler; Stephan Jl Bakker; John Beilby; Richard N Bergman; John Blangero; Morris J Brown; Michel Burnier; Harry Campbell; Aravinda Chakravarti; Peter S Chines; Simone Claudi-Boehm; Francis S Collins; Dana C Crawford; John Danesh; Ulf de Faire; Eco Jc de Geus; Marcus Dörr; Raimund Erbel; Johan G Eriksson; Martin Farrall; Ele Ferrannini; Jean Ferrières; Nita G Forouhi; Terrence Forrester; Oscar H Franco; Ron T Gansevoort; Christian Gieger; Vilmundur Gudnason; Christopher A Haiman; Tamara B Harris; Andrew T Hattersley; Markku Heliövaara; Andrew A Hicks; Aroon D Hingorani; Wolfgang Hoffmann; Albert Hofman; Georg Homuth; Steve E Humphries; Elina Hyppönen; Thomas Illig; Marjo-Riitta Jarvelin; Berit Johansen; Pekka Jousilahti; Antti M Jula; Jaakko Kaprio; Frank Kee; Sirkka M Keinanen-Kiukaanniemi; Jaspal S Kooner; Charles Kooperberg; Peter Kovacs; Aldi T Kraja; Meena Kumari; Kari Kuulasmaa; Johanna Kuusisto; Timo A Lakka; Claudia Langenberg; Loic Le Marchand; Terho Lehtimäki; Valeriya Lyssenko; Satu Männistö; André Marette; Tara C Matise; Colin A McKenzie; Barbara McKnight; Arthur W Musk; Stefan Möhlenkamp; Andrew D Morris; Mari Nelis; Claes Ohlsson; Albertine J Oldehinkel; Ken K Ong; Lyle J Palmer; Brenda W Penninx; Annette Peters; Peter P Pramstaller; Olli T Raitakari; Tuomo Rankinen; D C Rao; Treva K Rice; Paul M Ridker; Marylyn D Ritchie; Igor Rudan; Veikko Salomaa; Nilesh J Samani; Jouko Saramies; Mark A Sarzynski; Peter Eh Schwarz; Alan R Shuldiner; Jan A Staessen; Valgerdur Steinthorsdottir; Ronald P Stolk; Konstantin Strauch; Anke Tönjes; Angelo Tremblay; Elena Tremoli; Marie-Claude Vohl; Uwe Völker; Peter Vollenweider; James F Wilson; Jacqueline C Witteman; Linda S Adair; Murielle Bochud; Bernhard O Boehm; Stefan R Bornstein; Claude Bouchard; Stéphane Cauchi; Mark J Caulfield; John C Chambers; Daniel I Chasman; Richard S Cooper; George Dedoussis; Luigi Ferrucci; Philippe Froguel; Hans-Jörgen Grabe; Anders Hamsten; Jennie Hui; Kristian Hveem; Karl-Heinz Jöckel; Mika Kivimaki; Diana Kuh; Markku Laakso; Yongmei Liu; Winfried März; Patricia B Munroe; Inger Njølstad; Ben A Oostra; Colin Na Palmer; Nancy L Pedersen; Markus Perola; Louis Pérusse; Ulrike Peters; Chris Power; Thomas Quertermous; Rainer Rauramaa; Fernando Rivadeneira; Timo E Saaristo; Danish Saleheen; Juha Sinisalo; P Eline Slagboom; Harold Snieder; Tim D Spector; Kari Stefansson; Michael Stumvoll; Jaakko Tuomilehto; André G Uitterlinden; Matti Uusitupa; Pim van der Harst; Giovanni Veronesi; Mark Walker; Nicholas J Wareham; Hugh Watkins; H-Erich Wichmann; Goncalo R Abecasis; Themistocles L Assimes; Sonja I Berndt; Michael Boehnke; Ingrid B Borecki; Panos Deloukas; Lude Franke; Timothy M Frayling; Leif C Groop; David J Hunter; Robert C Kaplan; Jeffrey R O'Connell; Lu Qi; David Schlessinger; David P Strachan; Unnur Thorsteinsdottir; Cornelia M van Duijn; Cristen J Willer; Peter M Visscher; Jian Yang; Joel N Hirschhorn; M Carola Zillikens; Mark I McCarthy; Elizabeth K Speliotes; Kari E North; Caroline S Fox; Inês Barroso; Paul W Franks; Erik Ingelsson; Iris M Heid; Ruth Jf Loos; L Adrienne Cupples; Andrew P Morris; Cecilia M Lindgren; Karen L Mohlke Journal: Nature Date: 2015-02-12 Impact factor: 49.962
Authors: Keri L Monda; Gary K Chen; Kira C Taylor; Cameron Palmer; Todd L Edwards; Leslie A Lange; Maggie C Y Ng; Adebowale A Adeyemo; Matthew A Allison; Lawrence F Bielak; Guanjie Chen; Mariaelisa Graff; Marguerite R Irvin; Suhn K Rhie; Guo Li; Yongmei Liu; Youfang Liu; Yingchang Lu; Michael A Nalls; Yan V Sun; Mary K Wojczynski; Lisa R Yanek; Melinda C Aldrich; Adeyinka Ademola; Christopher I Amos; Elisa V Bandera; Cathryn H Bock; Angela Britton; Ulrich Broeckel; Quiyin Cai; Neil E Caporaso; Chris S Carlson; John Carpten; Graham Casey; Wei-Min Chen; Fang Chen; Yii-Der I Chen; Charleston W K Chiang; Gerhard A Coetzee; Ellen Demerath; Sandra L Deming-Halverson; Ryan W Driver; Patricia Dubbert; Mary F Feitosa; Ye Feng; Barry I Freedman; Elizabeth M Gillanders; Omri Gottesman; Xiuqing Guo; Talin Haritunians; Tamara Harris; Curtis C Harris; Anselm J M Hennis; Dena G Hernandez; Lorna H McNeill; Timothy D Howard; Barbara V Howard; Virginia J Howard; Karen C Johnson; Sun J Kang; Brendan J Keating; Suzanne Kolb; Lewis H Kuller; Abdullah Kutlar; Carl D Langefeld; Guillaume Lettre; Kurt Lohman; Vaneet Lotay; Helen Lyon; Joann E Manson; William Maixner; Yan A Meng; Kristine R Monroe; Imran Morhason-Bello; Adam B Murphy; Josyf C Mychaleckyj; Rajiv Nadukuru; Katherine L Nathanson; Uma Nayak; Amidou N'diaye; Barbara Nemesure; Suh-Yuh Wu; M Cristina Leske; Christine Neslund-Dudas; Marian Neuhouser; Sarah Nyante; Heather Ochs-Balcom; Adesola Ogunniyi; Temidayo O Ogundiran; Oladosu Ojengbede; Olufunmilayo I Olopade; Julie R Palmer; Edward A Ruiz-Narvaez; Nicholette D Palmer; Michael F Press; Evandine Rampersaud; Laura J Rasmussen-Torvik; Jorge L Rodriguez-Gil; Babatunde Salako; Eric E Schadt; Ann G Schwartz; Daniel A Shriner; David Siscovick; Shad B Smith; Sylvia Wassertheil-Smoller; Elizabeth K Speliotes; Margaret R Spitz; Lara Sucheston; Herman Taylor; Bamidele O Tayo; Margaret A Tucker; David J Van Den Berg; Digna R Velez Edwards; Zhaoming Wang; John K Wiencke; Thomas W Winkler; John S Witte; Margaret Wrensch; Xifeng Wu; James J Yang; Albert M Levin; Taylor R Young; Neil A Zakai; Mary Cushman; Krista A Zanetti; Jing Hua Zhao; Wei Zhao; Yonglan Zheng; Jie Zhou; Regina G Ziegler; Joseph M Zmuda; Jyotika K Fernandes; Gary S Gilkeson; Diane L Kamen; Kelly J Hunt; Ida J Spruill; Christine B Ambrosone; Stefan Ambs; Donna K Arnett; Larry Atwood; Diane M Becker; Sonja I Berndt; Leslie Bernstein; William J Blot; Ingrid B Borecki; Erwin P Bottinger; Donald W Bowden; Gregory Burke; Stephen J Chanock; Richard S Cooper; Jingzhong Ding; David Duggan; Michele K Evans; Caroline Fox; W Timothy Garvey; Jonathan P Bradfield; Hakon Hakonarson; Struan F A Grant; Ann Hsing; Lisa Chu; Jennifer J Hu; Dezheng Huo; Sue A Ingles; Esther M John; Joanne M Jordan; Edmond K Kabagambe; Sharon L R Kardia; Rick A Kittles; Phyllis J Goodman; Eric A Klein; Laurence N Kolonel; Loic Le Marchand; Simin Liu; Barbara McKnight; Robert C Millikan; Thomas H Mosley; Badri Padhukasahasram; L Keoki Williams; Sanjay R Patel; Ulrike Peters; Curtis A Pettaway; Patricia A Peyser; Bruce M Psaty; Susan Redline; Charles N Rotimi; Benjamin A Rybicki; Michèle M Sale; Pamela J Schreiner; Lisa B Signorello; Andrew B Singleton; Janet L Stanford; Sara S Strom; Michael J Thun; Mara Vitolins; Wei Zheng; Jason H Moore; Scott M Williams; Shamika Ketkar; Xiaofeng Zhu; Alan B Zonderman; Charles Kooperberg; George J Papanicolaou; Brian E Henderson; Alex P Reiner; Joel N Hirschhorn; Ruth J F Loos; Kari E North; Christopher A Haiman Journal: Nat Genet Date: 2013-04-14 Impact factor: 38.330
Authors: Alanna C Morrison; Arend Voorman; Andrew D Johnson; Xiaoming Liu; Jin Yu; Alexander Li; Donna Muzny; Fuli Yu; Kenneth Rice; Chengsong Zhu; Joshua Bis; Gerardo Heiss; Christopher J O'Donnell; Bruce M Psaty; L Adrienne Cupples; Richard Gibbs; Eric Boerwinkle Journal: Nat Genet Date: 2013-06-16 Impact factor: 38.330
Authors: Paul F O'Reilly; Clive J Hoggart; Yotsawat Pomyen; Federico C F Calboli; Paul Elliott; Marjo-Riitta Jarvelin; Lachlan J M Coin Journal: PLoS One Date: 2012-05-02 Impact factor: 3.240