| Literature DB >> 22278381 |
Haiquan Li1, Younghee Lee, James L Chen, Ellen Rebman, Jianrong Li, Yves A Lussier.
Abstract
OBJECTIVE: Thousands of complex-disease single-nucleotide polymorphisms (SNPs) have been discovered in genome-wide association studies (GWAS). However, these intragenic SNPs have not been collectively mined to unveil the genetic architecture between complex clinical traits. The authors hypothesize that biological annotations of host genes of trait-associated SNPs may reveal the biomolecular modularity across complex-disease traits and offer insights for drug repositioning.Entities:
Mesh:
Year: 2012 PMID: 22278381 PMCID: PMC3277620 DOI: 10.1136/amiajnl-2011-000482
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Figure 1Workflow of our methodology. This study demonstrates the non-trivial modularity of biological mechanisms shared by some complex diseases. Specifically, complex diseases are connected using biological similarity computations between single-nucleotide polymorphisms (SNPs) associated by genome-wide association studies (GWAS) with complex traits. First, gene–gene similarity in Gene Ontology (GO) was calculated at a genome scale using an information theory similarity (ITS) measure validated previously. For a pair of genes in the STRING database, the shortest protein interaction distance is shown correlated with their biological similarity obtained by Gene_ITS using GO annotations (figure 2). Second, host genes were mapped to complex traits using the NHGRI GWAS Catalog and dbSNP database, and the trait–trait ITS of their associated traits was derived from results calculated in step 1. A disease similarity network was constructed by choosing the significant trait–trait similarity (figure 3). Metatrait modules were extended by their shortest paths between host genes with significant information similarities, due to the validated reverse correlation between them (figure 4).
Definitions and abbreviations
| Concept | Definition |
| Genome-wide association studies (GWAS) | Investigation of genes in the whole genome for a large number of individuals that tests the genetic variations differentially found between two contrasted groups (case vs control) with respect to a specific trait, such as a disease |
| Gene Ontology (GO) | Controlled vocabulary of annotations to gene and gene product attributes |
| Protein–protein interaction network (PPIN) | Graphic representation of protein–protein interactions on a large scale constructed in order to appreciate the network structure |
| Single-nucleotide polymorphism (SNP) | Single-nucleotide variation in the genomic sequence found to be different between individuals or between two chromosomes of the same individual |
| Intragenic SNP | A SNP located within a gene region |
| Intergenic SNP | A SNP located outside any gene region |
| Host gene of an intragenic SNP | The gene that physically contains the intragenic SNP in its genomic sequence |
| Trait | A characteristic phenotype or disease state of an individual, such as hair color or type II diabetes |
| Metatrait | A class of disorders clinically related in time or sharing common molecular mechanisms (eg, ‘metabolic syndrome’ is a metatrait for the traits ‘essential hypertension,’ ‘adult-onset diabetes mellitus’ and others) |
| Intertrait | Relationship found between two traits |
| Intra-metatrait | Connections between traits that belong to the same metatrait |
| Network modules | A subnetwork possessing some biological or medical implications whose nodes are densely connected inside the subnetwork but are sparsely connected with nodes outside of the subnetwork |
| Gene Ontology term | Standardized description of a biological concept, such as the molecular function, the biological processes or the subcellular localizations of a gene |
| Minimal ancestor of two GO terms | The most specific GO term that could summarize or contain the characteristics shared between a pair of GO terms |
| NHGRI GWAS Catalog | The National Human Genome Research Institute Catalog of Published Genome-Wide Association Studies ( |
| STRING | Search tool for the retrieval of interacting genes/proteins: the most comprehensive database of protein–protein interactions and associations |
| Fusion | A reliable protein–protein interaction prediction method based on the hypothesis that two proteins are more likely to interact with each other if they have been incorporated into a third protein as two domains during evolution |
| Shortest distance of two proteins | The minimum number of distinct edges found among all possible routes connecting two proteins in a protein–protein interaction network |
| Shortest path(s) of two proteins | All routes possessing the minimum number of distinct edges found among all possible routes connecting two proteins in a protein–protein interaction network |
Figure 2Higher Gene Ontology (GO) similarity between proteins is associated with smaller shortest distance in protein interaction networks. Relationships are seen between average Gene_ITS values and the shortest distance between pairs of proteins in the protein–protein interaction network. An average information theory similarity value was calculated for groups of protein–protein pairs in STRING v8.0 with the same shortest distances (length value of shortest paths). As hypothesized, higher biological similarity in GO was associated with shorter distances in the protein interaction network (p<10−16, Spearman correlation, using the entire set of protein combinations with GO annotations and protein interactions, which were 5753 and 5955 for GO biological processes (GO:BP) and GO molecular functions (GO:MF), respectively). These results are reproducible in the subset of genes hosting the single-nucleotide polymorphisms associated with disease traits in the National Human Genome Research Institute Catalog of Published Genome-Wide Association Studies (data not shown).
Figure 3Network of biological similarity between complex-disease traits calculated from genome-wide association study single-nucleotide polymorphisms (SNPs). The disease similarity network was calculated using Genome Ontology biological process similarity of the host genes of trait-associated intragenic SNPs with similarities ≥0.2 and an empirical p value <0.05. As shown, among 280 intertrait similarity connections (blue lines), 186 (66.4%) cannot be explained simply by shared host genes between the traits (gray lines) and are reported here for the first time. Therefore, this figure illustrates that our information theoretic similarity (ITS) method has found non-trivial relationships that would not have been found by conventional methods. As hypothesized, metabolic syndrome traits (green) and cancer traits (purple) are significantly enriched in connections with other traits in the same metatrait (p=3.5×10−7 and 0.001 respectively, Fisher's exact test). A significant module of cancer traits (circled in orange) is shown in greater detail in figure 4. Circles represent diseases or traits whose sizes are proportional to their number of associated intragenic SNP host genes. Green circles represent metabolic syndrome-related traits curated a priori (dark green for metabolic syndrome traits and light green for their risk factors), purple circles represent cancer traits, and gray circles represent other traits. Blue lines represent biological process similarities that are ≥0.2 and have a p value <0.05. Gray lines represent shared SNP host genes between diseases if their Trait_ITS is ≥0.2 (in other words overlapping connections between our information theoretic method and conventional gene overlapping method). Line thicknesses are proportional to Trait_ITS similarity values or number of shared genes. Solid lines have been validated as clinically meaningful by clinicians, while dotted lines have not.
Figure 4Biological similarity between cancer traits annotated by shortest protein interaction path between host genes is enriched in oncogenes. The subset of figure 3 circled in orange corresponds to a biomodule of cancer traits. (A) provides the detailed view of their genome-wide association study-associated single-nucleotide polymorphisms (SNPs), their corresponding host genes, and their dense biological similarities (Gene_ITS ≥0.2; Trait_ITS ≥0.2; green lines; based on Gene Ontology biological processes). (B) provides an additional annotation of shortest protein interaction paths (red dotted lines) between host genes. Oncogenes (gold color) are statistically enriched in the shortest protein-interaction paths among pairs of SNP host genes associated with distinct traits that were paired by similarity measures (p=0.0001, Fisher's exact test). In addition, five out of six host genes associated with these cancer traits are either oncogenes or directly interact with oncogenes in the cancer-related modules in the disease network. Taken together, our metric produces multi-scale connections, as it contains protein interactions as well as biological similarities that both underlie the disease network connections and utilize different knowledge bases, thus validating one another.
Fisher's exact test contingency table for enrichment of a specific type of intra-metatrait connections in disease networks
| Intra-metatrait connections | Non-intra-metatrait connections | Subtotal | |
| Observed in the network | NMM | NTT−NMM | NTT |
| Not observed in the network | CMM−NMM | CTT−CMM−NTT+NMM | CTT−NTT |
| Subtotal | CMM | CTT−CMM | CTT |
M, a metatrait containing a set of related traits; T, the set of all traits in the disease network; NMM, the number of observed connections between the two traits within the metatrait M; NTT, the number of connections in the disease network; CMM, all possible pairwise combinations of traits in M; CTT, the total possible pairwise combinations between all traits in the network.
Direct protein–protein interactions are enriched in gene–gene pairs with high biological similarity
| Biological similarity between pairs of genes (genome-wide) | ||||
| GO biological process | GO molecular function | |||
| ITS ≥0.7 | ITS <0.7 | ITS ≥0.7 | ITS <0.7 | |
| Direct protein interaction between pairs of genes | 7314 | 41 710 | 5413 | 47 112 |
| No direct protein interactions | 53 174 | 16 442 890 | 138 125 | 17 537 385 |
| OR | 53.7 | 14.6 | ||
| Fisher's exact test | p<10−16 | p<10−16 | ||