Literature DB >> 21347156

Modeling and characterization of disease associated subnetworks in the human interactome using machine learning.

Lee T Sam1, George Michailidis.   

Abstract

The availability of large-scale, genome-wide data about the molecular interactome of entire organisms has made possible new types of integrative studies, making use of rapidly accumulating knowledge of gene-disease associations. Previous studies have established the presence of functional biomodules in the molecular interaction network of living organisms, a number of which have been associated with the pathogenesis and progression of human disease. While a number of studies have examined the networks and biomodules associated with disease, the properties that contribute to the particular susceptibility of these subnetworks to disruptions leading to disease phenotypes have not been extensively studied. We take a machine learning approach to the characterization of these disease subnetworks associated with complex and single-gene diseases, taking into account both the biological roles of their constituent genes and topological properties of the networks they form.

Entities:  

Year:  2009        PMID: 21347156      PMCID: PMC3041579     

Source DB:  PubMed          Journal:  Summit Transl Bioinform        ISSN: 2153-6430


Introduction

Recent advances in gene-disease association and large scale protein interaction have made an unprecedented amount of data available for researchers to study the systems biology of human disease. Particularly, this interest has taken the form of analyses of combined protein interaction data and gene-disease annotations to elucidate the molecular mechanisms underlying human diseases and disorders. These studies suggest the presence of disease-related subnetworks within the larger human protein interaction network. This is consistent with the belief that diseases significantly dysregulate functional biomodules within the interactome. As a result, analysis of these subnetworks may provide insights into the functional modules within the interactome that are responsible for the pathogenesis and progression of human disease. In this paper, we present a model-driven technique for constructing disease-associated sub-networks based on gene-disease interactions and protein interactions and characterize them using both the topological and biological properties of the constituent genes and the subnetworks they form. Three sets of subnetworks are generated from this process: a group of subnetworks involved in well-defined biological processes, and two groups of subnetworks associated with complex and single gene diseases. We apply unsupervised methods to demonstrate that these three subnetwork sets are poorly separable and train a random forest classifier to delineate between sub-networks specifically associated with disease and those built from a priori knowledge from the Gene Ontology in order to better understand the structural and biological characteristics of the biological processes associated with diseases arising from single genes and how they differ from those associated with complex disease through their classification. Supplementary Methods and Materials for this study are available at http://www.stat.lsa.umich.edu/~gmichail/subnetworks_study/

Background

The advent of high-throughput techniques for determining molecular interactions has opened the door to genome scale evaluation of the molecular interactome of many species due to the quickly growing pool of data. A number of databases have been developed in order to integrate protein interaction data from high throughput experiments such as DIP, BIND, HPRD, and several others. Studies looking at this data across a number of organisms have indicated that these networks are organized into functional biomodules that function at multiple scales (1–3). Analysis of disease gene knowledge coupled with data from large-scale protein interaction networks to form a phenome-interactome network have revealed that a significant portion of disease-associated genes form small sub-networks. The networks formed by the interactions of known disease genes have been used to relate phenotypically similar inherited diseases together (4). Similarly, subnetworks that represent protein complexes have been used to relate diseases with similar phenotypes and provide novel disease gene candidates when melded to association data (5). The disease-associated genes themselves also seem to possess a number of characteristics within the interactome. Compared to the mean degree values of all proteins, many disease related proteins display relatively elevated degree and tend to interact with other disease-related proteins (6, 7). This property has been used to propose likely candidate genes for disease association (8). Taken together, it suggests that the intermediate nodes in the interactome play a contributory factor. In addition to the importance of highly interconnected “hub” proteins (9, 10), certain topological features were found to be associated with essentiality/lethality (11). Additional research has suggested that genes expressing proteins of similar importance also share topological characteristics in the interaction network (12). These topological characteristics have been used to explain variable disease outcome (13), making an argument for their role in the progression of disease. In this study, node count, radius, and diameter are used to measure the size and spread of the networks. In graph-theoretic terms where eccentricity is defined as the greatest distance between a vertex and any other, the diameter and radius are defined as the maximum eccentricity and the minimum eccentricity in a network, respectively. The two degree measurements, clustering coefficient, and observed edge fraction, characterize the density and interconnectivity of the graphs, where degree is defined as the number of connections a vertex has to other vertices. Clustering coefficient analyzes the links in a graph to quantify how close it is to being completely connected with all vertices connected to all other vertices. The observed edge fraction is similar in counting the fraction of edges observed in the subnetwork compared to all possible edges. Cyclicity, defined as the existence of looping paths in the graph, and biconnectivity, defined as the presence of vertices which connect segments of the subgraph, are used to characterize the structure of the graphs. A number of biological properties characterize the biomodules associated with biological processes and diseases. Genes involved with the same biological process or functional subunit often co-localize on the genome (14) and are often under the control of identical regulatory factors. In consideration of these positional factors, we take into account mean gene start location, mean gene end location, mean length, and mean genomic strand. Mean G-C content fraction is calculated as it affects thermostability of the genetic material and its transcriptional propensity. Similarly, sets of genes with interacting protein products contain motifs for known interacting domains. With this in mind, mean PFAM domain annotation count, mean ProSite annotation count, mean number of signal domains, and mean number of transmembrane domains are considered. In this case, we applied a random forests ensemble learning method described by Breiman (15). The random forest is composed of a defined set of unpruned decision trees, each trained on a subset of the training data selected with replacement. Each tree chooses a random subset of variables to classify the data at each node, the quantity of which is defined as a parameter. These properties make the classifier extremely robust to overfitting on data.

Methods

Data Extraction

Protein interaction data was retrieved from the Michigan Molecular Interaction Index (MiMi) (16), which integrates interaction and annotation data from BIND, the Gene Ontology, HPRD, DIP, the BioGRID, IntAct, InterPro, IPI, the Max-Delbrueck Center for Molecular Medicine protein interaction database, Pfam, ProtoNet, SwissProt, and RefSeq. This process yielded 12,318 unique protein-protein interactions involving 6199 unique Entrez Gene identifiers. Gene-disease relationships were derived from two sources; the Online Mendelian Inheritance in Man (OMIM) (17) and the PhenoGO database (18). Gene-Disease associations in PhenoGO not using Entrez Gene identifiers were translated using mappings from HUGO (19). Diseases in these two resources were defined in terms of coded Medical Subject Heading (MeSH) (20) and Unified Medical Language System (UMLS) (21) identifiers. The unfiltered, translated data set resulted in 3469 Entrez identifiers associated to 2325 phenotype codes. OMIM mappings found in the mim2gene file supplied by NCBI already employ Entrez Gene identifiers and no translation was necessary for the OMIM data. Entries in the OMIM database were filtered to include only gene-disease references, resulting in 1846 distinct Entrez indentified genes annotated to OMIM-defined diseases. 708 of the identifiers found in the OMIM mappings are also present in the MiMi interaction data set. Gene Ontology (22) data and biological annotation was extracted from BioMart (23) using data from Ensembl version 47 built from the NCBI36 release of the human genome. MeSH and UMLS term descriptors were retrieved directly from the NLM.

Subnetwork Generation

The subnetworks associated to human diseases and biological processes were built by the determination of all shortest pairs paths between all distinct associated genes found in the protein interaction network. Shortest paths in the interaction subnetwork are determined using Dijkstra’s shortest paths algorithm (24). For example, illustrates a hypothetical disease of interest associated to UMLS concept ‘UMLS:000000’, associated with genes A, B, C, D, and E. The shortest path between pairs {A,B}, {A,C}, {A,D}, {A,E}, {B,C}, {B,D}, {B,E}, {C, D}, {C, E}, and {D, E} would be analyzed, noting the identities of the original nodes, the original node also found in the protein interaction network (as many nodes are not represented within the network), the intermediate connecting nodes, and the respective counts of each class. This process discovers intermediate nodes X, Y, and Z in the process of deriving the subnetwork and associates these nodes. The generated results were split into three distinct classes. A “background” set was generated from a priori knowledge from the Gene Ontology, consisting of the subnetworks formed by the classes represented in the “Biological Process” and “Molecular Function” trees of the Gene Ontology. This process resulted in the generation of 6,606 GO-associated subnetworks. A “single gene disease” (SGD) subnetwork set was generated from the contents of OMIM, producing 2,079 subnetworks. A “complex disease” (CD) set was built from the PhenoGO annotations, composed of 2,317 subnetworks in total.

Data Characterization and Filtering

Resulting subnetworks in each of the three data sets was topologically characterized using a set of Perl scripts employing the Boost Graph Library interface. Subnetworks are topologically characterized based on node count, clustering coefficient, observed edge fraction, average degree, maximum degree, radius, diameter, cyclicity, and biconnectivity. Biological characteristics noted for each subgraph include mean gene start location, mean gene end location, mean length, strand, mean PFAM domain annotation count, mean ProSite annotation count, mean number of signal domains, mean number of transmembrane domains, and mean G-C content fraction. The networks are filtered for size, imposing a minimum of three nodes found in the interaction network. 79 and 278 subnetworks passed this filter from the SGD and CD sets, respectively. 2590 of the subnetworks generated from the Gene Ontology passed this filter. This final filtered set was used to train and test the classifier.

Machine Learning and Classification

The Waikato Environment for Knowledge Analysis (Weka), version 3.4.12 (25) was used to train and test a random forest classifier with a stratified 10-fold cross validation methodology. In this case, the cross-validation approach was chosen due to the relative paucity of data from the disease subsets. Each random forest was composed of 100 trees, each taking into account four random parameters from the data. In all, a total of nine classifications were done in an attempt to discretize the three sets of subnetworks using varying parameter sets and amalgamations of the two disease sets. Because the Weka random forest classifier did not provide variable importance measures, the analysis was repeated using the randomForest package in R 2.7.1, which provided nearly identical results. Principal components analysis of the data was done using PAST (26).

Results

Subnetwork Charactersitics

As expected, the subnetworks derived from OMIM, the SGD set, demonstrated a smaller range in size in terms of total gene count from 3 to 32 genes with a median of five genes, while the PhenoGO derived complex disease set was composed of networks of size ranging from 3 to 127 genes, with a median of eight genes. The Gene Ontology derived background set had the largest range from 3 to 968 genes. As shown in Sup. , most subnetworks tended to remain small, generally involving between three and nine genes. The GO background set exhibits a long-tailed distribution with most networks remaining under seventeen genes in size.

Classification Accuracy

Unsupervised Principal Components Analysis and k-means clustering methods were first attempted in order to assess the separability of the three classes of subnetworks. As shown in and Sup , clustering mirrored the results of the PCA with high misclassification levels (misclassifying ~55% of the data), further demonstrating the poor separability of the data. As a result, machine learning techniques must be applied to derive the subtle differences between the CD, SGD, and GO sets. As shown in Sup. , the overall misclassification error rate remains relatively low across several subsets of the subnetwork parameter data, never exceeding 5%. Other measures –precision, recall, f-measure- exhibit very satisfactory performance. However, a close inspection of the results for the three class problems (SGD, GO, CD) reveals that the results for the SGD class are not satisfactory. Confusion matrices from these analyses show the classifier tends to assign those subnetworks to the GO class, an issue addressed in the discussion section. Further analysis of the data by breaking down the features into biological and topological characteristics further revealed the similarities between the SGD and GO set, further analyzed in the Supplementary Methods and Materials. The separability of the SGD and CD sets as shown in Sup. demonstrates the differences in subnetwork characteristics between those primary involved with single-gene disorders and those associated with multigenic, complex disorders. A reclassification of all the study data was also done using a GO dataset that included only the “Biological Process” entires, with similar results. The complete results of the classifications as well as additional methods and analyses are available in the Supplementary Methods and Materials. The most important variables in the classification of subnetworks to their individual classes is illustrated in as derived using the reduction in Gini index, a measure of the reduction in misclassification when a particular variable is used.

Discussion and Conclusion

The relative paucity of data describing disease-associated subnetworks continues to present a serious challenge in the analysis of the functional biomodules underlying human disease. While the classification of complex disease-associated subnetworks appears to achieve reasonable results, the underlying heterogeneity of human disease, as evidenced by the SGD set, will always present a problem in classification. It is notable that the variables with the highest influence are a mix of both topological and biological factors, confirming previous findings that characteristics from both categories play an important role in the susceptibility to biological disruption and resulting disease. The relative importance of clustering coefficients confirms recent results examining the differences between disease-associated genes and essential genes (27). The inclusion of mean gene start locus and GC content confirm the relative importance of genomic localization and transcriptional propensity (28). While the examination of individual factors increases confidence in the findings through recapitualation of established study results, the random forest is able to capture the interaction between these variables. These inter-variable interactions are a prime target for continued study. It is not completely surprising that the SGD subnetworks appear to bear a strong resemblance to the GO background considering the pathogenesis of diseases that arise from anomalies in a single gene. In many cases, the GO-derived subnetworks can be considered functional biomodules of the interactome. The disruption of certain genes in these functional biomodules is likely to manifest in the form of disease phenotypes if they are not serious enough to result in lethality. This can result in failures of protein complex assembly and complementation such as in Xeroderma Pigmentosum, a single gene disease that can arise from any one of the seven known genes in the XPA-XPG complementation group associated with nucleotide excision repair. As such, these two classes are relatively poorly separable even in a supervised machine learning context. As we expected, the differences between the networks formed by sets of genes associated with biological processes and those associated with human disease are subtle and not easily derived as they are, by definition, intimately linked. The similarity between the single gene disease-associated subnetworks and those derived from the Gene Ontology demonstrates the multiscale behavior of a single disruption in a functional biomodule, and its ability to cause debilitating effects. The need for additional data and high specificity data is made abundantly clear in this study, as demonstrated by the propensity for misclassification of complex disease-associated subnetworks as well as the limited number of subnetworks derived from the data due to lack of representation in the interaction network. The limited availability of interaction propensity or data quality measures associated with individual interactions in the particular version of the interaction database we employed led us to treat all interactions as equally probable and equally correct. This may be a source of error in the process that may be ameliorated in the future with additional data and quantitative measures associated with the interactions. As more gene-disease association data becomes available, the effectiveness of this method should be re-evaluated. Principal Components Analysis demonstrates the poor separability of the data Figure 1a. Principal components analysis of all sets using all parameters. 95% of data points fall within the ellipse. Figure 1b. Principal components analysis of SGD and CD sets using all parameters. 95% of data points fall within the ellipse.

Supplementary Tables and Figures

A principal components analysis of the combined sets using all the parameters, suggests that the difference between disease-related subnetworks and the GO baseline subnetworks are subtle and not easily derived. When the PCA is done over just the CD and SGD sets, we see a similar pattern where there is no clear separation. However the non-continuous nature of the features may be a confounding factor when applying the PCA approach. With that in mind, a simple k-means clustering approach was taken where k = 3 to represent the three source types. Complete results of unsupervised k-means clustering of the data Size distribution of subnetworks in each category Figure 2a: Size Distribution of SGD Subnetworks Figure 2b: Size Distribution of CD Subnetworks Figure 2c: Size Distribution of GO Subnetworks Classification results from each of nine classification attempts using complete GO set Biological Parameters Only Biological parameters only: dataset split into “disease” and “normal” classes Biological parameters only: dataset split into CD, SGD, and GO classes Biological parameters only: SGD and GO classes Topological Parameters Only Topological Parameters Only: dataset split into “disease” and “normal” classes Topological Parameters Only: dataset split into CD, SGD, and GO classes Topological Parameters Only: SGD and GO classes Combined Parameterization All parameters: dataset split into “disease” and “normal” classes All parameters: dataset split into CD, SGD, and GO classes All parameters: SGD and GO classes All parameters: SGD and CD classes The first classification was done on a set combining all SGD and CD subnetworks into a single larger disease class in comparison to the GO-derived background set. The second classification used only the SGD subset of the data in comparison to the GO data. The third classification used each subset of data in its own discrete class. These subsets were further separated into three groups depending on the underlying parameters available to the classifier. These groups used parameters exclusively from the topological and biological parameter sets, as well as the combined parameterization. It can be seen that overall the biological characteristics prove more informative than the topological ones and achieve a lower misclassification error rate, ranging between 2.89 and 3.70%. On the other hand, for the topological characteristics the misclassification error rate was around 10% for the three class problem. However, when the CD class was excluded, the topological characteristics matched the performance of the biological ones. Further, an inspection of Sup. suggests that the presence of the SGD class is the source of the significantly higher misclassification error rate with respect to the topological features. In most cases, the presence of the large number of representative GO subnetworks leads to a high classification accuracy. However, it is useful to examine the true positive (TP) rate of classification between the combined “disease” set, a combination of the SGD and CD sets, and the GO background. In the combined parameterization and biological parameter only cases, the TP rate of this combined set is relatively good, at 61% and 72%, respectively. Examination of the TP rates for classifying into the three distinct classes revels that the subnetworks in the SGD set appear to be poorly distinguishable from the background GO set. However, the CD set appears to have predictive power setting it apart from the GO background. This similarity between the GO and SGD sets likely leads to the poor classification accuracy seen between the two sets as reflected in the poor TP values for the SGD set in Sup. . Ranked features by parameter type Biological Parameters Only Topological Parameters Only Combined Parameterization Classification results from each of nine classification attempts using “Biological Process” only GO set Biological Parameters Only Biological parameters only: dataset split into “disease” and “normal” classes Biological parameters only: dataset split into CD, SGD, and GO classes Biological parameters only: SGD and GO classes Topological Parameters Only Topological Parameters Only: dataset split into “disease” and “normal” classes Topological Parameters Only: dataset split into CD, SGD, and GO classes Topological Parameters Only: SGD and GO classes Combined Parameterization All parameters: dataset split into “disease” and “normal” classes All parameters: dataset split into CD, SGD, and GO classes All parameters: SGD and GO classes All parameters: SGD and CD classes Data was extracted from MiMi using SQL queries for human-specific interactions from the National Center for Integrative Biomedical Informatics SQL server using SQL Server Management Studio Express. The disease and biological process associated subnetworks are built from two fundamental components. First, a protein interaction network is used to define the relationships and interactions between the proteins considered in the study. We separate the OMIM and PhenoGO sets for two reasons. The primary factor for the separation is the drastically different underlying focus of both of these resources, although they do share some commonly annotated diseases. PhenoGO contains data describing both single gene and multi-gene complex disease, whereas OMIM is primary focused on single gene diseases. The secondary factor is curation; the OMIM data is manually curated while PhenoGO is a computationally derived data source. Derivation of the subnetworks was done using the Boost Library version 1.43.1 (http://www.boost.org/) and version .9 of the Boost Graph Library bindings to Python (http://osl.iu.edu/~dgregor/bgl-python/) using ActiveState ActivePython version 2.4.3 (http://www.activestate.com/). Subnetworks that resulted in errors in the software were removed from the set, as the memory requirements for processing a number of large, dense networks was beyond the memory capacity of our workstation. Because the data in the PhenoGO resource spans drugs, cell types, and other biological contexts not directly associated with disease, the subnetworks formed by this resource were filtered using the UMLS metathesaurus. Therefore, only genes associated with MeSH and UMLS terms are used to create the subnetworks. To restrict the set, a list of UMLS and MeSH codes was derived using a Perl script containing a total of unique terms. Of the 423,550 terms in the UMLS and MeSH that met these rules, the UMLS composed 419,087 terms and MeSH composed 5,563 terms. This process of restricting the set yielded a dramatic reduction in the number of subnetworks in the disease set. The data from the biological and topological characterization for each of the classes was then filtered for size using a perl script, constraining the set to networks of size between 3 and 9999 nodes. 79 and 278 subnetworks passed this filter from the OMIM and PhenoGO sets, respectively. 2590 of the subnetworks generated from the Gene Ontology passed this filter. To characterize subnetworks structurally, we chose a number of well-defined metrics to measure their size, density, and connectivity. Subnetworks are characterized based on node count, clustering coefficient, average degree, maximum degree, radius, diameter, cyclicity, and biconnectivity. Cyclicity and biconnectivity are handled as Boolean variables with values of either 1 (True) or 0 (false). To account for the biological characteristics of the constituent genes of these subnetworks, we use biological characteristics for the constituent genes extracted from BioMart. These factors accounted for positional and orientation effects, biological role of the protein product, and physical stability. Factors include mean gene start location, mean gene end location, mean length, strand, mean PFAM domain annotation count, mean ProSite annotation count, mean number of signal domains, mean number of transmembrane domains, and mean G-C content fraction. Parameterization of subnetworks was done using a series of Perl scripts using the Perl-Graph library version .84 (http://search.cpan.org/dist/Graph/) as well as the Boost Graph Library Bindings for Perl version 1.4 (http://search.cpan.org/~dburdick/Boost-Graph-1.4/). These libraries were used to determine the topological characteristics of each of the subnetworks. Factors include the average degree, maximum degree, node count, radius, and diameter for each subnetwork. Each subnetwork was also tested for cyclicity and biconnectivity. During the parameterization process, a number of entries were removed from the set as the subnetworks they formed were not computable within the memory limits of our workstation. GO:0007218 - neuropeptide signaling pathway GO:0045893 - positive regulation of transcription, DNA-dependent GO:0006937 - regulation of muscle contraction Classification was done with Weka using the built-in weka.classifiers.trees. RandomForest package. The parameterized data was split into 3 sets for the biological and topological groups. The first set composed of all three data sources comprising three distinct classes. The second set assigned “normal” and “disease” flags to the subnetworks derived from the Gene Ontology, and OMIM and PhenoGO, respectively. The third subset was composed of only disease subnetworks derived from OMIM while maintaining the GO background set. A factor analysis was done using the RandomForest package in R 2.7.1 in each of the biological parameter only, topological parameter only, and combined parameter groups to determine the relative influence of each of the parameters in determining class membership in each of the classification sets. The random forest was set to use 4 variables per tree and 100 total trees for the classification task.
=== Run information ===
Scheme: weka.clusterers. SimpleKMeans -N 3 -S 10
Relation: combined_data
Instances: 2944
Attributes: 20
  average gene start
  average gene end
  average length
  average gene strand
  average pfam count
  average prosite count
  average # of singnal domains
  average # transmembrane domains
  average GC content
  observed edges/total possible edges
  average node degree
  max node degree
  radius
  diameter
  node count
  cyclicity
  biconnectivity
  clustering coefficent
Ignored:
  source
  phenotype code
Test mode: Classes to clusters evaluation on training data
=== Model and evaluation on training set ===
kMeans
======
Number of iterations: 6
Within cluster sum of squared errors: 1660.859140812153
Table 2a.

Biological parameters only: dataset split into “disease” and “normal” classes

Out of bag error: 0.0309
Correctly Classified Instances283696.2988 %
Incorrectly Classified Instances1093.7012 %
Kappa statistic0.8064
Mean absolute error0.1287
Root mean squared error0.216
Relative absolute error60.339 %
Root relative squared error66.1667 %
Total Number of Instances2945
TP RateFP RatePrecisionRecallf-Measureclass
0.9950.2720.9640.9950.979GO
0.7280.0050.9560.7280.827Disease
Table 2b.

Biological parameters only: dataset split into CD, SGD, and GO classes

Out of bag error: 0.0309
Correctly Classified Instances283296.163 %
Incorrectly Classified Instances1133.837 %
Kappa statistic0.8008
Mean absolute error0.0893
Root mean squared error0.1801
Relative absolute error61.2569 %
Root relative squared error66.7931 %
Total Number of Instances2945
TP RateFP RatePrecisionRecallf-Measureclass
0.1650.0030.5650.1650.255SGD
0.8670.0010.9920.8670.925CD
0.9960.2830.9620.9960.979GO
Table 2c.

Biological parameters only: SGD and GO classes

Out of bag error: 0.0274
Correctly Classified Instances259097.1129 %
Incorrectly Classified Instances772.8871 %
Kappa statistic0.1974
Mean absolute error0.0527
Root mean squared error0.1661
Relative absolute error91.1176 %
Root relative squared error97.9961 %
Total Number of Instances2667
TP RateFP RatePrecisionRecallf-Measureclass
0.1270.0030.5560.1270.206SGD
0.9970.8730.9740.9970.985GO
Table 2d.

Topological Parameters Only: dataset split into “disease” and “normal” classes

Out of bag error: 0.0853
Correctly Classified Instances267590.8628 %
Incorrectly Classified Instances2699.1372 %
Kappa statistic0.4646
Mean absolute error0.1475
Root mean squared error0.2732
Relative absolute error69.1481 %
Root relative squared error83.7012 %
Total Number of Instances2944
TP RateFP RatePrecisionRecallf-Measureclass
0.3920.020.7290.3920.51Disease
0.980.6080.9210.980.95GO
Table 2e.

Topological Parameters Only: dataset split into CD, SGD, and GO classes

Out of bag error: 0.0832
Correctly Classified Instances268891.30%
Incorrectly Classified Instances2568.70%
Kappa statistic0.4863
Mean absolute error0.1016
Root mean squared error0.2241
Relative absolute error69.7015 %
Root relative squared error83.1102 %
Total Number of Instances2944
TP RateFP RatePrecisionRecallf-Measureclass
0.0380.0040.2140.0380.065SGD
0.4930.0110.830.4930.619CD
0.9850.6080.9220.9850.952GO
Table 2f.

Topological Parameters Only: SGD and GO classes

Out of bag error: 0.0315
Correctly Classified Instances258196.81%
Incorrectly Classified Instances853.19%
Kappa statistic0.0586
Mean absolute error0.0543
Root mean squared error0.1716
Relative absolute error93.8315 %
Root relative squared error101.201 %
Total Number of Instances2666
TP RateFP RatePrecisionRecallf-Measureclass
0.0380.0030.250.0380.066SGD
0.9970.9620.9710.9970.984GO
Table 2g.

All parameters: dataset split into “disease” and “normal” classes

Out of bag error: 0.0452
Correctly Classified Instances279194.803 %
Incorrectly Classified Instances1535.197 %
Kappa statistic0.7128
Mean absolute error0.1269
Root mean squared error0.2191
Relative absolute error59.5021 %
Root relative squared error67.1287 %
Total Number of Instances2944
TP RateFP RatePrecisionRecallf-Measureclass
0.6110.0050.940.6110.74Disease
0.9950.3890.9490.9950.971GO
Table 2h.

All parameters: dataset split into CD, SGD, and GO classes

Out of bag error: 0.0438
Correctly Classified Instances279594.9389 %
Incorrectly Classified Instances1495.0611 %
Kappa statistic0.7225
Mean absolute error0.0886
Root mean squared error0.1815
Relative absolute error60.7398 %
Root relative squared error67.2984 %
Total Number of Instances2944
TP RateFP RatePrecisionRecallf-Measureclass
0.1010.0030.50.1010.168SGD
0.9970.3870.9490.9970.972GO
0.7520.0010.9860.7520.853CD
Table 2i.

All parameters: SGD and GO classes

Out of bag error: 0.0281
Correctly Classified Instances259197.1868 %
Incorrectly Classified Instances752.8132 %
Kappa statistic0.2332
Mean absolute error0.0498
Root mean squared error0.1594
Relative absolute error86.0831 %
Root relative squared error93.9883 %
Total Number of Instances2666
TP RateFP RatePrecisionRecallf-Measureclass
0.1520.0030.60.1520.242SGD
0.9970.8480.9750.9970.986GO
Table 2j.

All parameters: SGD and CD classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 4 -S 1
Relation: OMIM-PhenoGO-weka.filters.unsupervised.attribute. Remove-R2
Instances: 357
Attributes: 19
  source
  average gene start
  average gene end
  average length
  average gene strand
  average pfam count
  average prosite count
  average # of signal domains
  average # transmembrane domains
  average GC content
  observed edges/total possible edges
  average node degree
  max node degree
  radius
  diameter
  node count
  cyclicity
  biconnectivity
  clustering coefficient
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 3a:

Biological Parameters Only

GOSGD/OMIMCD/PhenoGOMeanDecreaseAccuracyMeanDecreaseGini
averageGeneStart0.27834821.02807830.90599600.275749484.56684
averageGeneEnd0.27681570.93945270.89257330.274746782.32455
averageLength0.26448071.23017540.95103590.287619789.97404
averageGeneStrand0.17589040.13572940.95397240.277603163.51283
averagePfamCount0.27301300.52547450.88569970.271781568.71366
averagePrositeCount0.27320540.77805310.86677910.272921971.44485
averageSingnalDomainCount0.21260321.13214890.92156450.274430146.04487
averageTransmembraneDomainsCount0.23691260.75114600.91071380.274647341.26618
averageGCContent0.25279321.18632290.96330710.287278490.52120
Table 3b:

Topological Parameters Only

GOSGD/OMIMCD/PhenoGOMeanDecreaseAccuracyMeanDecreaseGini
observedEdgeFraction0.230011630.59407640.903124820.2467534793.847995
averageNodeDegree0.18907358−0.18967220.924948540.2511857973.325193
maxNodeDegree0.23248537−0.01955840.751461180.2396450745.595834
radius0.143630090.33417300.737972600.1762012610.558500
diameter0.165046370.32584330.899506120.2199010624.283709
nodeCount0.247167790.11740770.628149170.2475621347.349672
cyclicity0.076684060.15991570.056668380.082338932.229017
biconnectivity0.052813180.21826990.476376300.109613363.538654
clusteringCoefficent0.289667690.99253510.961018900.2881043197.553541
Table 3c:

Combined Parameterization

GOSGD/OMIMCD/PhenoGOMeanDecreaseAccuracyMeanDecreaseGini
averageGeneStart0.255771470.61879220.87829650.263109658.025555
averageGeneEnd0.241893660.86490500.88237250.251715554.866536
averageLength0.218601811.04761720.91573950.270202953.928221
averageGeneStrand0.212227270.47797120.88994480.261302737.971447
averagePfamCount0.245898710.71387330.81394010.255732951.837923
averagePrositeCount0.246537670.80263520.82889240.255344951.873560
averageSingnalDomainCount0.176084400.87252590.84942070.250446228.695867
averageTransmembraneDomainsCount0.176430060.80164040.85873880.239890325.758543
averageGCContent0.206307771.02498910.90424560.262150057.568889
observedEdgeFraction0.227218540.95676400.85536820.242449139.992423
averageNodeDegree0.243572450.60443570.83506960.258631133.451690
maxNodeDegree0.233118840.56872220.77040130.241879123.089282
radius0.193720180.55070240.57258790.19422859.303571
diameter0.222634320.76832700.72325730.229585116.473967
nodeCount0.239545300.79259860.80417910.243008125.501844
cyclicity0.110507590.22013860.51570500.15593553.013125
biconnectivity0.076425970.18901600.29932800.10742291.420956
clusteringCoefficent0.260428961.40088050.89918040.270551761.914586
Table 4a.

Biological parameters only: dataset split into “disease” and “normal” classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 4 -S 1
Instances: 1706
Attributes: 10
  source
  average gene start
  average gene end
  average length
  average gene strand
  average pfam count
  average prosite count
  average # of signal domains
  average # transmembrane domains
  average GC content
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 4b.

Biological parameters only: dataset split into CD, SGD, and GO classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 4 -S 1
Instances: 1706
Attributes: 10
  source
  average gene start
  average gene end
  average length
  average gene strand
  average pfam count
  average prosite count
  average # of signal domains
  average # transmembrane domains
  average GC content
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 4c.

Biological parameters only: SGD and GO classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 6 -S 1
Relation: filtered_biological_2class_OMIM_omly-weka.filters.unsupervised.attribute. Remove-R2
Instances: 1428
Attributes: 10
  source
  average gene start
  average gene end
  average length
  average gene strand
  average pfam count
  average prosite count
  average # of signal domains
  average # transmembrane domains
  average GC content
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 2d.

Topological Parameters Only: dataset split into “disease” and “normal” classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 4 -S 1
Relation: filtered_2class_topological_data-weka.filters.unsupervised.attribute. Remove-R2
Instances: 1705
Attributes: 10
  state
  observed edges/total possible edges
  average node degree
  max node degree
  radius
  diameter
  node count
  cyclicity
  biconnectivity
  clustering coefficient
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 2e.

Topological Parameters Only: dataset split into CD, SGD, and GO classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 4 -S 1
Relation: filtered_3class_topological_data-weka.filters.unsupervised.attribute. Remove-R2
Instances: 1705
Attributes: 10
  source
  observed edges/total possible edges
  average node degree
  max node degree
  radius
  diameter
  node count
  cyclicity
  biconnectivity
  clustering coefficient
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 2f.

Topological Parameters Only: SGD and GO classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 4 -S 1
Relation: filtered_2class_omimonly_topological_data-weka.filters.unsupervised.attribute. Remove-R2
Instances: 1427
Attributes: 10
  source
  observed edges/total possible edges
  average node degree
  max node degree
  radius
  diameter
  node count
  cyclicity
  biconnectivity
  clustering coefficient
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 2g.

All parameters: dataset split into “disease” and “normal” classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 4 -S 1
Relation: filtered_combined_data_2class-weka.filters.unsupervised.attribute. Remove-R2
Instances: 1705
Attributes: 19
  source
  average gene start
  average gene end
  average length
  average gene strand
  average pfam count
  average prosite count
  average # of signal domains
  average # transmembrane domains
  average GC content
  observed edges/total possible edges
  average node degree
  max node degree
  radius
  diameter
  node count
  cyclicity
  biconnectivity
  clustering coefficient
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 2h.

All parameters: dataset split into CD, SGD, and GO classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 4 -S 1
Relation: filtered_combined_data-weka.filters.unsupervised.attribute. Remove-R2
Instances: 1705
Attributes: 19
  source
  average gene start
  average gene end
  average length
  average gene strand
  average pfam count
  average prosite count
  average # of signal domains
  average # transmembrane domains
  average GC content
  observed edges/total possible edges
  average node degree
  max node degree
  radius
  diameter
  node count
  cyclicity
  biconnectivity
  clustering coefficient
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 2i.

All parameters: SGD and GO classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 4 -S 1
Relation: filtered_combined_data_2class_omim_only-weka.filters.unsupervised.attribute. Remove-R2
Instances: 1427
Attributes: 19
  source
  average gene start
  average gene end
  average length
  average gene strand
  average pfam count
  average prosite count
  average # of signal domains
  average # transmembrane domains
  average GC content
  observed edges/total possible edges
  average node degree
  max node degree
  radius
  diameter
  node count
  cyclicity
  biconnectivity
  clustering coefficient
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 2j.

All parameters: SGD and CD classes

=== Run information ===
Scheme: weka.classifiers.trees. RandomForest -I 100 -K 4 -S 1
Relation: filtered_OMIM-PhenoGO-weka.filters.unsupervised.attribute. Remove-R2
Instances: 357
Attributes: 19
  source
  average gene start
  average gene end
  average length
  average gene strand
  average pfam count
  average prosite count
  average # of signal domains
  average # transmembrane domains
  average GC content
  observed edges/total possible edges
  average node degree
  max node degree
  radius
  diameter
  node count
  cyclicity
  biconnectivity
  clustering coefficient
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Table 1:

Unsupervised k-means clustering illustrates the poor separability of the data, with 1631 (55.4%) instances incorrectly clustered

Assigned to Cluster
GOSGDCD
SourceGO59416
SGD1220435932
CD1583189
Table 2.

Classification of CD, SGD, and GO classes using all variables

Correctly Classified Instances279594.94 %
Incorrectly Classified Instances1495.06%
TP RateFP RatePrecisionRecallf-Measureclass
0.1010.0030.50.1010.168SGD
0.9970.3870.9490.9970.972GO
0.7520.0010.9860.7520.853CD
  22 in total

1.  Modular organization of cellular networks.

Authors:  Alexander W Rives; Timothy Galitski
Journal:  Proc Natl Acad Sci U S A       Date:  2003-01-21       Impact factor: 11.205

2.  A network of transcriptionally coordinated functional modules in Saccharomyces cerevisiae.

Authors:  Allegra A Petti; George M Church
Journal:  Genome Res       Date:  2005-08-18       Impact factor: 9.043

3.  Discovering disease-genes by topological features in human protein-protein interaction network.

Authors:  Jianzhen Xu; Yongjin Li
Journal:  Bioinformatics       Date:  2006-09-05       Impact factor: 6.937

4.  The human disease network.

Authors:  Kwang-Il Goh; Michael E Cusick; David Valle; Barton Childs; Marc Vidal; Albert-László Barabási
Journal:  Proc Natl Acad Sci U S A       Date:  2007-05-14       Impact factor: 11.205

5.  Interacting gene clusters and the evolution of the vertebrate immune system.

Authors:  Takashi Makino; Aoife McLysaght
Journal:  Mol Biol Evol       Date:  2008-06-23       Impact factor: 16.240

6.  Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's disease.

Authors:  Michael Krauthammer; Charles A Kaufmann; T Conrad Gilliam; Andrey Rzhetsky
Journal:  Proc Natl Acad Sci U S A       Date:  2004-10-07       Impact factor: 11.205

7.  The HUGO Gene Nomenclature Database, 2006 updates.

Authors:  Tina A Eyre; Fabrice Ducluzeau; Tam P Sneddon; Sue Povey; Elspeth A Bruford; Michael J Lush
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

8.  Characterizing disease states from topological properties of transcriptional regulatory networks.

Authors:  David P Tuck; Harriet M Kluger; Yuval Kluger
Journal:  BMC Bioinformatics       Date:  2006-05-02       Impact factor: 3.169

9.  Why do hubs tend to be essential in protein networks?

Authors:  Xionglei He; Jianzhi Zhang
Journal:  PLoS Genet       Date:  2006-04-26       Impact factor: 5.917

10.  The importance of bottlenecks in protein networks: correlation with gene essentiality and expression dynamics.

Authors:  Haiyuan Yu; Philip M Kim; Emmett Sprecher; Valery Trifonov; Mark Gerstein
Journal:  PLoS Comput Biol       Date:  2007-02-14       Impact factor: 4.475

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.