| Literature DB >> 24392115 |
Ashis Saha1, Aik Choon Tan2, Jaewoo Kang3.
Abstract
Genes act in concert via specific networks to drive various biological processes, including progression of diseases such as cancer. Under different phenotypes, different subsets of the gene members of a network participate in a biological process. Single gene analyses are less effective in identifying such core gene members (subnetworks) within a gene set/network, as compared to gene set/network-based analyses. Hence, it is useful to identify a discriminative classifier by focusing on the subnetworks that correspond to different phenotypes. Here we present a novel algorithm to automatically discover the important subnetworks of closely interacting molecules to differentiate between two phenotypes (context) using gene expression profiles. We name it COSSY (COntext-Specific Subnetwork discoverY). It is a non-greedy algorithm and thus unlikely to have local optima problems. COSSY works for any interaction network regardless of the network topology. One added benefit of COSSY is that it can also be used as a highly accurate classification platform which can produce a set of interpretable features.Entities:
Mesh:
Year: 2014 PMID: 24392115 PMCID: PMC3877685 DOI: 10.1371/journal.pone.0084227
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Overview of COSSY.
Communities are extracted from the molecular interaction network to generate Molecular Interaction Subnetworks (MISs)(A–B). Each MIS is mapped to microarray probes and all the samples are clustered according to the expression pattern of a certain number of highly differentially expressed probes (3 probes in this example figure). MISs are ranked by the entropy score which is lowest when every cluster contains only one type (phenotype) of samples (C–D). Finally, the top MISs cast votes to predict the context (phenotype) of a new sample. The voting depends on the proportion of different types of samples in the cluster closest to the new sample (E–G).
Figure 2The thyroid cancer pathway in KEGG (ID: hsa05216).
The top rectangle marked by C1 shows that REDPTC and TRK have an indirect effect on the activation of Ras. Ras then activates BRAF, and BRAF phosphorylates MEK which in turn phosphorylates ERK. The result of this path is proliferation survival. This pathway has five connected components (C1–C5). Among them, C2 is actually a subset of C1, and the others are fully disconnected, i.e., there is no significant interaction between any pair of the components.
Figure 3Ranking of MIS.
Let three probes (i, j, and k) constitute the representative probeset of an MIS. We plot all the samples with the expression values of these probes in separate dimensions, and then we cluster the samples. If the samples in a cluster are mostly of one kind (such as or ), we can say that the cluster’s expression pattern represents the corresponding class (positive or negative). The ranking of an MIS producing such clusters should be high.
Microarray Datasets.
| Dataset Name | #Probes | Positive Class (#samples) | Negative Class (#samples) | Reference |
| Leukemia | 7129 | AML (25) | ALL (47) |
|
| CNS | 7129 | Demoplastic (9) | Classic (25) |
|
| DLBCL | 7129 | DLBCL (58) | FL (19) |
|
| Prostate1 | 12600 | Tumor (52) | Normal (50) |
|
| Prostate3 | 12626 | Tumor (24) | Normal (9) |
|
| Lung | 12533 | MPM (31) | ADCA (150) |
|
| GCM | 16063 | Tumor (190) | Normal (90) |
|
The first column, ‘Dataset Name’, indicates the name of the microarray dataset used in the manuscript. ‘#Probes’ shows the number of probes present in the dataset. The third and fourth columns contain the name of the positive and negative class, respectively, followed by the number of samples of that class. The last column shows the reference the dataset was collected from.
Network Properties of KEGG and STRING.
| Network | TotalNodes | Gene (Protein) Nodes | Total Edges | ConnectedComp. | Avg. Node Degree | Max Node Degree | ClusteringCoefficient |
| KEGG | 19568 | 10691 | 10728 | 4494 | 1.84 | 43 | 0.19 |
| STRING | 14250 | 14250 | 215800 | 182 | 30.30 | 1110 | 0.61 |
The ‘Total Nodes’ column contains the total number of nodes available in the network while the ‘Gene (Protein) Nodes’ column shows the number of nodes with at least one gene in KEGG (or one protein in STIRING). The fourth and fifth columns contain the total number of edges, and the number of connected components having at least one gene (or protein), respectively. ‘Avg. Node Degree’ represents the number of edges a node has on average. ‘Max Node Degree’ denotes the maximum number of edges a node has in the network. ‘Clustering Coefficient’ is the ratio of the triangles to the connected triples in a graph.
Molecular Interaction Subnetwork Size.
| Network | Appropriate Range | MIS below the range | MIS within the range | MIS above the range |
| KEGG | 5–15 | 1925 | 629 | 9 |
| STRING | 5–25 | 170 | 847 | 23 |
The table shows the number of MISs with a total of nodes below, within, and above the appropriate range.
Figure 4The top ranked KEGG MIS in the Leukemia dataset.
A–C) Three overlapped subnetworks from three different pathways constitute the MIS. D) The merged MIS is shown here. E) The expression heatmap of the representative probeset of the MIS is shown here.
LOOCV accuracy (%) of classifiers.
| Method | Leukemia | CNS | DLBCL | Prostate1 | Prostate3 | Lung | GCM | Average |
| COSSY [KEGG] | 98.6 | 85.3 | 93.5 | 90.2 | 100.0 | 99.5 | 85.0 | 93.2 |
| COSSY [STRING] | 95.8 | 88.2 | 94.8 | 90.2 | 97.0 | 98.3 | 84.6 | 92.7 |
| DIRAC | 94.8 | 72.3 | 73.4 | 62.9 | 100.0 | 98.8 | 75.2 | 82.5 |
|
| 95.8 | 97.1 | 97.4 | 91.2 | 97.0 | 98.9 | 85.4 | 94.7 |
| TSP | 93.8 | 77.9 | 98.1 | 95.1 | 97.0 | 98.3 | 75.4 | 90.8 |
| SVM | 98.6 | 82.4 | 97.4 | 91.2 | 100.0 | 99.5 | 93.2 | 94.6 |
| Doublet [Sign-DT] | 93.1 | 82.4 | 97.4 | 86.3 | 97.0 | 98.3 | 85.0 | 91.3 |
| Doublet [Sumdiff-DT] | 91.7 | 70.6 | 97.4 | 82.4 | 87.9 | 95.0 | 81.4 | 86.6 |
| Doublet [Mul-DT] | 84.7 | 55.9 | 97.4 | 86.3 | 90.9 | 92.3 | 83.2 | 84.4 |
| Decision Tree (DT) | 73.6 | 67.7 | 80.5 | 87.3 | 84.9 | 96.1 | 77.9 | 81.1 |
| Nave Bayes | 100.0 | 82.4 | 80.5 | 62.8 | 90.9 | 97.8 | 84.3 | 85.5 |
|
| 84.7 | 76.5 | 84.4 | 76.5 | 87.9 | 98.3 | 82.9 | 84.5 |
| PAM | 97.2 | 82.4 | 85.7 | 91.2 | 100.0 | 99.5 | 79.3 | 90.7 |
The leftmost column contains the names of the methods; the rightmost column shows the average accuracy of each method for seven datasets, and other columns show the accuracy (%) for individual datasets. ‘COSSY [KEGG]’ and ‘COSSY [STRING]’ represent COSSY using KEGG and STRING, respectively. ‘DIRAC’ is the algorithm proposed in [17] whose LOOCV accuracies have been calculated using the matlab code published with the paper. k-TSP and TSP denote the classification algorithms described in [5] and [4], respectively. SVM stands for Support Vector Machine. ‘Doublet [Sign-DT]’, ‘Doublet [Sumdiff-DT]’, and ‘Doublet [Mul-DT]’ denote the classification methods using Sign-Doublet, Sumdiff-Doublet, and Mul-Doublet, respectively, with decision trees as described in [6]. The last three rows contain the loocv accuracies using Nave Bayes, k Nearest Neighbor, and PAM classifier, respectively.
Results obtained from [5].
Results obtained from [6].
Figure 5LOOCV accuracy of five notable classifiers.
COSSY [KEGG] and COSSY [STRING] stand for COSSY using KEGG and STRING, respectively. k-TSP and DIRAC are the classification algorithms described in [5] and [17], respectively. SVM stands for the Support Vector Machine algorithm.
Area Under Curve (AUC) values of COSSY and DIRAC for different datasets.
| Dataset Name | AUC of COSSY using KEGG | AUC of COSSY using STRING | AUC of DIRAC |
| Leukemia |
| 0.985 | 0.948 |
| CNS | 0.862 |
| 0.726 |
| DLBCL |
| 0.976 | 0.636 |
| Prostate1 | 0.909 |
| 0.635 |
| Prostate3 |
| 0.972 |
|
| Lung |
|
| 0.990 |
| GCM |
| 0.889 | 0.757 |
| Average |
| 0.945 | 0.813 |
AUC has been calculated using the ROCR package in R [35]. The best AUC for each dataset is highlighted in bold face.
Figure 6Stitching of the MISs found from the Leukemia dataset.
The number at the end of the name of an MIS indicates its rank.