| Literature DB >> 21533165 |
J Keith Vass1, Desmond J Higham, Manikhandan A V Mudaliar, Xuerong Mao, Daniel J Crowther.
Abstract
Biomarker identification, using network methods, depends on finding regular co-expression patterns; the overall connectivity is of greater importance than any single relationship. A second requirement is a simple algorithm for ranking patients on how relevant a gene-set is. For both of these requirements discretized data helps to first identify gene cliques, and then to stratify patients.We explore a biologically intuitive discretization technique which codes genes as up- or down-regulated, with values close to the mean set as unchanged; this allows a richer description of relationships between genes than can be achieved by positive and negative correlation. We find a close agreement between our results and the template gene-interactions used to build synthetic microarray-like data by SynTReN, which synthesizes "microarray" data using known relationships which are successfully identified by our method.We are able to split positive co-regulation into up-together and down-together and negative co-regulation is considered as directed up-down relationships. In some cases these exist in only one direction, with real data, but not with the synthetic data. We illustrate our approach using two studies on white blood cells and derived immortalized cell lines and compare the approach with standard correlation-based computations. No attempt is made to distinguish possible causal links as the search for biomarkers would be crippled by losing highly significant co-expression relationships. This contrasts with approaches like ARACNE and IRIS.The method is illustrated with an analysis of gene-expression for energy metabolism pathways. For each discovered relationship we are able to identify the samples on which this is based in the discretized sample-gene matrix, along with a simplified view of the patterns of gene expression; this helps to dissect the gene-sample relevant to a research topic--identifying sets of co-regulated and anti-regulated genes and the samples or patients in which this relationship occurs.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21533165 PMCID: PMC3078920 DOI: 10.1371/journal.pone.0018634
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Estimation of consistent identification of the E coli transcriptional classification.
|
|
|
|
|
| ||
| 100 samples | 80 (82%)89 (92%)0 | 454 | 2329 (71%) | 575 (69%)620 (74%)0 | 71575038 | 2320316 |
| 200 samples | 88 (90%)92 (95%)0 | 556 | 2233 (89%) | 638 (76%)653 (78%)10 (1%) | 84183740 | 2523345 |
| 300 samples | 93 (96%)92 (95%)0 | 556 | 3336 (87%) | 652 (78%)667 (80%)12 (1%) | 87186443 | 2925361 |
| 400 samples | 93 (96%)93 (96%)0 | 556 | 3336 (87%) | 662 (79%)671 (80%)15 (2%) | 88387344 | 2826368 |
| 2×200 | 91 (95%)87 (90%)0 | 556 | 3332 (78%) | 630 (75%)640 (76%)6 (1%) | 85780141 | 2821337 |
We assess the correctness of our identified gene-pairs with the E coli activation () and repression () relationships used by SynTReN to build the networks. This is equivalent to a check on specificity. We additionally wished to identify gene-pairs which were highly likely to occur, based on the definitions, but including transitive relations, that is - all the genes that are connected by an network path-length of 2. This 2-path network is not a full prediction of all observed relations in the data-file as it does not include the and pairs. We calculated the sum of the E coli definition adjacency matrices for (+1) and (−1) for path-length 1, 2 and 3 and again compared this network with our identified pairs. The results with correlation analysis are almost the same as those found by discretization.
Figure 1Predicted transitive relations in a SynTReN model network.
The definitions used by SynTReN to model synthetic data ac (positive-regulation) and re (repression) are illustrated with the effector on the left. The targets with transitive relations, either positive or negative are shown connected with a dotted edge. Five simple motifs are illustrated, but scope for more complexity exists when these relationships overlap. Positive co-expression is predicted by either ac or re definitions, but the two targets have to be connected to the same effector by the same relationship for this to be true (a & b). Negative co-expression needs some form of asymmetry, as shown in c–e. The success of our predictions depends on how the simulation is set up; we used 100 genes with known relations and 100 background genes, in the comparisons shown in , but decreasing the number of background genes increases the complexity of the expected transitive relationships.
Figure 2Comparison of pp identified gene-pairs with transitive path-length 2 pairs from E coli transcriptional definitions.
An adjacency matrix was constructed, where the E coli definitions was set to 1 and set to −1; relationships were set to 0 and are therefore ignored in this analysis. This adjacency matrix, A, was squared (A.A) which reveals paths of length 2; in this qualitative analysis no allowance is made for loss of relationships due to positive and negative values summing to zero. This E coli definition derived matrix is the upper-triangle in the diagram and the gray squares are positive and black are negative. The lower-triangle is the matrix calculated from the SynTReN simulated data for 100 samples.
The use of independent studies to increase specificity in network determination.
|
|
|
| Low-variance pairs | |||
| Subset 1 |
| 154713432083 | 92900 | 556 | 3333 | 14128 |
| Subset 2 |
| 162113912099 | 92890 | 556 | 3334 | 9105 |
| Subset 1 AND 2 |
| 125311141364 | 91870 | 556 | 3332 | 450 |
SynTReN was used to build a synthetic dataset of 400 samples, these were randomly subdivided into two subsets of 200 each. The discretization-based co-expression networks were calculated for each and the shared edges used to give a third network. The 10% of the genes with the lowest variance were selected and the possible gene-pairs for those determined, all of these genes were not defined by ac, du or re relationships. The low-variance based gene-pairs detected are preferentially discarded by this procedure, suggesting that this is one reasonable technique for discarding false relationships.
Effectiveness of correlation network as a filter.
| Bio-noise |
| Network |
|
|
|
|
|
|
| 4 | 36 | 5707 |
|
| 94 | 7 | 3095 | ||
|
| 94 | 8 | 1758 | ||
|
|
| 1 | 36 | 1240 | |
|
| 90 | 7 | 632 | ||
|
| 91 | 8 | 410 | ||
|
|
| 1 | 36 | 20 | |
|
| 90 | 4 | 16 | ||
|
| 91 | 4 | 11 | ||
|
|
|
| 12 | 37 | 15101 |
|
| 91 | 13 | 7213 | ||
|
| 91 | 13 | 7148 | ||
|
|
| 1 | 35 | 3086 | |
|
| 89 | 3 | 1394 | ||
|
| 90 | 3 | 1466 | ||
|
|
| 0 | 27 | 141 | |
|
| 84 | 3 | 87 | ||
|
| 84 | 3 | 86 |
The discretization analysis was performed at two levels of “bio-noise” 0.1 and 0.5. Positive correlation was used as a filter to remove edges not present by correlation from and networks. Negative correlation at the three levels was required for edges to be retained. With 0.1 noise, correlation removes almost no TRUE edges while removing most of the FALSE (bgr_) pairs.
Assessment of predicted pm relationships from European versus Chinese and Japanese data.
|
|
|
| +ve corr | −ve corr | |
| Cheung | 5 053 (2%) | 4 911 (2%) | 155 326 (60%) | 5 880 (2%) | 101 862 (39%) |
| SAFS | 36 342 (14%) | 40 268 (16%) | 45 439 (18%) | 40 248 (14%) | 42 054 (16%) |
| Decode (all) | 33 921 (13%) | 34 831 (14%) | 47 699 (18%) | 46 272 (18%) | 48 574 (19%) |
| Decode (male) | 3 601 (1%) | 3 595 (1%) | 5 397 (2%) | 48 950 (19%) | 51 212 (20%) |
| Decode (female) | 5 980 (2%) | 5 952 (2%) | 9 010 (3%) | 38 034 (15%) | 39 730 (15%) |
Genes with significantly different expression between Asian and European subjects were identified by Spielman et al [20] and we divided these into two groups - European-up (Eu) and European-down (Ed), using the average expression for Europeans minus the average expression for Asian (Chinese and Japanese). These two probe-lists were used to make a pair-list of all possible combinations of Eu : Ed, and filtered to only contain the probes which appear in our final discretized data (Z = 0.4). For comparisons with the non-Affymetrix data (SAFHS and Decode) this Affymetrix probe pair-list was converted into a gene symbol pair-list. The comparisons show the number of common unique pairs between the networks and the Eu : Ed pair-list.
Comparison of discretized networks from 2 subsets of SAFHS subjects.
| Comparison of 2 randomly selected independent subsets of SAFHS(620 and 619 subjects) (edges ×103) | |||||
| mmB | pmA | pmB | ppA | ppB | |
|
|
| 0.1 | 1.4 |
|
|
|
| 1.2 | 0.01 |
|
| |
|
|
| 0.045 | 1.0 | ||
|
| 0.8 | 0.003 | |||
|
|
| ||||
|
| |||||
“Duplicate” information is discarded in these comparisons; reasons for duplication include multiple probesets for single genes and in the networks relationships going in both directions. Networks were constructed by the discretizion (Z = 0.4) or correlation methods from two randomly selected sample subsets of the SAFHS dataset. The number of edges in each of the networks is given in brackets (×103).
Discretized networks carry consistent information.
| Effect of randomization on specific information in networks(Cheung and Spielman, Z = 0.4) (edges ×103) | |||||
| pm | pp | Randomizedmm(80) | Randomizedpm(79) | Randomizedpp(159) | |
|
| 18 | 1466 | 15 | 30 | 15 |
|
| 18 | 21 | 40 | 21 | |
|
| 15 | 30 | 15 | ||
Networks were constructed from discretized (Z = 0.4) data for all the Cheung and Spielman subjects, with the total number of edges shown in brackets. The left-hand 2 columns show the number of shared edges for un-shuffled discretized gene-sample data, while the right-hand 3 columns give the result of the comparison between the un-shuffled and shuffled gene-sample networks. Randomization was carried out for each row of the gene-sample discretized table using the R-package function “sample”.
Discretized networks carry consistent information.
| Comparison of networks from Cheung and Spielman (C) and SAFHS (S) (×103) | |||||
|
|
|
|
|
| |
|
| 18 | 1466 |
|
|
|
|
| 18 |
|
|
| |
|
|
|
|
| ||
|
| 872 | 16697 | |||
|
| 848 | ||||
|
| |||||
The networks were derived from discretized data (Z = 0.4) for both the SAFHS (S) and the Cheung and Spielman (C). For comparison purposes the platform specific identifiers were converted to gene-names and any resulting probe-set redundancy eliminated. Only the gene-names represented on both the Illumina and Affymetrix chips were used in this comparison. The numbers for comparisons between the different datasets are shown in bold.
Discretized and correlation networks share many relationships.
| Comparison of discretization and correlation networks (edges ×103) | ||||||
| Correlation>0.1032(12900) | Correlation<−0.1032(20000) | |||||
| Discretization networks | Only in discretized | Both | Only in correlation | Only in discretized | Both | Only in correlation |
|
| 2600 | 7600 | 5300 | 1000 | 300 | 19700 |
|
| 2500 | 8800 | 4100 | 1000 | 300 | 19700 |
|
| 12500 | 350 | 12500 | 4500 | 8300 | 4500 |
Tabular Venn-diagrams show the shared information between networks constructed using discretization and correlation methods; both methods were applied to the two subsets of the SAFHS. The networks from each subset, for each method, were compared and only the gene-pairs found in both subsets were used for the comparison. The comparison between discretized and correlation networks is described in Methods. All duplicate gene-pairs, resulting from multiple probes, were eliminated – leaving only one gene-pair for each relationship; here the direction of the relations is ignored. The size of each resulting network is included, in brackets.
Figure 3Co-expression networks for fatty acid, tri-carboxylic acid cycle, glycolysis and related genes in peripheral blood cells.
The patterns of co-regulation of TCA-cycle genes by correlation and discretization are summarised (a). The correlation cut-off was set at ±0.1032, which gives approximately equal probability of accepting a gene-pair (P = 0.005) as the discretization method (quantile = 0.995). The top row shows positive co-regulation and the next row negative co-regulation. For illustrative purposes the graph is simplified by removing directionality from the edges. Although some of the details are different, both methods show strong co-regulation of SDH(B,C,D), FH and MDH1 and a weaker co-regulation of ACO(1,2), IDH3(A,B) and OGDH. With both methods this second group is more clearly delineated by its negative relationships to the first group. The networks (b, c) were produced using the discretization method and the genes were selected using genes for three areas of metabolism using KEGG pathways [27]. Analyses were carried out, in data from GSE7965, separately for male (b) and female (c) subjects. The network was analysed using the “eigen” function from the R-package, the first eigen-vector was used to reorder the nodes. The rank of the genes from the first eigen-vector for each sex was compared (c) and over 80% of the genes lie within 10 positions of their order in the opposite sex. The genes showing the largest difference between male and female are ACADL (beta-oxidation of fatty acids), CPT2 (transport of long chain fatty acids into mitochondria), PPARA (transcription control of fatty acid and carbohydrate metabolism), CPT1A (transport of long chain fatty acids into mitochondria) and ACACA (fatty acid synthesis). (d) Comparison of gene-pairs between male and female networks, over 80% of the pairs are common. The maximum number of edges in this network is 5151 gene-pairs. The order of genes in (b) is shown in (e); the prominent cluster near the origin are genes 1:40 and the more diffuse cluster from about 55 to the end. The TCA genes in cluster 1 (OGDH, IDH3A, ACO2) and cluster 2 (SDHA, SDHC, FH, SDHD, SDHB) show that many of the relationships, found for the TCA cycle genes for both sexes, fit into a wider pattern of gene for the separate sexes.
Effect of changing Z-score on Analysed Network Estimation.
|
|
|
| ||
| Z = 0.4 |
| 80890 | 454 | 2329 |
| Z = 0.8 |
| 71710 | 335 | 1226 |
| Z = 1.2 |
| 35410 | 124 | 0015 |
| Z = 1.4 |
| 14160 | 001 | 009 |
| Z = 1.6 |
| 13140 | 001 | 0010 |
The SynTReN simulated data for 100 samples was analysed using different Z-scores to select up- and down-regulated genes. Although the specificity increased at higher Z-scores the sensitivity was lower. Our strategy in looking for bio-markers is to accept relationships with lower significance at this stage but subsequently require that any useful pattern or clique is highly connected. In real situations, it is also important to require that the cliques are found in independent datasets. Our decision not to look at lower Z-scores than 0.4 is based on pragmatic biomarker requirements, where changes in expression have to be robust and indicate changes likely to be found by other methods.