| Literature DB >> 26511329 |
Matěj Holec1, Ondřej Kuželka2,3, Filip Železný4.
Abstract
BACKGROUND: Set-level classification of gene expression data has received significant attention recently. In this setting, high-dimensional vectors of features corresponding to genes are converted into lower-dimensional vectors of features corresponding to biologically interpretable gene sets. The dimensionality reduction brings the promise of a decreased risk of overfitting, potentially resulting in improved accuracy of the learned classifiers. However, recent empirical research has not confirmed this expectation. Here we hypothesize that the reported unfavorable classification results in the set-level framework were due to the adoption of unsuitable gene sets defined typically on the basis of the Gene ontology and the KEGG database of metabolic networks. We explore an alternative approach to defining gene sets, based on regulatory interactions, which we expect to collect genes with more correlated expression. We hypothesize that such more correlated gene sets will enable to learn more accurate classifiers.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26511329 PMCID: PMC4625461 DOI: 10.1186/s12859-015-0786-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Example of operon based gene sets. a The operon bcsABZC contains genes bcsC, bcsZ, bcsB, bcsA, and contains transcription units bcsABZC and bcsBZ. b COPR sets are consecutive set of genes in operons which are always co-transcribed
Fig. 2Example of transcription-factor based gene sets. a A transcription factor Fur regulates altogether 130 genes including positively regulated genes e.g., sucA, sucD and negatively regulated genes e.g., cyoB, sucB, sucC, entD. All these regulated genes constitute the gene set. b A complex regulon defined by genes sucA, sucD, sucB and sucC can be divided into two strict regulons defined by two pairs of genes (sucA, sucD) and (sucB, sucC). All the three mentioned regulons are regulated only by a common set of transcription factors CRP, ArcA, IHF, Fur and FNR
A summary of gene-set types and their properties
| Gene-set type | # sets | # genes | # of genes in set | ||
|---|---|---|---|---|---|
| Median | Mean | Max. | |||
| Operon Based | |||||
| Operon (OPR) | 2649 | 4524 | 1 | 1.708 | 16 |
| Transcriptional unit (TU) | 3213 | 4524 | 1 | 1.685 | 16 |
| Continuous subsequence (COPR) | 3164 | 4524 | 1 | 1.430 | 12 |
| Transcription Factor Based | |||||
| Transcription factor (TF) | 186 | 1685 | 7 | 24.720 | 534 |
| Regulon (REG) | 459 | 1685 | 2 | 3.671 | 61 |
| Strict regulon (SREG) | 541 | 1685 | 2 | 3.115 | 51 |
| Conventional | |||||
| GO+KEGG | 260 | 2734 | 12 | 31.830 | 847 |
For each of the type, the smallest sets contain exactly one gene. The “# genes” column contains the number of genes included in at least one set of the given type. Since the sets are not disjoint, # genes / # sets ≠ mean. The table does not list the seven randomized gene set collections, which possess exactly the same statistics as the respective listed types except their member genes are permuted
List of gene expression series collected from the Gene Expression Omnibus (10 largest series for E. coli K12)
| Series id | Platform id | # phenotypes |
|---|---|---|
| GSE6836 | GPL199 | 62 |
| GSE33147 | GPL199 | 30 |
| GSE10160-1 | GPL199 | 9 |
| GSE10160-2 | GPL3154 | 4 |
| GSE35371 | GPL3154 | 20 |
| GSE21869 | GPL199 | 5 |
| GSE17505 | GPL199 | 10 |
| GSE34023a | GPL3154 | 7 |
| GSE7398a | GPL199 | 8 |
| GSE4778 | GPL199 | 4 |
The series marked with a were omitted due their involvement in the development of the RegulonDB
Results obtained with the newly proposed gene sets on the selection data sets
| Gene-set type | Mean accuracy [%] | Sum of ranks |
|---|---|---|
| Operon (OPR) | 81.50 | 196.00 |
| Transcriptional unit (TU) | 82.17 | 198.50 |
| Continuous subsequence (COPR) |
|
|
| Transcription factor (TF) | 78.69 | 179.00 |
| Regulon (REG) |
|
|
| Strict regulon (SREG) | 82.50 | 209.50 |
Columns contain the mean accuracies and sum-of-ranks indicators over the datasets, higher rank indicates better performance. Here, the best ranked gene-set types from the two categories (operon-based, transcription-factor based) are COPR and REG, respectively
Summary of the main experimental findings
| Control | Novel sets | Conventional sets | |
|---|---|---|---|
| COPR sets | REG sets | KEGG+GO sets | |
| Randomized |
|
|
|
| ( | ( | ||
| Gene-level |
|
|
|
| ( | ( | ||
Both the selected types of the newly proposed gene sets (i.e., COPR and REG) perform significantly better than their randomized and gene-level versions. On the contrary, the state-of-the-art gene set type (KEGG+GO) performs indistinguishably from its randomized version and significantly worse than its gene-level version. As detailed in main text, the p-values correspond to the one-sided paired Wilcoxon test applied on the win/tie/loss counts determined by leave-one-out cross-validation of predictive accuracies
Fig. 3Density plots of pair-wise gene expression correlations. Random: each two genes are randomly sampled from among all genes. Remaining plots: a gene-set is first sampled from a given category (GO+KEGG, REG, COPR), and the two genes are then sampled from that set