| Literature DB >> 29513193 |
Jiří Kléma1, František Malinka2, Filip Železný2.
Abstract
BACKGROUND: One of the major challenges in the analysis of gene expression data is to identify local patterns composed of genes showing coherent expression across subsets of experimental conditions. Such patterns may provide an understanding of underlying biological processes related to these conditions. This understanding can further be improved by providing concise characterizations of the genes and situations delimiting the pattern.Entities:
Keywords: Biclustering; Enrichment analysis; Gene expression; Ontology; Symbolic machine learning
Mesh:
Year: 2017 PMID: 29513193 PMCID: PMC5657082 DOI: 10.1186/s12864-017-4132-5
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Drosophila ovary table statistic
| Complete dataset | Train | Test | |||
|---|---|---|---|---|---|
| all | keepLocations | keepGenes | bd | ||
| #of rows/genes | 6,510 | 5,447 | 1,063 | 5,447 | 1,063 |
| #of columns/locations | 100 | 84 | 84 | 16 | 16 |
Fig. 1Segmentation of an imaginal disc. An example of segmentation of an imaginal disc (left), altogether with its annotation by the Drosophila ontology terms (right). The disc is split into 20 segments distinguished in colors, the split was found to best capture the gene expression patterns observed in the individual in situ hybridization images. The annotation stems from [40]
Imaginal disc dataset statistic
| Complete dataset | Train | Test | |||
|---|---|---|---|---|---|
| all | keepLocations | keepGenes | bd | ||
| #of rows/genes | 1,207 | 1,010 | 197 | 1,010 | 197 |
| #of columns/locations | 72 | 60 | 60 | 12 | 12 |
The number of annotation terms available for our experimental datasets
| GO | KEGG | DAO | DLO | |
|---|---|---|---|---|
| Ovary | 8,407 | 1,605 | - | 100 |
| IDisc | 5,083 | 423 | 147 | - |
Evaluation results of the proposed approaches to semantic biclustering
| Dataset | Method | AUROC | # of biclusters | # of terms per bicluster |
|---|---|---|---|---|
| Ovary | Bicluster Enrichment | 0.823 ±0.006 | 11.8 ±1.5 | 64.8 ±3.4 |
| Rules (JRip) | 0.636 ±0.01 | 102.6 ±21.5 | 7.1 ±0.61 | |
| Tree (J48) | 0.659 ±0.01 | 109.9 ±5.2 | 25.4 ±2.0 | |
| IDiscs | Bicluster Enrichment | 0.608 ±0.03 | 16.4 ±4.7 | 47.9 ±2.13 |
| Rules (JRip) | 0.565 ±0.01 | 25.9 ±6.2 | 7.89 ±0.53 | |
| Tree (J48) | 0.627 ±0.05 | 20.6 ±11.09 | 11.01 ±4.71 |
Biological homogeneity of the found biclusters in terms of their enrichment
| Dataset | Method | % enriched |
|---|---|---|
| Ovary | Bicluster Enrichment | 0.952 ±0.063 |
| Rules (JRip) | 0.981 ±0.017 | |
| Tree (J48) | 0.974 ±0.021 | |
| IDiscs | Bicluster Enrichment | 0.851 ±0.102 |
| Rules (JRip) | 0.962 ±0.041 | |
| Tree (J48) | 0.931 ±0.043 |
Fig. 2Semantic biclustering ROC curves for Drosophila ovary table (left) and Imaginal disc dataset (right)
Fig. 3Train and test matrices
Generalization in terms of genes and locations. The table compares the AUROC for three different settings
| Dataset | Method | kG | kL | bd |
|---|---|---|---|---|
| Ovary | Bicluster Enrichment | 0.929 ±0.013 | 0.677 ±0.03 | 0.818 ±0.014 |
| Rules (JRip) | 0.692 ±0.02 | 0.583 ±0.01 | 0.583 ±0.02 | |
| Tree (J48) | 0.725 ±0.002 | 0.604 ±0.01 | 0.604 ±0.02 | |
| IDiscs | Bicluster Enrichment | 0.705 ±0.06 | 0.560 ±0.02 | 0.593 ±0.03 |
| Rules (JRip) | 0.588 ±0.01 | 0.546 ±0.01 | 0.537 ±0.02 | |
| Tree (J48) | 0.630 ±0.06 | 0.627 ±0.05 | 0.602 ±0.04 |
kG tests the generalization across locations, kL the generalization across genes and bd the generalization in both the dimensions
Runtimes (in seconds) of rule and tree learning methods on DOT and IDiscs datasets. The process of transforming original matrix onto ARFF file (build ARFF) and the process of building classification models were measured separately
| Split | DOT | IDiscs | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Build ARFF | Build model | Test model | Build ARFF | Build model | Best model | |||||
| J48 | JRip | J48 | JRip | J48 | JRip | J48 | JRip | |||
| 1 | 1,033 | 1,237 | 26,810 | 17.00 | 23.44 | 274 | 59.59 | 510.84 | 3.08 | 3.11 |
| 2 | 1,091 | 1,503 | 21,384 | 19.45 | 18.67 | 272 | 38.03 | 557.92 | 2.93 | 3.19 |
| 3 | 1,042 | 1,076 | 19,519 | 19.09 | 18.19 | 287 | 71.62 | 363.00 | 3.16 | 3.16 |
| 4 | 1,096 | 1,300 | 20,054 | 17.59 | 19.07 | 270 | 64.65 | 438.87 | 3.16 | 3.25 |
| 5 | 1,127 | 2,010 | 20,605 | 18.61 | 21.22 | 278 | 39.47 | 941.30 | 3.20 | 3.64 |
| 6 | 1,121 | 1,999 | 24,568 | 19.38 | 18.69 | 260 | 39.77 | 550.50 | 3.11 | 3.05 |
| 7 | 1,097 | 1,656 | 25,279 | 18.90 | 18.60 | 281 | 47.61 | 288.14 | 2.98 | 3.00 |
| 8 | 1,058 | 1,087 | 22,459 | 26.47 | 18.48 | 269 | 44.00 | 641.16 | 3.14 | 3.26 |
| 9 | 1,023 | 1,236 | 14,062 | 17.81 | 18.24 | 288 | 54.83 | 201.10 | 3.25 | 2.91 |
| 10 | 1,268 | 1,583 | 27,299 | 18.81 | 21.07 | 276 | 42.83 | 506.14 | 2.96 | 3.06 |
|
| 1,096 | 1,469 | 22,204 | 19.31 | 19.57 | 629.4 | 50.24 | 499.9 | 3.10 | 3.16 |
|
| ±70.6 | ±343 | ±3,995 | ±2.64 | ±1.75 | ±32.3 | ±11.78 | ±204.8 | ±0.11 | ±0.2 |
Runtimes (in seconds) of bi-directional enrichment on DOT and IDiscs datasets
| Split | DOT | IDiscs | ||||
|---|---|---|---|---|---|---|
| Prepare data | Build model | Test model | Prepare data | Build model | Test model | |
| 1 | 21.80 | 74.75 | 278.79 | 14.75 | 133.14 | 70.42 |
| 2 | 20.44 | 122.27 | 233.85 | 13.96 | 112.36 | 53.41 |
| 3 | 14.76 | 100.80 | 259.17 | 10.11 | 101.49 | 49.12 |
| 4 | 16.05 | 87.42 | 223.64 | 9.36 | 107.10 | 47.32 |
| 5 | 14.54 | 120.49 | 266.52 | 9.28 | 72.78 | 60.17 |
| 6 | 16.98 | 110.70 | 228.80 | 13.87 | 124.81 | 45.06 |
| 7 | 14.79 | 100.55 | 231.63 | 9.51 | 153.33 | 82.83 |
| 8 | 14.43 | 80.02 | 229.41 | 14.08 | 144.09 | 50.18 |
| 9 | 14.58 | 94.29 | 204.34 | 9.73 | 176.95 | 61.83 |
| 10 | 14.02 | 103.77 | 230.10 | 15.60 | 90.13 | 45.86 |
|
| 16.24 | 99.51 | 238.63 | 12.03 | 121.62 | 56.62 |
|
| ±2.73 | ±15.88 | ±22.46 | ±2.61 | ±31.26 | ±12.30 |