| Literature DB >> 30367574 |
Emanuel Weitschek1,2, Silvia Di Lauro3, Eleonora Cappelli4, Paola Bertolazzi3,5, Giovanni Felici3.
Abstract
BACKGROUND: The high growth of Next Generation Sequencing data currently demands new knowledge extraction methods. In particular, the RNA sequencing gene expression experimental technique stands out for case-control studies on cancer, which can be addressed with supervised machine learning techniques able to extract human interpretable models composed of genes, and their relation to the investigated disease. State of the art rule-based classifiers are designed to extract a single classification model, possibly composed of few relevant genes. Conversely, we aim to create a large knowledge base composed of many rule-based models, and thus determine which genes could be potentially involved in the analyzed tumor. This comprehensive and open access knowledge base is required to disseminate novel insights about cancer.Entities:
Keywords: Big data; Cancer; Classification; Knowledge extraction
Mesh:
Year: 2018 PMID: 30367574 PMCID: PMC6191971 DOI: 10.1186/s12859-018-2299-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The homepage of CamurWeb
Fig. 2The classification section of CamurWeb
Fig. 3The results page of CamurWeb
Fig. 4The software architecture of CamurWeb
An example of RNA-seq data matrix
| Aliquot | ENSG00000 130309.9 | ENSG00000 101189.6 | ......... | ENSG00000 260597.1 | Class |
|---|---|---|---|---|---|
| TCGA-4G.. | 0 | 9,7872338 | .... | 0,141 | Tumoral |
| TCGA-W5.. | 0,0323 | 1,4725 | ... | 0,62107 | Normal |
| ...... | ..... | ..... | ..... | ..... | .... |
| TCGA-ZH.. | 0,06223 | 8,7757 | ..... | 0,4818 | Tumoral |
Rows are indexed by the tissues, columns by the genes (except the last one containing the class). Each element of the matrix represents the FPKM gene expression value associated to the respective gene and tissue
The considered data of The Cancer Genome Atlas extracted from the Genomic Data Commons portal
| Cancer | # of tissues | # of tumoral | # of normal | % of tumoral | File size (MB) |
|---|---|---|---|---|---|
| ACC | 79 | 79 | 0 | 100 | 45,08 |
| BLCA | 433 | 414 | 19 | 95,61 | 250,69 |
| BRCA | 1222 | 1102 | 120 | 90,18 | 592,77 |
| CESC | 309 | 304 | 5 | 98,38 | 180,67 |
| CHOL | 45 | 36 | 9 | 80,00 | 26,49 |
| COAD | 521 | 478 | 43 | 91,75 | 293,15 |
| DLBC | 48 | 48 | 0 | 100 | 28,62 |
| ESCA | 173 | 161 | 12 | 93,06 | 117,00 |
| GBM | 174 | 156 | 18 | 89,66 | 107,08 |
| HNSC | 546 | 500 | 46 | 91,58 | 317,43 |
| KICH | 89 | 65 | 24 | 73,03 | 52,83 |
| KIRC | 611 | 538 | 73 | 88,05 | 372,75 |
| KIRP | 321 | 288 | 33 | 89,72 | 187,99 |
| LAML | 173 | 173 | 0 | 100 | 98,28 |
| LGG | 534 | 534 | 0 | 100 | 319,55 |
| LIHC | 424 | 371 | 53 | 87,50 | 233,13 |
| LUAD | 594 | 533 | 61 | 89,73 | 353,07 |
| LUSC | 551 | 502 | 49 | 91,11 | 333,09 |
| MESO | 86 | 86 | 0 | 100 | 50,96 |
| OV | 309 | 309 | 0 | 100 | 238,69 |
| PAAD | 182 | 177 | 5 | 97,25 | 108,34 |
| PCPG | 186 | 178 | 8 | 95,70 | 107,82 |
| READ | 177 | 166 | 11 | 93,79 | 100,34 |
| SARC | 265 | 259 | 6 | 97,74 | 152,34 |
| STAD | 407 | 375 | 32 | 92,14 | 268,86 |
| TGCT | 156 | 156 | 0 | 100 | 95,25 |
| THYM | 121 | 119 | 2 | 98,35 | 72,01 |
| UCEC | 587 | 551 | 36 | 93,87 | 336,61 |
| UCS | 56 | 56 | 0 | 100 | 34,28 |
| UVM | 80 | 80 | 0 | 100 | 43,96 |
The number of tissues, the ratio of tumoral and normal ones, and the file size in MB is reported for each cancer dataset
Results of the classification analyses with CamurWeb
| Cancer | Execution time | # of iterations | # of rules | # of genes |
|---|---|---|---|---|
| BLCA | 4:36:52 | 100 | 334 | 164 |
| BRCA | 190:29:57 | 30 | 3015 | 1847 |
| CESC | 0:01:50 | 20 | 5 | 3 |
| CHOL | 0:00:13 | 47 | 3 | 2 |
| COAD | 1:48:12 | 100 | 90 | 32 |
| ESCA | 0:56:09 | 100 | 229 | 122 |
| GBM | 14:21:12 | 100 | 1487 | 832 |
| HNSC | 84:27:30 | 100 | 3201 | 1363 |
| KICH | 0:00:52 | 26 | 8 | 5 |
| KIRC | 6:36:45 | 100 | 470 | 183 |
| KIRP | 0:01:17 | 9 | 3 | 2 |
| LIHC | 24:08:10 | 100 | 1890 | 854 |
| LUAD | 12:06:36 | 100 | 775 | 298 |
| LUSC | 0:06:23 | 32 | 8 | 5 |
| PAAD | 0:29:37 | 100 | 132 | 71 |
| PCPG | 6:35:40 | 100 | 348 | 173 |
| READ | 0:01:11 | 23 | 6 | 5 |
| SARC | 7:42:24 | 100 | 358 | 164 |
| STAD | 2:04:16 | 100 | 416 | 243 |
| THYM | 0:00:19 | 14 | 3 | 3 |
| UCEC | 3:52:26 | 100 | 496 | 209 |
We report for each considered cancer the execution time, the number of performed iterations, the number of extracted rules and genes by CAMUR
Fig. 5The results page of the classification analyses on the LUCS tumor
Most represented genes in the rules extracted from the HNSC tumor
| Gene | Occurrences |
|---|---|
| ENSG00000130309.9 | 1934 |
| ENSG00000197467.12 | 467 |
| ENSG00000101189.6 | 354 |
| ENSG00000260597.1 | 250 |
| ENSG00000197766.6 | 218 |
| ... | ... |
Pairs of genes that occur most in the classification rules related to the HNSC tumor
| Gene 1 | Gene 2 | Occurrences |
|---|---|---|
| ENSG00000260597.1 | ENSG00000130309.9 | 250 |
| ENSG00000130309.9 | ENSG00000197766.6 | 203 |
| ENSG00000256229.6 | ENSG00000130309.9 | 167 |
| ENSG00000164114.17 | ENSG00000130309.9 | 165 |
| ... | ... | ... |
Most represented genes in the rules extracted from the LIHC tumor
| Gene | Occurrences |
|---|---|
| ENSG00000187730.7 | 413 |
| ENSG00000158882.11 | 376 |
| ENSG00000231856.2 | 295 |
| ENSG00000164283.11 | 229 |
| ... | ... |
Most frequent genes in the rules extracted from the BRCA tumor
| Gene | Occurrences |
|---|---|
| ENSG00000136158.9 | 1078 |
| ENSG00000165197.4 | 993 |
| ENSG00000099953.8 | 725 |
| ENSG00000157766.14 | 515 |
| ... | ... |