| Literature DB >> 26519501 |
Valerio Cestarelli1, Giulia Fiscon2, Giovanni Felici1, Paola Bertolazzi1, Emanuel Weitschek3.
Abstract
MOTIVATION: Nowadays, knowledge extraction methods from Next Generation Sequencing data are highly requested. In this work, we focus on RNA-seq gene expression analysis and specifically on case-control studies with rule-based supervised classification algorithms that build a model able to discriminate cases from controls. State of the art algorithms compute a single classification model that contains few features (genes). On the contrary, our goal is to elicit a higher amount of knowledge by computing many classification models, and therefore to identify most of the genes related to the predicted class.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26519501 PMCID: PMC4795614 DOI: 10.1093/bioinformatics/btv635
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Example of the breast cancer RNA-seq data matrix extracted from The Cancer Genome Atlas (TCGA)
| SampleID | ANO8 | C1orf27 | TRPM6 |
| Class |
|---|---|---|---|---|---|
| A8-A09D | 2.64 | 5.42 | 0.38 |
| Breast cancer |
| BH-A0DH | 1.46 | 6.47 | 0.76 |
| Normal |
| GM-A2DC | 2.22 | 22.50 | 0.53 |
| Breast cancer |
| GM-A2D9 | 3.13 | 14.21 | 0.61 |
| Breast cancer |
|
| ⋯ | ⋯ | ⋯ |
|
|
| GM-A2DB | 3.86 | 5.15 | 0.59 |
| Breast cancer |
The rows correspond to the samples and the columns to their features (gene expression profiles). The cells contain the gene expression measure Reads Per Kilobase per Million mapped reads (RPKM) explained in Section 2.3.
Fig. 1.Component diagram of the MSE part of the CAMUR software package
Fig. 2.Screenshot of the MSA part of the CAMUR software package: it displays the initial parameters configuration available to the user
Summary of the analyzed data sets
| Cancer | Tissues | Tumoral | Normal | Genes | [MB] |
|---|---|---|---|---|---|
| BRCA | 884 | 783 | 101 | 20532 | 292 |
| HNSC | 295 | 264 | 31 | 20532 | 92 |
| STAD | 271 | 238 | 33 | 29699 | 56 |
The three data sets are extracted from The Cancer Genome Atlas. The numbers refer to the sequenced tissues, belonging to tumoral and normal classes (first three columns). It is worth noting that for each data set the number of analyzed samples corresponds to the number of tumoral tissues (third column). The last two columns refer to the number of genes and the size of the three data sets.
CAMUR execution times
| Cancer | Total time [h] | Loose mode time [h] | Strict mode time [h] |
|---|---|---|---|
| BRCA | 6 h:56 m | 6 h:17 m | 0 h:39 m |
| HNSC | 0 h:33 m | 0 h:26 m | 0 h:7 m |
| STAD | 0 h:27 m | 0 h:17 m | 0 h:10 m |
The execution times for Breast (BRCA), Head and Neck (HNSC), Stomach (STAD) Cancer. Times are reported in hours.
A portion from the output results of the ‘list of attributes’ query
| Gene | Occurrences |
|---|---|
| ADAMTS5 | 109 |
| MMP11 | 102 |
| FIGF | 84 |
| SDPR | 82 |
| COL10A1 | 51 |
|
|
|
List of the most common 12 genes (row-wise) extracted by CAMUR
| MMP11 | ADAMTS5 | SDPR |
| FIGF | CGB7 | COL10A1 |
| TMEM220 | ARHGAP20 | SPRY2 |
| ACSM5 | FXYD1 | EPDR1 |
An example of the output for query 5
| Gene 1 | Gene 2 | Occurrences |
|---|---|---|
| FIGF | MMP11 | 100 |
| CGB7 | ADAMTS5 | 73 |
| SDPR | ANXA1 | 37 |
| EPDR1 | MMP11 | 34 |
|
|
|
|
Fig. 3.Eulero-Venn diagram of the CAMUR gene lists for BRCA, HNSC and STAD: (a) diagram of overlapped genes extracted by CAMUR; (b) diagram of the overlapped genes between the CAMUR gene lists and the differential expressed ones