| Literature DB >> 26467206 |
Shu-Chuan Chen1, Tsung-Hsien Tsai2, Cheng-Han Chung3, Wen-Hsiung Li4,5.
Abstract
BACKGROUND: The purpose of gene expression analysis is to look for the association between regulation of gene expression levels and phenotypic variations. This association based on gene expression profile has been used to determine whether the induction/repression of genes correspond to phenotypic variations including cell regulations, clinical diagnoses and drug development. Statistical analyses on microarray data have been developed to resolve gene selection issue. However, these methods do not inform us of causality between genes and phenotypes. In this paper, we propose the dynamic association rule algorithm (DAR algorithm) which helps ones to efficiently select a subset of significant genes for subsequent analysis. The DAR algorithm is based on association rules from market basket analysis in marketing. We first propose a statistical way, based on constructing a one-sided confidence interval and hypothesis testing, to determine if an association rule is meaningful. Based on the proposed statistical method, we then developed the DAR algorithm for gene expression data analysis. The method was applied to analyze four microarray datasets and one Next Generation Sequencing (NGS) dataset: the Mice Apo A1 dataset, the whole genome expression dataset of mouse embryonic stem cells, expression profiling of the bone marrow of Leukemia patients, Microarray Quality Control (MAQC) data set and the RNA-seq dataset of a mouse genomic imprinting study. A comparison of the proposed method with the t-test on the expression profiling of the bone marrow of Leukemia patients was conducted.Entities:
Mesh:
Year: 2015 PMID: 26467206 PMCID: PMC4606551 DOI: 10.1186/s12864-015-1970-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 3The DAR algorithm for microarray data analysis
Fig. 1The distribution of confidence for all combinations of p1 = (0.1, 0.3, 0.5, 0.7, 0.9) and p2 = (0.1, 0.3, 0.5, 0.7, 0.9) under independence when n = 50
Fig. 2The distribution of confidence under independence when n = 1000
Critical value when n = 1000
| p1 | |||||||
| 0.1 | 0.3 | 0.5 | 0.7 | 0.9 | |||
| 0.1 | 80 % | 0.1245 | 0.1135 | 0.1105 | 0.1095 | 0.1075 | |
| 90 % | 0.1395 | 0.1225 | 0.1165 | 0.1145 | 0.1125 | ||
| 95 % | 0.1515 | 0.1285 | 0.1225 | 0.1185 | 0.1165 | ||
| 99 % | 0.1755 | 0.1415 | 0.1315 | 0.1265 | 0.1235 | ||
| 0.3 | 80 % | 0.3385 | 0.3215 | 0.3165 | 0.3145 | 0.3125 | |
| 90 % | 0.3585 | 0.3335 | 0.3255 | 0.3215 | 0.3195 | ||
| 95 % | 0.3755 | 0.3435 | 0.3335 | 0.3285 | 0.3245 | ||
| 99 % | 0.4095 | 0.3625 | 0.3475 | 0.3405 | 0.3355 | ||
| p2 | 0.5 | 80 % | 0.5415 | 0.5235 | 0.5185 | 0.5155 | 0.5135 |
| 90 % | 0.5635 | 0.5365 | 0.5285 | 0.5235 | 0.5205 | ||
| 95 % | 0.5825 | 0.5475 | 0.5365 | 0.5305 | 0.5265 | ||
| 99 % | 0.6165 | 0.5665 | 0.5515 | 0.5435 | 0.5385 | ||
| 0.7 | 80 % | 0.7385 | 0.7215 | 0.7165 | 0.7145 | 0.7125 | |
| 90 % | 0.7585 | 0.7335 | 0.7255 | 0.7215 | 0.7195 | ||
| 95 % | 0.7745 | 0.7425 | 0.7325 | 0.7275 | 0.7245 | ||
| 99 % | 0.8035 | 0.7605 | 0.7465 | 0.7395 | 0.7345 | ||
| 0.9 | 80 % | 0.9255 | 0.9145 | 0.9105 | 0.9095 | 0.9075 | |
| 90 % | 0.9375 | 0.9215 | 0.9165 | 0.9135 | 0.9125 | ||
| 95 % | 0.9465 | 0.9275 | 0.9215 | 0.9175 | 0.9155 | ||
| 99 % | 0.9635 | 0.9375 | 0.9295 | 0.9255 | 0.9225 |
The results of the single antecedent rules mining when the DAR algorithm is applied to the data of Callow et al. [19]
| Support/Confidence | Meaningful rules | Screen rules | Final genes |
|---|---|---|---|
| 10/80 | 216 | 198 | 198 |
| 10/85 | 216 | 198 | 198 |
| 10/90 | 215 | 197 | 197 |
| 20/80 | 20 | 14 | 14 |
| 20/85 | 20 | 14 | 14 |
| 20/90 | 19 | 13 | 13 |
| 30/80 | 14 | 9 | 9 |
| 30/85 | 14 | 9 | 9 |
| 30/90 | 13 | 8 | 8 |
| Dynamic Support and Confidence | 14 | 9 | 9 |
The influential genes found when the DAR algorithm was applied to the data in Callow et al. [19]
| Rule type | ID | Gene | Expression level | Influential in S. Dudoit 2002 | Multiple |
|---|---|---|---|---|---|
| Single Antecedent | 540 | Apo AI | −1 | Yes | 4.00E−04 |
| 1496 | SPATA5L1 | −1 | Yes | 0.0156 | |
| 1739 | Apo CIII | −1 | Yes | 4.00E−04 | |
| 2149 | Apo AI | −1 | Yes | 4.00E−04 | |
| 2537 | Apo CIII | −1 | Yes | 7.00E−04 | |
| 4139 | SC5D | −1 | Yes | 5.00E−04 | |
| 4941 | SC5D | −1 | Yes | 0.0086 | |
| 5356 | Apo AI | −1 | Yes | 7.00E−04 | |
| 2296 | CASP6 | −1 | New | 0.4745 | |
| Additional in Double antecedent | 5053 | BLANK | 1 | New | 1 |
| 5419 | DTNBP1 | 1 | New | 1 | |
| 6215 | MAK | 1 | New | 1 | |
| 6245 | BLANK | 1 | New | 1 | |
| 6379 | CAS1 | 1 | New | 1 |
The results of the double antecedent rules mining when the DAR algorithm was applied to the data of Callow et al. [19]
| Support/ Confidence | Meaningful rules | Screen rules | Final genes |
|---|---|---|---|
| 10/80 | 1,298,658 | 23,076 | 454 |
| 10/85 | 1,298,224 | 23,075 | 454 |
| 10/90 | 1,298,010 | 23,074 | 454 |
| 20/80 | 120,241 | 312 | 52 |
| 20/85 | 119,807 | 311 | 52 |
| 20/90 | 113,393 | 310 | 52 |
| 30/80 | 83,228 | 65 | 14 |
| 30/85 | 82,794 | 64 | 14 |
| 30/90 | 76,830 | 63 | 14 |
| Dynamic Support and Confidence | 82,361 | 64 | 14 |
Fig. 4Overlaps between the gene sets in significant association rules, Oct4-sorted, Oct4-RNAi, and RA-induction. The number in the Venn diagrams shows the intersection set of the contiguous regions
The results of the marker genes of ESC or differentiation from the dataset of Zhou et al. [20]
| Marker genes of ESC | RefSeq ID | Sig. in Oct4-sorted+ | Sig. in’1 to Oct4 + ’ | Multiple |
|---|---|---|---|---|
| Oct4 | NM_013633 | + | + | 4.00E−04 |
| Sox2 | NM_011443 | + | - | 6.00E−04 |
| Nanog | NM_024865 | + | + | 0.0038 |
| Esrrb | NM_011934 | + | + | 6.00E−04 |
| Tcl1 | NM_009337 | + | + | 4.00E−04 |
| Dppa5 | NM_025274 | + | + | 4.00E−04 |
| Utf1 | NM_009482 | + | + | 4.00E−04 |
| Marker genes of differentiation | Sig. in Oct4-sorted- | Sig. in ‘-1 to Oct4 + ’ | ||
| Tcf7l2 | NM_009333 | + | + | 0.1526 |
| Gata4 | NM_008092 | + | + | 6.00E−04 |
| Gata6 | NM_010258 | + | + | 0.0032 |
| Tgfbr3 | NM_011578 | + | - | 1 |
| Foxa2 | NM_010446 | + | + | 4.00E−04 |
| Bmp2 | NM_007553 | + | + | 0.017 |
| Cited2 | NM_010828 | + | + | 4.00E-04 |
The results of the identified core regulators from the dataset of Zhou et al. study [20]
| Gene name | RefSeq ID | Sig. in Oct4+ sorted | Sig. in ‘1 to Oct4+’ |
|---|---|---|---|
| Oct4 | NM_013633 | + | + |
| Sox2 | NM_011443 | + | - |
| Nanog | NM_024865 | + | + |
| Stat3 | NM_213659 | - | - |
| Esrrb | NM_011934 | + | + |
| Sall4 | NM_201395 | - | - |
| Nr5a2 | NM_030676 | + | + |
| Otx2 | NM_144841 | + | - |
| Tcf7 | NM_009331 | + | + |
| Etv5 | NM_023794 | + | + |
| Utf1 | NM_009482 | + | + |
| Tcfap2c | NM_009335 | + | + |
| Mtf2 | NM_013827 | + | + |
| Rest | NM_011263 | + | + |
| Rbpsuh | NM_009035 | + | + |
Number of significant probes and non-redundant genes under tested methods
| Association rule |
| Common gene | |
|---|---|---|---|
| aAML >ALL (# of probe) | 576 (683) | 86 (97) | 41 |
| bALL >AML (# of probe) | 396 (467) | 615 (672) | 201 |
aNumber of Ensembl genes that was expressed higher significantly in AML than in ALL (non-redundant)
bNumber of Ensembl genes that was significantly up-regulated gene in ALL than in AML (non-redundant)
Significant pathway in term of biological process of GO database
| ALL > AML (−1 to AML) in association rule | ALL > AML in | ||
|---|---|---|---|
| Biological Process |
| Biological Process |
|
|
| 5.05E–25 |
| 8.38E–25 |
|
| 1.65E–21 |
| 2.80E–23 |
|
| 2.20E–13 |
| 3.20E-21 |
|
| 5.27E–07 |
| 3.23E–16 |
|
| 1.30E–06 |
| 1.41E–13 |
|
| 3.88E–06 |
| 7.06E–10 |
|
| 1.59E–04 |
| 6.09E–08 |
|
| 2.00E–04 |
| 1.10E–07 |
|
| 7.37E–04 |
| 4.00E–07 |
|
| 1.01E–03 | Response to stimulus | 4.04E–07 |
|
| 3.10E–03 | mRNA processing | 2.10E–06 |
|
| 5.49E–03 |
| 2.70E–06 |
|
| 7.40E–03 |
| 5.52E–06 |
| Protein complex biogenesis | 1.61E–02 | response to stress | 6.65E–05 |
| Protein complex assembly | 1.61E–02 |
| 1.73E–04 |
|
| 3.09E–02 |
| 2.75E–04 |
|
| 3.10E–02 |
| 3.43E–04 |
| Transcription from RNA polymerase II promoter | 4.21E–02 | Purine nucleobase metabolic process | 8.90E–04 |
|
| 4.84E–02 | Protein phosphorylation | 1.03E–03 |
| mRNA splicing, via spliceosome | 1.40E–03 | ||
| Meiosis | 3.33E–03 | ||
|
| 4.12E–03 | ||
| RNA splicing | 8.94E–03 | ||
| RNA splicing, via transesterification reactions | 8.94E–03 | ||
| Cellular defense response | 2.28E–02 | ||
| Cell proliferation | 3.93E–02 | ||
| Nitrogen compound metabolic process | 4.54E–02 | ||
|
| 4.63E–02 | ||
| Regulation of carbohydrate metabolic process | 4.66E–02 | ||
| AML > ALL (+1 to AML) in association rule | AML > ALL in | ||
| Biological Process |
| Biological Process |
|
| Metabolic process | 1.20E-10 |
| 1.14E-02 |
| Cell communication | 8.26E-10 | ||
| Developmental process | 3.19E-09 | ||
|
| 5.32E-09 | ||
| Immune response | 6.40E-09 | ||
| Immune system process | 4.00E-08 | ||
| Primary metabolic process | 5.37E-08 | ||
| Macrophage activation | 1.43E-07 | ||
| Response to stimulus | 1.71E-07 | ||
| System development | 8.39E-06 | ||
| Cell death | 6.55E–05 | ||
| Apoptotic process | 6.55E–05 | ||
| Death | 7.05E–05 | ||
| Proteolysis | 2.85E–04 | ||
| Hemopoiesis | 6.24E–04 | ||
| Protein metabolic process | 1.30E–03 | ||
| Negative regulation of apoptotic process | 1.51E–03 | ||
| Transport | 4.90E–03 | ||
| Localization | 8.73E–03 | ||
| Biological regulation | 9.95E–03 | ||
| Angiogenesis | 1.46E–02 | ||
| Regulation of biological process | 2.34E–02 | ||
| B cell mediated immunity | 7.41E–02 | ||
| Skeletal system development | 8.16E–02 | ||
| Cellular defense response | 8.65E–02 | ||
| Mesoderm development | 9.51E–02 |
Bolded terms indicate the significant pathways appeared in both DAR and t-test
Number of significant DAR rules using five different confidence levels
| Association rules | New method | ||||
|---|---|---|---|---|---|
| 70 % | 75 % | 80 % | 85 % | 90 % | |
| Confidence | Confidence | Confidence | Confidence | Confidence | |
| Expression of rules | Number of rule (number of overlapping genes in Yamaguchi et al.’s study) | ||||
| 1 geneA - > case | 1055 (267) | 575 (158) | 168 (38) | 34 (6) | 9 (0) |
| −1 geneA - > case | 869 (255) | 456 (129) | 117 (30) | 11 (2) | 5 (1) |
| 1 geneA - > control | 609 (9) | 581 (9) | 304 (5) | 304 (5) | 282 (5) |
| −1 geneA - > control | 737 (8) | 732 (8) | 332 (3) | 332 (3) | 320 (3) |
| 0 geneA - > case | 236 (2) | 125 (2) | 36 (36) | 2 (0) | 0 |
| 0 geneA - > control | 2478 (701) | 2370 (696) | 1474 (444) | 876 (238) | 454 (132) |
Expressed imprinted genes in the RNA-seq data of Yamaguchi et al. [22]
| Gene symbol | Expressing allele | at least two out of ten FC > 1.5 | ||
|---|---|---|---|---|
| Significant (up, down) or not | Intersect with significant association rules | Multiple t-test (adj. p-value) | ||
| Mest | P | Down | 1 | |
| H19 | M | Up | 1 | |
| Meg3 | M | Down | 1 | |
| Grb10 | M | Up | 1 geneA - > paternalKO | 1 |
| Rian | M | Down | 1 | |
| Peg10 | P | Down | −1 geneA - > paternalKO | 1 |
| Cdkn1c | M | Up | 1 | |
| Peg3 | P | up/down | 1 | |
| Sgce | P | Down | −1 geneA - > paternalKO | 1 |
| Asb4 | M | Up | 1 geneA - > paternalKO | 1 |
| Cmah | M | Down | 1 | |
| Impact | P | Down | 1 | |
| Pon2 | M | Down | 1 | |
| Ube3a | M | Up | 1 | |
| Peg13 | P | Down | 1 | |
| Phlda2 | M | Down | 1 | |
| Dcn | M | Up | 1 | |
| Airn | P | up/down | 1 | |
| Ddc | P | Down | 1 | |
| Zim1 | M | Up | 1 | |
| Magel2 | P | Down | 1 | |
| Begain | P | up/down | 1 | |
| Tspan32 | M | Down | −1 geneA - > paternalKO | 1 |
| Art5 | M | Down | 1 | |
| Wt1 | M | Up | 0.818182 | |
| Qpct | M | Down | 1 | |
| Atp10a | M | Down | 1 | |
| Nespas | P | Down | −1 geneA - > paternalKO | 1 |
| Tnfrsf23 | M | Down | 1 | |
| Tfpi2 | M | Down | 1 | |
| Sfmbt2 | P | up/down | 1 | |
| Nap1l5 | P | Up | 1 geneA - > paternalKO | 1 |
| Slc22a18 | M | Down | 1 | |
| Th | M | Up | 1 geneA - > paternalKO | 1 |
| Usp29 | P | Down | 1 | |
| Cntn3 | M | Down | 1 | |
| Mst1r | M | Down | 1 | |
| Calcr | M | Down | 1 | |
| Kcnq1 | M | Down | 1 |
Fig. 5Number of common genes between DAR and compared methods in TAQ data. a DAR method with 90 % confidence level; b DAR method with 70 % confidence level