| Literature DB >> 19455237 |
Mary E Edgerton1, Douglas H Fisher, Lianhong Tang, Lewis J Frey, Zhihua Chen.
Abstract
We use Backward Chaining Rule Induction (BCRI), a novel data mining method for hypothesizing causative mechanisms, to mine lung cancer gene expression array data for mechanisms that could impact survival. Initially, a supervised learning system is used to generate a prediction model in the form of "IF <conditions> THEN <outcome>" style rules. Next, each antecedent (i.e. an IF condition) of a previously discovered rule becomes the outcome class for subsequent application of supervised rule induction. This step is repeated until a termination condition is satisfied. "Chains" of rules are created by working backward from an initial condition (e.g. survival status). Through this iterative process of "backward chaining," BCRI searches for rules that describe plausible gene interactions for subsequent validation. Thus, BCRI is a semi-supervised approach that constrains the search through the vast space of plausible causal mechanisms by using a top-level outcome to kick-start the process. We demonstrate the general BCRI task sequence, how to implement it, the validation process, and how BCRI-rules discovered from lung cancer microarray data can be combined with prior knowledge to generate hypotheses about functional genomics.Entities:
Keywords: C4.5; class discovery; data analysis; decision trees; microarray; molecular mechanisms; non-small cell lung cancer; semi-supervised methods; systems biology
Year: 2007 PMID: 19455237 PMCID: PMC2312096
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1.candidate split points.
Pseudocode for BCRI. Data is a data set such as the Beer et al. data set. Classes is a set of class labels that are included in and used to classify Data. RuleInducer is a function of two parameters (i.e. a supervised rule induction system) that learns if-then rules to predict a TargetCond from a DataSet. PriorityFn is a function that takes an if-then rule a returns a floating point priority value associated with the rule. This priority value is used to order the rule on a priority queue, and this priority queue is used to guide the exploration of rules to which backward chaining is applied. TerminateFn is a function that decides whether a given rule should be backward chained.
| Function Wrapper-BCRI |
| Returns a RuleSet |
| With parameters DataSet Data |
| TargetSet Classes |
| RuleSet Function RuleInducer (DataSet, TargetCond) |
| float Function PriorityFn (Rule), |
| bool Function TerminateFn (Rule) { |
| PQ = InitializePriorityQueue(PriorityFn); |
| FOR each class in Classes, Enqueue(PQ, [class |
| WHILE (NOT Empty(PQ)) { |
| R = Dequeue(PQ);/* and place R in Results SET*/ |
| IF (NOT TerminateFn(R) { |
| FOR each a IN ANTECEDENTS(R) { |
| Children = RuleInducer (Data, a); |
| FOR each c IN Children Enqueue(PQ, c) |
| } |
| } |
| }/* end WHILE */ |
| }/* end BCRI */ |
Gene Names, Locus, Function, and Rule Depth. The genes are grouped according to their gene locus. Rule depth, described in the text, is given to indicate the closeness of the terms in the rules. Note that many genes appear at the same rule depth for both high and low risk classes. Gene transcripts located on the same chromosome arm are shown in bold if their rules are within a single depth unit of one another, suggestive of related transcription control. Abbreviations are HUGO compliant except where noted by an asterisk. Locus is based on the LocusLink information and Function is based on the Gene Ontology information as reported in Genecards.
| Low Risk | High Risk | ||||
|---|---|---|---|---|---|
| IDS | Iduronate 2-sulfatase precursor | Xq28 | metabolism | 3 | 3 |
| MRPL19 | 60S ribosomal protein L19, mitochondrial precursor | 2q11.1–11.2 | protein biosynthesis | 2 | 2 |
| TRIP12 | Thyroid receptor interacting protein 12 | 2q36.3 | ubiquitin-protein ligase activity | 3 | 3 |
| ANXA5 | Annexin V | 4q28–q32 | phospholipase inhibitor activity | 2 | |
| SC4MOL | Sterol C-4 methyl oxidase-like | 4q32–q34 | steroid metabolism | 3 | |
| H3FD *(HIST1H3E) | H3 histone family, member D (H3FD) | 6p21.3 | chromosome organization | 2 | |
| KIAA01618* (POM121) | Nuclear envelope pore membrane protein (POM121) | 7q11.23 | transport | 3 | |
| FRDA | Frataxin, mitochondrial precursor | 9q13–q21.1 | inositol/phosphatidylinositol kinase activity | 3 | |
| EIF2S1 | Eukaryotic translation initiation factor 2 subunit 1 | 14q23.3 | protein biosynthesis | 3 | |
| SERPINA1 | Alpha-1-antitrypsin precursor | 14q32.1 | 1 | ||
| AKAP13* (LBC) | A-kinase anchoring protein | 15q24–25 | intracellular signaling cascade | 3 | |
| CTRL | Chymotrypsin-like protease CTRL-1 precursor | 16q22.1 | proteolysis and peptidolysis | 3 | 3 |
| KRT13 | Keratin, type I cytoskeletal 13 | 17q12–q21.2 | structural constituent of cytoskeleton | 3 | |
| DDX5 | Probable RNA-dependent helicase p68 | 17q21 | ATP-dependent helicase activity | 2 | |
| KRT15 | Keratin, type I cytoskeletal 15 | 17q21.2 | structural constituent of cytoskeleton | 3 | |
| NAPG | N-ethylmaleimide-sensitive factor attachment protein, beta | 18p11.21 | Intracellular transporter activity | 3 | 3 |
| SLC14A2 | Urea transporter, kidney | 18q12.1–q21.1 | urea transport | 3 | |
| ELA2 | Leukocyte elastase precursor | 19p13.3 | proteolysis and peptidolysis | 1 | 1 |
| PLAB (GDF15) | *Growth differentiation factor 15 | 19p31.1–13.2 | signal transduction (TGF-β) | 2 | |
Figure 2.C45W-BCRI rules expressed as an AND/OR graph.
*The rule to predict MRPL19 < 161.4 in the Low Risk trace will also predict MRPL19 < 161.4 in the High Risk trace. It is not shown again in the High Risk trace.
Rules induced by backward chaining of Risk top-level categories. The ordering of rules is not strictly indicative of the order in which they were discovered. Rule number is given with its associated depth in the backward chaining process. Indentation indicates a parent child relationship. “acc” denotes accuracy of prediction and “cov” denotes the number of cases covered by the IF condition over the total number of samples. A notation of (i) indicates that this rule is interesting either because of an established association in the literature or because of a plausible hypotheses that can be inferred from the rule.
| Rule #/Depth | ||
| 1./0 | (Stage = 1) → (Risk = Low) | |
| 2./1 | (ELA2 > 163.3) → (Stage = 1)
| |
| 3./2 | (MRPL19 ≤ 161.4) & (EIF2S1 > 52) & (KRT15 ≤ 616.8) → (ELA2 > 163.3)
| |
| 4./3 | (TRIP12 ≤ 1176) & (NAPG ≤= 243) → (MRPL19 ≤ 161.4)
| |
| 5./3 | (FRDA > 37.8) → (EIF2S1 >52)
| |
| 6./3 | (CTRL > 194.4) & (IDS ≤ 163.3) → (KRT15 ≤ = 616.8)
| |
Density of References for Molecular Species (Gene or Gene Product) in Rules. To be considered a positive finding (+), a reference linking the gene specified in the row with either lung cancer (column 2) or with cancer in general (column 3) must have been identified in a PubMed search. Otherwise, there is no known correlation (–).
| ELA2 | + | + |
| SERPINA1 | + | + |
| MRPL19 | − | − |
| EIF2S1 | + | + |
| KRT15 | − | + |
| TRIP12 | − | + |
| NAPG | − | − |
| FRDA | − | + |
| CTRL | − | + |
| IDS | − | − |
| ANXA5 | + | + |
| PLAB*(GDF15) | + | + |
| H3FD*(HIST1H3E) | + | + |
| DDX5 | − | + |
| AKAP13 | + | + |
| SLC14A2 | − | − |
| KIAA01618* (POM121) | − | + |
| SC4MOL | − | + |
| KRT13 | − | − |
| 19 | 6 | 12 |
Knowledge Introduced by Rule. The integer 1 is used to indicate a positive result in response to the question presented at the head of the column and 0 is used to indicate a null result.
| Rule # | Corresponds to to a previously established or hypothesized interaction for lung cancer or cancer in general? | Supports, specializes, or contradicts a previously forwarded interaction? | Suggests a new question? | Has a plausible hypothesis for a mechanism requiring further study? | Coverage |
|---|---|---|---|---|---|
| 2 | 1 | 1 | 0 | 1 | 46/61 |
| 3 | 1 | 1 | 0 | 1 | 45/61 |
| 4 | 0 | 0 | 1 | 1 | 53/61 |
| 5 | 0 | 0 | 1 | 0 | 57/61 |
| 6 | 0 | 0 | 1 | 1 | 54/61 |
| 8 | 1 | 1 | 0 | 1 | 12/61 |
| 9 | 0 | 0 | 1 | 0 | 8/61 |
| 10 | 0 | 0 | 1 | 1 | 5/61 |
| 11 | 0 | 0 | 0 | 0 | 3/61 |
| 12 | 1 | 1 | 0 | 1 | 5/61 |
| 13 | 0 | 0 | 1 | 0 | 4/61 |
| 14 | 0 | 0 | 1 | 0 | 3/61 |
| 15 | 1 | 1 | 1 | 2 | 44/61 |
| 16 | 0 | 0 | 1 | 1 | 57/61 |
| 17 | 0 | 0 | 1 | 0 | 54/61 |
| 18 | 0 | 0 | 1 | 1 | 53/61 |
| 19 | 0 | 0 | 1 | 1 | 57/61 |
| 17Rules | 5 Previously Hypothesized | 5 Specialization of Hypothesis | 12 New Associations | 12 Plausible Associations | |
Figure 3.Pathway Assist™ diagram showing SERPINA1 and ELA2 relationships of Example 1. The protein products are indicated by the large ovals, a binding interaction is indicated by the purple dot relationship between the ovals, and gene expression regulation is indicated by a square along a dotted line.
Details of Example 1 relationships given by Pathway Assist™.
| Binding | ELA2 ---- SERPINA1 | |
| Regulation | ELA2 ---| SERPINA1 | negative |
| Regulation | SERPINA1 ---| ELA2 | negative |
Figure 4.Pathway Assist™ diagram illustrating FXN and EIF2S1 relationships of Example 2
Details of Example 2 relationships given by Pathway Assist™.
| Regulation | heme ---| EIF2S1 | negative |
| MolSynthesis | FXN ---> heme | unknown |
Figure 5.Pathway Assist™ diagram of KRT13 and DDX5 relationships of Example 3
Details of relationships of Example 3 given by Pathway Assist™.
| Regulation | KRT13 ---> assemble | unknown |
| Regulation | DDX5 ---> assemble | unknown |