| Literature DB >> 18284669 |
Francisco J Lopez1, Armando Blanco, Fernando Garcia, Carlos Cano, Antonio Marin.
Abstract
BACKGROUND: Last years' mapping of diverse genomes has generated huge amounts of biological data which are currently dispersed through many databases. Integration of the information available in the various databases is required to unveil possible associations relating already known data. Biological data are often imprecise and noisy. Fuzzy set theory is specially suitable to model imprecise data while association rules are very appropriate to integrate heterogeneous data.Entities:
Mesh:
Year: 2008 PMID: 18284669 PMCID: PMC2277399 DOI: 10.1186/1471-2105-9-107
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Linguistic labels defined for continuous features. This figure describes how the membership functions are defined for each fuzzy set in the corresponding continuous domain.
Thresholds and total number of rules
| Variables | CF & Conf. threshold | Support threshold | Total number of rules | FDR |
| Structural variables | 0.1 | 0.01 | 24 | 0.093 |
| Molecular Function & Structural variables | 0.4 | 0.004 | 20 | 0.042 |
| Biological Process & Structural variables | 0.5 | 0.004 | 7 | 0.050 |
| Cellular Component & Structural variables | 0.5 | 0.004 | 12 | 0.011 |
| Protein abundance & Responsiveness & TATA box | 0.1 | 0.002 | 15 | 0.000 |
| Protein abundance & Structural variables | 0.1 | 0.002 | 4 | 0.040 |
| Protein abundance & Molecular Function | 0.2 | 0.002 | 19 | 0.109 |
| Protein abundance & Biological Process | 0.4 | 0.002 | 21 | 0.005 |
| Protein abundance & Cellular Component | 0.3 | 0.002 | 14 | 0.011 |
| Responsiveness & Structural variables | 0.1 | 0.002 | 10 | 0.044 |
| Responsiveness & Molecular Function | 0.3 | 0.002 | 23 | 0.069 |
| Responsiveness & Biological Process | 0.6 | 0.002 | 19 | 0.002 |
| Responsiveness & Cellular Component | 0.4 | 0.002 | 19 | 0.011 |
| TATA box & Structural variables | 0.1 | 0.002 | 8 | 0.098 |
| TATA box & Molecular Function | 0.3 | 0.002 | 26 | 0.213 |
| TATA box & Biological Process | 0.5 | 0.002 | 15 | 0.131 |
| TATA box & Cellular Component | 0.3 | 0.002 | 12 | 0.260 |
| Cho et al. – EDA (grouping 1) | 0.4 | 0.001 | 23 | 0.318 |
| Cho et al. – EDA (grouping 2) | 0.4 | 0.001 | 6 | 0.115 |
| Cho et al. – G&S SHAVING (grouping 1) | 0.6 | 0.002 | 45 | 0.006 |
| Cho et al. – G&S SHAVING (grouping 2) | 0.6 | 0.002 | 36 | 0.003 |
| Gasch et al. – EDA (grouping 1) | 0.4 | 0.001 | 17 | 0.005 |
| Gasch et al. – EDA (grouping 2) | 0.4 | 0.001 | 21 | 0.004 |
| Gasch et al. – G&S SHAVING (grouping 1) | 0.6 | 0.001 | 56 | 0.023 |
| Gasch et al. – G&S SHAVING (grouping 2) | 0.7 | 0.001 | 35 | 0.019 |
This table shows the CF, Confidence and Support thresholds set in each experiment as well as the total number of rules and the FDR obtained in each case.
Structural variables
| Sup. | Conf. | CF | Association rule |
| 0.12 | 0.40 | 0.15 | |
| 0.12 | 0.38 | 0.14 | |
| 0.12 | 0.41 | 0.16 | |
| 0.12 | 0.40 | 0.14 | |
| 0.13 | 0.41 | 0.17 | |
| 0.13 | 0.43 | 0.18 | |
| 0.13 | 0.44 | 0.21 | |
| 0.13 | 0.44 | 0.22 | |
| 0.18 | 0.63 | 0.24 | |
| 0.23 | 0.56 | 0.15 | |
| 0.20 | 0.40 | 0.16 | |
| 0.20 | 0.68 | 0.37 | |
| 0.19 | 0.36 | 0.10 | |
| 0.19 | 0.65 | 0.27 | |
| 0.13 | 0.42 | 0.17 | |
| 0.13 | 0.41 | 0.17 | |
| 0.14 | 0.46 | 0.23 | |
| 0.14 | 0.46 | 0.23 | |
| 0.038 | 0.48 | 0.12 | |
| 0.010 | 0.41 | 0.17 | |
| 0.015 | 0.39 | 0.14 |
This table shows the selected rules involving structural features.
Protein abundance, responsiveness and TATA box
| Sup. | Conf. | CF | Association rule |
| 0.092 | 0.48 | 0.12 | |
| 0.087 | 0.45 | 0.22 | |
| 0.10 | 0.40 | 0.16 | |
| 0.10 | 0.35 | 0.13 | |
| 0.11 | 0.39 | 0.14 | |
| 0.074 | 0.40 | 0.15 | |
| 0.096 | 0.37 | 0.12 | |
| 0.11 | 0.44 | 0.21 | |
| 0.11 | 0.38 | 0.17 | |
| 0.10 | 0.37 | 0.10 | |
| 0.055 | 0.41 | 0.17 | |
| 0.058 | 0.44 | 0.21 |
This table shows some rules obtained when looking for relations between the protein abundance, the responsiveness, the TATA box and the rest of variables.
GO terms and structural variables. First approach
| Sup. | Conf. | CF | Association rule |
| 0.0041 | 0.88 | 0.84 | |
| 0.0017 | 1 | 1 | |
| 0.023 | 0.57 | 0.39 |
This table shows some rules obtained when looking for relations between the GO terms and the structural variables. These rules were obtained with the first approach, i.e. when considering all the rules involving GO terms.
GO terms. Rule reduction rate
| Variables | Number of rules before | Number of rules after | Rule reduction rate |
| Molecular Function & Structural variables | 38 | 20 | 47% |
| Biological Process & Structural variables | 11 | 7 | 36% |
| Cellular Component & Structural variables | 24 | 12 | 50% |
| Protein abundance & Molecular Function | 34 | 19 | 44% |
| Protein abundance & Biological Process | 37 | 21 | 43% |
| Protein abundance & Cellular Component | 23 | 14 | 39% |
| Responsiveness & Molecular Function | 45 | 23 | 49% |
| Responsiveness & Biological Process | 28 | 19 | 32% |
| Responsiveness & Cellular Component | 50 | 19 | 62% |
| TATA box & Molecular Function | 53 | 26 | 51% |
| TATA box & Biological Process | 17 | 15 | 12% |
| TATA box & Cellular Component | 37 | 12 | 68% |
| Cho et al. – EDA (grouping 1) | 24 | 23 | 4% |
| Cho et al. – EDA (grouping 2) | 6 | 6 | 0% |
| Cho et al. – G&S SHAVING (grouping 1) | 98 | 45 | 54% |
| Cho et al. – G&S SHAVING (grouping 2) | 79 | 36 | 54% |
| Gasch et al. – EDA (grouping 1) | 21 | 17 | 19% |
| Gasch et al. – EDA (grouping 2) | 25 | 21 | 16% |
| Gasch et al. – G&S SHAVING (grouping 1) | 95 | 56 | 41% |
| Gasch et al. – G&S SHAVING (grouping 2) | 77 | 35 | 55% |
This table shows the number of rules obtained in the experiments where GO terms are involved before and after applying the rule reduction. There is also a column indicating the rule reduction rate.
GO terms and structural variables. Second approach
| Sup. | Conf. | CF | Association rule |
| 0.028 | 0.77 | 0.67 | |
| 0.01 | 0.78 | 0.69 |
This table shows some rules obtained when looking for relations between the GO terms and the structural variables. These rules were obtained with the second approach, i.e. groups of rules representing the same knowledge are merged into one general rule.
Biclusters
| Sup. | Conf. | CF | Association rule |
| 0.0029 | 0.54 | 0.45 | |
| 0.0033 | 0.61 | 0.45 | |
| 0.0018 | 0.68 | 0.46 | |
| 0.0022 | 0.80 | 0.74 | |
| 0.0012 | 0.43 | 0.40 | |
| 0.0039 | 0.65 | 0.5 | |
| 0.0029 | 0.48 | 0.44 | |
| 0.0033 | 0.81 | 0.73 | |
| 0.0036 | 0.89 | 0.85 | |
| 0.0037 | 0.90 | 0.89 | |
| 0.0037 | 0.90 | 0.89 | |
| 0.0037 | 0.90 | 0.87 | |
| 0.0035 | 0.86 | 0.78 | |
| 0.0035 | 0.86 | 0.85 | |
| 0.0035 | 0.86 | 0.85 | |
| 0.0107 | 0.92 | 0.89 | |
| 0.0073 | 0.63 | 0.41 | |
| 0.0019 | 0.71 | 0.69 | |
| 0.0017 | 0.64 | 0.61 | |
| 0.0017 | 0.64 | 0.62 |
This table shows some rules obtained when looking for relations between the gene expression patterns discovered by the biclustering algorithms and the rest of variables.
Figure 2Biclusters 1 & 2. This figure shows the gene expression pattern represented by biclusters 1 (A) and 2 (B).
Figure 3Biclusters 3 & 4. This figure shows the gene expression pattern represented by biclusters 3 (A) and 4 (B).
Figure 4Biclusters 5 & 6. This figure shows the gene expression pattern represented by biclusters 5 (A) and 6 (B).
ANOVAs for Fuzzy – Crisp comparison
| Rule quality measure | Mean-Crisp | Mean-Fuzzy | |
| Support | 1, 80 | 0.0080 | 0.0073 |
| Confidence | 1, 13 | 0.777 | 0.757 |
| Certainty Factor | 1, 47 | 0.622 | 0.606 |
This table shows the results of the ANOVAs carried out to compare fuzzy and crisp Supports, Confidences and Certainty Factors.
Some rules obtained with the fuzzy and crisp algorithms
| C-Sup. | F-Sup | C-Conf. | F-Conf | C-CF | F-CF | Association rule |
| 0.0039 | 0.0030 | 0.70 | 0.53 | 0.60 | 0.43 | |
| 0.0044 | 0.0036 | 1 | 0.81 | 1 | 0.75 | |
| 0.0055 | 0.0044 | 0.71 | 0.56 | 0.58 | 0.41 | |
| 0.0032 | 0.038 | 0.39 | 0.48 | 0.09 | 0.12 |
This table shows some rules which were obtained with the fuzzy and crisp algorithms. For each of them their fuzzy and crisp Support, Confidence and CF values are provided.
Figure 5Comparison between fuzzy and crisp results 1. A) The histogram shows the distribution of the genes annotated in the term electron transport along the protein abundance domain. The graph below describes how the fuzzy sets are defined in this domain. The red dashed lines show the percentiles p33 and p66, i.e. the borders of the crisp sets. B) The same but for the genes annotated in the term snoRNA binding. Only the percentile p66 is shown in this case.
Figure 6Comparison between fuzzy and crisp results 2. A) The histogram shows the distribution of the genes that belong to bicluster 5 along the responsiveness domain. The graph below describes how the fuzzy sets are defined in this domain. The red dashed lines show the percentiles p33 and p66, i.e. the borders of the crisp sets. B) The same but for the genes located at chromosome 16 and the intergenic length domain.
An example of a frequent item list
| Index | Item | Support |
| 1 | {Gene orientation = TANDEM} | 7 |
| 2 | {Gene length = SHORT} | 6 |
| 3 | {Intergenic length = MEDIUM} | 5.98 |
| 4 | {Intergenic length = LARGE} | 4.4 |
| 5 | {Gene length = LARGE} | 4.4 |
| 6 | {Intergenic length = SHORT} | 4 |
| ... | ... | ... |
This table shows an example of a frequent item list obtained during the first step of the Fuzzy TD-FP Growth algorithm.
Figure 7Complete Fuzzy-FP Tree. This figure shows an example of a complete Fuzzy-FP tree. Each node contains two membership degree lists, only one is included in the figure for clarity since initially both of them contain the same values.
Figure 8Procedure for Fuzzy-FP Tree construction. This figure shows the pseudocode of the algorithm followed to build the Fuzzy-FP tree.
Figure 9Frequent itemsets generation. This figure shows pseudocodes of the algorithm followed to traverse the Fuzzy-FP tree and get the frequent itemsets.