| Literature DB >> 18460176 |
Adrien Coulet1, Malika Smaïl-Tabbone, Pascale Benlian, Amedeo Napoli, Marie-Dominique Devignes.
Abstract
BACKGROUND: Complexity and amount of post-genomic data constitute two major factors limiting the application of Knowledge Discovery in Databases (KDD) methods in life sciences. Bio-ontologies may nowadays play key roles in knowledge discovery in life science providing semantics to data and to extracted units, by taking advantage of the progress of Semantic Web technologies concerning the understanding and availability of tools for knowledge representation, extraction, and reasoning.Entities:
Mesh:
Year: 2008 PMID: 18460176 PMCID: PMC2367630 DOI: 10.1186/1471-2105-9-S4-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of the proposed method The KDD process is divided into three main steps: data preparation, data mining, and data interpretation. The figure details data preparation within the KDD process and illustrates our method of data selection guided with domain knowledge. Data relevant to the study are collected from various resources such as genomic variation databases, published pharmacogenomic studies and private datasets. Various operations are applied to these data: cleaning, integration and transformation. Theses operations implies first an instantiation of a knowledge base (1), and second the design of the “initial dataset”(2). In this study, a dataset is defined as a relation between set of objects (rows) and set of attributes (columns). A mapping is then built between objects and attributes of this dataset and the instances from the KB (3). Data selection results from the definition of a subset of instances in the KB (4), allowing the selection of corresponding objects and attributes, with respect to the mapping. This process takes as inputs the initial dataset and the KB is controlled by the domain expert, and yields the “reduced dataset”. Characteristics of the ontology such as subsumption relationships, properties and class descriptions, are used to guide the definition of meaningful instance subsets. These subsets are in turn used for data selection. Data mining algorithms are then applied to the reduced dataset. The results of the mining operation are interpreted in terms of knowledge units that can be eventually integrated into the knowledge base.
Figure 2Articulation between data and knowledge Some classes of SNP- and SO-Pharm ontologies are shown as well as their assigned instances. The mapping between objects and attributes of the FH dataset, and instances of the KB is schematized.
Quantitative characterization of data mining results depending on attribute selection.
Table 1 gives quantitative information about output (number of itemsets and number of clusters) for two data mining methods involved in this experiment. A column corresponds to a various selection of attribute in the FH dataset.
| Number of Variants | 289 | 231 | 126 | 198 |
| FI (FCI) {ratio FI/FCI} | 6928 (255) {27. 17} | 314 (24) {13.08} | 304 (12) {25.33} | 300(28){10.71} |
| Clusters | 194 | 186 | 56 | 40 |
Figure 3Tag-SNP variant unification. This figure focuses on some classes and instances from Figure 2. It develops the description of Haplotype and the isHaplotypeMemberOf and isTaggedBy object properties used for illustrating functional dependencies between instances of variants and tag_snp.
Characteristics of the FH dataset.
The FH dataset results from a clinical study relative to Familial Hypercholesterolemia. Its size and composition are described in Table 2. Phenotype refers to phenotypic attributes including for instance LDL concentration in blood. Genotype attributes include 289 genomic variations of the LDLR gene and 3 attributes relative to the presence of mutations in 3 other genes.
| Objects | 125 | ||
| Attributes | 12 | 304 | |
| 292 | |||