Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Visual management of large scale data mining projects.

Literature DB >> 10902176

Visual management of large scale data mining projects.

Abstract

This paper describes a unified framework for visualizing the preparations for, and results of, hundreds of machine learning experiments. These experiments were designed to improve the accuracy of enzyme functional predictions from sequence, and in many cases were successful. Our system provides graphical user interfaces for defining and exploring training datasets and various representational alternatives, for inspecting the hypotheses induced by various types of learning algorithms, for visualizing the global results, and for inspecting in detail results for specific training sets (functions) and examples (proteins). The visualization tools serve as a navigational aid through a large amount of sequence data and induced knowledge. They provided significant help in understanding both the significance and the underlying biological explanations of our successes and failures. Using these visualizations it was possible to efficiently identify weaknesses of the modular sequence representations and induction algorithms which suggest better learning strategies. The context in which our data mining visualization toolkit was developed was the problem of accurately predicting enzyme function from protein sequence data. Previous work demonstrated that approximately 6% of enzyme protein sequences are likely to be assigned incorrect functions on the basis of sequence similarity alone. In order to test the hypothesis that more detailed sequence analysis using machine learning techniques and modular domain representations could address many of these failures, we designed a series of more than 250 experiments using information-theoretic decision tree induction and naive Bayesian learning on local sequence domain representations of problematic enzyme function classes. In more than half of these cases, our methods were able to perfectly discriminate among various possible functions of similar sequences. We developed and tested our visualization techniques on this application.

Entities: Disease

Mesh：

Substances：

Year: 2000 PMID： 10902176 PMCID： PMC2709531 DOI： 10.1142/9789814447331_0026

Source DB: PubMed Journal: Pac Symp Biocomput ISSN： 2335-6928

Keyword Cloud
References

7 in total

Visual management of large scale data mining projects.

1. The SWISS-PROT protein sequence data bank.

2. Identification of divergent functions in homologous proteins by induction over conserved modules.

3. Visualization based on the Enzyme Commission nomenclature.

4. Predicting enzyme function from sequence: a systematic appraisal.

5. Pfam: a comprehensive database of protein domain families based on seed alignments.

6. The ENZYME data bank.

7. Modular arrangement of proteins as inferred from analysis of homology.