There is no doubt that both computational biology and
bioinformatics, and the interface of computer science and
biology in general, are central to the future of biological
research. The disciplines span a process that begins with data
collection, analysis, classification, and integration, and
ends with interpretation, modeling, visualization, and
prediction. Data mining plays a role in the middle of this
process. Overall, the focus is on identifying opportunities
and developing computational solutions (including algorithms,
models, tools, and databases) that can be used for
experimental design, data analysis and interpretation, and
hypothesis generation.Data mining is the search for hidden trends within large sets of
data. Data mining approaches are needed at all levels of genomics
and proteomics analyses. These studies can provide a wealth of
information and rapidly generate large quantities of data from the
analysis of biological specimens from healthy and diseased
tissues. The high dimensionality of data generated from these
studies will require the development of improved bioinformatics
and computational biology tools for efficient and accurate data
analyses.This issue of the Journal of Biomedicine and Biotechnology
consists of seventeen papers that describe different
applications of data mining to both genomics and proteomics
studies in yeast, and plant and human cells and tissues.
Papers by Bensmail et al, Ghosh and Chinnaiyan, and Mao et al
present different classification and clustering approaches for
disease biomarkers discovery. Genomics and proteomics studies
have shown great promises and have been applied to studies
aiming at generating expression profiles and elucidating
expression networks in different organisms as shown in the
papers by Samsa et al, Mungur et al, Liu et al,
Baldwin et al, and Joy et al. Data mining in
genomics and proteomics studies reveals new
regulatory pathways and mechanisms in different health and
disease conditions as presented by Wren and Garner, and
provides comparative sequence analysis approaches as presented
by Gambin and Otto and Gao et al. Those studies have also
provided approaches for subcellular localization of proteins
suggesting that such approaches can produce an objective
systematics for protein location and provide an important
starting point for discovering sequence motifs that determine
localization as presented by Chen and Murphy. Chen et al
studied the performance of five nonparameteric tests to select
genes and proved that the popular F test does not perform well
on gene expression data since the heterogeneity behavior
assumption is the most dominant in the gene expression data.
Corder et al explored a statistical approach called
grade of membership (GOM) and proved that brain
hypoperfusion contributes to dementia, possibly to Alzheimer's
disease (AD) pathogenesis, and raises the possibility that the
APOE ϵ4 allele contributes directly to
heart value and
myocardial damage. Hand and Heard present in their
review article various tools for finding relevant subgroups in
gene expression data. Alkharouf et al conduct an OLAP cube
(online analytical processing) to mine a time series
experiment designed to identify genes associated with
resistance of soybean to the soybean cyst nematode, which is a
devastating pest of soybean. Brylinski et al created a
sequence-to-structure library based on the complete PDB
database. Then an early-stage folding conformation and
information entropy were used for structure analysis and
classification.Whilst postgenomic science is producing vast data torrents, it is
well known that data do not equal knowledge and so the extraction
of the most meaningful parts of these data is key to the
generation of useful new knowledge. More sophisticated data mining
strategies are needed for mining such high-dimensional data to
generate useful relationships, rules, and predictions.