Literature DB >> 25276500

Hard Data Analytics Problems Make for Better Data Analysis Algorithms: Bioinformatics as an Example.

Jaume Bacardit1, Paweł Widera1, Nicola Lazzarini1, Natalio Krasnogor1.   

Abstract

Data mining and knowledge discovery techniques have greatly progressed in the last decade. They are now able to handle larger and larger datasets, process heterogeneous information, integrate complex metadata, and extract and visualize new knowledge. Often these advances were driven by new challenges arising from real-world domains, with biology and biotechnology a prime source of diverse and hard (e.g., high volume, high throughput, high variety, and high noise) data analytics problems. The aim of this article is to show the broad spectrum of data mining tasks and challenges present in biological data, and how these challenges have driven us over the years to design new data mining and knowledge discovery procedures for biodata. This is illustrated with the help of two kinds of case studies. The first kind is focused on the field of protein structure prediction, where we have contributed in several areas: by designing, through regression, functions that can distinguish between good and bad models of a protein's predicted structure; by creating new measures to characterize aspects of a protein's structure associated with individual positions in a protein's sequence, measures containing information that might be useful for protein structure prediction; and by creating accurate estimators of these structural aspects. The second kind of case study is focused on omics data analytics, a class of biological data characterized for having extremely high dimensionalities. Our methods were able not only to generate very accurate classification models, but also to discover new biological knowledge that was later ratified by experimentalists. Finally, we describe several strategies to tightly integrate knowledge extraction and data mining in order to create a new class of biodata mining algorithms that can natively embrace the complexity of biological data, efficiently generate accurate information in the form of classification/regression models, and extract valuable new knowledge. Thus, a complete data-to-information-to-knowledge pipeline is presented.

Year:  2014        PMID: 25276500      PMCID: PMC4174911          DOI: 10.1089/big.2014.0023

Source DB:  PubMed          Journal:  Big Data        ISSN: 2167-6461            Impact factor:   2.128


  19 in total

1.  Direct-coupling analysis of residue coevolution captures native contacts across many protein families.

Authors:  Faruck Morcos; Andrea Pagnani; Bryan Lunt; Arianna Bertolino; Debora S Marks; Chris Sander; Riccardo Zecchina; José N Onuchic; Terence Hwa; Martin Weigt
Journal:  Proc Natl Acad Sci U S A       Date:  2011-11-21       Impact factor: 11.205

Review 2.  Omics technologies, data and bioinformatics principles.

Authors:  Maria V Schneider; Sandra Orchard
Journal:  Methods Mol Biol       Date:  2011

3.  Predicting protein structures with a multiplayer online game.

Authors:  Seth Cooper; Firas Khatib; Adrien Treuille; Janos Barbero; Jeehyung Lee; Michael Beenen; Andrew Leaver-Fay; David Baker; Zoran Popović; Foldit Players
Journal:  Nature       Date:  2010-08-05       Impact factor: 49.962

4.  Functional network construction in Arabidopsis using rule-based machine learning on large-scale data sets.

Authors:  George W Bassel; Enrico Glaab; Julietta Marquez; Michael J Holdsworth; Jaume Bacardit
Journal:  Plant Cell       Date:  2011-09-06       Impact factor: 11.277

5.  The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures.

Authors:  Anne-Claire Haury; Pierre Gestraud; Jean-Philippe Vert
Journal:  PLoS One       Date:  2011-12-21       Impact factor: 3.240

6.  Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data.

Authors:  Enrico Glaab; Jaume Bacardit; Jonathan M Garibaldi; Natalio Krasnogor
Journal:  PLoS One       Date:  2012-07-11       Impact factor: 3.240

7.  Classification of microarray data using gene networks.

Authors:  Franck Rapaport; Andrei Zinovyev; Marie Dutreix; Emmanuel Barillot; Jean-Philippe Vert
Journal:  BMC Bioinformatics       Date:  2007-02-01       Impact factor: 3.169

8.  Analysis of mass spectrometry data from the secretome of an explant model of articular cartilage exposed to pro-inflammatory and anti-inflammatory stimuli using machine learning.

Authors:  Anna L Swan; Kirsty L Hillier; Julia R Smith; David Allaway; Susan Liddell; Jaume Bacardit; Ali Mobasheri
Journal:  BMC Musculoskelet Disord       Date:  2013-12-13       Impact factor: 2.362

9.  Automated alphabet reduction for protein datasets.

Authors:  Jaume Bacardit; Michael Stout; Jonathan D Hirst; Alfonso Valencia; Robert E Smith; Natalio Krasnogor
Journal:  BMC Bioinformatics       Date:  2009-01-06       Impact factor: 3.169

10.  Ab initio modeling of small proteins by iterative TASSER simulations.

Authors:  Sitao Wu; Jeffrey Skolnick; Yang Zhang
Journal:  BMC Biol       Date:  2007-05-08       Impact factor: 7.431

View more
  1 in total

1.  A data analysis framework for biomedical big data: Application on mesoderm differentiation of human pluripotent stem cells.

Authors:  Benjamin Ulfenborg; Alexander Karlsson; Maria Riveiro; Caroline Améen; Karolina Åkesson; Christian X Andersson; Peter Sartipy; Jane Synnergren
Journal:  PLoS One       Date:  2017-06-27       Impact factor: 3.240

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.