Literature DB >> 28469415

Current Developments in Machine Learning Techniques in Biological Data Mining.

Gerard G Dumancas1, Indra Adrianto2, Ghalib Bello3, Mikhail Dozmorov4.   

Abstract

This supplement is intended to focus on the use of machine learning techniques to generate meaningful information on biological data. This supplement under Bioinformatics and Biology Insights aims to provide scientists and researchers working in this rapid and evolving field with online, open-access articles authored by leading international experts in this field. Advances in the field of biology have generated massive opportunities to allow the implementation of modern computational and statistical techniques. Machine learning methods in particular, a subfield of computer science, have evolved as an indispensable tool applied to a wide spectrum of bioinformatics applications. Thus, it is broadly used to investigate the underlying mechanisms leading to a specific disease, as well as the biomarker discovery process. With a growth in this specific area of science comes the need to access up-to-date, high-quality scholarly articles that will leverage the knowledge of scientists and researchers in the various applications of machine learning techniques in mining biological data.

Entities:  

Year:  2017        PMID: 28469415      PMCID: PMC5390918          DOI: 10.1177/1177932216687545

Source DB:  PubMed          Journal:  Bioinform Biol Insights        ISSN: 1177-9322


The area of biology has evolved in a manner that encompasses mining meaningful information to answer the next biological question. Machine learning techniques are computational methods that use “experience” to improve performance or to make accurate predictions. Within the context of this supplement, experience refers to past information such as electronic data available to the learners, which are then consequently used for analyses.[1] Over the years, the collection of biological information has been rapidly increasing due to the developments and improvements of existing technologies and facilities. An excellent example would be the Human Genome Project, founded in 1990 by the US Department of Energy and the US National Institutes of Health and was eventually completed in 2003. The rapid growth of these massive data eventually led to the need for the use of computational and statistical methods to organize, maintain, and interpret biological results.[2] There is a strong motivation in the use of machine learning methods in knowledge discovery and data mining to generate models of biological implications. The history of the relationship between machine learning and biology is considered long and complex. One of the early techniques used in machine learning called perceptron constituted an attempt to mimic and model the the behavior of a biological neuron. Consequently, the area of artificial neural network (ANN) emerged from this attempt.[3] This supplement covers a wide array of topics involving the use of machine learning techniques to extract meaningful information from genetic and clinical data with the primary objective of answering biological questions. Various applications of machine learning techniques in different areas are covered in this supplement, including its use for predicting human leukocyte antigen (HLA)-peptide binding activity, integrating disparate short-read alignment algorithms for mapping next-generation sequencing reads to a reference genome, identifying similar diseases by semantic and genomic similarity, as well as development of risk assessment tools for prediction of life expectancy using genetic algorithm (GA) and weighted quantile sum (WQS) regression. The network analysis, as well as other machine learning techniques, is reviewed by Luo et al.[4] This review summarizes different methods and tools for predicting binding properties of the HLAs. The HLA system produces a variety of peptides that play a critical role in immune system regulation by recognizing foreign antigens and presenting them to different types of immune cells. The review by Luo et al[4] covers a wide variety of machine learning methods for predicting HLA binding, from regression-based techniques and decision trees through support vector machines (SVMs), hidden Markov models, and ANNs. The authors discuss the strengths and limitations of each method and propose a combination approach that uses multiple types of prediction methods to address some of the limitations. The complexity of immune system was also addressed in the paper by Dozmorov et al.[5] The authors compare 2 methods, csSAM and DSection, for detection of cell type–specific differential expression by analyzing RNA-seq data obtained from a heterogeneous mixture of immune cells of healthy individuals and patients diagnosed with an autoimmune disease systemic lupus erythematosus. Both methods use linear regression to estimate cell type–specific gene expression differences, whereas the DSection method also uses Bayesian approach to estimate cell proportions. Dozmorov et al[5] compared the results of cell type–specific differential expression analysis with genes differentially expressed in heterogeneous mixture of cells. In addition to evaluating csSAM and DSection methods applied to the cell type–specific differential expression analysis of RNA-seq data, the authors provide a brief methodologic overview of gold standard tools for differential expression analysis of RNA-seq data. Adams et al[6] provided a novel examination on the use of GA for determining variables predictive of mortality. Their manuscript offers a novel method involving the use of a GA approach for selection of predictive variables from health questionnaire data. The selected variables are then used to construct predictive models of 5-year mortality using various machine learning techniques. Parametric and nonparametric machine learning algorithms are emerging computational methods that have increasing applications in the area of bioinformatics and computational biology. Results obtained from this study will provide novel insights for computational biologists and bioinformaticians to use GA in conjunction with machine learning techniques to efficiently select important variables and also determine their collective prediction accuracy. The various machine learning techniques used in the study included gradient boosting, ANN, elastic net, SVM, ridge regression, logistic regression, random forest, least absolute shrinkage and selection operator (LASSO), partial least squares-discriminant analysis, and decision trees. The optimization of variable selection for questionnaire data and the construction of predictive models using selected variables are areas of interest for researchers and clinicians alike. The study demonstrates the feasibility of various machine learning techniques for developing both prognostic and explanatory models using data collected via surveys or questionnaires.[6] Bello et al[7] used a statistical technique known as weighted quantile sum (WQS) regression to develop a model that condenses the information from a variety of health markers into a composite index (referred to as the health status metric [HSM]). The HSM can be used as a holistic measure of overall health and also as a risk score for predicting all-cause mortality. Indeed, results of their study indicated that the index was highly predictive of life expectancy and long-term health-related outcomes such as hospital utilization. Weighted quantile sum regression is a novel, penalized regression method that was developed to handle multicollinear data, the situation whereby complex correlation patterns exist among multiple variables. Weighted quantile sum controls the variance inflation arising from multicollinearity by imposing nonnegative and unit-sum penalties on model coefficients. It is a powerful alternative to other penalized regression techniques, such as LASSO and the elastic net, which are commonly used in machine learning. The study demonstrates the utility of WQS in predictive modeling and development of risk scores.
  5 in total

1.  B-Cell and Monocyte Contribution to Systemic Lupus Erythematosus Identified by Cell-Type-Specific Differential Expression Analysis in RNA-Seq Data.

Authors:  Mikhail G Dozmorov; Nicolas Dominguez; Krista Bean; Susan R Macwana; Virginia Roberts; Edmund Glass; Judith A James; Joel M Guthridge
Journal:  Bioinform Biol Insights       Date:  2015-10-08

Review 2.  Machine Learning Methods for Predicting HLA-Peptide Binding Activity.

Authors:  Heng Luo; Hao Ye; Hui Wen Ng; Leming Shi; Weida Tong; Donna L Mendrick; Huixiao Hong
Journal:  Bioinform Biol Insights       Date:  2015-10-11

Review 3.  Machine learning and its applications to biology.

Authors:  Adi L Tarca; Vincent J Carey; Xue-wen Chen; Roberto Romero; Sorin Drăghici
Journal:  PLoS Comput Biol       Date:  2007-06       Impact factor: 4.475

4.  Development and Validation of a Clinical Risk-Assessment Tool Predictive of All-Cause Mortality.

Authors:  Ghalib A Bello; Gerard G Dumancas; Chris Gennings
Journal:  Bioinform Biol Insights       Date:  2015-09-01

5.  Development and Application of a Genetic Algorithm for Variable Optimization and Predictive Modeling of Five-Year Mortality Using Questionnaire Data.

Authors:  Lucas J Adams; Ghalib Bello; Gerard G Dumancas
Journal:  Bioinform Biol Insights       Date:  2015-11-08
  5 in total
  1 in total

1.  Machine Learning Assisted Prediction of Prognostic Biomarkers Associated With COVID-19, Using Clinical and Proteomics Data.

Authors:  Rahila Sardar; Arun Sharma; Dinesh Gupta
Journal:  Front Genet       Date:  2021-05-20       Impact factor: 4.599

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.