Literature DB >> 18798568

Using genetic algorithms to select most predictive protein features.

Andrew Kernytsky1, Burkhard Rost.   

Abstract

Many important characteristics of proteins such as biochemical activity and subcellular localization present a challenge to machine-learning methods: it is often difficult to encode the appropriate input features at the residue level for the purpose of making a prediction for the entire protein. The problem is usually that the biophysics of the connection between a machine-learning method's input (sequence feature) and its output (observed phenomenon to be predicted) remains unknown; in other words, we may only know that a certain protein is an enzyme (output) without knowing which region may contain the active site residues (input). The goal then becomes to dissect a protein into a vast set of sequence-derived features and to correlate those features with the desired output. We introduce a framework that begins with a set of global sequence features and then vastly expands the feature space by generically encoding the coexistence of residue-based features. It is this combination of individual features, that is the step from the fractions of serine and buried (input space 20 + 2) to the fraction of buried serine (input space 20 * 2) that implicitly shifts the search space from global feature inputs to features that can capture very local evidence such as a the individual residues of a catalytic triad. The vast feature space created is explored by a genetic algorithm (GA) paired with neural networks and support vector machines. We find that the GA is critical for selecting combinations of features that are neither too general resulting in poor performance, nor too specific, leading to overtraining. The final framework manages to effectively sample a feature space that is far too large for exhaustive enumeration. We demonstrate the power of the concept by applying it to prediction of protein enzymatic activity. (c) 2008 Wiley-Liss, Inc.

Entities:  

Mesh:

Substances:

Year:  2009        PMID: 18798568     DOI: 10.1002/prot.22211

Source DB:  PubMed          Journal:  Proteins        ISSN: 0887-3585


  8 in total

Review 1.  Genetic algorithm optimization in drug design QSAR: Bayesian-regularized genetic neural networks (BRGNN) and genetic algorithm-optimized support vectors machines (GA-SVM).

Authors:  Michael Fernandez; Julio Caballero; Leyden Fernandez; Akinori Sarai
Journal:  Mol Divers       Date:  2010-03-20       Impact factor: 2.943

2.  Contrastive learning on protein embeddings enlightens midnight zone.

Authors:  Michael Heinzinger; Maria Littmann; Ian Sillitoe; Nicola Bordin; Christine Orengo; Burkhard Rost
Journal:  NAR Genom Bioinform       Date:  2022-06-11

3.  Machine learning on normalized protein sequences.

Authors:  Dominik Heider; Jens Verheyen; Daniel Hoffmann
Journal:  BMC Res Notes       Date:  2011-03-31

4.  Insights into the classification of small GTPases.

Authors:  Dominik Heider; Sascha Hauke; Martin Pyka; Daniel Kessler
Journal:  Adv Appl Bioinform Chem       Date:  2010-05-21

5.  Improved Bevirimat resistance prediction by combination of structural and sequence-based classifiers.

Authors:  J Nikolaj Dybowski; Mona Riemenschneider; Sascha Hauke; Martin Pyka; Jens Verheyen; Daniel Hoffmann; Dominik Heider
Journal:  BioData Min       Date:  2011-11-14       Impact factor: 2.522

6.  Predicting Bevirimat resistance of HIV-1 from genotype.

Authors:  Dominik Heider; Jens Verheyen; Daniel Hoffmann
Journal:  BMC Bioinformatics       Date:  2010-01-20       Impact factor: 3.169

7.  Automatic quantitative MRI texture analysis in small-for-gestational-age fetuses discriminates abnormal neonatal neurobehavior.

Authors:  Magdalena Sanz-Cortes; Giuseppe A Ratta; Francesc Figueras; Elisenda Bonet-Carne; Nelly Padilla; Angela Arranz; Nuria Bargallo; Eduard Gratacos
Journal:  PLoS One       Date:  2013-07-26       Impact factor: 3.240

8.  Effective automated feature construction and selection for classification of biological sequences.

Authors:  Uday Kamath; Kenneth De Jong; Amarda Shehu
Journal:  PLoS One       Date:  2014-07-17       Impact factor: 3.240

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.