Literature DB >> 26411473

Correct machine learning on protein sequences: a peer-reviewing perspective.

Ian Walsh, Gianluca Pollastri, Silvio C E Tosatto.   

Abstract

Machine learning methods are becoming increasingly popular to predict protein features from sequences. Machine learning in bioinformatics can be powerful but carries also the risk of introducing unexpected biases, which may lead to an overestimation of the performance. This article espouses a set of guidelines to allow both peer reviewers and authors to avoid common machine learning pitfalls. Understanding biology is necessary to produce useful data sets, which have to be large and diverse. Separating the training and test process is imperative to avoid over-selling method performance, which is also dependent on several hidden parameters. A novel predictor has always to be compared with several existing methods, including simple baseline strategies. Using the presented guidelines will help nonspecialists to appreciate the critical issues in machine learning.
© The Author 2015. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

Entities:  

Keywords:  evaluation; machine learning; posttranslational modification; predictor; protein sequence; training

Mesh:

Substances:

Year:  2015        PMID: 26411473     DOI: 10.1093/bib/bbv082

Source DB:  PubMed          Journal:  Brief Bioinform        ISSN: 1467-5463            Impact factor:   11.622


  18 in total

1.  DOME: recommendations for supervised machine learning validation in biology.

Authors:  Ian Walsh; Dmytro Fishman; Dario Garcia-Gasulla; Tiina Titma; Gianluca Pollastri; Jennifer Harrow; Fotis E Psomopoulos; Silvio C E Tosatto
Journal:  Nat Methods       Date:  2021-07-27       Impact factor: 28.547

2.  PON-P and PON-P2 predictor performance in CAGI challenges: Lessons learned.

Authors:  Abhishek Niroula; Mauno Vihinen
Journal:  Hum Mutat       Date:  2017-05-02       Impact factor: 4.878

Review 3.  A guide to machine learning for biologists.

Authors:  Joe G Greener; Shaun M Kandathil; Lewis Moffat; David T Jones
Journal:  Nat Rev Mol Cell Biol       Date:  2021-09-13       Impact factor: 94.444

4.  Constructing benchmark test sets for biological sequence analysis using independent set algorithms.

Authors:  Samantha Petti; Sean R Eddy
Journal:  PLoS Comput Biol       Date:  2022-03-07       Impact factor: 4.475

5.  Machine learning predicts new anti-CRISPR proteins.

Authors:  Simon Eitzinger; Amina Asif; Kyle E Watters; Anthony T Iavarone; Gavin J Knott; Jennifer A Doudna; Fayyaz Ul Amir Afsar Minhas
Journal:  Nucleic Acids Res       Date:  2020-05-21       Impact factor: 16.971

6.  Performance of in silico tools for the evaluation of p16INK4a (CDKN2A) variants in CAGI.

Authors:  Marco Carraro; Giovanni Minervini; Manuel Giollo; Yana Bromberg; Emidio Capriotti; Rita Casadio; Roland Dunbrack; Lisa Elefanti; Pietro Fariselli; Carlo Ferrari; Julian Gough; Panagiotis Katsonis; Emanuela Leonardi; Olivier Lichtarge; Chiara Menin; Pier Luigi Martelli; Abhishek Niroula; Lipika R Pal; Susanna Repo; Maria Chiara Scaini; Mauno Vihinen; Qiong Wei; Qifang Xu; Yuedong Yang; Yizhou Yin; Jan Zaucha; Huiying Zhao; Yaoqi Zhou; Steven E Brenner; John Moult; Silvio C E Tosatto
Journal:  Hum Mutat       Date:  2017-05-16       Impact factor: 4.878

7.  Collaborative representation-based classification of microarray gene expression data.

Authors:  Lizhen Shen; Hua Jiang; Mingfang He; Guoqing Liu
Journal:  PLoS One       Date:  2017-12-13       Impact factor: 3.240

8.  PON-tstab: Protein Variant Stability Predictor. Importance of Training Data Quality.

Authors:  Yang Yang; Siddhaling Urolagin; Abhishek Niroula; Xuesong Ding; Bairong Shen; Mauno Vihinen
Journal:  Int J Mol Sci       Date:  2018-03-28       Impact factor: 5.923

9.  KEAP1 Cancer Mutants: A Large-Scale Molecular Dynamics Study of Protein Stability.

Authors:  Carter J Wilson; Megan Chang; Mikko Karttunen; Wing-Yiu Choy
Journal:  Int J Mol Sci       Date:  2021-05-20       Impact factor: 5.923

10.  PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions.

Authors:  Jaroslav Bendl; Miloš Musil; Jan Štourač; Jaroslav Zendulka; Jiří Damborský; Jan Brezovský
Journal:  PLoS Comput Biol       Date:  2016-05-25       Impact factor: 4.475

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.