Ivan Limongelli1,2, Simone Marini3, Riccardo Bellazzi4. 1. IRCCS Policlinico S. Matteo, Pzz.le Volontari del Sangue 2, 27100, Pavia, Italy. ivan.limongelli@unipv.it. 2. Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 1, 27100, Pavia, Italy. ivan.limongelli@unipv.it. 3. Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 1, 27100, Pavia, Italy. simone.marini@unipv.it. 4. Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Via Ferrata 1, 27100, Pavia, Italy. riccardo.bellazzi@unipv.it.
Abstract
BACKGROUND: High throughput sequencing technologies are able to identify the whole genomic variation of an individual. Gene-targeted and whole-exome experiments are mainly focused on coding sequence variants related to a single or multiple nucleotides. The analysis of the biological significance of this multitude of genomic variant is challenging and computational demanding. RESULTS: We present PaPI, a new machine-learning approach to classify and score human coding variants by estimating the probability to damage their protein-related function. The novelty of this approach consists in using pseudo amino acid composition through which wild and mutated protein sequences are represented in a discrete model. A machine learning classifier has been trained on a set of known deleterious and benign coding variants with the aim to score unobserved variants by taking into account hidden sequence patterns in human genome potentially leading to diseases. We show how the combination of amphiphilic pseudo amino acid composition, evolutionary conservation and homologous proteins based methods outperforms several prediction algorithms and it is also able to score complex variants such as deletions, insertions and indels. CONCLUSIONS: This paper describes a machine-learning approach to predict the deleteriousness of human coding variants. A freely available web application (http://papi.unipv.it) has been developed with the presented method, able to score up to thousands variants in a single run.
BACKGROUND: High throughput sequencing technologies are able to identify the whole genomic variation of an individual. Gene-targeted and whole-exome experiments are mainly focused on coding sequence variants related to a single or multiple nucleotides. The analysis of the biological significance of this multitude of genomic variant is challenging and computational demanding. RESULTS: We present PaPI, a new machine-learning approach to classify and score human coding variants by estimating the probability to damage their protein-related function. The novelty of this approach consists in using pseudo amino acid composition through which wild and mutated protein sequences are represented in a discrete model. A machine learning classifier has been trained on a set of known deleterious and benign coding variants with the aim to score unobserved variants by taking into account hidden sequence patterns in human genome potentially leading to diseases. We show how the combination of amphiphilic pseudo amino acid composition, evolutionary conservation and homologous proteins based methods outperforms several prediction algorithms and it is also able to score complex variants such as deletions, insertions and indels. CONCLUSIONS: This paper describes a machine-learning approach to predict the deleteriousness of human coding variants. A freely available web application (http://papi.unipv.it) has been developed with the presented method, able to score up to thousands variants in a single run.
Authors: Jacob A Tennessen; Abigail W Bigham; Timothy D O'Connor; Wenqing Fu; Eimear E Kenny; Simon Gravel; Sean McGee; Ron Do; Xiaoming Liu; Goo Jun; Hyun Min Kang; Daniel Jordan; Suzanne M Leal; Stacey Gabriel; Mark J Rieder; Goncalo Abecasis; David Altshuler; Deborah A Nickerson; Eric Boerwinkle; Shamil Sunyaev; Carlos D Bustamante; Michael J Bamshad; Joshua M Akey Journal: Science Date: 2012-05-17 Impact factor: 47.728
Authors: Hashem A Shihab; Julian Gough; David N Cooper; Peter D Stenson; Gary L A Barker; Keith J Edwards; Ian N M Day; Tom R Gaunt Journal: Hum Mutat Date: 2012-11-02 Impact factor: 4.878
Authors: Peter D Stenson; Matthew Mort; Edward V Ball; Katy Howells; Andrew D Phillips; Nick St Thomas; David N Cooper Journal: Genome Med Date: 2009-01-22 Impact factor: 11.117
Authors: Melissa J Landrum; Jennifer M Lee; George R Riley; Wonhee Jang; Wendy S Rubinstein; Deanna M Church; Donna R Maglott Journal: Nucleic Acids Res Date: 2013-11-14 Impact factor: 16.971