José Juan Almagro Armenteros1,2, Casper Kaae Sønderby2, Søren Kaae Sønderby2, Henrik Nielsen1, Ole Winther2,3. 1. Department of Bio and Health Informatics, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark. 2. The Bioinformatics Centre, Department of Biology, University of Copenhagen, 2200 Copenhagen N, Denmark. 3. DTU Compute, Technical University of Denmark, 2800 Kgs. Lyngby, Denmark.
Abstract
MOTIVATION: The prediction of eukaryotic protein subcellular localization is a well-studied topic in bioinformatics due to its relevance in proteomics research. Many machine learning methods have been successfully applied in this task, but in most of them, predictions rely on annotation of homologues from knowledge databases. For novel proteins where no annotated homologues exist, and for predicting the effects of sequence variants, it is desirable to have methods for predicting protein properties from sequence information only. RESULTS: Here, we present a prediction algorithm using deep neural networks to predict protein subcellular localization relying only on sequence information. At its core, the prediction model uses a recurrent neural network that processes the entire protein sequence and an attention mechanism identifying protein regions important for the subcellular localization. The model was trained and tested on a protein dataset extracted from one of the latest UniProt releases, in which experimentally annotated proteins follow more stringent criteria than previously. We demonstrate that our model achieves a good accuracy (78% for 10 categories; 92% for membrane-bound or soluble), outperforming current state-of-the-art algorithms, including those relying on homology information. AVAILABILITY AND IMPLEMENTATION: The method is available as a web server at http://www.cbs.dtu.dk/services/DeepLoc. Example code is available at https://github.com/JJAlmagro/subcellular_localization. The dataset is available at http://www.cbs.dtu.dk/services/DeepLoc/data.php. CONTACT: jjalma@dtu.dk.
MOTIVATION: The prediction of eukaryotic protein subcellular localization is a well-studied topic in bioinformatics due to its relevance in proteomics research. Many machine learning methods have been successfully applied in this task, but in most of them, predictions rely on annotation of homologues from knowledge databases. For novel proteins where no annotated homologues exist, and for predicting the effects of sequence variants, it is desirable to have methods for predicting protein properties from sequence information only. RESULTS: Here, we present a prediction algorithm using deep neural networks to predict protein subcellular localization relying only on sequence information. At its core, the prediction model uses a recurrent neural network that processes the entire protein sequence and an attention mechanism identifying protein regions important for the subcellular localization. The model was trained and tested on a protein dataset extracted from one of the latest UniProt releases, in which experimentally annotated proteins follow more stringent criteria than previously. We demonstrate that our model achieves a good accuracy (78% for 10 categories; 92% for membrane-bound or soluble), outperforming current state-of-the-art algorithms, including those relying on homology information. AVAILABILITY AND IMPLEMENTATION: The method is available as a web server at http://www.cbs.dtu.dk/services/DeepLoc. Example code is available at https://github.com/JJAlmagro/subcellular_localization. The dataset is available at http://www.cbs.dtu.dk/services/DeepLoc/data.php. CONTACT: jjalma@dtu.dk.
Authors: Adriana M Martha-Paz; David Eide; David Mendoza-Cózatl; Norma A Castro-Guerrero; Elva T Aréchiga-Carvajal Journal: Mol Membr Biol Date: 2019-12 Impact factor: 2.857
Authors: Maura Rojas-Pirela; Diego Andrade-Alviárez; Verónica Rojas; Ulrike Kemmerling; Ana J Cáceres; Paul A Michels; Juan Luis Concepción; Wilfredo Quiñones Journal: Open Biol Date: 2020-11-25 Impact factor: 6.411
Authors: Dorith Rotenberg; Aaron A Baumann; Sulley Ben-Mahmoud; Olivier Christiaens; Wannes Dermauw; Panagiotis Ioannidis; Chris G C Jacobs; Iris M Vargas Jentzsch; Jonathan E Oliver; Monica F Poelchau; Swapna Priya Rajarapu; Derek J Schneweis; Simon Snoeck; Clauvis N T Taning; Dong Wei; Shirani M K Widana Gamage; Daniel S T Hughes; Shwetha C Murali; Samuel T Bailey; Nicolas E Bejerman; Christopher J Holmes; Emily C Jennings; Andrew J Rosendale; Andrew Rosselot; Kaylee Hervey; Brandi A Schneweis; Sammy Cheng; Christopher Childers; Felipe A Simão; Ralf G Dietzgen; Hsu Chao; Huyen Dinh; Harsha Vardhan Doddapaneni; Shannon Dugan; Yi Han; Sandra L Lee; Donna M Muzny; Jiaxin Qu; Kim C Worley; Joshua B Benoit; Markus Friedrich; Jeffery W Jones; Kristen A Panfilio; Yoonseong Park; Hugh M Robertson; Guy Smagghe; Diane E Ullman; Maurijn van der Zee; Thomas Van Leeuwen; Jan A Veenstra; Robert M Waterhouse; Matthew T Weirauch; John H Werren; Anna E Whitfield; Evgeny M Zdobnov; Richard A Gibbs; Stephen Richards Journal: BMC Biol Date: 2020-10-19 Impact factor: 7.431
Authors: Marc Planas-Marquès; Martí Bernardo-Faura; Judith Paulus; Farnusch Kaschani; Markus Kaiser; Marc Valls; Renier A L van der Hoorn; Núria S Coll Journal: Mol Cell Proteomics Date: 2018-03-09 Impact factor: 5.911