Raphael R Eguchi1, Po-Ssu Huang2. 1. Department of Biochemistry, School of Medicine, Stanford University, Shriram Center for Bioengineering and Chemical Engineering, 443 via Ortega, Room 036, Stanford, CA 94305, USA. 2. Department of Bioengineering, Schools of Engineering and Medicine, Stanford University Shriram Center for Bioengineering and Chemical Engineering, 443 via Ortega, Room 036, Stanford, CA 94305, USA.
Abstract
MOTIVATION: Recent advances in computational methods have facilitated large-scale sampling of protein structures, leading to breakthroughs in protein structural prediction and enabling de novo protein design. Establishing methods to identify candidate structures that can lead to native folds or designable structures remains a challenge, since few existing metrics capture high-level structural features such as architectures, folds and conformity to conserved structural motifs. Convolutional Neural Networks (CNNs) have been successfully used in semantic segmentation-a subfield of image classification in which a class label is predicted for every pixel. Here, we apply semantic segmentation to protein structures as a novel strategy for fold identification and structure quality assessment. RESULTS: We train a CNN that assigns each residue in a multi-domain protein to one of 38 architecture classes designated by the CATH database. Our model achieves a high per-residue accuracy of 90.8% on the test set (95.0% average per-class accuracy; 87.8% average per-structure accuracy). We demonstrate that individual class probabilities can be used as a metric that indicates the degree to which a randomly generated structure assumes a specific fold, as well as a metric that highlights non-conformative regions of a protein belonging to a known class. These capabilities yield a powerful tool for guiding structural sampling for both structural prediction and design. AVAILABILITY AND IMPLEMENTATION: The trained classifier network, parser network, and entropy calculation scripts are available for download at https://git.io/fp6bd, with detailed usage instructions provided at the download page. A step-by-step tutorial for setup is provided at https://goo.gl/e8GB2S. All Rosetta commands, RosettaRemodel blueprints, and predictions for all datasets used in the study are available in the Supplementary Information. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Recent advances in computational methods have facilitated large-scale sampling of protein structures, leading to breakthroughs in protein structural prediction and enabling de novo protein design. Establishing methods to identify candidate structures that can lead to native folds or designable structures remains a challenge, since few existing metrics capture high-level structural features such as architectures, folds and conformity to conserved structural motifs. Convolutional Neural Networks (CNNs) have been successfully used in semantic segmentation-a subfield of image classification in which a class label is predicted for every pixel. Here, we apply semantic segmentation to protein structures as a novel strategy for fold identification and structure quality assessment. RESULTS: We train a CNN that assigns each residue in a multi-domain protein to one of 38 architecture classes designated by the CATH database. Our model achieves a high per-residue accuracy of 90.8% on the test set (95.0% average per-class accuracy; 87.8% average per-structure accuracy). We demonstrate that individual class probabilities can be used as a metric that indicates the degree to which a randomly generated structure assumes a specific fold, as well as a metric that highlights non-conformative regions of a protein belonging to a known class. These capabilities yield a powerful tool for guiding structural sampling for both structural prediction and design. AVAILABILITY AND IMPLEMENTATION: The trained classifier network, parser network, and entropy calculation scripts are available for download at https://git.io/fp6bd, with detailed usage instructions provided at the download page. A step-by-step tutorial for setup is provided at https://goo.gl/e8GB2S. All Rosetta commands, RosettaRemodel blueprints, and predictions for all datasets used in the study are available in the Supplementary Information. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Nathan H Joh; Gevorg Grigoryan; Yibing Wu; William F DeGrado Journal: Philos Trans R Soc Lond B Biol Sci Date: 2017-08-05 Impact factor: 6.237
Authors: Christine E Tinberg; Sagar D Khare; Jiayi Dou; Lindsey Doyle; Jorgen W Nelson; Alberto Schena; Wojciech Jankowski; Charalampos G Kalodimos; Kai Johnsson; Barry L Stoddard; David Baker Journal: Nature Date: 2013-09-04 Impact factor: 49.962
Authors: Robert Pejchal; Katie J Doores; Laura M Walker; Reza Khayat; Po-Ssu Huang; Sheng-Kai Wang; Robyn L Stanfield; Jean-Philippe Julien; Alejandra Ramos; Max Crispin; Rafael Depetris; Umesh Katpally; Andre Marozsan; Albert Cupo; Sebastien Maloveste; Yan Liu; Ryan McBride; Yukishige Ito; Rogier W Sanders; Cassandra Ogohara; James C Paulson; Ten Feizi; Christopher N Scanlan; Chi-Huey Wong; John P Moore; William C Olson; Andrew B Ward; Pascal Poignard; William R Schief; Dennis R Burton; Ian A Wilson Journal: Science Date: 2011-10-13 Impact factor: 47.728
Authors: Andrew Leaver-Fay; Michael Tyka; Steven M Lewis; Oliver F Lange; James Thompson; Ron Jacak; Kristian Kaufman; P Douglas Renfrew; Colin A Smith; Will Sheffler; Ian W Davis; Seth Cooper; Adrien Treuille; Daniel J Mandell; Florian Richter; Yih-En Andrew Ban; Sarel J Fleishman; Jacob E Corn; David E Kim; Sergey Lyskov; Monica Berrondo; Stuart Mentzer; Zoran Popović; James J Havranek; John Karanicolas; Rhiju Das; Jens Meiler; Tanja Kortemme; Jeffrey J Gray; Brian Kuhlman; David Baker; Philip Bradley Journal: Methods Enzymol Date: 2011 Impact factor: 1.600
Authors: Natalie L Dawson; Tony E Lewis; Sayoni Das; Jonathan G Lees; David Lee; Paul Ashford; Christine A Orengo; Ian Sillitoe Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971
Authors: T J Brunette; Fabio Parmeggiani; Po-Ssu Huang; Gira Bhabha; Damian C Ekiert; Susan E Tsutakawa; Greg L Hura; John A Tainer; David Baker Journal: Nature Date: 2015-12-16 Impact factor: 49.962