Amelia Villegas-Morcillo1, Stavros Makrodimitris2,3, Roeland C H J van Ham2,3, Angel M Gomez1, Victoria Sanchez1, Marcel J T Reinders2,4. 1. Department of Signal Theory, Telematics and Communications, University of Granada, 18071 Granada, Spain. 2. Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands. 3. Keygene N.V., 6708PW Wageningen, The Netherlands. 4. Leiden Computational Biology Center, Leiden University Medical Center, 2333ZC Leiden, The Netherlands.
Abstract
MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Protein function prediction is a difficult bioinformatics problem. Many recent methods use deep neural networks to learn complex sequence representations and predict function from these. Deep supervised models require a lot of labeled training data which are not available for this task. However, a very large amount of protein sequences without functional labels is available. RESULTS: We applied an existing deep sequence model that had been pretrained in an unsupervised setting on the supervised task of protein molecular function prediction. We found that this complex feature representation is effective for this task, outperforming hand-crafted features such as one-hot encoding of amino acids, k-mer counts, secondary structure and backbone angles. Also, it partly negates the need for complex prediction models, as a two-layer perceptron was enough to achieve competitive performance in the third Critical Assessment of Functional Annotation benchmark. We also show that combining this sequence representation with protein 3D structure information does not lead to performance improvement, hinting that 3D structure is also potentially learned during the unsupervised pretraining. AVAILABILITY AND IMPLEMENTATION: Implementations of all used models can be found at https://github.com/stamakro/GCN-for-Structure-and-Function. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Rahmad Akbar; Habib Bashour; Puneet Rawat; Philippe A Robert; Eva Smorodina; Tudor-Stefan Cotet; Karine Flem-Karlsen; Robert Frank; Brij Bhushan Mehta; Mai Ha Vu; Talip Zengin; Jose Gutierrez-Marcos; Fridtjof Lund-Johansen; Jan Terje Andersen; Victor Greiff Journal: MAbs Date: 2022 Jan-Dec Impact factor: 5.857
Authors: Zichen Wang; Steven A Combs; Ryan Brand; Miguel Romero Calvo; Panpan Xu; George Price; Nataliya Golovach; Emmanuel O Salawu; Colby J Wise; Sri Priya Ponnapalli; Peter M Clark Journal: Sci Rep Date: 2022-04-27 Impact factor: 4.996
Authors: Michael Heinzinger; Maria Littmann; Ian Sillitoe; Nicola Bordin; Christine Orengo; Burkhard Rost Journal: NAR Genom Bioinform Date: 2022-06-11