Literature DB >> 26664031

Functional prediction of hypothetical proteins in human adenoviruses.

Abstract

Assigning functional information to hypothetical proteins in virus genomes is crucial for gaining insight into their proteomes. Human adenoviruses are medium sized viruses that cause a range of diseases. Their genomes possess proteins with uncharacterized function known as hypothetical proteins. Using a wide range of protein function prediction servers, functional information was obtained about these hypothetical proteins. A comparison of functional information obtained from these servers revealed that some of them produced functional information, while others provided little functional information about these human adenovirus hypothetical proteins. The PFP, ESG, PSIPRED, 3d2GO, and ProtFun servers produced the most functional information regarding these hypothetical proteins.

Entities: Chemical Disease Gene Species

Keywords: Hypothetical protein; human adenovirus; pathogen; protein function prediction; web server

Year: 2015 PMID： 26664031 PMCID： PMC4658645 DOI： 10.6026/97320630011466

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Human adenoviruses (HAdVs) are double stranded DNA viruses that are around 35 kb in size [1]. These viruses cause a wide variety of diseases such as acute respiratory disease [2], keratoconjunctivitis [3], and gastroenteritis [4]. Therefore, HAdVs are important human pathogens. There are 7 species of human adenoviruses, species A-G which are further divided into different strains/types increasingly based on bioinformatics and genomics approaches due to the availability of whole genome sequences, whereas earlier, this was done based on serum neutralization and hemagglutination inhibition [5]. In recent years, the availability of whole genome sequences of various organisms has increased dramatically due to next generation sequencing methods. For example, there was a 21% annual increase in the number of virus nucleotide base-pairs in GenBank and an overall annual increase in all GenBank nucleotide base-pairs of 43.6% in 2014 [6]. Many of the proteins in sequenced genomes are annotated as “hypothetical proteins.” These are predicted proteins that do not have experimental evidence for their translation [7] nor do they have a characterized function [8]. In order to better understand the genomes to which these proteins belong, it will be extremely helpful to assign functions to these hypothetical proteins. Even with their relatively small genome size compared with prokaryotes and eukaryotes, HAdVs possess several hypothetical proteins that need to be functionally annotated. A myriad of computational approaches to protein function prediction have been developed ranging from template based methods where a template with known function and structure is used to predict function of a query sequence [9], to text mining methods [10] to computational intelligence methods [11]. In this study, we used several well known protein function prediction servers to annotate HAdV hypothetical proteins. We found that some of these servers provided little to no information about the function of these HAdV hypothetical proteins, while others provided information that could potentially be used by wet bench biologists to further experimentally characterize these proteins. These results can serve as a guide to users, particularly wet bench biologists, as to which servers to use to predict the function of hypothetical proteins, particularly those belonging to viruses.

Methodology

Twenty-eight hypothetical proteins across 11 HAdVs Table 1 (see supplementary material) were obtained from GenBank [6] by searching these genomes for the keyword “hypothetical”. Three additional proteins not explicitly annotated as hypothetical (AAT97486, AAT97487, AAT97489 from HAdV-4) were chosen as they are very likely hypothetical due to BLASTP hits to other hypothetical proteins. One of the 31 proteins, ADN06471 from HAdV-40/41, although annotated as hypothetical, is known to be expressed [12]. All thirty-one of these proteins were then submitted to several sequence-based protein function prediction servers. These were PFP [13] (Protein Function Prediction), ESG [13] (Extended Similarity Group), ARGOT2 [14], BAR+ [15], PSIPRED [16], ProtFun [17], and dcGO [18]. The hypothetical proteins were also submitted to the fold recognition server Phyre2 [19] in order to determine the fold of these proteins. Protein domain prediction was performed using the protein families database server Pfam [20] , and the SMART server [21] (Simple Modular Architecture Research Tool). The homology modeling server SWISS-MODEL [22] and the MuFOLD protein structure prediction server [23] were used to predict the structures of the hypothetical proteins. Successfully predicted structures were then submitted to the structure-based server 3d2GO [24]. Tables were then constructed for all servers׳ predictions for function of each individual protein, protein domain predictions, and fold predictions.

Results

The average length of the 31 hypothetical proteins from 11 different human adenovirus genomes was 124 amino acids, with the high being 224 and a low of 58 (Table 1). The PFP server predicted functions for all 31 hypothetical proteins, some of which with high confidence, such as beta1-adrenergic receptor activity at 92% confidence for protein ACN88103 and purine nucleotide binding at 100% for protein AAW65500 Table 2 (see supplementary material). The ESG server was not as successful as the PFP server, but still managed to predict functions for 26 of the 31 possible hypothetical proteins. For instance, GTPase activity and GTP binding at 99% confidence was predicted as the function of AGF90820, and lyase activity and aldehyde-lyase activity at 89% confidence was predicted for ACN88103 as shown in Table 2. ARGOT2 was only capable of predicting the function of 7 hypothetical proteins, such as hydrolase activity at 100% confidence for protein AGE46441 and transferase activity at 85% confidence for protein AAT97487 Table 3 (see supplementary material). Additionally as shown in Table III, BAR+ was unable to predict a function for any of the hypothetical proteins. Similarly, the dcGO server was unable to predict a function for any of the hypothetical proteins (table not shown). The PSIPRED server predicted functions for all 31 hypothetical proteins such as GTP binding at 94% probability for AFH58045 and oxidoreductase activity at 99% probability for protein AAT97539 Table 4 (see supplementary material). The fold recognition server Phyre2 identified potential folds in 8 of the 31 hypothetical proteins as shown in Table 4. These folds include: pyruvate kinase C-terminal domain-like at 17.70% confidence for AFH58048 and barrel-sandwich hybrid at 25.10% confidence for protein AAW65505. The ProtFun server predicted functions for 24 of the 31 proteins, along with categorical information concerning gene ontology and whether the protein was an enzyme or not Table 5 (see supplementary material). Protein AAT97531 was predicted to play a role in the cell envelope with 53% probability, be an enzyme with 46% probability, and finally, be a structural protein with 27% probability. Additionally, protein AFH58048 was predicted to play a role in transport and binding with 74% probability, be a non-enzyme with 82% probability, and finally, be a growth factor with 7% probability as shown in Table 5. The homology modeling server SWISS-MODEL did not produce a structure output for any of the 31 hypothetical proteins for use with the 3d2GO server. However, the structure-based 3d2GO server predicted a function for 22 of the 31 hypothetical proteins from proposed structures of these proteins, provided by MuFold Table 6 (see supplementary material). For example, 3d2GO predicted oxidoreductase activity at 29% confidence as a function for AAW33184 and transport at 61% confidence for protein AAW65506. The protein family server Pfam found no domains for any of the hypothetical proteins Table 7 (see supplementary material). In contrast, the protein domain prediction server SMART produced results for 25 of the 31 hypothetical proteins, with the majority containing low complexity regions as shown in Table 7.

Discussion

The PFP server predicted some form of “binding” for 25 of the 31 function predictions, and had an average prediction confidence of 81% (Table 2). Additionally, the ESG server made function predictions for 26 of the 31 proteins, averaging 50 % confidence. ESG did not predict a function for all proteins as PFP did, but it provided more complete functional information, albeit with average to low confidence. For example, for protein AAT97533, 4-hydroxy-tetrahydrodipicolinate reductase, oxidoreductase activity, oxidoreductase activity, acting on CH or CH2 groups, NAD or NADP as acceptor, NADP binding, NAD binding, and NADPH binding was predicted at 32% confidence (Table 2). Also, for protein ADN06471 Nacetyltransferase activity, transferase activity, transferase activity, transferring acyl groups, transferase activity, and transferring acyl groups other than amino-acyl groups was predicted at 53% confidence. ARGOT2 predicted only 7 functions, averaging 80% confidence (Table 3). The BAR+ and dcGO servers were both unable to predict a function for any of the proteins as shown in Table 3. PSIPRED was capable of predicting a function for all 31 proteins, averaging 91% confidence in the process (Table 4). The function of “structural constituent of ribosome” was predicted for 7 of the 31 proteins. Also, some form of “binding” was predicted for 16 of the 31 proteins and ranged from “calcium ion binding” to “actin binding”. While the PSIPRED predictions were rather vague, the confidence of the predictions remained high across all 31 hypothetical proteins. Additionally, the fold recognition server Phyre2 only identified 8 potential matching folds out of a possible 31 and had an average confidence of 16.60% which is the probability of the query sequence and template being homologous (Table 4). Moreover, since Phyre2 utilizes fold recognition, the information the server provided allows users to gain insight into the fold of that protein. ProtFun provided a more thorough functional prediction for each protein that it could predict a function for. ProtFun managed to make 24 of the possible 31 hypothetical protein function predictions (Table 5). Not only did ProtFun predict functions for the 24 proteins, it also predicted whether the protein was an enzyme or nonenzyme, and its gene ontology (GO). Across the 26 predictions, function prediction confidence averaged at 29%, enzyme/nonenzyme prediction confidence averaged at 63%, and gene ontology prediction confidence averaged at 17%. SWISS-MODEL did not find templates for any of the proteins and therefore, could not produce a structure to use as input to the 3d2GO server. However, MuFold predicted a structure for 22 of the 31 hypothetical proteins (Table 6). Furthermore, structure-based server 3d2GO utilized those predicted structures to predict a function for the 22 proteins as shown in Table 6. Average prediction confidence was 50% and the server was able to predict a function from all structures proposed by MuFold. The function for protein AAW33433 was predicted to be RNA binding, ribosome, ribonucleoprotein complex, structural molecule activity, intracellular, translation, rRNA binding and structural constituent of ribosome at 99% confidence, but aside from this thorough prediction, most other predictions were rather vague, such as “cytosol”, “cytoplasm”, and “membrane” as shown in Table 6. While Pfam and SMART are not strictly protein function prediction servers, we wanted to investigate whether they could provide pertinent domain information for the HAdV hypothetical proteins. Pfam also did not find any domains in these proteins. Further, while the SMART server did find matching regions for 26 of the 31 hypothetical proteins, the information provided from the server was very minimal as 23 of the 26 matches were “low complexity regions” and the other 3 were classified as “signal peptide regions” (Table 7).

Conclusions

It is apparent from the results no single server produces the most complete functional determination of these “hard” HAdV hypothetical proteins. The servers that provided the most information were PFP, ESG, PSIPRED, 3d2GO, and ProtFun. The servers which provided very little or no functional information were ARGOT2, BAR+, and dcGO. We believe that the best option for functional prediction of hypothetical proteins is to use a multitude of servers and analyze the results produced. Furthermore, we agree with Radivojac et al. [25] that these servers need to be improved in order to better predict protein function.

24 in total

Review 1. Genetic content and evolution of adenoviruses.

Authors: Andrew J Davison; Mária Benkő; Balázs Harrach
Journal: J Gen Virol Date: 2003-11 Impact factor: 3.891

2. Expression of adenovirus type 5 E4 Orf2 protein during lytic infection.

Authors: I Dix; K N Leppard
Journal: J Gen Virol Date: 1995-04 Impact factor: 3.891

Review 3. Roles for text mining in protein function prediction.

Authors: Karin M Verspoor
Journal: Methods Mol Biol Date: 2014

4. BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences.

Authors: Damiano Piovesan; Pier Luigi Martelli; Piero Fariselli; Andrea Zauli; Ivan Rossi; Rita Casadio
Journal: Nucleic Acids Res Date: 2011-05-26 Impact factor: 16.971

5. GenBank.

Authors: Dennis A Benson; Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2014-11-20 Impact factor: 19.160

6. SMART: recent updates, new developments and status in 2015.

Authors: Ivica Letunic; Tobias Doerks; Peer Bork
Journal: Nucleic Acids Res Date: 2014-10-09 Impact factor: 16.971

7. Outbreak of epidemic keratoconjunctivitis caused by human adenovirus type 56, China, 2012.

Authors: Guohong Huang; Wenqing Yao; Wei Yu; Lingling Mao; Haibo Sun; Wei Yao; Jiang Tian; Ling Wang; Zhijian Bo; Zhen Zhu; Yan Zhang; Zhuo Zhao; Wenbo Xu
Journal: PLoS One Date: 2014-10-24 Impact factor: 3.240

Review 8. A survey of computational intelligence techniques in protein function prediction.

Authors: Arvind Kumar Tiwari; Rajeev Srivastava
Journal: Int J Proteomics Date: 2014-12-11

9. The Phyre2 web portal for protein modeling, prediction and analysis.

Authors: Lawrence A Kelley; Stefans Mezulis; Christopher M Yates; Mark N Wass; Michael J E Sternberg
Journal: Nat Protoc Date: 2015-05-07 Impact factor: 13.491

10. Adenovirus serotype 3 and 7 infection with acute respiratory failure in children in Taiwan, 2010-2011.

Authors: Chen-Yin Lai; Chia-Jie Lee; Chun-Yi Lu; Ping-Ing Lee; Pei-Lan Shao; En-Ting Wu; Ching-Chia Wang; Boon-Fatt Tan; Hsin-Yu Chang; Shao-Hsuan Hsia; Jainn-Jim Lin; Luan-Yin Chang; Yhu-Chering Huang; Li-Min Huang
Journal: PLoS One Date: 2013-01-10 Impact factor: 3.240

2 in total

1. Draft Genome Sequence of the Deep-Sea Bacterium Moritella sp. JT01 and Identification of Biotechnologically Relevant Genes.

Authors: Robert Cardoso de Freitas; Estácio Jussie Odisi; Chiaki Kato; Marcus Adonai Castro da Silva; André Oliveira de Souza Lima
Journal: Mar Biotechnol (NY) Date: 2017-07-22 Impact factor: 3.619

2. An Educational Bioinformatics Project to Improve Genome Annotation.

Authors: Zoie Amatore; Susan Gunn; Laura K Harris
Journal: Front Microbiol Date: 2020-12-07 Impact factor: 5.640

2 in total