| Literature DB >> 33365016 |
Zoie Amatore1, Susan Gunn2, Laura K Harris1.
Abstract
Scientific advancement is hindered without proper genome annotation because biologists lack a complete understanding of cellular protein functions. In bacterial cells, hypothetical proteins (HPs) are open reading frames with unknown functions. HPs result from either an outdated database or insufficient experimental evidence (i.e., indeterminate annotation). While automated annotation reviews help keep genome annotation up to date, often manual reviews are needed to verify proper annotation. Students can provide the manual review necessary to improve genome annotation. This paper outlines an innovative classroom project that determines if HPs have outdated or indeterminate annotation. The Hypothetical Protein Characterization Project uses multiple well-documented, freely available, web-based, bioinformatics resources that analyze an amino acid sequence to (1) detect sequence similarities to other proteins, (2) identify domains, (3) predict tertiary structure including active site characterization and potential binding ligands, and (4) determine cellular location. Enough evidence can be generated from these analyses to support re-annotation of HPs or prioritize HPs for experimental examinations such as structural determination via X-ray crystallography. Additionally, this paper details several approaches for selecting HPs to characterize using the Hypothetical Protein Characterization Project. These approaches include student- and instructor-directed random selection, selection using differential gene expression from mRNA expression data, and selection based on phylogenetic relations. This paper also provides additional resources to support instructional use of the Hypothetical Protein Characterization Project, such as example assignment instructions with grading rubrics, links to training videos in YouTube, and several step-by-step example projects to demonstrate and interpret the range of achievable results that students might encounter. Educational use of the Hypothetical Protein Characterization Project provides students with an opportunity to learn and apply knowledge of bioinformatic programs to address scientific questions. The project is highly customizable in that HP selection and analysis can be specifically formulated based on the scope and purpose of each student's investigations. Programs used for HP analysis can be easily adapted to course learning objectives. The project can be used in both online and in-seat instruction for a wide variety of undergraduate and graduate classes as well as undergraduate capstone, honor's, and experiential learning projects.Entities:
Keywords: bioinformatics; classroom; education; genome annotation; hypothetical protein; undergraduate
Year: 2020 PMID: 33365016 PMCID: PMC7750189 DOI: 10.3389/fmicb.2020.577497
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Example studies considered in the development of the Hypothetical Protein Characterization Project.
| 10 | CDD-BLAST, Pfam, PS2, STRING, QFinder, ExPASy ProtParam, SOSUI, DISULFIND | ||
| 35 | PSI-BLAST, ExPASy ProtParam, CDD-BLAST, Pfam, PS2, 3DLigandSite, STITCH, STRING, PSORTb, SOSUI, DISULFIND | ||
| 204 (41%) | BLAST, FASTA, HMMER, SBASE, CATH, SUPERFAMILY, InterPro, SYSTERS, CDART, SMART, GPCRpred, Discovery Studio, STITCH, STRING, iPfam, ExPASy ProtParam, PSORTb, PSLpred, LOCTree3, TMHMM, HMMTOP, SignalP 4.1, SecretomeP, VirulentPred, DBETH server | ||
| 1055 (55%) | BLASTP, ExPASy ProtParam, PSORTb, CELLO, TMHMM, SignalP 4.1, HHPred, HMMSCAN, Pfam, InterPro, SUPERFAMILY, VirulentPred, VICMPred | ||
| 540 | InterPro, Pfam, BLASTP, CELLO2GO, GO FEAT, STRING, ExPASy ProtParam, VICMpred, MP3, I-TASSER | ||
| 172 (47%) | GO FEAT, Pfam, CATH, SUPERFAMILY, VICMPred, CDART, CDD-BLAST, ExPASy ProtParam, PSORTb, TopHat, Gipsy, VirlentPred, STRING, PSIPRED, Modeler | ||
| 6 | CDD-BLAST, Pfam, PS2, STRING, QFinder, ExPASy ProtParam, PSORTb, DISULFIND | ||
| 344 | BLASTP, ExPASy ProtParam, PSLpred, CELLO, ScanProsite, SMART, Motif Scan, PFP-FunDSeqE, VirulentPred, PFP, Argot2, PSIPred, Modeler | ||
| Vaccinia virus | 1 (100%) | BLAST, GOR IV server, I-TASSER, ExPASy ProtParam PSI-BLAST and Clustal Omega used to select model template for I-TASSER | |
| Human adenovirus | 28 | BLASTP, Pfam, SMART, Phyre2, SWISS-MODEL, MuFOLD, PFP, ESG, Argot2, BAR+, PSIPred, ProtFun, dcGO, 3d2GO | |
| 38 (16%) | BLASTP, Pfam, CATH, SUPERFAMILY, INETRPRO, MOTIF, CDART, SMART, SVMPort, ProtoNet, I-TASSER, ExPASy ProtParam, Virus PLoc, TMHMM, HMMTOP, DISULFIND |
FIGURE 1Schematic of Hypothetical Protein Characterization Project. The Hypothetical Protein Characterization Project provides students with a process that generates evidence to address if a hypothetical protein (HP) is accurately labeled. The HP can be selected randomly, through differential gene expression analysis using established statistical methods, or phylogenetic relations established through sequence similarity. Once selected, the HP’s amino acid sequence is analyzed by web-accessible individual programs for (1) detection of sequence similarities, (2) identification of protein domains, (3) 3D predictive modeling of the HP’s structure including active site and potential ligand binding partners, and (4) determination of protein cellular location. If results from these analyses provide sufficient evidence to support a function for the HP, the results can be provided directly to knowledgebases so the protein’s public record can be updated. Otherwise, the HP needs experimental examination before a function could be assigned.
Selected approaches for hypothetical protein selection.
| Random | Student-directed | Complete student autonomy to select HPs for characterization | Beginner | C |
| Instructor-directed | Instructors limit student ability to select HPs for characterization ( | Beginner | C | |
| Differential Gene Expression | Single-gene Analysis | Use of statistical method(s) ( | Intermediate | C, E, H, G |
| Singular Enrichment Analysis | Gene enrichment analysis comparing groups of significant HPs with similar differentially expression as defined by single-gene analysis | Intermediate | C, E, H, G | |
| Gene Set Enrichment Analysis | Gene enrichment analysis comparing a group of the most differentially expressed HPs to a gene signature ( | Advanced | E, H, G | |
| Phylogenetic Relations | N/A | HPs for characterization are selected for their sequence similarities to proteins with established tertiary structures | Intermediate | E, H, G |
FIGURE 2Schematics of differential gene expression approaches for hypothetical protein (HP) selection. (A) Volcano plot of mRNA expression data from Gene Expression Omnibus accession number GSE46687 identified HPs with statistical (two-tailed Welch’s T-test p-value < 0.05) and biological relevance [fold change (FC) > 5 for over-expressed or <–5 for under-expressed genes in experimental compared to control groups] to antibiotic resistance in Staphylococcus aureus that could be selected for the Hypothetical Protein Characterization Project. (B) Venn diagram illustrates conceptually how HPs are selected from singular enrichment analysis using the overlap of statistically significant (e.g., T-test p-value < 0.05) over-expressed genes between two mRNA expression datasets. The same concept applies to selecting under-expressed HPs also. (C) Schematic shows how HPs can be selected from gene signature comparison using Gene Set Enrichment Analysis (GSEA). Gene signatures are gene lists ranked by their differential expression based single-gene analysis (e.g., T-score or FC). A gene signature for each of two mRNA expression datasets are generated. One signature is chosen from which the 500 most over- and under-expressed genes are taken to derive positive and negative query gene sets, respectively. Each query gene set is compared individually to the second gene signature, which is used as reference for GSEA. GSEA calculates an enrichment plot with a maximum enrichment score. GSEA identifies leading-edge genes, which are genes that contribute most to reaching the maximum enrichment score. HPs among leading-edge genes are selected for the Hypothetical Protein Characterization Project.
Selected analysis programs for Hypothetical Protein Characterization Project.
| Sequence Similarity Detection | BLASTP | Encompasses similarities between relevant sequences to predict the functionality and evolutionary aspect of sequences between gene families. | |
| PSI-BLAST | Provides means of detection to note distant relationships between proteins. | ||
| Domain Identification | Pfam | Database of functional proteins that are called domains. Provides the students with structure of the protein, family annotation, and protein search against database models. | |
| CD-Search | Protein annotation that contains annotated sequence alignment models along with complete proteins. The output allows for identification of domains in the form of matrices. | ||
| 3D Predictive Modeling | PHYRE2 | Provides affiliation of proteins to predict protein structure, function, and mutation. Software uses a detection method through homologs to build 3D models, note binding sites, and analyze amino acids. | |
| 3DLigandSite | Allows for the prediction of ligand binding sites by using the predicted protein structure. | ||
| Cellular Location Determination | SOSUI | Provides transmembrane domain prediction of a single alpha helix. This process occurs through scanning through protein sequence to identify hydrophobic regions. | |
| PSORTb | Contains multiple modules to analyze biological features of known characteristics pertaining to subcellular localization. Thus, the database may predict a protein localization site. Database also encompasses Gram-negative and Gram-positive localization features. |
FIGURE 3Predictive 3D Models for Hypothetical Protein Characterization Project Examples. (A) Completeness of Phyre2 model of AUH26_00140 shows AUH26_00140 has outdated annotation. (B) Completeness of Phyre2 model of L2624_01843 suggests L2624_01843 has outdated annotation. (C) Lack of completeness of Phyre2 model of WP_002214142 supports the conclusion that WP_002214142 is an example of indeterminate annotation. (D) Lack of completeness of Phyre2 model of YP_009724396 indicates YP_009724396 is an example of indeterminate annotation. All images are colored by rainbow from N terminus to C terminus.