| Literature DB >> 30520990 |
Brendan R E Ansell1, Bernard J Pope2,3,4,5, Peter Georgeson2,3,4, Samantha J Emery-Corbin1, Aaron R Jex1,6.
Abstract
Background: Large-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient to annotate the genomes of such divergent biological systems. Conversely, protein structure tolerates substantial variation in primary amino acid sequence and is thus a robust indicator of biochemical function. Structural proteomics is poised to become a standard part of pathogen genomics research; however, informatic methods are now required to assign confidence in large volumes of predicted structures. Aims: Our aim was to predict the proteome of a neglected human pathogen, Giardia duodenalis, and stratify predicted structures into high- and lower-confidence categories using a variety of metrics in isolation and combination.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30520990 PMCID: PMC6312909 DOI: 10.1093/gigascience/giy150
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
I-TASSER output metrics and additional features used in this study
| Feature class | I-TASSER output feature | Additional feature |
|---|---|---|
| Predicted structure metrics | C-score: Convergence score | SS-sd: Standard deviation in proportional secondary structure predictions |
| C-sd: Error in convergence score | ||
| TM-model: Estimated TM score | ||
| TM-model-sd: Error in TM-model | ||
| RMSD-model: Estimated RMSD | ||
| RMSD-model-sd: Error in RMSD-model | ||
| Structural homology metrics | % AA ID: Amino acid identity across region of structural homology | |
| TM score: Template modeling score | ||
| RMSD: Root mean squared deviation in alpha-carbon atom position | ||
| Coverage: Relative coverage in 3D space | ||
| Comparative sequence metrics | Length ratio: ratio of query peptide-to-reference peptide AA length | |
| Pfam match: presence of at least one identical Pfam domain annotated in both query and reference peptides |
Figure 1:Pfam code agreement as a proxy for predicted protein structure quality. A query peptide sequence is submitted to I-TASSER software to predict its 3D structure (colored blue). Metrics describing the predicted structure (“model”) are extracted for downstream analysis. The model is compared with empirically determined protein crystal structures available in the PDB using TM-align, from which the closest structural homologue is identified ("reference"; colored red). Metrics describing the alignment are also extracted. Pfam codes are assigned to primary peptide sequences that constitute the model and reference structures using InterPro Scan software (lower right side). The presence of at least one matching Pfam code assigned to the query and reference peptides (“PFAM match”) indicates greater likelihood of structural similarity between the model and the reference. Models with this feature are assigned as “high-confidence.” The ability of each extracted metric (“Feature”) to predict the high-confidence category (“Factor”) is assessed, and then a RF classifier is trained to identify the factor using all available features.
Figure 2:Structure prediction and homology searching elaborates putative functions for query peptides. (A) Intersection of predicted structures for which Pfam codes were available via query or reference peptides. The majority of structures predicted from BLAST-annotated peptides (blue vertical bars) had at least one Pfam annotation that matched with the reference structure. The majority of peptides that lacked BLAST annotation (aka “hypothetical proteins”; black vertical bars) also lacked Pfam codes. A total of 824 proteins (792 hypothetical) for which no Pfam codes were annotated in the query or the reference are not displayed. (B) Differential abundance of Pfam codes assigned to query and reference peptides for 1,095 high-confidence pairs. (C) Number of unique Pfam codes available for query (orange) and reference (teal) peptides for 1,095 high-confidence pairs. The right-shifted distribution in reference-derived Pfam codes indicates an overall increase in annotation via this method.
Random forest classifier performance discriminating high- from lower-confidence predicted protein structures
| Hold-out data Predicted | All data Predicted | |||||||
|---|---|---|---|---|---|---|---|---|
| HC | LC | Class error | HC | LC | Class error | |||
| All metrics | Actual | HC | 228 | 22 | 0.088 | 1054 | 34 | 0.031 |
| LC | 24 | 226 | 0.096 | 305 | 3437 | 0.082 | ||
| % AA identity omitted | Actual | HC | 227 | 23 | 0.092 | 1048 | 40 | 0.037 |
| LC | 29 | 221 | 0.116 | 349 | 3393 | 0.093 | ||
Metrics for 71 mainly ribosomal protein structures were insufficient for inclusion in data sets for the random forest.
Figure 3:A random forest classifier correctly identifies the majority of high-confidence models using I-TASSER software output and derived metrics. (A) Relative importance of 12 metrics used to predict the presence of matching Pfam terms between query peptides and reference peptides identified via structural homology searching. (B) Receiver operating characteristic curves for the best-performing individual metrics (AUC ≥0.7; Table 1) and the random forest classifier (“Exact_match_prediction”). The unbroken x = y line represents chance prediction.
Performance of classifier trained on Giardia duodenalis data on 100 models predicted for Homo sapiens
|
| |||||
|---|---|---|---|---|---|
| HC | LC | Class error | |||
| All metrics | Actual | HC | 43 | 2 | 0.04 |
| LC | 9 | 46 | 0.16 | ||
Figure 4:Distribution of I-TASSER software output and derived metrics across high-confidence, high-confidence-like, lower-confidence, and lower-confidence-like models. The random forest classifier's prediction of confidence status (“Exact_match_prediction”) is outlined in black.
Figure 5:Computationally predicted structures for putative ferredoxin:NAD(P)H reductases (FNRs). The high confidence-like structure predicted for GL_87577 is similar to the predicted C-terminal of an Entamoeba histolytica protein previously annotated as glutamate synthase (EhNO1) [27]. EhNO1 exhibits FNR activity and, unlike bacterial enzymes such as the Thermogota maritime FNR (PDB code: 4YLF), does not require an alpha subunit. Tm FNR beta subunit: purple; alpha subunit: blue; FMN co-factor: green.