| Literature DB >> 31211016 |
Jason E McDermott1,2, John R Cort1, Ernesto S Nakayasu1, Jonathan N Pruneda2, Christopher Overall3, Joshua N Adkins1.
Abstract
BACKGROUND: Although pathogenic Gram-negative bacteria lack their own ubiquitination machinery, they have evolved or acquired virulence effectors that can manipulate the host ubiquitination process through structural and/or functional mimicry of host machinery. Many such effectors have been identified in a wide variety of bacterial pathogens that share little sequence similarity amongst themselves or with eukaryotic ubiquitin E3 ligases.Entities:
Keywords: Machine learning; Protein function; Sequence analysis; Ubiquitination; Virulence
Year: 2019 PMID: 31211016 PMCID: PMC6557245 DOI: 10.7717/peerj.7055
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Reduced amino acid (RED) encodings.
| Name | Groups | Notes | Reference |
|---|---|---|---|
| NAT | ACDEFGHIKLMNPQRSTVWY | No encoding | |
| RED1 | SFTNKYEQCWPHDR | Hydrophilic | |
| RED2 | AGILMV | Hydrophobic | |
| RED3 | CILMVFWY | Low | |
| RED4 | SFTNYQCWPH | Hydrophobic | This study |
| RED5 | SFTNKYEQCWHDR | Hydrophilic | This study |
Cross-validation toy example.
| A | 1 | positive |
| B | 1 | positive |
| C | 2 | positive |
| D | 2 | positive |
| E | 3 | negative |
| F | 3 | negative |
| G | 4 | negative |
| H | 4 | negative |
Best model performance.
| NAT | 17 | 0.851 |
| RED0 | 14 | 0.903 |
| RED1 | 6 | 0.803 |
| RED2 | 8 | 0.742 |
| RED3 | 6 | 0.884 |
| RED4 | 13 | 0.814 |
Figure 1Amino acid reduction based on physicochemical properties is important.
Models were evaluated using the standard hydrophobic/hydrophilic reduction alphabet (RED0) and randomly divided sets of amino acids (RND0) with a kmer length of 14. Performance was evaluated using 100 fold family-wise cross validation and AUC. The plot shows that a division of amino acids into hydrophobic and hydrophilic residues outperforms a random division of amino acids.
Figure 2Assessing the information content of reduced amino acid kmers for ubiquitin ligase prediction.
Family-normalized counts for kmer occurrence in positive and negative examples from the ubiquitin ligase examples used in the study were calculated and a differential score derived where 1.0 signifies kmers that are absolutely conserved in every example from the known ubiquitin ligase examples and not present in the negative examples and 0 is neutral in terms of representation. The different amino acid encodings are shown in each panel with the length of the kmer used indicated on the X axis and the box and whiskers representing the overall distribution of scores for all observed kmers. The red box indicates the minimal 10 kmer model described in the text. This plot shows that the simple hydrophobic/hydrophilic encoding (RED0) displays the greatest flexibility for the longest kmer lengths when predicting this class of proteins.
Figure 3Discriminating peptides in E3 ligase domains.
Differential scores were calculated for each position in the example E3 ligases shown that represent how unique the kmer at that location is across all known ubiquitin ligase examples used in the study. Examples shown are (A) the HECT-like Salmonella Typhimurium SopA, (B) the NEL family Salmonella Typhimurium SspH2, (C) the Pseudomonas syringae AvrPtoB, and (D) the recently discovered Legionella pneumophila RavN. This score was normalized for sequence families and a score of 1.0 represents a position that is completely conserved in the positive examples and not present in the negative examples. Kmers with scores of greater than 0.2 (dotted line) are significantly predictive of the functional class. Known E3 ligase domains are indicated in the shaded boxes. The RavN protein is a recently discovered E3 ubiquitin ligase with no sequence similarity with any existing examples and was not included in our training set. Combined with the ability of SIEVEUb to accurately predict ubiquitin ligase function these plots collectively indicate that some of the most predictive kmers are present in the known domains, despite the family-wise cross-validation approach that was used to prevent trivial sequence similarity inside families from impacting the results.
Proteins predicted to be similar to ubiquitin ligase mimic set. *annotation based on sequence comparison only.
| APZ00_07775 | 0.62 | 0 | 8 | 0 | 2-methylfumaryl-CoA hydratase | ||
| KKKWG1_2059 | 0.61 | 0 | 15 | 0 | UPF0758 family protein | ||
| LV28_06870 | 0.60 | 17 | 0 | 0 | Benzaldehyde dehydrogenase | ||
| AB185_15825 | 0.58 | 0 | 5 | 0 | N-acetyltransferase ElaA | ||
| PMI0843 | 0.57 | 6 | 4 | 1 | Low-affinity putrescine importer PlaP | ||
| NC_006155 | 0.53 | 63 | 8 | 2 | hypothetical protein | ||
| LV28_00130 | 0.53 | 17 | 0 | 0 | MBL-fold metallo-hydrolase superfamily | ||
| APZ00_04010 | 0.52 | 0 | 8 | 0 | Soluble lytic murein transglycosylase | ||
| APH_0317 | 0.51 | 0 | 24 | 0 | fabH | 3-oxoacyl-[acyl-carrier-protein] synthase | |