| Literature DB >> 35725732 |
Krzysztof Odrzywolek1,2, Zuzanna Karwowska3, Jan Majta1,4, Aleksander Byrski2, Kaja Milanowska-Zabel5, Tomasz Kosciolek6.
Abstract
Understanding the function of microbial proteins is essential to reveal the clinical potential of the microbiome. The application of high-throughput sequencing technologies allows for fast and increasingly cheaper acquisition of data from microbial communities. However, many of the inferred protein sequences are novel and not catalogued, hence the possibility of predicting their function through conventional homology-based approaches is limited, which indicates the need for further research on alignment-free methods. Here, we leverage a deep-learning-based representation of proteins to assess its utility in alignment-free analysis of microbial proteins. We trained a language model on the Unified Human Gastrointestinal Protein catalogue and validated the resulting protein representation on the bacterial part of the SwissProt database. Finally, we present a use case on proteins involved in SCFA metabolism. Results indicate that the deep learning model manages to accurately represent features related to protein structure and function, allowing for alignment-free protein analyses. Technologies that contextualize metagenomic data are a promising direction to deeply understand the microbiome.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35725732 PMCID: PMC9209496 DOI: 10.1038/s41598-022-14055-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Workflow showing the training of the model and its subsequent use in analyses. (A) Training of the embedding model using UHGP dataset. (B) Using Bacterial SwissProt dataset and the embedding model to analyse information encoded into the embeddings.
Results from metagenomic validation of the trained models. Exponential Cross-Entropy (ECE) measures how good the model is at the training task, which is predicting the next or the previous amino acid in a protein sequence. More detailed results can be found in Supplementary Tables 1 and 2.
| Dataset | EBI-ENA Study Accession ID | ECE | |
|---|---|---|---|
| Model trained on UHGP | Model trained on Pfam | ||
| PRJEB37249 | 15.3 ± 0.6 | ||
| healthy subset | PRJNA762199 | 13.44 ± 0.2 | |
12.
Description of Bacterial SwissProt ontology databases. For the label recovery task, we used a number of ontologies that can be assigned to a protein. These ontologies are based on 3D protein structure (SUPFAM, Gene 3D), domains (Pfam, InterPro), function (GO, KO, EC numbers) or provide information about organism of origin (taxonomy).
| Database | Category | Description | Bacterial SwissProt | |
|---|---|---|---|---|
| #Proteins | #Classes | |||
| SUPFAM | Structure | SUPFAM associates sequence families from Pfam with SCOP structural families using profile matching to produce sequence superfamilies of known structure | 147,137 | 989 |
| GENE 3D | Structure | GENE 3D contains protein domain assignments for sequences from all of the major sequence databases. Domains are predicted using a library of representative profile HMMs, derived from CATH superfamilies or directly mapped from structures in the CATH database | 116,919 | 1173 |
| InterPro | Sequence and domain | InterPro brings together 11 protein family databases (CATH-Gene3D, HAMAP, PANTHER, Pfam, PRINTS, ProDom, PROSITE Patterns, PROSITE Profiles, SMART, SUPERFAMILY, and TIGRFAMs). Each database provides a specific signature i.e. position-specific score matrices, hidden Markov models and profiles etc. to increase the sensitivity of protein classification | 198,677 | 12,244 |
| KO (KEGG Orthology) | Function | KO is a database of molecular functions. Each molecular function is represented in terms of a manually defined functional ortholog that together create molecular networks (pathways). Each functional ortholog is defined from experimentally characterized genes and proteins in specific organisms, which are then used to assign orthologous genes in other organisms, based on sequence similarity | 177,018 | 6614 |
| GO (Gene Ontology) | Function | GO is a controlled terminology that can be used to consistently and structurally identify genes and gene products. The GO terms are organized within a directed acyclic graph (DAG), and each GO term has a described relationship to one or more other terms in the same domain (i.e. biological process, molecular function, or cellular location) | 192,990 | 5799 |
| eggNOG | Function and taxonomy | eggNOG is a database of orthology relationships, gene evolutionary histories and functional annotations. It is built on the concept of OGs (orthologous groups) that are the result of a non-supervised analysis of thousands of genomes and relationships between all their genes | 162,261 | 15,932 |
| EC number | Function | EC numbers are a manually assigned nomenclature that describes enzymes, based on the chemical reactions they catalyse | 193,198 | 3005 |
| Pfam | Sequence and domain | Pfam is a database of protein families and domains. Each Pfam family has a seed alignment that contains a representative set of sequences for the entry. This alignment is used to build a hidden Markov model profile and the profile is being searched in the sequence database called pfamseq using the HMMER software | 120,184 | 5551 |
| Taxonomy: Order | Taxonomy | Uniprot uses the NCBI taxonomic database to assign taxonomic identifiers to nucleotide sequences | 200,536 | 132 |
| Taxonomy: Family | 198,996 | 274 | ||
| Taxonomy: Genus | 200,615 | 660 | ||
Figure 2The degree of correctness in the recovery of labels using deep, k-mer-based, and amino acid frequency representations, and MMseqs2—state-of-the-art proteins search tool. The recovery is measured by four metrics: Intersection over Union (IoU), F1 Score, Precision, and Recall. Ontologies are sorted by average results.
Figure 3Visualization of the first two UMAP components of Bacterial SwissProt embeddings. (A) Proteins colored by Recovery Error Rate, the metric that quantifies how hard it was to recover protein’s labels based on its neighbors, the metric that quantifies how hard it was to recover protein’s labels based on its neighbors. (B) Proteins colored by percentage of transmembrane residues in a protein chain; adopted from Perdigão et al.[56]. (C) Proteins colored by sequence length.
Figure 4Deep embeddings UMAP projection of Bacterial SwissProt colored by KO. (A) transferase proteins that share the same Pfam domain and belong to the EC 2.5.1 class—UDP-N-acetylglucosamine 1-carboxyvinyltransferase (K00790) in dark green, 3-phosphoshikimate 1-carboxyvinyltransferase (K00800) in brown. (B) GTP binding proteins sharing Pfam domains—Elongation Factor G (K02355) in purple, Peptide chain release factor (K02837) in pink. (C) All Bacterial SwissProt proteins. (D) proteins that belong to the tRNA ligases class (EC 6.1.1)—Cysteine (K01883) in dark green, Arginine (K01887) in blue, Glutamate (K01885) in navy blue, Glutamine (K01886) in cyan,Valine (K01873) in pink, and Isoleucine (K01870) in light green. (E) ribosomal proteins—30S ribosomal protein S1 (K02961) in light green, 50S ribosomal protein L14 (K02874) in light blue, 50S ribosomal protein L36 (K02919) in black, 50S ribosomal protein L35 (K02916) in dark green, and 50S ribosomal protein L15 (K02876) in purple.
Figure 5Visualization of EC 2.7.2 proteins in the deep embedding space. (A) Deep embeddings of EC 2.7.2 proteins visualized with UMAP. Colors correspond to EC numbers and shapes to PFAM domains. Axes represent UMAP's first two components. (B) Domain architecture of EC 2.7.2. (C) The mean distance between EC 2.7.2 proteins and 500 random proteins from the SwissProt space with distinction between embedding-based distance (green) and ClustalO distances (red). Values for both methods were calculated as averages of pairwise distances between all proteins within given clusters. (D) Comparison of embedding-based and sequence-based distance (ClustalO) to EC proteins 2.7.2.1. The distances were divided into those within the protein group EC 2.7.2.1, from EC 2.7.2.1 to other EC 2.7.2 proteins, and from EC 2.7.2.1 to randomly selected proteins. The embedding-based, as opposed to the sequence-based distance, differentiates the distances from EC 2.7.2.1 to other members of EC 2.7.2 and from EC 2.7.2.1 to random proteins. Marginal histograms represent data distribution of the two analysed distances in three different categories described above.