| Literature DB >> 36151519 |
Susanna R Grigson1, Jody C McKerral2, James G Mitchell2, Robert A Edwards2.
Abstract
BACKGROUND: Due to the ever-expanding gap between the number of proteins being discovered and their functional characterization, protein function inference remains a fundamental challenge in computational biology. Currently, known protein annotations are organized in human-curated ontologies, however, all possible protein functions may not be organized accurately. Meanwhile, recent advancements in natural language processing and machine learning have developed models which embed amino acid sequences as vectors in n-dimensional space. So far, these embeddings have primarily been used to classify protein sequences using manually constructed protein classification schemes.Entities:
Keywords: Bacteria; Function prediction; Machine learning; Protein ontology; Sequence embedding
Mesh:
Substances:
Year: 2022 PMID: 36151519 PMCID: PMC9502642 DOI: 10.1186/s12859-022-04930-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 1Procedure to embed amino acid sequences as vectors using Protvec models
Fig. 2A Distribution of the lengths of the 3-mer vectors in the Bacillus carbohydrate metabolism Protvec model. The shaded region corresponds to 3-mer vectors with a length greater than 16. B Comparison of the Bacillus carbohydrate metabolism Protvec model with the BLOSUM62 matrix. The number of occurrences (count) of each amino acid in 3-mer vectors with a length greater than 16 is compared with the value of each amino acid on the diagonal of the BLOSUM62 matrix
Fig. 3A Sequence embeddings of Bacillus carbohydrate metabolism sequences embedded using the Bacillus carbohydrate metabolism Protvec model, k-mer frequency and the Swiss-Prot Protvec model. Sequences are colored by their subclass and visualized using PCA. B CH index of Bacillus carbohydrate metabolism sequences (n = 5000) embedded using the Bacillus carbohydrate metabolism Protvec model, k-mer frequency and the Swiss-Prot Protvec model for K = 2:150 clusters. For each value of K, 500 bootstrap iterations were used
Fig. 4Comparison of Bacillus carbohydrate metabolism sequences grouped using agglomerative clustering on sequence embeddings using the Bacillus carbohydrate metabolism Protvec model and the SEED annotation hierarchy. The color joining the dendrograms is continuous across the Protvec dendrogram. Boxes are drawn around each subsystem in the SEED annotation hierarchy
Fig. 5K-means clustering of unannotated Bacillus sequences embedded using a Protvec model trained with unannotated Bacillus sequences. Embedded sequences were grouped into 12 clusters and visualized using t-SNE. The 100 sequences closest to the centroid of each cluster are shown in separate colors and the centroid of each cluster is shown in black