| Literature DB >> 35724564 |
Varun S Sharma1,2, Andrea Fossati3,4, Rodolfo Ciuffa1, Marija Buljan5,6, Evan G Williams7, Zhen Chen8, Wenguang Shao1, Patrick G A Pedrioli1, Anthony W Purcell9, María Rodríguez Martínez10, Jiangning Song9,11, Matteo Manica10, Ruedi Aebersold1,12, Chen Li1,9.
Abstract
In molecular biology, it is a general assumption that the ensemble of expressed molecules, their activities and interactions determine biological function, cellular states and phenotypes. Stable protein complexes-or macromolecular machines-are, in turn, the key functional entities mediating and modulating most biological processes. Although identifying protein complexes and their subunit composition can now be done inexpensively and at scale, determining their function remains challenging and labor intensive. This study describes Protein Complex Function predictor (PCfun), the first computational framework for the systematic annotation of protein complex functions using Gene Ontology (GO) terms. PCfun is built upon a word embedding using natural language processing techniques based on 1 million open access PubMed Central articles. Specifically, PCfun leverages two approaches for accurately identifying protein complex function, including: (i) an unsupervised approach that obtains the nearest neighbor (NN) GO term word vectors for a protein complex query vector and (ii) a supervised approach using Random Forest (RF) models trained specifically for recovering the GO terms of protein complex queries described in the CORUM protein complex database. PCfun consolidates both approaches by performing a hypergeometric statistical test to enrich the top NN GO terms within the child terms of the GO terms predicted by the RF models. The documentation and implementation of the PCfun package are available at https://github.com/sharmavaruns/PCfun. We anticipate that PCfun will serve as a useful tool and novel paradigm for the large-scale characterization of protein complex function.Entities:
Keywords: gene ontology; machine learning; natural language processing; protein complex function
Mesh:
Substances:
Year: 2022 PMID: 35724564 PMCID: PMC9310514 DOI: 10.1093/bib/bbac239
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 13.994