| Literature DB >> 35669347 |
Matineh Rahmatbakhsh1, Mohamed Taha Moutaoufik1, Alla Gagarinova2, Mohan Babu1.
Abstract
Motivation: Despite arduous and time-consuming experimental efforts, protein-protein interactions (PPIs) for many pathogenic microbes with their human host are still unknown, limiting our understanding of the intricate interactions during infection and the identification of therapeutic targets. Since computational tools offer a promising alternative, we developed an R/Bioconductor package, HPiP (Host-Pathogen Interaction Prediction) software with a series of amino acid sequence property descriptors and an ensemble machine learning classifiers to predict the yet unmapped interactions between pathogen and host proteins.Entities:
Year: 2022 PMID: 35669347 PMCID: PMC9154073 DOI: 10.1093/bioadv/vbac038
Source DB: PubMed Journal: Bioinform Adv ISSN: 2635-0041
Fig. 1.HPiP workflow, parameter evaluation and validation of predicted interactions. (a) Five-step computational pipeline to predict HP-PPIs: (i) Building literature-curated training set; (ii) Converting amino acid sequences to numerical values using physicochemical descriptors; (iii) Selecting protein sequence-based numerical variables from different descriptors as input for ML; (iv) Fine-tuning ML model parameters using the training set, followed by CV and model performance evaluation; as well as (v) Scoring HP-PPIs in test set, and predicting multiprotein complexes from HP-PPI network using fast greedy (FG), walk-trap (WT), label propagation (LP), multilevel community (ML) and Markov clustering (MCL) methods. Putative interacting protein pairs were subjected to functional enrichment analysis for associations to biological functions using GO and KEGG pathway terms. For acronyms on physicochemical descriptors, see Supplementary Table S1. RFE, recursive feature elimination; SVM, support vector machine; LR, logistic regression. (b, c) auROC showing the performance measures of ensemble versus other ML classifiers evaluated using 10-fold CV in a training (b) and test (c) sets. (d) Histogram (left) and distribution (right) of SARS-CoV-2–human protein interaction pairs at various ensemble score; arrow with ensemble cutoff score ≥0.6 imply statistically significant (P-value ≤ 10−3) interactions; P-value computed by permutation test. (e) Overlap of SARS-CoV-2-human PPIs predicted from this study versus literature-curated interactions from BioGRID database and experimentally derived PPIs from THP-1 cells using AP/MS method. (f, g) Evidence supporting sequence-based prediction of SARS-CoV-2-human PPIs by colocalization to same cellular compartment (f) criteria, and human host factors interacting with SARS-CoV-2 proteins that were vital for coronavirus infections based on published genetic screens (g). The number in parenthesis indicates total number of interactions (e, f) or host factors (g) used for comparison. (h) Overlap of sequence-based prediction of SARS-CoV-2-human PPIs versus associations detected by AP/MS in THP-1 cells or in the BioGRID database at varying ensemble score cutoffs