| Literature DB >> 35615014 |
Panyu Ren1, Xiaodi Yang1, Tianpeng Wang1, Yunpeng Hou1, Ziding Zhang1.
Abstract
As one of the most studied Apicomplexan parasite Cryptosporidium, Cryptosporidium parvum (C. parvum) causes worldwide serious diarrhea disease cryptosporidiosis, which can be deadly to immunodeficiency individuals, newly born children, and animals. Proteome-wide identification of protein-protein interactions (PPIs) has proven valuable in the systematic understanding of the genome-phenome relationship. However, the PPIs of C. parvum are largely unknown because of the limited experimental studies carried out. Therefore, we took full advantage of three bioinformatics methods, i.e., interolog mapping (IM), domain-domain interaction (DDI)-based inference, and machine learning (ML) method, to jointly predict PPIs of C. parvum. Due to the lack of experimental PPIs of C. parvum, we used the PPI data of Plasmodium falciparum (P. falciparum), which owned the largest number of PPIs in Apicomplexa, to train an ML model to infer C. parvum PPIs. We utilized consistent results of these three methods as the predicted high-confidence PPI network, which contains 4,578 PPIs covering 554 proteins. To further explore the biological significance of the constructed PPI network, we also conducted essential network and protein functional analysis, mainly focusing on hub proteins and functional modules. We anticipate the constructed PPI network can become an important data resource to accelerate the functional genomics studies of C. parvum as well as offer new hints to the target discovery in developing drugs/vaccines.Entities:
Keywords: AC, Auto Covariance; AI, artificial intelligence; AP-MS, affinity purification coupled with mass spectrometry; BP, biological process; C. parvum, Cryptosporidium parvum; CC, cellular component; Cryptosporidium parvum; DDI, domain-domain interaction; DM, distributed-memory; DMI, domain-motif interaction; DPC, Di-peptide Composition; Doc2Vec, Document to vector; Domain-domain interaction; GO, Gene Ontology; HSP70s, 70kDa heat shock proteins; IM, interolog mapping; ITA, isothermal titration calorimetry; Interolog mapping; KEGG, Kyoto Encyclopedia of Genes and Genomes; LD, Local Descriptor; MF, molecular function; ML, machine learning; Machine learning; P. falciparum, Plasmodium falciparum; PCA, protein complementation assay; PPIs, protein-protein interactions; PR, Precision-Recall; Prediction; Protein–protein interaction; RF, Random Forest; ROC, Receiver Operating Characteristic; SGD, stochastic gradient descent; SPR, surface plasmon resonance; Y2H, yeast two-hybrid
Year: 2022 PMID: 35615014 PMCID: PMC9120227 DOI: 10.1016/j.csbj.2022.05.017
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 6.155
Fig. 1Workflow of the proposed computational pipeline to predict C. parvum PPIs. In the dataset preparation step, we constructed positive and negative samples based on P. falciparum PPI data from BioGRID as well as the Swiss-Prot database and divided the dataset into a training set (80%) and an independent test set (20%). Furthermore, we extracted protein features using the Doc2Vec encoding scheme and DPC encoding scheme. The Doc2Vec model was trained on the compiled protein corpus covering sequences from Swiss-Prot, P. falciparum PPI dataset, and proteomes of P. falciparum and C. parvum. Based on the encoded feature vectors of the P. falciparum PPI dataset, we further trained the ML classification model using the RF algorithm and assessed the model’s performance. Finally, we transferred the trained ML model to predict C. parvum PPIs and combined two other traditional methods (i.e., IM and DDI inference) to obtain the high-confidence C. parvum PPI network.
Fig. 2(A) Performance of different individual sequence encoding schemes-based Random Forest (RF) classifiers in predicting P. falciparum PPIs. Areas under the Precision-Recall curves (AUPRC) indicate that Document to Vector (Doc2Vec) outperformed Di-peptide composition (DPC), Local Descriptor (LD), and Auto Covariance (AC) applying an independent test set. (B) Performance of two best individual sequence encoding schemes (i.e., Doc2Vec and DPC)-based RF classifiers and their sequence encoding combination (Doc2Vec + DPC)-based RF classifier in predicting P. falciparum PPIs. AUPRC indicates that the combination sequence encoding scheme slightly outperformed each individual sequence encoding scheme.
Performance of individual or combined encoding schemes-based RF classifiers.
| Method | AUC | AUPRC | |
|---|---|---|---|
| Doc2Vec | 0.957 | 0.762 | |
| DPC | 0.955 | 0.745 | |
| LD | 0.950 | 0.710 | |
| AC | 0.928 | 0.687 | |
| Doc2Vec + DPC | 0.963 | 0.768 | |
| Doc2Vec + LD | 0.958 | 0.742 | |
| Doc2Vec + AC | 0.957 | 0.752 | |
| DPC + LD | 0.954 | 0.732 | |
| DPC + AC | 0.953 | 0.741 | |
| LD + AC | 0.951 | 0.712 | |
| Doc2Vec + DPC + LD | 0.959 | 0.752 | |
| Doc2Vec + DPC + AC | 0.956 | 0.755 | |
| Doc2Vec + LD + AC | 0.958 | 0.742 | |
| DPC + LD + AC | 0.954 | 0.731 | |
| Doc2Vec + DPC + LD + AC | 0.959 | 0.749 |
Fig. 3(A) Performance of different individual sequence encoding schemes-based Random Forest (RF) classifiers in predicting P. falciparum PPIs based on the novel partition of the dataset (i.e., non-overlapped proteins between the training set and test set). Areas under the Precision-Recall curves (AUPRC) indicate that Di-peptide composition (DPC) outperformed Document to Vector (Doc2Vec), Local Descriptor (LD), and Auto Covariance (AC) applying an independent test set. (B) Performance of two best individual sequence encoding schemes (i.e., DPC and Doc2Vec)-based RF classifiers and their combination (Doc2Vec + DPC)-based RF classifier in predicting P. falciparum PPIs. AUPRC indicates that the sequence encoding scheme combination significantly improves the performance compared to each sequence encoding scheme.
Performance of each individual or combined encoding schemes-based RF classifiers by using the novel dataset partition (i.e., non-overlapped proteins between training set and test set).
| Method | AUC | AUPRC | |
|---|---|---|---|
| Doc2Vec | 0.936 | 0.613 | |
| DPC | 0.930 | 0.631 | |
| LD | 0.919 | 0.509 | |
| AC | 0.813 | 0.340 | |
| Doc2Vec + DPC | 0.960 | 0.742 | |
| Doc2Vec + LD | 0.942 | 0.635 | |
| Doc2Vec + AC | 0.949 | 0.700 | |
| DPC + LD | 0.928 | 0.558 | |
| DPC + AC | 0.929 | 0.615 | |
| LD + AC | 0.918 | 0.527 | |
| Doc2Vec + DPC + LD | 0.944 | 0.650 | |
| Doc2Vec + DPC + AC | 0.958 | 0.732 | |
| Doc2Vec + LD + AC | 0.942 | 0.644 | |
| DPC + LD + AC | 0.927 | 0.572 | |
| Doc2Vec + DPC + LD + AC | 0.945 | 0.656 |
Fig. 4Overlaps of predicted C. parvum PPIs among three computational prediction methods.
Fig. 5Enriched GO terms and KEGG pathways of the predicted interactors of the hub proteins.
Fig. 6GO term/KEGG pathway enrichment analysis of the DNA replication module in the categories of biological process (A), cellular component (B), molecular function (C), and KEGG pathway (D).