| Literature DB >> 21496229 |
Rupanjali Chaudhuri1, Faraz Alam Ansari, Muthukurussi Varieth Raghunandanan, Srinivasan Ramachandran.
Abstract
BACKGROUND: The availability of sequence data of human pathogenic fungi generates opportunities to develop Bioinformatics tools and resources for vaccine development towards benefitting at-risk patients. DESCRIPTION: We have developed a fungal adhesin predictor and an immunoinformatics database with predicted adhesins. Based on literature search and domain analysis, we prepared a positive dataset comprising adhesin protein sequences from human fungal pathogens Candida albicans, Candida glabrata, Aspergillus fumigatus, Coccidioides immitis, Coccidioides posadasii, Histoplasma capsulatum, Blastomyces dermatitidis, Pneumocystis carinii, Pneumocystis jirovecii and Paracoccidioides brasiliensis. The negative dataset consisted of proteins with high probability to function intracellularly. We have used 3945 compositional properties including frequencies of mono, doublet, triplet, and multiplets of amino acids and hydrophobic properties as input features of protein sequences to Support Vector Machine. Best classifiers were identified through an exhaustive search of 588 parameters and meeting the criteria of best Mathews Correlation Coefficient and lowest coefficient of variation among the 3 fold cross validation datasets. The "FungalRV adhesin predictor" was built on three models whose average Mathews Correlation Coefficient was in the range 0.89-0.90 and its coefficient of variation across three fold cross validation datasets in the range 1.2% - 2.74% at threshold score of 0. We obtained an overall MCC value of 0.8702 considering all 8 pathogens, namely, C. albicans, C. glabrata, A. fumigatus, B. dermatitidis, C. immitis, C. posadasii, H. capsulatum and P. brasiliensis thus showing high sensitivity and specificity at a threshold of 0.511. In case of P. brasiliensis the algorithm achieved a sensitivity of 66.67%. A total of 307 fungal adhesins and adhesin like proteins were predicted from the entire proteomes of eight human pathogenic fungal species. The immunoinformatics analysis data on these proteins were organized for easy user interface analysis. A Web interface was developed for analysis by users. The predicted adhesin sequences were processed through 18 immunoinformatics algorithms and these data have been organized into MySQL backend. A user friendly interface has been developed for experimental researchers for retrieving information from the database.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21496229 PMCID: PMC3224177 DOI: 10.1186/1471-2164-12-192
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
List of databases from which the human pathogenic fungal proteomes were sourced.
| Species | Source | Reference |
|---|---|---|
| Candida Genome Database | [ | |
| Genolevures | [ | |
| J. Craig Venter Institute | [ | |
| Broad Institute | [ | |
| Broad Institute | [ | |
| Broad Institute | [ | |
| Broad Institute | [ | |
| Broad Institute | [ | |
| Sanger Institute | [ | |
| Broad Institute | [ | |
| Broad Institute | [ | |
| Broad Institute | [ | |
| Broad Institute | [ | |
Figure 1Support Vector Machine (SVM) run flowchart. SVM was trained and tested following this flow process, and the best classifiers were selected.
Parameter Sets and Performances of three Selected Models to Identify Fungal Adhesins and Adhesin-Like Proteins in human pathogenic fungal species.
| Best | Kernel | Parameters | Performance of best | Mean MCC for | CV for parameters | Accuracy |
|---|---|---|---|---|---|---|
| 470a | RBF | g = 0.01 | 0.9189 | 0.8981 | 2.74% | 99.45% |
| 470b | RBF | g = 0.01 | 0.9044 | 0.8981 | 2.74% | 99.34% |
| 449c | RBF | g = 0.001 | 0.8876 | 0.8922 | 1.20% | 99.23% |
Figure 2Receiver operating characteristic curve. The selected optimal threshold value (marked by arrow) for "FungalRV adhesin predictor" is shown.
Summary of predictions by FungalRV adhesin predictor using optimal threshold of 0.511.
| Species | Number of Proteins | Number of Known | Number of adhesins | Number of hypothetical | Number of false |
|---|---|---|---|---|---|
| 38 | 2 | 2(100%) | 20 | 0 | |
| 81 | 14 | 14(100%) | 0 | 1 | |
| 62 | 20 | 20(100%) | 0 | 0 | |
| 33 | 1 | 1(100%) | 10 | 2 | |
| 23 | 1 | 1(100%) | 8 | 0 | |
| 27 | 1 | 1(100%) | 13 | 1 | |
| 21 | 1 | 1(100%) | 6 | 1 | |
| 27 | 3 | 2(66.67%) | 11 | 0 | |
Parameter Sets and Performances of three Selected Models to Identify Fungal Adhesins and Adhesin-Like Proteins in other fungi (not pathogenic to human).
| Best model(classifier) | Kernel Type | Parameters | Performance of best model (MCC) | Mean MCC for parameters | CV for parameters |
|---|---|---|---|---|---|
| 26a | polynomial | d = 2 | 0.9019 | 0.89 | 3.24% |
| 470b | RBF | g = 0.01 | 0.9044 | 0.8981 | 2.74% |
| 6c | polynomial | d = 1 | 0.9044 | 0.8945 | 0.9% |
Algorithms used to analyse predicted adhesins for Immunoinformatics.
|
|
|
|
|---|---|---|
| 1. BLASTCLUST | Clusters protein or DNA sequences based on pairwise matches found using the BLAST algorithm in case of proteins or Mega BLAST algorithm for DNA. | [ |
| 2. OrthoMCL | OrthoMCL software was used to cluster proteins based on sequence similarity, using an all-against-all BLAST search of each species' proteome, followed by normalization of inter-species differences, and Markov clustering. | [ |
| 3. BetaWrap | Predicts the right-handed parallel beta-helix supersecondary structural motif in primary amino acid sequences by using beta-strand interactions learned from non-beta-helix structures. | [ |
| 4. Antigenic | Predicts potentially antigenic regions of a protein sequence, based on occurrence frequencies of amino acid residue types in known epitopes. | [ |
| 5. TargetP1.1 | Predicts the subcellular location of eukaryotic proteins based on the predicted presence of any of the N-terminal presequences: chloroplast transit peptide ( | [ |
| 5. SignalP 3.0 | Predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks and hidden Markov models. | [ |
| 6. TMHMM Server v. 2.0 | Predicts the transmembrane helices in proteins based on Hidden Markov Model. | [ |
| 7. Conserved Domain Database and Search Service, v2.22 | The Database is a collection of multiple sequence alignments for ancient domains and full-length proteins. It is used to identify the conserved domains present in a protein query sequence. | [ |
| 8. BlastP | It uses the BLAST algorithm to compare an amino acid query sequence against a protein sequence database. | [ |
| 9. ABCPred | Predict | [ |
| 10. BcePred | Predicts linear B-cell epitopes, using physico-chemical properties. | [ |
| 11. Discotope 1.2 | Predicts discontinuous B cell epitopes from protein three dimensional structures utilizing calculation of surface accessibility (estimated in terms of contact numbers) and a novel epitope propensity amino acid score. | [ |
| 12. BEPro | BEPro, uses a combination of amino-acid propensity scores and half sphere exposure values at multiple distances to achieve state-of-the-art performance. | [ |
| 13. Propred | Predicts MHC Class-II binding regions in an antigen sequence, using quantitative matrices derived from published literature. It assists in locating promiscous binding regions that are useful in selecting vaccine candidates. | [ |
| 14. IEDB-AR (Average Relative Binding Method) | Predicts IC(50) values allowing combination of searches involving different peptide sizes and alleles into a single global prediction. | [ |
| 15. Bimas | Ranks potential 8-mer, 9-mer, or 10-mer peptides based on a predicted half-time of dissociation to HLA class I molecules. The analysis is based on coefficient tables deduced from the published literature by Dr. Kenneth Parker, Children's Hospital Boston. | [ |
| 16. NetMHC 3.0 | Predicts binding of peptides to a number of different HLA alleles using artificial neural networks (ANNs) and weight matrices. | [ |
| 17. AlgPred | Predicts allergens in query protein based on similarity to known epitopes, searching MEME/MAST allergen motifs using MAST and assign a protein allergen if it have any motif, search based on SVM modules and search with BLAST search against 2890 allergen-representative peptides obtained from Bjorklund et al 2005 and assign a protein allergen if it has a BLAST hit. | [ |
| 18. Allermatch | Predicts the potential allergenicity of proteins by bioinformatics approaches as recommended by the Codex alimentarius and FAO/WHO Expert consultation on allergenicity of foods derived through modern biotechnology. | [ |
Figure 3FungalRV adhesin predictor Web site. Users can paste or upload sequences in FASTA format for human pathogenic fungal adhesin and adhesin-like proteins prediction.
Figure 4FungalRV Immunoinformatics Web site. Users can query FungalRV Immunoinformatics database for data useful from reverse vaccinology point of view corresponding to the predicted 307 adhesin and adhesin like proteins and known vaccine candidates.
Figure 5Number of Sequence Pairs in the shown ClustalW score (percent Identity) ranges. This graph was plotted for the 307 predicted fungal adhesins and adhesin like protein sequences from the selected eight human pathogenic fungal species. This data includes sequences from the training set.
Figure 6Overall FungalRV Layout: The proteomes of eight human pathogenic fungal species listed in the diagram were run through "FungalRV adhesin predictor" obtaining a list of 307 fungal adhesins and adhesin like proteins. The diagram provides a layout of analysis of the predicted proteins. All data are organized in relation to the primary key ORF ID. The analysis data obtained was arranged into FungalRV Database providing users' facility to query and export results into tab delimited text format.