| Literature DB >> 23812987 |
Meghana Kshirsagar1, Jaime Carbonell, Judith Klein-Seetharaman.
Abstract
MOTIVATION: An important aspect of infectious disease research involves understanding the differences and commonalities in the infection mechanisms underlying various diseases. Systems biology-based approaches study infectious diseases by analyzing the interactions between the host species and the pathogen organisms. This work aims to combine the knowledge from experimental studies of host-pathogen interactions in several diseases to build stronger predictive models. Our approach is based on a formalism from machine learning called 'multitask learning', which considers the problem of building models across tasks that are related to each other. A 'task' in our scenario is the set of host-pathogen protein interactions involved in one disease. To integrate interactions from several tasks (i.e. diseases), our method exploits the similarity in the infection process across the diseases. In particular, we use the biological hypothesis that similar pathogens target the same critical biological processes in the host, in defining a common structure across the tasks.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23812987 PMCID: PMC3694681 DOI: 10.1093/bioinformatics/btt245
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(A) Host–pathogen PPI prediction where the host is human and the pathogens are bacteria. (B) An example depicting the commonality in the bacterial attack of human proteins. Pathway-1 and pathway-3 (highlighted) represent critical processes targeted by all bacterial species
Fig. 2.A schematic illustrating the pathway summarizing function S for a task . On the left are the examples from the input predicted to be positive, indicated by . The matrix P has the pathway vectors for each example in . The summary function aggregates the pathway vectors to get the distribution
Fig. 3.The exponential function for different values of C
Characteristics of all four interaction datasets used
| Total no. of bacterial proteins (‘reviewed’ protein set from UniprotKB) | 2321 | 1086 | 4600 | 3592 |
| Total no. of human–bacteria protein pairs | 59.4 M | 27.8 M | 117.7 M | 87.7 M |
| No. of known interactions | 3073 | 1383 | 4059 | 62 |
| No. of interactions with no missing features | 655 | 491 | 839 | 62 |
| Size of training data with 1:100 class ratio | 66 155 | 49 591 | 84 739 | 6262 |
| No. of unique features in the training data | 69 4715 | 468 955 | 886 480 | 349 155 |
Note: Total no. of human proteins: 25 596; M, million. For each host–pathogen PPI dataset, the number of pathogen proteins, the size of the dataset and other such statistics are shown.
Fig. 4.Heatmap showing pathways enriched in each bacterial–human PPI interactions dataset. The horizontal axis represents the pathways (about 2100 of them) and the vertical axis represents the four datasets. Each entry in the heatmap represents the P-value of a pathway w.r.t one dataset. Darker values represent more enrichment. The black columns that span across all four rows show the commonly enriched pathways
Conserved interactions in the form of interologs across the various host–bacterial datasets
| Human–bacteria PPI datasets compared | H-B versus H-F | H-B versus H-Y | H-B versus H-S | H-F versus H-Y | H-F versus H-S | H-Y versus H-S |
|---|---|---|---|---|---|---|
| Number of Interologs | 2 | 3 | 0 | 3 | 0 | 0 |
Note: H-X: stands for human–pathogen where the pathogen ‘X’ can be B, F, Y and S referring to B.anthracis, F.tularensis, Y.pestis and S.typhi., respectively. The non-zero entry ‘2’ for ‘H-B versus H-F’ means there are two PPIs in the H-B dataset that have interologs in the H-F dataset.
Averaged 10-fold CV performance for all methods for a positive:negative class ratio of 1:100
| Method | ||||
|---|---|---|---|---|
| Indep. | 27.8 ± 4 | 25.7 ± 5.4 | 28.8 ± 4 | 72.5 ± 11.4 |
| Coupled | 27 ± 3.9 | 25.5 ± 5 | 27.9 ± 3.4 | 69.8 ± 12.4 |
| Indep. Path. | 26.5 ± 4.7 | 26.1 ± 6.9 | 26.7 ± 4.3 | 69.1 ± 12.7 |
| Mean MTL | 25.2 ± 4.9 | 26.7 ± 4 | 27.5 ± 6.3 | 69.4 ± 12.1 |
| MTPL |
Note: Accuracy is reported as the F1 measure computed on the positive class. The standard deviation over the 10-folds is also reported. Bold values indicate the highest F1 value for each column (i.e. for that PPI dataset).
P-values from pairwise t-tests of statistical significance
| 4.1e-04 | 9.1e-04 | 2.2e-07 | 0.1 |
Note: We compare MTPL with the best baseline ‘Indep.’, using results from 50 bootstrap sampling experiments. The null hypothesis is ‘there is no significant difference between the performance of MTPL and Indep.’. Null hypothesis: MTPL = Indep. aAlt. hypothesis: MTPL > Indep. bAlt. hypothesis: MTPL < Indep.
Pairwise model performance of MTPL
| Pairwise tasks | F1 | |
|---|---|---|
| Task-1, Task-2 | Task-1 | Task-2 |
| 31.4 | 30.1 | |
| 31.6 | 32 | |
| 73 | ||
| 30 | 32.1 | |
| 74.2 | ||
Note: F1 computed during 10-fold CV of various pairwise models from MTPL. Positive: negative class ratio was 1:100. The best F1 achieved for each task (i.e. for each bacterial species) is shown in bold. For example, B.anthracis has the best performance of 32 when it is coupled with S.typhi.
Fig. 5.The intersection of enriched human pathways from predicted interactions. The total number of enriched pathways for each bacterial species are B.anthracis: 250, F.tularensis: 164, Y.pestis: 400 and S.typhi: 40. The size of the intersection between all tasks’ enriched pathways is 17. The size of this intersection for the high-throughput datasets (excluding S.typhi) is much larger: 104
Five of the 17 commonly enriched pathways in the predicted interactions from MTPL
| Platelet activation, signaling and aggregation | |
| Integrin alpha IIb beta3 signaling | |
| Stabilization & expansion of E-cadherin adherens junction | |
| Post-translational regulation of adherens junction stability & disassembly | |
| Signaling by NGF |