| Literature DB >> 31462221 |
Lyuba V Bozhilova1, Alan V Whitmore2, Jonny Wray2, Gesine Reinert1, Charlotte M Deane3.
Abstract
BACKGROUND: Protein interaction databases often provide confidence scores for each recorded interaction based on the available experimental evidence. Protein interaction networks (PINs) are then built by thresholding on these scores, so that only interactions of sufficiently high quality are included. These networks are used to identify biologically relevant motifs or nodes using metrics such as degree or betweenness centrality. This type of analysis can be sensitive to the choice of threshold. If a node metric is to be useful for extracting biological signal, it should induce similar node rankings across PINs obtained at different reasonable confidence score thresholds.Entities:
Keywords: Confidence scores; Protein interaction networks; Protein-protein interactions; Ranking; Robustness
Mesh:
Substances:
Year: 2019 PMID: 31462221 PMCID: PMC6714100 DOI: 10.1186/s12859-019-3036-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Thresholding effects in STRING networks. Average degree (a) and average local clustering coefficient (b) as functions of the threshold in the three STRING networks. The dotted vertical lines correspond to the four default STRING threshold values
Fig. 2Metric rank similarity between consecutive thresholds. a In the four PINs, metrics were either consistently stable (e.g. degree and LOUD natural connectivity), consistently unstable (e.g. local clustering coefficient), or showed decreasing stability (e.g. betweenness). b The synthetic network based on a randomly rescored subset of the PVX network, SYN-PVX, and the network based on a Bernoulli random graph, SYN-GNP, exhibited different behaviour, with metrics showing the least similarity across thresholds in the SYN-GNP network
Fig. 3Relaxed similarity between overall and threshold ranks in the scored PINs. a The overall ranks have been calculated over the medium-high confidence regions—0.60 to 0.90 for the three STRING networks (black dotted lines) and 0.15 to 0.28 for the HitPredict network (pink dotted lines). b The STRING medium-high confidence interval was also used for the synthetic networks. The SYN-GNP network, where both structure and score allocation are uniform, exhibits lower relaxed similarity. The SYN-PVX network has inherently heterogeneous network structure, on which scores are assigned randomly. Score thresholding introduces the same rate of change in different parts of the network, so that the relative node degree, for example, may remain largely unchanged across a series of thresholds. In contrast, the heterogeneous score allocation in PINs makes rank reorderings more likely, and identifiability may be expected to be lower
Fig. 4Rank instability of metrics in the scored networks. a Rank instability in the four PINs. The dotted lines correspond to 1%. Instability measures in the HPRED network were generally narrower. b Rank instability in the synthetic networks. Instability measures in PINs were generally lower and have been plotted for comparison. Note the different scales between plots in a and in b
Fig. 5Confidence score distributions in each of the four studied PINs. Bin width in all four cases has been set to 0.01. Scores from the HitPredict network (bottom right) follow a different distribution and cannot necessarily be interpreted in the same way as STRING scores
Summary statistics for the six analysed networks
| Name | Network | Number of nodes | Number of edges | Edge density |
|---|---|---|---|---|
| PVX | 3255 | 344691 | ∼ 0.065 | |
| ECOLI | 4144 | 583440 | ∼ 0.068 | |
| YEAST | 6418 | 939998 | ∼ 0.046 | |
| HPRED | 5673 | 113001 | ∼ 0.007 | |
| SYN-GNP | Synthetic, Bernoulli | 500 | 7459 | ∼ 0.060 |
| SYN-PVX | Synthetic, randomised | 1000 | 30516 | ∼ 0.061 |
The left-most column corresponds to the names the networks are referred as later in the text. The number of edges and edge density refer to the all scored edges before any threshold is applied to the network
Fig. 6Thresholding scored networks. A scored network, with edge widths corresponding to confidence scores (left). At a low threshold, only the lowest scoring edge CD is removed (middle). At a higher threshold, only the highest scoring edges AB and BC remain in the network (right). Edge scores are otherwise ignored in the thresholded networks
The complete set of twenty-five standard and LOUD metrics, calculated at each node v
| Name | Details |
|---|---|
| Degree | Number of neighbours of |
| Local clustering | Proportion of pairs of neighbours of |
| Redundancy | (Local clustering) × (Degree - 1) [ |
| PageRank | Calculated with the default damping factor |
| Closeness | Reciprocal to the sum over all |
| Harmonic centrality | The sum over all |
| Betweenness | Measures how many shortest paths a node |
| Number of edges in the step-one ego-network of | |
| Number of nodes in the step-two ego-network of | |
| Number of nodes that have exactly distance two to | |
| A measure of relative local density calculated as | |
| The ratio of step-one to step-two neighbourhood sizes for | |
| LOUD Average local clustering | |
| LOUD Global clustering | |
| LOUD Average redundancy | |
| LOUD Average closeness | |
| LOUD Average path length | |
| LOUD Number of connected pairs | |
| LOUD Average betweenness | |
| LOUD Natural connectivity | |
| LOUD Average | |
| LOUD Average | |
| LOUD Average | |
| LOUD Average | |
| LOUD Average |
Standard metrics are above the line break. LOUD metrics are below the line break. LOUD metrics are based on global metrics f calculated both for each thresholded network G, and for the same network, where in turn each node v has been isolated from its neighbours G. The difference between the two metrics is recorded as f(v)=f(G)−f(G)