| Literature DB >> 30453879 |
Paolo Boldi1, Marco Frasca2, Dario Malchiodi1.
Abstract
BACKGROUND: Supervised machine learning methods when applied to the problem of automated protein-function prediction (AFP) require the availability of both positive examples (i.e., proteins which are known to possess a given protein function) and negative examples (corresponding to proteins not associated with that function). Unfortunately, publicly available proteome and genome data sources such as the Gene Ontology rarely store the functions not possessed by a protein. Thus the negative selection, consisting in identifying informative negative examples, is currently a central and challenging problem in AFP. Several heuristics have been proposed through the years to solve this problem; nevertheless, despite their effectiveness, to the best of our knowledge no previous existing work studied which protein features are more relevant to this task, that is, which protein features help more in discriminating reliable and unreliable negatives.Entities:
Keywords: Biological networks; Negative example selection; Protein features; Protein function prediction
Mesh:
Substances:
Year: 2018 PMID: 30453879 PMCID: PMC6245585 DOI: 10.1186/s12859-018-2385-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Description of data networks
| Organism | Nodes | Average degree | Components | Component size | Diameter | Weighted diameter |
|---|---|---|---|---|---|---|
| Yeast | 5586 | 38.4740 | 41 | 5483, 2–7 | 12 | 3.0481 |
| Mouse | 13921 | 59.9990 | 190 | 13417, 2–10 | 13 | 3.1857 |
| Human | 15154 | 47.5572 | 89 | 14951, 2–10 | 11 | 3.1552 |
Column Components denotes the number of connected components in the network, whereas Component size denotes the corresponding number of nodes. Diameter is the number of edges on the longest path between two nodes, without considering edge weights
Number of GO terms in the three GO branches for which |C|≥20. |C| denotes the cardinality of C (i.e., the number of negative proteins that become positive in the temporal period)
| Organism | CC | MF | BP |
|---|---|---|---|
| Yeast | 5 | 9 | 29 |
| Mouse | 62 | 75 | 512 |
| Human | 71 | 105 | 363 |
Number GO terms with 20–200 annotated proteins in the more recent release
| Organism | CC | MF | BP |
|---|---|---|---|
| Yeast | 9 | 18 | 41 |
| Mouse | 18 | 32 | 178 |
| Human | 41 | 64 | 153 |
The considered features for node i∈V and GO term k
| Symbol | Name | Definition |
|---|---|---|
| Neighborhood mean |
| |
| Neighborhood variance |
| |
| Weighted degree |
| |
| Weighted clustering coefficient |
| |
| Number of annotations |
| |
| Closeness centrality [ |
| |
| Lin’s index [ |
| |
| Harmonic centrality [ |
| |
| Betweenness [ |
| |
| Positive neighborhood |
| |
| Mean of positive neighborhood |
| |
| Positive closeness centrality |
| |
| Positive Lin’s index |
| |
| Positive harmonic centrality |
| |
| 1-step Random Walk |
| |
| 2-step Random Walk |
| |
| 3-step Random Walk |
|
C denotes the connected component of i, the positive nodes in C, d the shortest-path distance from s to t (using as weight matrix), σ the number of shortest paths from s to t, and σ(u) the number of such paths that include u as internal node. and are defined in (1) and (2)
Fig. 1Proportion of times features are selected by the SFFS algorithm on yeast (first two rows) and human (last two rows) data. Grey and black bars are for term-unaware and term-aware protein features. The black horizontal dashed line corresponds to the mean value of the bars. For each organism, the two rows refer to the use of features f1– f14 and f1– f17, respectively. a, d, g, l correspond to CC terms, b, e, h, m to MF terms, and c, f, i, n to BP terms
Fig. 2Performance in terms on F1 measure averaged across GO branches for linear SVM (a) and RF (b) classifiers
Fig. 3Average F1 across GO branch terms on yeast data when removing the corresponding feature
Fig. 4Number of false negatives averaged across GO terms. Results in the first two rows are obtained on yeast data, whereas the last two rows refer to human data. First (resp. second) and third (resp. fourth) rows show the results of the SVM (resp. RF) selection algorithm. a, d, g correspond to CC terms, b, e, h to MF terms, and c, f, i to BP terms