| Literature DB >> 30367594 |
Marco Frasca1, Giuliano Grossi2, Jessica Gliozzo3, Marco Mesiti2, Marco Notaro2, Paolo Perlasca2, Alessandro Petrini2, Giorgio Valentini2.
Abstract
BACKGROUND: Several problems in network biology and medicine can be cast into a framework where entities are represented through partially labeled networks, and the aim is inferring the labels (usually binary) of the unlabeled part. Connections represent functional or genetic similarity between entities, while the labellings often are highly unbalanced, that is one class is largely under-represented: for instance in the automated protein function prediction (AFP) for most Gene Ontology terms only few proteins are annotated, or in the disease-gene prioritization problem only few genes are actually known to be involved in the etiology of a given disease. Imbalance-aware approaches to accurately predict node labels in biological networks are thereby required. Furthermore, such methods must be scalable, since input data can be large-sized as, for instance, in the context of multi-species protein networks.Entities:
Keywords: Biological networks; GPU-based Hopfield nets; Large-sized networks; Node label prediction; Protein function prediction
Mesh:
Substances:
Year: 2018 PMID: 30367594 PMCID: PMC6191976 DOI: 10.1186/s12859-018-2301-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1AFP problem as a set of binary classification problems. The aim is determining the color/label of unlabeled nodes/proteins, given the graph topology and the labels of the known part of the graph
Characteristics of protein networks
| Organism | Nodes | Average degree | Components | Largest component size | Diameter | Weighted diameter |
|---|---|---|---|---|---|---|
| Yeast | 6391 | 314.0563 | 1 | 6391 | 6 | 1.0925 |
| Mouse | 21151 | 596.3804 | 21 | 21105 | 9 | 1.8362 |
| Human | 19576 | 579.9477 | 2 | 19574 | 6 | 1.0302 |
Column Components denotes the number of connected components in the network, whereas Largest component size is the number of nodes in the largest connected component. Diameter is the number of edges on the longest path between two nodes, without considering edge weights
Number of GO terms with at least 5 annotations
| Organism | CC | MF | BP |
|---|---|---|---|
| Yeast | 50 | 74 | 191 |
| Mouse | 85 | 112 | 733 |
| Human | 115 | 193 | 580 |
Fig. 2CPU/GPU schema of the PARCOSNET parallelization. Multiple CPU threads are launched in parallel each one solving the AFP problem for a given class/protein function. The GPU thread blocks, each composed of several CUDA threads, first solve the coloring problem for the graph and then concurrently process all neurons of a given color, for all colors in sequence. A further fine-grained level of parallelism is finally introduced by assigning to each neuron a thread block to perform the neuron level local computations
Average COSNET performance in predicting GO protein functions
| Organism | Precision | Recall | F | AUPRC |
|---|---|---|---|---|
| Yeast - CC | 0.3467 | 0.5139 | 0.4023 | 0.363 |
| Yeast - MF | 0.3317 | 0.4609 | 0.3815 | 0.3237 |
| Yeast - BP | 0.4042 | 0.5231 | 0.4486 | 0.4079 |
| Mouse - CC | 0.2068 | 0.2383 | 0.2198 | 0.1517 |
| Mouse - MF | 0.1861 | 0.2377 | 0.2068 | 0.1411 |
| Mouse - BP | 0.1473 | 0.1725 | 0.1582 | 0.0935 |
| Human - CC | 0.2238 | 0.2765 | 0.2448 | 0.1709 |
| Human - MF | 0.1847 | 0.2302 | 0.2032 | 0.1356 |
| Human - BP | 0.1485 | 0.1793 | 0.1611 | 0.0972 |
Average CPU time in seconds for COSNET and PARCOSNET to perform a CV cycle on one GO term
| Method | Yeast | Mouse | Human |
|---|---|---|---|
|
| 8.86 | 107.57 | 84.03 |
|
| 0.28 | 1.71 | 1.49 |
| 0.09 | 0.54 | 0.48 | |
| 0.07 | 0.33 | 0.31 | |
| 0.06 | 0.28 | 0.26 |
Average CPU occupancy in percentage for PARCOSNET to perform a CV cycle on every GO term. Optimal scalability is achieved when the occupancy reaches 100%×n, with n the number of CPU threads
| Method | Yeast | Mouse | Human |
|---|---|---|---|
|
| 99% | 100% | 100% |
| 345% | 364% | 368% | |
| 613% | 659% | 692% | |
| 815% | 941% | 975% |
Average memory usage in GigaBytes (GBs) for COSNET and PARCOSNET when running cross-validation to predict GO terms
| Method | Yeast | Mouse | Human |
|---|---|---|---|
|
| 0.40 | 3.73 | 3.26 |
|
| 0.27 | 1.13 | 0.94 |
Average CPU time in seconds and maximum memory consumption in GB for PARCOSNET to perform a single CV cycle on one class over the synthetic datasets
| Number of nodes | ||||
|---|---|---|---|---|
| Density ( | 500k | 1000k | 1500k | |
| 50 | Time | 45.3 | 137 | 330 |
| Memory | 1.94 | 3.86 | 5.81 | |
| 100 | Time | 54.3 | 166 | 360 |
| Memory | 3.69 | 7.37 | 11.1 | |
| 300 | Time | 92.9 | 248 | 609 |
| Memory | 10.7 | 21.4 | 32.1 | |
Average speed-up T/T, where T and T are the average execution time of COSNET and PARCOSNET to perform an entire cross-validation on one GO term
| Method | Yeast | Mouse | Human |
|---|---|---|---|
|
| 31.64x | 62.90x | 56.39x |
| 98.44x | 199.20x | 175.06x | |
| 126.57x | 325.96x | 271.06x | |
| 147.66x | 384.18x | 323.19x |