| Literature DB >> 24843788 |
Marco Mesiti1, Matteo Re1, Giorgio Valentini1.
Abstract
BACKGROUND: Network-based learning algorithms for automated function prediction (AFP) are negatively affected by the limited coverage of experimental data and limited a priori known functional annotations. As a consequence their application to model organisms is often restricted to well characterized biological processes and pathways, and their effectiveness with poorly annotated species is relatively limited. A possible solution to this problem might consist in the construction of big networks including multiple species, but this in turn poses challenging computational problems, due to the scalability limitations of existing algorithms and the main memory requirements induced by the construction of big networks. Distributed computation or the usage of big computers could in principle respond to these issues, but raises further algorithmic problems and require resources not satisfiable with simple off-the-shelf computers.Entities:
Keywords: Big data analysis; Biomolecular networks; Network-based learning
Year: 2014 PMID: 24843788 PMCID: PMC4006453 DOI: 10.1186/2047-217X-3-5
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1A sample net. A graphical representation of a sample Neo4j net.
Figure 2Implementation of 1-step algorithm in cypher. The notation (i)-[e:rtype]->(j) is used to represent a relationship e of type rtype between nodes i and j. The dot-notation is used to access a single property of a node/edge.
Figure 3Efficient disk access with (a) Shards: Int1,…IntK refer to the K intervals in which the vertices are split, while S1,…SK to the corresponding shards (b) Parallel Sliding Windows.
Figure 4Pseudocode of the vertex-centric implementation of the 1-step algorithm.
CAFA2 bacteria species and their proteins available in Swissprot (May 2013)
| 83333 | Escherichia | 4431 |
| 224308 | Bacillus | 4188 |
| 99287 | Salmonella | 1771 |
| 208964 | Pseudomonas | 1245 |
| 321314 | Salmonella | 882 |
| 160488 | Pseudomonas | 693 |
| 223283 | Pseudomonas | 675 |
| 85962 | Helicobacter | 581 |
| 170187 | Streptococcus | 502 |
| 243273 | Mycoplasma | 483 |
The first column reports the SwissProt organism identifier, the last one the number of proteins.
Figure 5Construction of bacteria net. Data flows from different sources of information, construction of the data-type specific networks and networks integration.
Public databases exploited for the construction of protein profiles
| Pfam [ | Protein domain |
| Protein superfamilies [ | Structural and functional annotations |
| PRINTS [ | Motif fingerprints |
| PROSITE [ | Protein domains and families |
| InterPro [ | Integrated resource of protein families, domains and functional sites |
| EggNOG [ | Evolutionary genealogy of genes: Non-supervised Orthologous Groups |
| SMART [ | Simple Modular Architecture Research Tool (database annotations) |
| Swissprot | Manually curated keywords describing the function of the proteins |
| at different degrees of abstraction |
Summary of the distribution of the number of positives across the 381 GO BP classes involved in the functional labelling of the 17638 proteins comprised in the bacterial multi species protein network
| 21.0 | 31.0 | 53.0 | 135.4 | 131.0 | 2000.0 |
Selected species from the core region of the STRING protein networks database
| 3218 | Physcomitrella | 10352 |
| 3702 | Arabidopsis | 23576 |
| 7227 | Drosophila | 12845 |
| 7739 | Branchiostoma | 16418 |
| 8364 | Xenopus (Silurana) | 13678 |
| 9031 | Gallus | 13119 |
| 9258 | Ornithorhynchus | 13333 |
| 9606 | Homo | 20140 |
| 9615 | Canis lupus | 16912 |
| 10090 | Mus | 20023 |
| 13616 | Monodelphis | 15409 |
| 39947 | Oryza sativa | 13330 |
| 69293 | Gasterosteus | 13307 |
Each species is represented by at least 10000 proteins.
Empirical time complexity of the main and secondary memory-based implementations of network based algorithms for multi-species function prediction with the
| | ||||||
|---|---|---|---|---|---|---|
| 8.11s | 27.92s | 8.84s | – | 208.27s | 12.32s | |
| 16.05s | 54.36s | 16.98s | – | 408.57s | 25.06s | |
| 23.95s | 81.18s | 25.12s | – | 621.92s | 36.51s | |
: average AUC, precision at 20% recall (P20R) and precision at 40% recall across 381 GO BP terms estimated through 5-fold cross-validation
| 0.8744 | 0.2264 | 0.1673 | |
| 0.8590 | 0.1318 | 0.0893 | |
| 0.8419 | 0.1064 | 0.0713 |
: Average per-term empirical time complexity between and implementations
| | ||||
|---|---|---|---|---|
| 189.60s | 20.44s | 2520.00s | 21.46s | |
| 367.82s | 31.68s | 4919.35s | 33.19s | |
| 549.84s | 45.73s | 7333.10s | 46.69s | |
: average AUC, precision at 20% recall (P20R) and precision at 40% recall across 50 GO terms estimated through 5-fold cross-validation
| 0.8601 | 0.1449 | 0.0943 | |
| 0.9667 | 0.1329 | 0.0929 | |
| 0.9598 | 0.0927 | 0.0785 |
Comparison of the average AUC, precision at 20% recall (P20R) and precision at 40% recall between multi-species and single-species approaches with 301 species of bacteria
| 0.8744 | 0.2264 | 0.1673 | |
| 0.8590 | 0.1318 | 0.0893 | |
| 0.8419 | 0.1064 | 0.0713 | |
| 0.8263 | 0.1801 | 0.1176 | |
| 0.8146 | 0.1059 | 0.0647 | |
| 0.8179 | 0.1009 | 0.0563 | |