| Literature DB >> 29092069 |
Jananan Sylvestre Pathmanathan1, Philippe Lopez1, François-Joseph Lapointe2, Eric Bapteste1.
Abstract
Genes evolve by point mutations, but also by shuffling, fusion, and fission of genetic fragments. Therefore, similarity between two sequences can be due to common ancestry producing homology, and/or partial sharing of component fragments. Disentangling these processes is especially challenging in large molecular data sets, because of computational time. In this article, we present CompositeSearch, a memory-efficient, fast, and scalable method to detect composite gene families in large data sets (typically in the range of several million sequences). CompositeSearch generalizes the use of similarity networks to detect composite and component gene families with a greater recall, accuracy, and precision than recent programs (FusedTriplets and MosaicFinder). Moreover, CompositeSearch provides user-friendly quality descriptions regarding the distribution and primary sequence conservation of these gene families allowing critical biological analyses of these data.Entities:
Keywords: bioinformatics; evolution; molecular evolution; network analysis; protein sequence analysis
Mesh:
Year: 2018 PMID: 29092069 PMCID: PMC5850286 DOI: 10.1093/molbev/msx283
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1(A) Top: Example of a composite gene. Gene family 3 evolved from a composite of families 1 and 2. Bottom: Sequences from family 3 partially align with sequences from families 1 and 2. (B) Similarity network of a composite gene family (red) and its component gene families (green and purple). MosaicFinder will detect only the top case where composite genes form a clique, whereas CompositeSearch detects composite gene families forming a clique (top) or quasi-clique (bottom).
CompositeSearch, FusedTriplets, and MosaicFinder Performances Comparison.
| Data | Nodes | Edges | Software | #CPU | Runtime | Memory (GB) |
|---|---|---|---|---|---|---|
| 1 | 338,868 | 71,946,457 | MosaicFinder | 1 | 548 h 27 min | 82 |
| FusedTriplets | 1 | 70 h 47 min | 18 | |||
| CompositeSearch | 1 | 00 h 12 min | 2.5 | |||
| CompositeSearch | 10 | 00 h 06 min | 2.5 | |||
| 2 | 3,166,706 | 282,789,792 | MosaicFinder | 1 | — | — |
| FusedTriplets | 1 | — | — | |||
| CompositeSearch | 10 | 08 h 48 min | 32 |
Note.—We compared the performance of CompositeSearch, FusedTriplets, and MosaicFinder on the same Linux machine with Intel Xeon CPU E5-2630 v2 2.60-GHz processors and 256 GB RAM. The data (1) are an SSN from plasmids complete genomes (NCBI December 2014) and (2) HCH metagenomes (Sangwan et al. 2012). CompositeSearch outperform FusedTriplets and MosaicFinder even with one CPU as shown for data (1). On the data (2), FusedTriplets and MosaicFinder stop by running out of memory, which was not the case for CompositeSearch.