Literature DB >> 29092069

CompositeSearch: A Generalized Network Approach for Composite Gene Families Detection.

Jananan Sylvestre Pathmanathan¹, Philippe Lopez¹, François-Joseph Lapointe², Eric Bapteste¹.

Abstract

Genes evolve by point mutations, but also by shuffling, fusion, and fission of genetic fragments. Therefore, similarity between two sequences can be due to common ancestry producing homology, and/or partial sharing of component fragments. Disentangling these processes is especially challenging in large molecular data sets, because of computational time. In this article, we present CompositeSearch, a memory-efficient, fast, and scalable method to detect composite gene families in large data sets (typically in the range of several million sequences). CompositeSearch generalizes the use of similarity networks to detect composite and component gene families with a greater recall, accuracy, and precision than recent programs (FusedTriplets and MosaicFinder). Moreover, CompositeSearch provides user-friendly quality descriptions regarding the distribution and primary sequence conservation of these gene families allowing critical biological analyses of these data.

Entities: Chemical Species

Keywords: bioinformatics; evolution; molecular evolution; network analysis; protein sequence analysis

Mesh：

Year: 2018 PMID： 29092069 PMCID： PMC5850286 DOI： 10.1093/molbev/msx283

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Genetic sequences evolve through multiple processes beyond point mutations. In particular, the remodeling of genes by shuffling of genetic fragments, fusion, and fission, as well as de novo gene emergence, contributes to the creation, and diversification of gene families (Kawai et al. 2003; Moore et al. 2008; Kaessmann 2010; Marsh and Teichmann 2010; Wu et al. 2012; Promponas et al. 2014; Bornberg-Bauer et al. 2015; McLysaght and Guerzoni 2015; Ruiz-Orera et al. 2015; Guerzoni and McLysaght 2016; Lees et al. 2016; Meheust et al. 2016). Therefore, genetic sequences show similarity with one another for diverse reasons, that is, common ancestry producing homology, and/or partial sharing of component fragments (Song et al. 2008; Haggerty et al. 2014). These processes must be disentangled to understand the rules and constraints on genes evolution. Although gene remodeling has been especially studied in eukaryotes (Kawai et al. 2003; Patthy 2003; Ekman et al. 2007; Nakamura et al. 2007; Meheust et al. 2016) and in cultured prokaryotes (Enright et al. 1999; Marcotte et al. 1999; Enright and Ouzounis 2000, 2001; Snel et al. 2000; Jachiet et al. 2013), analyses of large molecular data sets remain a computational bottleneck (Salim et al. 2011; Jachiet et al. 2013). For instance, a large scale investigation of how remodeled genes evolved in prokaryotes would require comparing millions of coding sequences from the thousands of complete genomes available, but previous detection methods are unable to handle such large data sets. In this article, we present CompositeSearch, a memory-efficient, fast, and scalable method to detect composite gene families in large data sets, typically in the range of several million sequences. Composite genes are the result of the fusion of partial or complete nonhomologous DNA fragments, called components, or as a result of fission from a larger gene into dissociated persistent fragment (fig. 1). CompositeSearch generalizes the use of similarity networks to detect composite and component gene families with a greater recall, accuracy, and precision than recent programs, FusedTriplets and MosaicFinder (Jachiet et al. 2013). Moreover, it provides user-friendly quality descriptions regarding the distribution and primary sequence conservation of these gene families allowing critical biological analyses of these data, and it is used as an input for the reconstruction of multirooted gene networks (Haggerty et al. 2014).

. 1

(A) Top: Example of a composite gene. Gene family 3 evolved from a composite of families 1 and 2. Bottom: Sequences from family 3 partially align with sequences from families 1 and 2. (B) Similarity network of a composite gene family (red) and its component gene families (green and purple). MosaicFinder will detect only the top case where composite genes form a clique, whereas CompositeSearch detects composite gene families forming a clique (top) or quasi-clique (bottom).

New Approach

Here, we present CompositeSearch, a memory-efficient, fast, and scalable method, implemented in C ++, which detects composite gene families in large data sets (typically in the range of several million sequences). Composite genes are traditionally defined based on their apparent modularity: they are composed of segments (i.e., components) that have evolved separately in distinct gene families (Patthy 2003; Song et al. 2008; Jachiet et al. 2013). Under this definition, composite genes can be the result of fusion of components, or involved as progenitors in fission events, after which associations of components are split in separate gene families. CompositeSearch generalizes the use of sequence similarity networks (SSN) to detect composite and component gene families. SSN are undirected graphs, where each node represents a unique sequence and each edge represents the similarity between connected sequences (given similarity criteria, such as a minimum percentage identity, BLAST E value; Altschul et al. 1990 and minimum mutual coverage, i.e., the minimal length covered by the matching parts with respect to the total length of each compared sequence) (Jachiet et al. 2013; Corel et al. 2016). For a given comparison between two sequences, the alignment, score, and E value are not symmetric. They can vary depending on which sequence is used as the query. Thus, the network is first symmetrized by considering the best match of each pairwise comparison. As the greatest asymmetry is found in the better-scoring comparisons (i.e., at a much more stringent threshold than the ones used for network reconstruction; Atkinson et al. 2009), this procedure does not impact the topology. This network’s structure captures much of the history of gene evolution: not only divergence by point mutations but also recombinations, fusions, and fission events (Adai et al. 2004; Jachiet et al. 2013). Typically, gene families form subgraphs with high connectivity, in which connected sequences display significant BLAST E values ≤ 1E−5, mutual covers ≥ 80%, and %ID ≥ 30%. By contrast, superfamilies (Atkinson et al. 2009) and composite gene families (Song et al. 2008; Jachiet et al. 2013, 2014; Haggerty et al. 2014; Meheust et al. 2016) introduce more complex informative patterns in SSNs. Using these graphs to identify composite genes and gene families, CompositeSearch shows a greater recall, accuracy, and precision than recent programs FusedTriplets (FT) and MosaicFinder (MF). In short, these two programs are helpful but limited in scope. FT cannot handle large data sets and does not define composite gene families. MF is also unable to analyze large data sets (due to memory and speed limitations). Although it identifies composite and component gene families, MF is only meant to find highly conserved composite gene families that form minimal clique separators in sequence similarity network. The “clique” condition implies that MF misses divergent (e.g., ancient or fast evolving) composite gene families (whose members do not necessarily connect all together in sequence similarity networks) (fig. 1). The “separator” condition implies that composite genes will remain undetected for data sets with highly remodeled genes by MF. Indeed, the repeated use of gene components introduces cyclic paths in sequence similarity networks, which turns composite families into local, but not global separators. Beyond its larger scope and better performance, CompositeSearch can also provide quality descriptions (absent from MF and FT) regarding the size and primary sequence conservation of composite and component gene families, easing critical biological analyses of these data. CompositeSearch is available at https://github.com/TeamAIRE/CompositeSearch, last accessed November 2, 2017. For a detailed description of the algorithm, see supplementary Materials and Methods, Supplementary Material online.

Results

Benchmark on Simulated Data

We tested and compared CompositeSearch with FT and MF (Jachiet et al. 2013) on 100 replicates of simulated data, covering a large range of parameters and simulating 2-components and 3-components composites (supplementary fig. S4 and Materials and Methods, Supplementary Material online). We explored the effect of gene family divergence and multiple component reassortments on composite gene detection under the hypothesis that the more divergent gene families are, the harder they are to detect. The sensitivity and specificity of each program were summarized in supplementary table S1, Supplementary Material online. In terms of detection of composite genes, CompositeSearch performs as well as FT, with identical True Positive Rate (TPR) and False Positive Rate (FPR), but, unlike FT, CompositeSearch returns composite gene families. However, CompositeSearch has higher TPR than MF, especially for divergent composite sequences, with a similar 1% FPR. Therefore, CompositeSearch will find additional composite genes with respect to MF, thanks to the detection of composite genes forming quasi-cliques. As CompositeSearch is able to detect the number of components for each composite, we created a more detailed table (supplementary table S2, Supplementary Material online) showing the sensitivity and specificity of CompositeSearch to detect the exact number of components.

Benchmark on Real Data

We also used a data set of 204,894 viral proteins from (Jachiet et al. 2014) to benchmark our software against real data. CompositeSearch detected 21,623 composite genes clustered in 5,532 families, vastly outperforming MF (5,845 composites in 1,718 families). FT found slightly more composites (23,305), but did not return any families. This slight increase in the number of composites detected by FT was mainly due to BLAST overextending matches on real data, thus producing false positives.

Performances

Because its algorithm uses a dichotomous search to browse the network and because it is multithreaded, CompositeSearch outperforms both FT and MF in terms of speed and memory use, when these parameters are contrasted on a Linux machine with Intel Xeon CPU E5-2630 v2 2.60-GHz processors and 256 GB RAM, even on one CPU. This is especially noticeable for large metagenomic data sets (table 1). By contrast, construction the SSN composite genes and composite gene families detection runs in a few second to few minutes depending on the network’s size.

Table 1

CompositeSearch, FusedTriplets, and MosaicFinder Performances Comparison.

Data	Nodes	Edges	Software	#CPU	Runtime	Memory (GB)
1	338,868	71,946,457	MosaicFinder	1	548 h 27 min	82
			FusedTriplets	1	70 h 47 min	18
			CompositeSearch	1	00 h 12 min	2.5
			CompositeSearch	10	00 h 06 min	2.5
2	3,166,706	282,789,792	MosaicFinder	1	—	—
			FusedTriplets	1	—	—
			CompositeSearch	10	08 h 48 min	32

Note.—We compared the performance of CompositeSearch, FusedTriplets, and MosaicFinder on the same Linux machine with Intel Xeon CPU E5-2630 v2 2.60-GHz processors and 256 GB RAM. The data (1) are an SSN from plasmids complete genomes (NCBI December 2014) and (2) HCH metagenomes (Sangwan et al. 2012). CompositeSearch outperform FusedTriplets and MosaicFinder even with one CPU as shown for data (1). On the data (2), FusedTriplets and MosaicFinder stop by running out of memory, which was not the case for CompositeSearch.

CompositeSearch, FusedTriplets, and MosaicFinder Performances Comparison. Note.—We compared the performance of CompositeSearch, FusedTriplets, and MosaicFinder on the same Linux machine with Intel Xeon CPU E5-2630 v2 2.60-GHz processors and 256 GB RAM. The data (1) are an SSN from plasmids complete genomes (NCBI December 2014) and (2) HCH metagenomes (Sangwan et al. 2012). CompositeSearch outperform FusedTriplets and MosaicFinder even with one CPU as shown for data (1). On the data (2), FusedTriplets and MosaicFinder stop by running out of memory, which was not the case for CompositeSearch.

Discussion

CompositeSearch is an efficient tool that detects composite genes and composite gene families. It allows investigating the process of gene remodeling in large data sets, for example metagenomes and/or thousands of complete genomes. Although CompositeSearch is faster than currently available software, like FusedTriplets and MosaicFinder, it still can be improved. We observed that in CompositeSearch, the most time consuming step is the detection of gene families, using a DFS algorithm than runs on a single CPU. Parallelized algorithms that detect connected components are available (Kang et al. 2009; Iverson et al. 2015), but they usually require high computational resources. As CompositeSearch was developed with maximum portability in mind, these algorithms are not implemented yet could be in a future version.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

30 in total

1. Detecting protein function and protein-protein interactions from genome sequences.

Authors: E M Marcotte; M Pellegrini; H L Ng; D W Rice; T O Yeates; D Eisenberg
Journal: Science Date: 1999-07-30 Impact factor: 47.728

2. Responses of ferns to red light are mediated by an unconventional photoreceptor.

Authors: Hiroko Kawai; Takeshi Kanegae; Steen Christensen; Tomohiro Kiyosue; Yoshikatsu Sato; Takato Imaizumi; Akeo Kadota; Masamitsu Wada
Journal: Nature Date: 2003-01-16 Impact factor: 49.962

Review 3. Emergence of de novo proteins from 'dark genomic matter' by 'grow slow and moult'.

Authors: Erich Bornberg-Bauer; Jonathan Schmitz; Magdalena Heberlein
Journal: Biochem Soc Trans Date: 2015-10 Impact factor: 5.407

4. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

Authors: Holly J Atkinson; John H Morris; Thomas E Ferrin; Patricia C Babbitt
Journal: PLoS One Date: 2009-02-03 Impact factor: 3.240

5. Comparative metagenomic analysis of soil microbial communities across three hexachlorocyclohexane contamination levels.

Authors: Naseer Sangwan; Pushp Lata; Vatsala Dwivedi; Amit Singh; Neha Niharika; Jasvinder Kaur; Shailly Anand; Jaya Malhotra; Swati Jindal; Aeshna Nigam; Devi Lal; Ankita Dua; Anjali Saxena; Nidhi Garg; Mansi Verma; Jaspreet Kaur; Udita Mukherjee; Jack A Gilbert; Scot E Dowd; Rajagopal Raman; Paramjit Khurana; Jitendra P Khurana; Rup Lal
Journal: PLoS One Date: 2012-09-28 Impact factor: 3.240

6. Evolution at the subgene level: domain rearrangements in the Drosophila phylogeny.

Authors: Yi-Chieh Wu; Matthew D Rasmussen; Manolis Kellis
Journal: Mol Biol Evol Date: 2011-09-07 Impact factor: 16.240

7. Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions.

Authors: A J Enright; C A Ouzounis
Journal: Genome Biol Date: 2001 Impact factor: 13.583

8. De Novo Genes Arise at a Slow but Steady Rate along the Primate Lineage and Have Been Subject to Incomplete Lineage Sorting.

Authors: Daniele Guerzoni; Aoife McLysaght
Journal: Genome Biol Evol Date: 2016-04-25 Impact factor: 3.416

Review 9. New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation.

Authors: Aoife McLysaght; Daniele Guerzoni
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2015-09-26 Impact factor: 6.237

Review 10. Network-Thinking: Graphs to Analyze Microbial Complexity and Evolution.

Authors: Eduardo Corel; Philippe Lopez; Raphaël Méheust; Eric Bapteste
Journal: Trends Microbiol Date: 2016-01-13 Impact factor: 17.079

9 in total

1. Phylogenomic fingerprinting of tempo and functions of horizontal gene transfer within ochrophytes.

Authors: Richard G Dorrell; Adrien Villain; Benoît Perez-Lamarque; Guillemette Audren de Kerdrel; Giselle McCallum; Andrew K Watson; Ouardia Ait-Mohamed; Adriana Alberti; Erwann Corre; Kyle R Frischkorn; Juan J Pierella Karlusich; Eric Pelletier; Hélène Morlon; Chris Bowler; Guillaume Blanc
Journal: Proc Natl Acad Sci U S A Date: 2021-01-26 Impact factor: 12.779

2. Ancestrality and Mosaicism of Giant Viruses Supporting the Definition of the Fourth TRUC of Microbes.

Authors: Philippe Colson; Anthony Levasseur; Bernard La Scola; Vikas Sharma; Arshan Nasir; Pierre Pontarotti; Gustavo Caetano-Anollés; Didier Raoult
Journal: Front Microbiol Date: 2018-11-27 Impact factor: 5.640

3. Eukaryote Genes Are More Likely than Prokaryote Genes to Be Composites.

Authors: Yaqing Ou; James O McInerney
Journal: Genes (Basel) Date: 2019-08-28 Impact factor: 4.096

4. Characterization of Burkholderia cepacia Complex Core Genome and the Underlying Recombination and Positive Selection.

Authors: Jianglin Zhou; Hongguang Ren; Mingda Hu; Jing Zhou; Beiping Li; Na Kong; Qi Zhang; Yuan Jin; Long Liang; Junjie Yue
Journal: Front Genet Date: 2020-05-21 Impact factor: 4.599

5. Ab Initio Construction and Evolutionary Analysis of Protein-Coding Gene Families with Partially Homologous Relationships: Closely Related Drosophila Genomes as a Case Study.

Authors: Xia Han; Jindan Guo; Erli Pang; Hongtao Song; Kui Lin
Journal: Genome Biol Evol Date: 2020-03-01 Impact factor: 3.416

6. Hundreds of Out-of-Frame Remodeled Gene Families in the Escherichia coli Pangenome.

Authors: Andrew K Watson; Philippe Lopez; Eric Bapteste
Journal: Mol Biol Evol Date: 2022-01-07 Impact factor: 8.800

7. MultiTwin: A Software Suite to Analyze Evolution at Multiple Levels of Organization Using Multipartite Graphs.

Authors: Eduardo Corel; Jananan S Pathmanathan; Andrew K Watson; Slim Karkar; Philippe Lopez; Eric Bapteste
Journal: Genome Biol Evol Date: 2018-10-01 Impact factor: 3.416

8. Automatic construction of molecular similarity networks for visual graph mining in chemical space of bioactive peptides: an unsupervised learning approach.

Authors: Longendri Aguilera-Mendoza; Yovani Marrero-Ponce; César R García-Jacas; Edgar Chavez; Jesus A Beltran; Hugo A Guillen-Ramirez; Carlos A Brizuela
Journal: Sci Rep Date: 2020-10-22 Impact factor: 4.379

9. Gene Similarity Networks Unveil a Potential Novel Unicellular Group Closely Related to Animals from the Tara Oceans Expedition.

Authors: Alicia S Arroyo; Romain Iannes; Eric Bapteste; Iñaki Ruiz-Trillo
Journal: Genome Biol Evol Date: 2020-09-01 Impact factor: 3.416

9 in total