Literature DB >> 24225318

Selectome update: quality control and computational improvements to a database of positive selection.

Sébastien Moretti¹, Balazs Laurenczy, Walid H Gharib, Briséïs Castella, Arnold Kuzniar, Hannes Schabauer, Romain A Studer, Mario Valle, Nicolas Salamin, Heinz Stockinger, Marc Robinson-Rechavi.

Abstract

Selectome (http://selectome.unil.ch/) is a database of positive selection, based on a branch-site likelihood test. This model estimates the number of nonsynonymous substitutions (dN) and synonymous substitutions (dS) to evaluate the variation in selective pressure (dN/dS ratio) over branches and over sites. Since the original release of Selectome, we have benchmarked and implemented a thorough quality control procedure on multiple sequence alignments, aiming to provide minimum false-positive results. We have also improved the computational efficiency of the branch-site test implementation, allowing larger data sets and more frequent updates. Release 6 of Selectome includes all gene trees from Ensembl for Primates and Glires, as well as a large set of vertebrate gene trees. A total of 6810 gene trees have some evidence of positive selection. Finally, the web interface has been improved to be more responsive and to facilitate searches and browsing.

Entities: Chemical Disease Species

Mesh：

Year: 2013 PMID： 24225318 PMCID： PMC3964977 DOI： 10.1093/nar/gkt1065

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Selectome is a database of positive selection (1). It provides users with access to precomputed estimates of positive selection from the branch-site test (2) mapped to branches of gene trees (including speciations and duplications), and to amino-acid sites of multiple sequence alignments (MSAs). This allows the detection of episodic selection, which is an important component of protein evolution (3). Selectome's first release was based on TreeFam A (PLACEHOLDER FOR NAR DATABASE UPDATE). While this choice was made to ensure high quality, it posed two problems: one is that TreeFam A was, by design, incomplete, and the other is that TreeFam has not been regularly updated. We have thus decided to move to Ensembl Compara (4) to receive gene trees and MSAs. Ensembl Compara provides a set of gene trees and MSAs as complete as possible, updated with every release of Ensembl (5). Moreover, using Ensembl's gene trees and MSAs allow easy extension to other taxonomic groups, which are covered by the Ensembl Genomes projects (6). The transition from TreeFam A to TreeFam A + B then to Ensembl Compara has raised two major challenges: (i) computing branch-site positive selection (2) on hundreds of thousands of branches from thousands of gene trees is a major computational challenge, especially considering that CodeML from PAML (7) has never been optimized with respect to computational efficiency; (ii) the MSAs provided by the automated Compara pipeline, while sufficient for many purposes, contain many misaligned regions, which induce false positives in tests for positive selection, especially for the branch-site test (8–10) (the same is true of other pipelines). These false-positive issues led us to label ‘beta’ several releases after the transition away from TreeFam A. We present the latest release of Selectome (release 6), which is the first release based on Ensembl Compara to take advantage of improvements concerning both computational efficiency and MSA quality control.

CHANGES IN DATABASE CONTENT

A summary of the content of Selectome release 6 is presented in Table 1. We define taxon-specific subtrees as monophyletic groups, which contain only sequences from the target taxon (Figure 1). We have computed branch-site tests for positive selection for all internal branches of all gene trees of Primates and of Glires, which contained at least six sequences (leaves of the subtree) after alignment quality control. We have also computed the tests for all internal branches of small- to medium-sized gene trees, which cover all Euteleostomi. As in previous releases of Selectome (1), multiple testing is controlled with a q-value of 10% computed over the union of all test results (all branches, all trees); this was done separately for each taxonomic group (i.e. Primates, Glires, Euteleostomi).

Table 1.

Statistics on release 06 of Selectome

Taxonomic group	Species number	Ensembl release	Subtrees^a				Sequences per subtree
			Total	Filtered^b	Computed	With positive selection	Median	Max
Euteleostomi	54	68	19 940	15 923	13 695^c	6543	32	139
Glires	7	71	20 114	4656^d	4656	136	6	257
Primates	10	70	20 300	15 738	15 738	131	8	180

aPruned from larger Ensembl Compara trees, according to the taxonomic group.

bSubtrees with at least six sequences after alignment quality filtering.

cThe largest gene trees were not computed.

dMany Glires subtrees do not have six sequences before or after our filtering.

Figure 1.

Selectome subtrees from Ensembl Compara gene tree. Left, the tree for human gene ENSGT00410000025651 from Ensembl release 68. Right, the subtrees selected for use in Selectome. Note that (i) as the tree is rooted in Amniota (i.e. there are no homologs detected outside Amniota), which is a subset of Euteleostomi, this node was chosen for the subtree for Euteleostomi; (ii) there are four Primate subtrees, due to gene duplications; (iii) only the Glires subtree with at least six sequences was used; (iv) some Primate or Glires subtrees can differ from the Ensembl tree because they use later Ensembl releases (Table 1). Statistics on release 06 of Selectome aPruned from larger Ensembl Compara trees, according to the taxonomic group. bSubtrees with at least six sequences after alignment quality filtering. cThe largest gene trees were not computed. dMany Glires subtrees do not have six sequences before or after our filtering. Since Selectome is now based on Ensembl, all cross-references, taxonomic information, keywords, and other information are now from Ensembl, and no longer from TreeFam. We have first tackled the computational challenge of updating Selectome by a better use of computing infrastructure. CodeML has been ported to the Swiss multi-scientific computing grid SMSCG (http://www.smscg.ch). All computations for Primates data were done on this infrastructure, using a customized GC3pie framework (11), which notably manages submissions and error messages. We experienced a failure rate of 0.7%, i.e. submission/execution issues that are due to the Grid infrastructure (including exceeding allocated execution time for single jobs). All erroneous jobs were successfully resubmitted. Thus, 67 054 job pairs (H0 and H1 hypotheses of the test sequentially on the same node) were successfully computed on SMSCG, and 276 were computed on the Vital-IT computer cluster (http://www.vital-it.ch), because they exceeded the runtime limit of SMSCG. Secondly, we have optimized CodeML for the branch-site test. Briefly, SlimCodeML (12) is an optimized sequential version of CodeML, which provides identical results to the original code. All computations for Euteleostomi and Glires were performed using SlimCodeML on the Vital-IT cluster. For Euteleostomi, the 2228 largest subtrees were not computed because of time limitations on the cluster. This showed again an intrinsic performance/scalability problem of (Slim)CodeML with respect to large data sets. In the original Selectome pipeline, poorly aligned regions were removed using GBLOCKS (13), but both our experience and published benchmarks (8–10,14) indicate that this is insufficient to remove unreliable regions of MSAs, which cause false positives for the branch-site test of positive selection. The Selectome pipeline now includes the following: realignment with PAGAN (15); masking of amino-acids that have a low consistency score from M-Coffee (16); and masking of amino-acids that have a low score from GUIDANCE (17). In addition, MaxAlign (18) is used to remove sequences that have few unambiguous sites, relative to the rest of the alignment, and TrimAl (19) is used to remove columns with few unambiguous sites. Detailed procedures and thresholds for each release are provided at http://selectome.unil.ch/cgi-bin/methods.cgi. Of note, Privman et al. (14) showed that the loss of true positives by filtering was outweighed by the removal of false positives. In total, 8.7% of MSA columns were removed before selection computations for Primates, versus 4.4% in Selectome 5 (GBLOCKS based pipeline); 12% of columns were removed for Glires, and 34% of columns for Euteleostomi, consistent with the expectation that more divergent sequences are more difficult to align reliably. More in detail, in Selectome 5, in Primates we identified 246 678 out of 1 149 639 sites (21%) as under positive selection, including long continuous stretches of ‘positively selected’ sites, which manual examination showed to be alignment or gene model errors [consistent with (10)]. In Selectome 6, filtering reduced the number of sites analyzed to 392 104, of which 61 119 are identified as under positive selection (16%); there are no more long stretches of sites, and manual inspection does not identify any obvious false positives. Further benchmarking of this pipeline shows that it masks not only MSA regions, which are difficult to align because of low complexity or alignment heuristics, but also gene model errors, which are a major source of false positives in MSAs from genomics (Moretti and Robinson-Rechavi, in preparation). By gene model errors, we mean errors in exon boundaries, in coding sequence start or stop, in prediction or choice of transcript from the gene; all these can lead to the alignment of nonhomologous sites. MSAs, which have less than six sequences or no aligned columns left after the filtering pipeline are not included in Selectome; this is notably the case for many Glires subtrees (Table 1).

CHANGES IN WEB INTERFACE

The Selectome web interface is similar to the original TreeFam interface, but with specific enrichments. We list here the main improvements of the interface since Selectome release 1. Improved search: For keyword search, queries are faster, thanks to the use of Sphinx (http://sphinxsearch.com), and queries are automatically restricted to the most relevant field (e.g. gene, species, cross-reference), which can then be manually modified. For advanced search, a species tree of interest can be chosen (i.e. Euteleostomi, Primates, Glires). Query results can now be viewed by genes or by gene families (subtrees), and sorting is possible according to each column (e.g. selection, taxon, gene name). Moreover, results can be filtered by species or keyword. Improved graphical user interface: Each query result includes a preview of the gene tree with selection highlighted. On the gene family (subtree) view, positive selection is now indicated by a highlight of the whole branch, rather than a discrete box on the node; there is easy navigation between subtrees from the same Ensembl family; and it is possible to change the size of the gene tree image. For MSA visualization (with the annotation of detected sites under positive selection) in Jalview (20), unreliably aligned sites (not used for computation) can be masked (indicated by the character ‘x’). Finally, we provide a DAS service (http://selectome.unil.ch/das/selectome) for integration with other resources [distributed annotation system (21)]. Selectome is also indexed and searchable by the ExPASy portal (http://expasy.org/), and external links to Ensembl point toward the version of Ensembl used for each result to ensure consistency; of note, linking to specific versions is not yet possible for Ensembl Genomes.

CONCLUSIONS AND PERSPECTIVES

Selectome presents, to our knowledge, the only phylogenomic database of branch-site positive selection (discussion of other resources in 1). The most significant progress since the first release is the improved MSA filtering, which dramatically reduces false positives, and allows us to use different input sources: if the input includes low-quality sequences, gene or transcript models or alignments, they are not used for positive selection inference. The use of Ensembl and the improved computational efficiency allow us to present for the first time a database with complete computations of branch-site positive selection for the two most studied mammalian clades: Primates, Glires. The next release of Selectome will also include the Drosophila clade. The major future challenge of Selectome is to further increase computational efficiency, to allow complete computations on large clades such as vertebrates (Euteleostomi), arthropods or green plants. The use of Ensembl and the existence of the Ensembl Genomes projects provide consistent data sources for most clades of interest. We have recently confirmed that the branch-site test can be reliably used even on deep nodes of such clades (22); the results of our partial release on Euteleostomi moreover confirm that with these larger gene trees, we have satisfactory power to detect positive selection (Table 1). The proportion of Euteleostomi genes with positive selection (48%) is lower than the 77% reported previously on a smaller sample (23) (biased toward genes conserved among vertebrates), but remains high, and should be further investigated. A potential problem, which we have not yet addressed, is synonymous rate variation between sites (24), which has been shown to be a problem for the site-test but has not been investigated for the branch-site test. As methods of detecting episodic positive selection improve, they will be taken into account in Selectome. Given the runtime issues for large data sets, we have developed a new, parallel and highly optimized software for the branch-site model: FastCodeML (Valle et al., in preparation; ftp://ftp.vital-it.ch/tools/FastCodeML/). Tests show that running this software on a supercomputer allows computing positive selection even on the largest Ensembl Compara gene trees. Future Selectome releases will thus use FastCodeML on a mixture of commodity computers as well as large cluster computer systems and eventually computational grids. Our aim is to provide yearly updates that cover Ensembl-type data as completely as possible, given the constraints on MSA quality.

FUNDING

Project UNIL.5 (Grid/Selectome) of the ‘AAA/SWITCH–e-infrastructure for e-science’ program; the Swiss Platform for High-Performance and High-Productivity Computing (HP2C); the Swiss National Science Foundation [31003A 133011/1 to M.R.R. and CR32I3_143768 to N.S. and M.R.R.]; Etat de Vaud; Fondation du 450ème anniversaire de l'Université de Lausanne and Swiss National Science Foundation [132476 and 136477 to R.A.S.]. Cluster computations were performed at the Vital-IT (http://www.vital-it.ch) Center for high-performance computing of the SIB Swiss Institute of Bioinformatics. Funding for open access charge: Etat de Vaud. Conflict of interest statement. None declared.

23 in total

1. Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments.

Authors: Gerard Talavera; Jose Castresana
Journal: Syst Biol Date: 2007-08 Impact factor: 15.683

2. PAML 4: phylogenetic analysis by maximum likelihood.

Authors: Ziheng Yang
Journal: Mol Biol Evol Date: 2007-05-04 Impact factor: 16.240

3. EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates.

Authors: Albert J Vilella; Jessica Severin; Abel Ureta-Vidal; Li Heng; Richard Durbin; Ewan Birney
Journal: Genome Res Date: 2008-11-24 Impact factor: 9.043

4. Pervasive positive selection on duplicated and nonduplicated vertebrate protein coding genes.

Authors: Romain A Studer; Simon Penel; Laurent Duret; Marc Robinson-Rechavi
Journal: Genome Res Date: 2008-06-18 Impact factor: 9.043

5. Jalview Version 2--a multiple sequence alignment editor and analysis workbench.

Authors: Andrew M Waterhouse; James B Procter; David M A Martin; Michèle Clamp; Geoffrey J Barton
Journal: Bioinformatics Date: 2009-01-16 Impact factor: 6.937

6. MaxAlign: maximizing usable data in an alignment.

Authors: Rodrigo Gouveia-Oliveira; Peter W Sackett; Anders G Pedersen
Journal: BMC Bioinformatics Date: 2007-08-28 Impact factor: 3.169

7. The distributed annotation system.

Authors: R D Dowell; R M Jokerst; A Day; S R Eddy; L Stein
Journal: BMC Bioinformatics Date: 2001-10-10 Impact factor: 3.169

8. The branch-site test of positive selection is surprisingly robust but lacks power under synonymous substitution saturation and variation in GC.

Authors: Walid H Gharib; Marc Robinson-Rechavi
Journal: Mol Biol Evol Date: 2013-04-04 Impact factor: 16.240

9. Selectome: a database of positive selection.

Authors: Estelle Proux; Romain A Studer; Sébastien Moretti; Marc Robinson-Rechavi
Journal: Nucleic Acids Res Date: 2008-10-28 Impact factor: 16.971

10. M-Coffee: combining multiple sequence alignment methods with T-Coffee.

Authors: Iain M Wallace; Orla O'Sullivan; Desmond G Higgins; Cedric Notredame
Journal: Nucleic Acids Res Date: 2006-03-23 Impact factor: 16.971

33 in total

1. Gene-wide identification of episodic selection.

Authors: Ben Murrell; Steven Weaver; Martin D Smith; Joel O Wertheim; Sasha Murrell; Anthony Aylward; Kemal Eren; Tristan Pollner; Darren P Martin; Davey M Smith; Konrad Scheffler; Sergei L Kosakovsky Pond
Journal: Mol Biol Evol Date: 2015-02-19 Impact factor: 16.240

2. Less is more: an adaptive branch-site random effects model for efficient detection of episodic diversifying selection.

Authors: Martin D Smith; Joel O Wertheim; Steven Weaver; Ben Murrell; Konrad Scheffler; Sergei L Kosakovsky Pond
Journal: Mol Biol Evol Date: 2015-02-19 Impact factor: 16.240

3. Genome-scale detection of positive selection in nine primates predicts human-virus evolutionary conflicts.

Authors: Robin van der Lee; Laurens Wiel; Teunis J P van Dam; Martijn A Huynen
Journal: Nucleic Acids Res Date: 2017-10-13 Impact factor: 16.971

4. Tissue-Specific Evolution of Protein Coding Genes in Human and Mouse.

Authors: Nadezda Kryuchkova-Mostacci; Marc Robinson-Rechavi
Journal: PLoS One Date: 2015-06-29 Impact factor: 3.240

5. Patterns of positive selection in seven ant genomes.

Authors: Julien Roux; Eyal Privman; Sébastien Moretti; Josephine T Daub; Marc Robinson-Rechavi; Laurent Keller
Journal: Mol Biol Evol Date: 2014-04-29 Impact factor: 16.240

6. A generalized mechanistic codon model.

Authors: Maryam Zaheri; Linda Dib; Nicolas Salamin
Journal: Mol Biol Evol Date: 2014-06-23 Impact factor: 16.240

7. Positive selection during the evolution of the blood coagulation factors in the context of their disease-causing mutations.

Authors: Pavithra M Rallapalli; Christine A Orengo; Romain A Studer; Stephen J Perkins
Journal: Mol Biol Evol Date: 2014-08-25 Impact factor: 16.240

8. Understanding the functional difference between growth arrest-specific protein 6 and protein S: an evolutionary approach.

Authors: Romain A Studer; Fred R Opperdoes; Gerry A F Nicolaes; André B Mulder; René Mulder
Journal: Open Biol Date: 2014-10 Impact factor: 6.411

9. POTION: an end-to-end pipeline for positive Darwinian selection detection in genome-scale data through phylogenetic comparison of protein-coding genes.

Authors: Jorge A Hongo; Giovanni M de Castro; Leandro C Cintra; Adhemar Zerlotini; Francisco P Lobo
Journal: BMC Genomics Date: 2015-08-01 Impact factor: 3.969

10. Uncovering adaptive evolution in the human lineage.

Authors: Magdalena Gayà-Vidal; M Mar Albà
Journal: BMC Genomics Date: 2014-07-16 Impact factor: 3.969