Literature DB >> 30032301

SonicParanoid: fast, accurate and easy orthology inference.

Salvatore Cosentino1, Wataru Iwasaki1,2,3.   

Abstract

Motivation: Orthology inference constitutes a common base of many genome-based studies, as a pre-requisite for annotating new genomes, finding target genes for biotechnological applications and revealing the evolutionary history of life. Although its importance keeps rising with the ever-growing number of sequenced genomes, existing tools are computationally demanding and difficult to employ.
Results: Here, we present SonicParanoid, which is faster than, but comparably accurate to, the well-established tools with a balanced precision-recall trade-off. Furthermore, SonicParanoid substantially relieves the difficulties of orthology inference for those who need to construct and maintain their own genomic datasets. Availability and implementation: SonicParanoid is available with a GNU GPLv3 license on the Python Package Index and BitBucket. Documentation is available at http://iwasakilab.bs.s.u-tokyo.ac.jp/sonicparanoid. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities:  

Mesh:

Year:  2019        PMID: 30032301      PMCID: PMC6298048          DOI: 10.1093/bioinformatics/bty631

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Due to the recent advancement in DNA sequencing technologies, the number of completely sequenced genomes is growing at an accelerated pace. The accurate inference of orthologous genes encoded on multiple genomes is the key to various analyses based on those datasets (Altenhoff and Dessimoz, 2012). For example, comparative genomics, genome annotation, phylogenomics and the development of genome databases all depend on reliable orthology inference. There are dozens of tools available for orthology inference, including InParanoid (Sonnhammer and Östlund, 2015) and its extension MultiParanoid (Alexeyenko ), OrthoMCL (Li, 2003), Hieranoid (Kaduk and Sonnhammer, 2017), OMA (Train ), Proteinortho (Lechner ), OrthoFinder (Emms and Kelly, 2015) and PANTHER (Mi ). InParanoid is one of the oldest and most popular tools and has the best trade-off between specificity and recall (Altenhoff ). Here, we present SonicParanoid, which is a fast, accurate and easy-to-use tool for multi-species orthology inference.

2 Materials and methods

SonicParanoid borrows the concepts used in the graph-based algorithm of InParanoid (Remm ) because of its reported accuracy (Altenhoff ; Chen ), but brings changes to the core algorithm and speeds-up and automates the entire process (Supplementary Fig. S1). To reduce the computational time, second-pass alignments and bootstrapping tests are skipped, and MMseqs2 (Steinegger and Söding, 2017) is used instead of legacy-BLAST (Altschul ). Furthermore, SonicParanoid adopts a new scoring function and a configurable threshold that take into account sequence-length differences (Supplementary Material and Supplementary Figs S2 and S3). Another difference with the original InParanoid algorithm is in the way that overlapping groups of orthologs are clustered. In Remm , groups are merged or removed based on comparisons of confidence scores of the grouped orthologs. In contrast, SonicParanoid treats groups as elements of numeric sets, which made the algorithm faster and yet have minimal difference in terms of accuracy (Supplementary Fig. S4). Finally, because the multi-species orthology inference step is fully automated, the users can avoid the cumbersome and error-prone collection of ortholog tables and configuration files creation required by MultiParanoid. Details are described in Supplementary Material.

3 Results and discussion

3.1 Speed evaluation

We ran SonicParanoid, InParanoid (ver 4.1), OrthoFinder (ver 2.1.2) and Proteinortho (ver 5.15) on a benchmark proteome dataset provided by the Quest for Orthologs (QfO) consortium (Gabaldón ). Details including the hardware and software settings are described in Supplementary Material. Other than InParanoid, OrthoFinder and Proteinortho were selected because they were reported to be fastest among the existing tools (Lechner ). When the complete QfO dataset (including 66 proteomes) was processed using eight CPUs, InParanoid, OrthoFinder on BLASTP (Camacho ), OrthoFinder on DIAMOND (Buchfink ) and Proteinortho required 2079.9, 222.1, 8.2 and 212.4 h, respectively, while SonicParanoid in the default mode reduced the running time to 4.3 h (i.e. 482.8, 51.6, 1.9 and 49.3 times faster, respectively) (Fig. 1a, left). When the execution time was measured without considering the time required for the all-vs-all alignments, SonicParanoid was still faster than the other methods (Fig. 1a, right). This trend was also observed with the eukaryotic and prokaryotic subsets of the QfO dataset as input (Supplementary Figs S5 and S6). We also conducted the scalability analysis of SonicParanoid on large-scale genomic data (Supplementary Fig. S7) as well as memory usage analysis (Supplementary Fig. S8), which show that SonicParanoid is applicable to large genomic datasets.
Fig. 1.

Speed and accuracy of SonicParanoid. (a) Speed-up of SonicParanoid (in the fast, default and sensitive modes) in relation to InParanoid, OrthoFinder2 (on BLASTP and DIAMOND) and Proteinortho on the complete 2011 version of the QfO dataset (left: complete execution, right: orthology inference step only). The numbers on the tiles and their colors indicate the speed-up folds. The numbers in the square brackets represent the execution time in hours (left) or minutes (right). Eight processors were used for each tool. (b) Accuracy of SonicParanoid and other 13 orthology inference tools assessed by the QfO benchmark (generalized species tree discordance test at the last universal common ancestor level). The x-axis represents the numbers of completed tree samples out of 50 000 trials (larger is better), while the y-axis represents the average tree error (smaller is better)

Speed and accuracy of SonicParanoid. (a) Speed-up of SonicParanoid (in the fast, default and sensitive modes) in relation to InParanoid, OrthoFinder2 (on BLASTP and DIAMOND) and Proteinortho on the complete 2011 version of the QfO dataset (left: complete execution, right: orthology inference step only). The numbers on the tiles and their colors indicate the speed-up folds. The numbers in the square brackets represent the execution time in hours (left) or minutes (right). Eight processors were used for each tool. (b) Accuracy of SonicParanoid and other 13 orthology inference tools assessed by the QfO benchmark (generalized species tree discordance test at the last universal common ancestor level). The x-axis represents the numbers of completed tree samples out of 50 000 trials (larger is better), while the y-axis represents the average tree error (smaller is better)

3.2 Accuracy evaluation

The accuracy of SonicParanoid was compared to those of 13 orthology inference methods using the benchmark service from the QfO community (Altenhoff ). Details regarding the benchmark datasets construction are described in the Supplementary Material. Overall, the results indicate that SonicParanoid is a good alternative to the existing methods (Fig. 1b and Supplementary Figs S9 and S10). SonicParanoid (especially in the default and sensitive modes) showed accuracy similar to that of InParanoid and near the Pareto frontiers in most tests.

3.3 Usability

To meet various needs from quick assessment to detailed analysis, SonicParanoid provides fast, default and sensitive modes that employ different filtering thresholds for the alignment process. While the default mode would suit most studies, the fast and sensitive modes may be used for evolutionarily close and distant species, respectively (Supplementary Figs S9 and S11). SonicParanoid only requires users to provide a directory containing the input proteomes to generate ortholog relationship files between proteome pairs and the multi-species ortholog table file (Supplementary Fig. S1). In contrast, InParanoid and MultiParanoid require users to write programs to generate input files for every proteome pair, perform the required InParanoid runs and collect the generated ortholog relationship files for themselves. The entire process is considerably error prone if manually performed, and is difficult to employ for users lacking programing skills. SonicParanoid also allows seamless addition and deletion of proteomes by reusing the results from previous runs, which is beneficial to users who need to maintain their own orthology databases. Even if the run is interrupted before the computation is completed, for example by power failure, it can be easily resumed without losing previously computed results. Multi-species ortholog tables for subsets of proteomes in a previous run can also be quickly computed.

4 Conclusion

We developed SonicParanoid as a fast, accurate and easy orthology inference tool and evaluated its speed and accuracy using standardized datasets and benchmark. SonicParanoid also has high scalability; it can analyze the 276 proteomes used to build the InParanoid8 orthology database (Sonnhammer and Östlund, 2015) in less than 5 days using only 8 CPUs in the default mode. Considering the ever-growing number of new genomes (Cochrane ), we believe the speed, scalability, accuracy and ease-of-use of SonicParanoid will contribute to annotating new genomes, finding target genes in medical and biotechnological applications and revealing the evolutionary history of life. Click here for additional data file.
  18 in total

Review 1.  Inferring orthology and paralogy.

Authors:  Adrian M Altenhoff; Christophe Dessimoz
Journal:  Methods Mol Biol       Date:  2012

2.  Automatic clustering of orthologs and inparalogs shared by multiple proteomes.

Authors:  Andrey Alexeyenko; Ivica Tamas; Gang Liu; Erik L L Sonnhammer
Journal:  Bioinformatics       Date:  2006-07-15       Impact factor: 6.937

3.  Improved orthology inference with Hieranoid 2.

Authors:  Mateusz Kaduk; Erik Sonnhammer
Journal:  Bioinformatics       Date:  2017-04-15       Impact factor: 6.937

4.  OrthoMCL: identification of ortholog groups for eukaryotic genomes.

Authors:  Li Li; Christian J Stoeckert; David S Roos
Journal:  Genome Res       Date:  2003-09       Impact factor: 9.043

5.  InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic.

Authors:  Erik L L Sonnhammer; Gabriel Östlund
Journal:  Nucleic Acids Res       Date:  2014-11-27       Impact factor: 16.971

6.  The International Nucleotide Sequence Database Collaboration.

Authors:  Guy Cochrane; Ilene Karsch-Mizrachi; Toshihisa Takagi
Journal:  Nucleic Acids Res       Date:  2015-12-10       Impact factor: 16.971

7.  Joining forces in the quest for orthologs.

Authors:  Toni Gabaldón; Christophe Dessimoz; Julie Huxley-Jones; Albert J Vilella; Erik Ll Sonnhammer; Suzanna Lewis
Journal:  Genome Biol       Date:  2009-09-29       Impact factor: 13.583

8.  Assessing performance of orthology detection strategies applied to eukaryotic genomes.

Authors:  Feng Chen; Aaron J Mackey; Jeroen K Vermunt; David S Roos
Journal:  PLoS One       Date:  2007-04-18       Impact factor: 3.240

9.  Standardized benchmarking in the quest for orthologs.

Authors:  Adrian M Altenhoff; Brigitte Boeckmann; Salvador Capella-Gutierrez; Daniel A Dalquen; Todd DeLuca; Kristoffer Forslund; Jaime Huerta-Cepas; Benjamin Linard; Cécile Pereira; Leszek P Pryszcz; Fabian Schreiber; Alan Sousa da Silva; Damian Szklarczyk; Clément-Marie Train; Peer Bork; Odile Lecompte; Christian von Mering; Ioannis Xenarios; Kimmen Sjölander; Lars Juhl Jensen; Maria J Martin; Matthieu Muffato; Toni Gabaldón; Suzanna E Lewis; Paul D Thomas; Erik Sonnhammer; Christophe Dessimoz
Journal:  Nat Methods       Date:  2016-04-04       Impact factor: 28.547

10.  Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference.

Authors:  Clément-Marie Train; Natasha M Glover; Gaston H Gonnet; Adrian M Altenhoff; Christophe Dessimoz
Journal:  Bioinformatics       Date:  2017-07-15       Impact factor: 6.937

View more
  27 in total

1.  SwiftOrtho: A fast, memory-efficient, multiple genome orthology classifier.

Authors:  Xiao Hu; Iddo Friedberg
Journal:  Gigascience       Date:  2019-10-01       Impact factor: 6.524

2.  Elucidation of the speciation history of three sister species of crown-of-thorns starfish (Acanthaster spp.) based on genomic analysis.

Authors:  Hideaki Yuasa; Rei Kajitani; Yuta Nakamura; Kazuki Takahashi; Miki Okuno; Fumiya Kobayashi; Takahiro Shinoda; Atsushi Toyoda; Yutaka Suzuki; Nalinee Thongtham; Zac Forsman; Omri Bronstein; Davide Seveso; Enrico Montalbetti; Coralie Taquet; Gal Eyal; Nina Yasuda; Takehiko Itoh
Journal:  DNA Res       Date:  2021-08-25       Impact factor: 4.477

3.  Convergent copy number increase of genes associated with freshwater colonization in fishes.

Authors:  Asano Ishikawa; Shun Yamanouchi; Wataru Iwasaki; Jun Kitano
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2022-05-30       Impact factor: 6.671

4.  A chromosome-level genome of Antechinus flavipes provides a reference for an Australian marsupial genus with male death after mating.

Authors:  Ran Tian; Kai Han; Yuepan Geng; Chen Yang; Chengcheng Shi; Patrick B Thomas; Coral Pearce; Kate Moffatt; Siming Ma; Shixia Xu; Guang Yang; Xuming Zhou; Vadim N Gladyshev; Xin Liu; Diana O Fisher; Lisa K Chopin; Natália O Leiner; Andrew M Baker; Guangyi Fan; Inge Seim
Journal:  Mol Ecol Resour       Date:  2021-09-21       Impact factor: 8.678

5.  Caenorhabditis elegans MES-3 is a highly divergent ortholog of the canonical PRC2 component SUZ12.

Authors:  Berend Snel; Sander van den Heuvel; Michael F Seidl
Journal:  iScience       Date:  2022-06-17

6.  Premeiotic, 24-Nucleotide Reproductive PhasiRNAs Are Abundant in Anthers of Wheat and Barley But Not Rice and Maize.

Authors:  Sébastien Bélanger; Suresh Pokhrel; Kirk Czymmek; Blake C Meyers
Journal:  Plant Physiol       Date:  2020-09-11       Impact factor: 8.340

7.  OrthoFinder: phylogenetic orthology inference for comparative genomics.

Authors:  David M Emms; Steven Kelly
Journal:  Genome Biol       Date:  2019-11-14       Impact factor: 13.583

8.  Discovery of multi-operon colinear syntenic blocks in microbial genomes.

Authors:  Dina Svetlitsky; Tal Dagan; Michal Ziv-Ukelson
Journal:  Bioinformatics       Date:  2020-07-01       Impact factor: 6.937

9.  The chromosome-level wintersweet (Chimonanthus praecox) genome provides insights into floral scent biosynthesis and flowering in winter.

Authors:  Junzhong Shang; Jingpu Tian; Huihui Cheng; Qiaomu Yan; Lai Li; Abbas Jamal; Zhongping Xu; Lin Xiang; Christopher A Saski; Shuangxia Jin; Kaige Zhao; Xiuqun Liu; Longqing Chen
Journal:  Genome Biol       Date:  2020-08-10       Impact factor: 13.583

10.  Genomic and experimental data provide new insights into luciferin biosynthesis and bioluminescence evolution in fireflies.

Authors:  Ru Zhang; Jinwu He; Zhiwei Dong; Guichun Liu; Yuan Yin; Xinying Zhang; Qi Li; Yandong Ren; Yongzhi Yang; Wei Liu; Xianqing Chen; Wenhao Xia; Kang Duan; Fei Hao; Zeshan Lin; Jie Yang; Zhou Chang; Ruoping Zhao; Wenting Wan; Sihan Lu; Yanqiong Peng; Siqin Ge; Wen Wang; Xueyan Li
Journal:  Sci Rep       Date:  2020-09-28       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.