Literature DB >> 31742321

KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold.

Takuya Aramaki¹, Romain Blanc-Mathieu¹, Hisashi Endo¹, Koichi Ohkubo^1,2, Minoru Kanehisa¹, Susumu Goto³, Hiroyuki Ogata¹.

Abstract

SUMMARY: KofamKOALA is a web server to assign KEGG Orthologs (KOs) to protein sequences by homology search against a database of profile hidden Markov models (KOfam) with pre-computed adaptive score thresholds. KofamKOALA is faster than existing KO assignment tools with its accuracy being comparable to the best performing tools. Function annotation by KofamKOALA helps linking genes to KEGG resources such as the KEGG pathway maps and facilitates molecular network reconstruction.
AVAILABILITY AND IMPLEMENTATION: KofamKOALA, KofamScan and KOfam are freely available from GenomeNet (https://www.genome.jp/tools/kofamkoala/). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Species

Year: 2020 PMID： 31742321 PMCID： PMC7141845 DOI： 10.1093/bioinformatics/btz859

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Automatic gene function annotation is an important first step to interpret genomic data. Kyoto Encyclopedia of Genes and Genomes (KEGG) is a widely used reference knowledge base, which helps investigate genomic functions by linking genes to biological knowledge such as metabolic pathways and molecular networks (Kanehisa and Goto, 2000). In KEGG, the KEGG Orthology (KO) database—a manually curated large collection of protein families (i.e. KO families)—serves as a baseline reference to link genes with other KEGG resources such as metabolic maps through K number identifiers. Currently, KOs are assigned to 12 934 525 (48%) protein sequences in the KEGG GENES database (27 173 868 proteins). Three existing tools, BlastKOALA, GhostKOALA (Kanehisa ) and KAAS (Moriya ), are available to assign KOs to protein sequences. These tools use homology search software such as BLAST (Altschul ) and GHOSTX (Suzuki ) to search amino acid sequences against GENES. To reduce the lengthy computational times required for multiple pairwise sequence comparisons, these tools use selected representative sequences from GENES to build their target database. In this study, we propose to employ profile hidden Markov model (pHMM) to compress the database and to define adaptive thresholds for similarity scores, which can be used for reliable KO assignments.

2 Implementation

For each set of protein sequences in GENES annotated with a given KO, we generate a pHMM as follows. First, sequence redundancy in the sequence set is reduced by CD-HIT (Fu ) with 100% sequence identity clustering cutoff. Next, MAFFT (Katoh ) and HMMER/hmmbuild (Eddy, 2008) are used to align sequences and to generate a pHMM, respectively. A family-specific adaptive score threshold is computed for each KO family as follows. For a given KO family, non-redundant sequences belonging to the family are randomly divided into three groups. One of the groups is used as the positive training dataset, while the sequences in the remaining two groups are used to generate a pHMM. Sequences belonging to the remaining KO families serve as the negative training dataset for the KO under consideration. Sequence similarity scores (bit scores) between sequences in the positive/negative training datasets and the pHMM are computed using HMMER/hmmsearch. Based on the distributions of two sets of bit scores for the sequences in the positive and negative datasets, we determine a threshold score, T, which maximizes the F-measure (Supplementary Data). This procedure is repeated three times by replacing the positive training dataset among the three groups. Finally, the adaptive score threshold for the KO family is defined as the average of T . is used as a criterion to assign the KO family to new sequences. The database of HMMs with the adaptive score thresholds was named KOfam. When the present study was performed using KEGG release 88.2, KOfam contained 20 654 pHMMs. We developed KofamScan software and employed it in KofamKOALA webserver to annotate genes using KOfam and to link them with other KEGG resources through K numbers for versatile function investigation.

3 Assessment and discussions

To compare the performance of KofamScan with BlastKOALA, GhostKOALA and KAAS, we used 40 genomes (20 eukaryotes and 20 prokaryotes; Supplementary Table S1) randomly selected from 6030 genomes recorded in GENES as test queries. This test set contains 383 202 sequences (143 662 sequences with KOs) corresponding to 16 166 distinct KOs. From GENES, we removed all the genomes belonging to the genera of the selected 40 test query genomes. Then, using the remaining GENES sequences with KO annotations, we generated a test KOfam database for this assessment. As for BlastKOALA, GhostKOALA and KAAS, we used the default target databases used in their respective webservers after removing genomes from the genera that were represented by the test queries. The KOfam database created for this assessment contained 20 394 pHMMs, of which 9414 were represented by prokaryotic sequences. For the 40 genomes constituting our test set, the performance (F-measure) was comparable among KofamScan (0.866), BlastKOALA (0.889) and GhostKOALA (0.862), while KAAS showed a lower F-measure (0.810) (Table 1). To perform another test using only 20 prokaryotic genomes as test queries, we reduced the target databases either by excluding pHMMs composed exclusively of eukaryotic sequences (for KofamScan) or by using the target database for prokaryotes (for BlastKOALA, GhostKOALA and KAAS). Again, the performance of KofamScan (F = 0.875) was comparable to BlastKOALA (0.846) and GhostKOALA (0.886), while KAAS showed the lowest F-measure (0.786).

Table 1.

Comparison of the performance of KofamScan with other tools

	KofamScan	BlastKOALA	GhostKOALA	KAAS
Entire database (40 genomes)
Precision	0.844	0.835	0.787	0.895
Recall	0.888	0.950	0.952	0.739
F	0.866	0.889	0.862	0.810
Prokaryote database (20 genomes)
Precision	0.906	0.906	0.907	0.881
Recall	0.846	0.793	0.867	0.709
F	0.875	0.846	0.886	0.786

Comparison of the performance of KofamScan with other tools Regarding CPU time, KofamScan was 69, 2.1 and 1.1 times faster than BlastKOALA, KAAS and GhostKOALA, respectively, when they were tested for the annotation of 40 genomes (Supplementary Tables S2 and S3). CPU time for this calculation was 2 h26 m18 s for KofamScan, while it was over 168 h for BlastKOALA. For the test with 20 prokaryote genomes, KofamScan was 83, 1.9 and 1.8 times faster than BlastKOALA, KAAS and GhostKOALA, respectively. Required CPU time was 11 m59 s for KofamScan, while it was over 16 h for BlastKOALA. The latter result indicates that KofamScan can benefits more from the reduction of the target database compared to the three other tools while it is among the tools showing the highest F-measures. We developed a database of pHMMs based on the KO and GENES databases. The adaptive score thresholds are pre-computed for individual KO families, and can be used to assign KOs (K numbers) to sequences using KofamScan and KofamKOALA. Sequence matches with scores above the thresholds are considered more reliable than other matches and thus highlighted with ‘*’ marks in the output of these tools. KofamScan users are able to customize KOfam by choosing a subset of KOs so that they can focus on the annotation of specific class of proteins while reducing computational time. KofamKOALA webserver has additional functions to automatically send the search results to KEGG Mapper for reconstruction of pathways (PATHWAY), pathway modules (MODULE) and hierarchical function classifications (BRITE). Click here for additional data file.

8 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

Review 2. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

Review 3. BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences.

Authors: Minoru Kanehisa; Yoko Sato; Kanae Morishima
Journal: J Mol Biol Date: 2015-11-14 Impact factor: 5.469

4. MAFFT version 5: improvement in accuracy of multiple sequence alignment.

Authors: Kazutaka Katoh; Kei-ichi Kuma; Hiroyuki Toh; Takashi Miyata
Journal: Nucleic Acids Res Date: 2005-01-20 Impact factor: 16.971

5. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

6. KAAS: an automatic genome annotation and pathway reconstruction server.

Authors: Yuki Moriya; Masumi Itoh; Shujiro Okuda; Akiyasu C Yoshizawa; Minoru Kanehisa
Journal: Nucleic Acids Res Date: 2007-05-25 Impact factor: 16.971

7. A probabilistic model of local sequence alignment that simplifies statistical significance estimation.

Authors: Sean R Eddy
Journal: PLoS Comput Biol Date: 2008-05-30 Impact factor: 4.475

8. GHOSTX: an improved sequence homology search algorithm using a query suffix array and a database suffix array.

Authors: Shuji Suzuki; Masanori Kakuta; Takashi Ishida; Yutaka Akiyama
Journal: PLoS One Date: 2014-08-06 Impact factor: 3.240

8 in total

191 in total

1. KEGG Mapper for inferring cellular functions from protein sequences.

Authors: Minoru Kanehisa; Yoko Sato
Journal: Protein Sci Date: 2019-08-29 Impact factor: 6.725

2. Microbial production and consumption of hydrocarbons in the global ocean.

Authors: Connor R Love; Eleanor C Arrington; Kelsey M Gosselin; Christopher M Reddy; Benjamin A S Van Mooy; Robert K Nelson; David L Valentine
Journal: Nat Microbiol Date: 2021-02-01 Impact factor: 17.745

3. Evansella halocellulosilytica sp. nov., an alkali-halotolerant and cellulose-dissolving bacterium isolated from bauxite residue.

Authors: Guo-Hong Liu; Manik Prabhu Narsing Rao; Qian-Qian Chen; Jian-Mei Che; Huai Shi; Bo Liu; Wen-Jun Li
Journal: Extremophiles Date: 2022-06-04 Impact factor: 2.395

4. An estimate of the deepest branches of the tree of life from ancient vertically evolving genes.

Authors: Edmund R R Moody; Tara A Mahendrarajah; Nina Dombrowski; James W Clark; Celine Petitjean; Pierre Offre; Gergely J Szöllősi; Anja Spang; Tom A Williams
Journal: Elife Date: 2022-02-22 Impact factor: 8.140

5. Reclassification of Facklamia ignava, Facklamia sourekii and Facklamia tabacinasalis as Falseniella ignava gen. nov., comb. nov., Hutsoniella sourekii gen. nov., comb. nov., and Ruoffia tabacinasalis gen. nov., comb. nov., and description of Ruoffia halotolerans sp. nov., isolated from hypersaline Inland Sea of Qatar.

Authors: Rashmi Fotedar; Paul A Lawson; Krithivasan Sankaranarayanan; Matthew E Caldwell; Aisha Zeyara; Amina Al Malki; Ridhima Kaul; Hamad Al Shamari; Mohammad Ali; Masoud Al Marri
Journal: Antonie Van Leeuwenhoek Date: 2021-06-28 Impact factor: 2.271

6. Warmth Prevents Bone Loss Through the Gut Microbiota.

Authors: Claire Chevalier; Silas Kieser; Melis Çolakoğlu; Noushin Hadadi; Julia Brun; Dorothée Rigo; Nicolas Suárez-Zamorano; Martina Spiljar; Salvatore Fabbiano; Björn Busse; Julijana Ivanišević; Andrew Macpherson; Nicolas Bonnet; Mirko Trajkovski
Journal: Cell Metab Date: 2020-09-10 Impact factor: 27.287

7. A New Pathway for Forming Acetate and Synthesizing ATP during Fermentation in Bacteria.

Authors: Bo Zhang; Christopher Lingga; Courtney Bowman; Timothy J Hackmann
Journal: Appl Environ Microbiol Date: 2021-06-25 Impact factor: 4.792

8. From simple and specific zymographic detections to the annotation of a fungus Daldinia caldariorum D263 that encodes a wide range of highly bioactive cellulolytic enzymes.

Authors: Meng-Chun Lin; Hsion-Wen Kuo; Mu-Rong Kao; Wen-Dar Lin; Chen-Wei Li; Kuo-Sheng Hung; Sheng-Chih Yang; Su-May Yu; Tuan-Hua David Ho
Journal: Biotechnol Biofuels Date: 2021-05-21 Impact factor: 6.040

9. Deep ocean metagenomes provide insight into the metabolic architecture of bathypelagic microbial communities.

Authors: Silvia G Acinas; Pablo Sánchez; Guillem Salazar; Francisco M Cornejo-Castillo; Marta Sebastián; Ramiro Logares; Marta Royo-Llonch; Lucas Paoli; Shinichi Sunagawa; Pascal Hingamp; Hiroyuki Ogata; Gipsi Lima-Mendez; Simon Roux; José M González; Jesús M Arrieta; Intikhab S Alam; Allan Kamau; Chris Bowler; Jeroen Raes; Stéphane Pesant; Peer Bork; Susana Agustí; Takashi Gojobori; Dolors Vaqué; Matthew B Sullivan; Carlos Pedrós-Alió; Ramon Massana; Carlos M Duarte; Josep M Gasol
Journal: Commun Biol Date: 2021-05-21

10. Genotypic and Phenotypic Diversity among Human Isolates of Akkermansia muciniphila.

Authors: Bradford Becken; Lauren Davey; Dustin R Middleton; Katherine D Mueller; Agastya Sharma; Zachary C Holmes; Eric Dallow; Brenna Remick; Gregory M Barton; Lawrence A David; Jessica R McCann; Sarah C Armstrong; Per Malkus; Raphael H Valdivia
Journal: mBio Date: 2021-05-18 Impact factor: 7.867