Raíssa Silva1,2, Kleber Padovani2, Fabiana Góes3, Ronnie Alves4,5. 1. Vale Institute of Technology, Boaventura da Silva, 955, Belém, BR, 66055-090, Brazil. 2. PPGCC, Federal University of Pará, Augusto Corrêa, 01, Belém, BR, 66075-110, Brazil. 3. ICMC, University of São Paulo, Trab. São Carlense, 400, São Carlos, BR, 13566-590, Brazil. 4. Vale Institute of Technology, Boaventura da Silva, 955, Belém, BR, 66055-090, Brazil. ronnie.alves@itv.org. 5. PPGCC, Federal University of Pará, Augusto Corrêa, 01, Belém, BR, 66075-110, Brazil. ronnie.alves@itv.org.
Abstract
BACKGROUND: Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also creates a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available that can aid the gene annotation process though they lack handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. RESULTS: We introduce geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar's test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. CONCLUSIONS: We provide geneRFinder, an approach for gene prediction in distinct metagenomic complexities, available at gitlab.com/r.lorenna/generfinder and https://osf.io/w2yd6/ , and also we provide a novel, comprehensive benchmark data for gene prediction-which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions-available at https://sourceforge.net/p/generfinder-benchmark .
BACKGROUND: Microbes perform a fundamental economic, social, and environmental role in our society. Metagenomics makes it possible to investigate microbes in their natural environments (the complex communities) and their interactions. The way they act is usually estimated by looking at the functions they play in those environments and their responsibility is measured by their genes. The advances of next-generation sequencing technology have facilitated metagenomics research however it also creates a heavy computational burden. Large and complex biological datasets are available as never before. There are many gene predictors available that can aid the gene annotation process though they lack handling appropriately metagenomic data complexities. There is no standard metagenomic benchmark data for gene prediction. Thus, gene predictors may inflate their results by obfuscating low false discovery rates. RESULTS: We introduce geneRFinder, an ML-based gene predictor able to outperform state-of-the-art gene prediction tools across this benchmark by using only one pre-trained Random Forest model. Average prediction rates of geneRFinder differed in percentage terms by 54% and 64%, respectively, against Prodigal and FragGeneScan while handling high complexity metagenomes. The specificity rate of geneRFinder had the largest distance against FragGeneScan, 79 percentage points, and 66 more than Prodigal. According to McNemar's test, all percentual differences between predictors performances are statistically significant for all datasets with a 99% confidence interval. CONCLUSIONS: We provide geneRFinder, an approach for gene prediction in distinct metagenomic complexities, available at gitlab.com/r.lorenna/generfinder and https://osf.io/w2yd6/ , and also we provide a novel, comprehensive benchmark data for gene prediction-which is based on The Critical Assessment of Metagenome Interpretation (CAMI) challenge, and contains labeled data from gene regions-available at https://sourceforge.net/p/generfinder-benchmark .
Authors: Doug Hyatt; Gwo-Liang Chen; Philip F Locascio; Miriam L Land; Frank W Larimer; Loren J Hauser Journal: BMC Bioinformatics Date: 2010-03-08 Impact factor: 3.169
Authors: Joseph Nesme; Wafa Achouak; Spiros N Agathos; Mark Bailey; Petr Baldrian; Dominique Brunel; Åsa Frostegård; Thierry Heulin; Janet K Jansson; Edouard Jurkevitch; Kristiina L Kruus; George A Kowalchuk; Antonio Lagares; Hilary M Lappin-Scott; Philippe Lemanceau; Denis Le Paslier; Ines Mandic-Mulec; J Colin Murrell; David D Myrold; Renaud Nalin; Paolo Nannipieri; Josh D Neufeld; Fergal O'Gara; John J Parnell; Alfred Pühler; Victor Pylro; Juan L Ramos; Luiz F W Roesch; Michael Schloter; Christa Schleper; Alexander Sczyrba; Angela Sessitsch; Sara Sjöling; Jan Sørensen; Søren J Sørensen; Christoph C Tebbe; Edward Topp; George Tsiamis; Jan Dirk van Elsas; Geertje van Keulen; Franco Widmer; Michael Wagner; Tong Zhang; Xiaojun Zhang; Liping Zhao; Yong-Guan Zhu; Timothy M Vogel; Pascal Simonet Journal: Front Microbiol Date: 2016-02-10 Impact factor: 5.640
Authors: Benjamin Voigt; Oliver Fischer; Christian Krumnow; Christian Herta; Piotr Wojciech Dabrowski Journal: PLoS One Date: 2021-12-22 Impact factor: 3.240