Literature DB >> 35753697

BioAutoML: automated feature engineering and metalearning to predict noncoding RNAs in bacteria.

Robson P Bonidia1, Anderson P Avila Santos1,2, Breno L S de Almeida1, Peter F Stadler3, Ulisses N da Rocha2, Danilo S Sanches4, André C P L F de Carvalho1.   

Abstract

Recent technological advances have led to an exponential expansion of biological sequence data and extraction of meaningful information through Machine Learning (ML) algorithms. This knowledge has improved the understanding of mechanisms related to several fatal diseases, e.g. Cancer and coronavirus disease 2019, helping to develop innovative solutions, such as CRISPR-based gene editing, coronavirus vaccine and precision medicine. These advances benefit our society and economy, directly impacting people's lives in various areas, such as health care, drug discovery, forensic analysis and food processing. Nevertheless, ML-based approaches to biological data require representative, quantitative and informative features. Many ML algorithms can handle only numerical data, and therefore sequences need to be translated into a numerical feature vector. This process, known as feature extraction, is a fundamental step for developing high-quality ML-based models in bioinformatics, by allowing the feature engineering stage, with design and selection of suitable features. Feature engineering, ML algorithm selection and hyperparameter tuning are often manual and time-consuming processes, requiring extensive domain knowledge. To deal with this problem, we present a new package: BioAutoML. BioAutoML automatically runs an end-to-end ML pipeline, extracting numerical and informative features from biological sequence databases, using the MathFeature package, and automating the feature selection, ML algorithm(s) recommendation and tuning of the selected algorithm(s) hyperparameters, using Automated ML (AutoML). BioAutoML has two components, divided into four modules: (1) automated feature engineering (feature extraction and selection modules) and (2) Metalearning (algorithm recommendation and hyper-parameter tuning modules). We experimentally evaluate BioAutoML in two different scenarios: (i) prediction of the three main classes of noncoding RNAs (ncRNAs) and (ii) prediction of the eight categories of ncRNAs in bacteria, including housekeeping and regulatory types. To assess BioAutoML predictive performance, it is experimentally compared with two other AutoML tools (RECIPE and TPOT). According to the experimental results, BioAutoML can accelerate new studies, reducing the cost of feature engineering processing and either keeping or improving predictive performance. BioAutoML is freely available at https://github.com/Bonidia/BioAutoML.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2022        PMID: 35753697      PMCID: PMC9294424          DOI: 10.1093/bib/bbac218

Source DB:  PubMed          Journal:  Brief Bioinform        ISSN: 1467-5463            Impact factor:   13.994


  57 in total

1.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

Review 2.  Viral MicroRNAs, Host MicroRNAs Regulating Viruses, and Bacterial MicroRNA-Like RNAs.

Authors:  Sara-Elizabeth Cardin; Glen M Borchert
Journal:  Methods Mol Biol       Date:  2017

Review 3.  Automated machine learning: Review of the state-of-the-art and opportunities for healthcare.

Authors:  Jonathan Waring; Charlotta Lindvall; Renato Umeton
Journal:  Artif Intell Med       Date:  2020-02-21       Impact factor: 5.326

4.  CRISPRidentify: identification of CRISPR arrays using machine learning approach.

Authors:  Alexander Mitrofanov; Omer S Alkhnbashi; Sergey A Shmakov; Kira S Makarova; Eugene V Koonin; Rolf Backofen
Journal:  Nucleic Acids Res       Date:  2021-02-26       Impact factor: 16.971

Review 5.  Small Non-Coding RNAs: New Insights in Modulation of Host Immune Response by Intracellular Bacterial Pathogens.

Authors:  Waqas Ahmed; Ke Zheng; Zheng-Fei Liu
Journal:  Front Immunol       Date:  2016-10-18       Impact factor: 7.561

Review 6.  Multi-Omics Approaches to Study Long Non-coding RNA Function in Atherosclerosis.

Authors:  Adam W Turner; Doris Wong; Mohammad Daud Khan; Caitlin N Dreisbach; Meredith Palmore; Clint L Miller
Journal:  Front Cardiovasc Med       Date:  2019-02-19

Review 7.  Non-Coding RNAs and their Integrated Networks.

Authors:  Peijing Zhang; Wenyi Wu; Qi Chen; Ming Chen
Journal:  J Integr Bioinform       Date:  2019-07-13

8.  Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors:  Kuo-Chen Chou
Journal:  J Theor Biol       Date:  2010-12-17       Impact factor: 2.691

9.  Scaling tree-based automated machine learning to biomedical big data with a feature set selector.

Authors:  Trang T Le; Weixuan Fu; Jason H Moore
Journal:  Bioinformatics       Date:  2020-01-01       Impact factor: 6.937

10.  RNAcentral 2021: secondary structure integration, improved sequence search and new member databases.

Authors: 
Journal:  Nucleic Acids Res       Date:  2021-01-08       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.