Literature DB >> 36018838

Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks.

Florian Mock1, Fleming Kretschmer2, Anton Kriese3, Sebastian Böcker2, Manja Marz1,4,5,6.   

Abstract

Taxonomic classification, that is, the assignment to biological clades with shared ancestry, is a common task in genetics, mainly based on a genome similarity search of large genome databases. The classification quality depends heavily on the database, since representative relatives must be present. Many genomic sequences cannot be classified at all or only with a high misclassification rate. Here we present BERTax, a deep neural network program based on natural language processing to precisely classify the superkingdom and phylum of DNA sequences taxonomically without the need for a known representative relative from a database. We show BERTax to be at least on par with the state-of-the-art approaches when taxonomically similar species are part of the training data. For novel organisms, however, BERTax clearly outperforms any existing approach. Finally, we show that BERTax can also be combined with database approaches to further increase the prediction quality in almost all cases. Since BERTax is not based on similar entries in databases, it allows precise taxonomic classification of a broader range of genomic sequences, thus increasing the overall information gain.

Entities:  

Keywords:  deep learning; meta genome; taxonomic classification

Mesh:

Substances:

Year:  2022        PMID: 36018838      PMCID: PMC9436379          DOI: 10.1073/pnas.2122636119

Source DB:  PubMed          Journal:  Proc Natl Acad Sci U S A        ISSN: 0027-8424            Impact factor:   12.779


  27 in total

1.  Fast and sensitive protein alignment using DIAMOND.

Authors:  Benjamin Buchfink; Chao Xie; Daniel H Huson
Journal:  Nat Methods       Date:  2014-11-17       Impact factor: 28.547

2.  k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets.

Authors:  David Ainsworth; Michael J E Sternberg; Come Raczy; Sarah A Butcher
Journal:  Nucleic Acids Res       Date:  2017-02-28       Impact factor: 16.971

3.  DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.

Authors:  Yanrong Ji; Zhihan Zhou; Han Liu; Ramana V Davuluri
Journal:  Bioinformatics       Date:  2021-02-04       Impact factor: 6.937

4.  How many species are there on Earth and in the ocean?

Authors:  Camilo Mora; Derek P Tittensor; Sina Adl; Alastair G B Simpson; Boris Worm
Journal:  PLoS Biol       Date:  2011-08-23       Impact factor: 8.029

5.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.

Authors:  Rachid Ounit; Steve Wanamaker; Timothy J Close; Stefano Lonardi
Journal:  BMC Genomics       Date:  2015-03-25       Impact factor: 3.969

6.  Gene2vec: gene subsequence embedding for prediction of mammalian N 6-methyladenosine sites from mRNA.

Authors:  Quan Zou; Pengwei Xing; Leyi Wei; Bin Liu
Journal:  RNA       Date:  2018-11-13       Impact factor: 4.942

7.  Modeling aspects of the language of life through transfer-learning protein sequences.

Authors:  Michael Heinzinger; Ahmed Elnaggar; Yu Wang; Christian Dallago; Dmitrii Nechaev; Florian Matthes; Burkhard Rost
Journal:  BMC Bioinformatics       Date:  2019-12-17       Impact factor: 3.169

8.  Database indexing for production MegaBLAST searches.

Authors:  Aleksandr Morgulis; George Coulouris; Yan Raytselis; Thomas L Madden; Richa Agarwala; Alejandro A Schäffer
Journal:  Bioinformatics       Date:  2008-06-21       Impact factor: 6.937

9.  Fast and sensitive taxonomic classification for metagenomics with Kaiju.

Authors:  Peter Menzel; Kim Lee Ng; Anders Krogh
Journal:  Nat Commun       Date:  2016-04-13       Impact factor: 14.919

Review 10.  SciPy 1.0: fundamental algorithms for scientific computing in Python.

Authors:  Pauli Virtanen; Ralf Gommers; Travis E Oliphant; Matt Haberland; Tyler Reddy; David Cournapeau; Evgeni Burovski; Pearu Peterson; Warren Weckesser; Jonathan Bright; Stéfan J van der Walt; Matthew Brett; Joshua Wilson; K Jarrod Millman; Nikolay Mayorov; Andrew R J Nelson; Eric Jones; Robert Kern; Eric Larson; C J Carey; İlhan Polat; Yu Feng; Eric W Moore; Jake VanderPlas; Denis Laxalde; Josef Perktold; Robert Cimrman; Ian Henriksen; E A Quintero; Charles R Harris; Anne M Archibald; Antônio H Ribeiro; Fabian Pedregosa; Paul van Mulbregt
Journal:  Nat Methods       Date:  2020-02-03       Impact factor: 28.547

View more
  1 in total

1.  Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks.

Authors:  Florian Mock; Fleming Kretschmer; Anton Kriese; Sebastian Böcker; Manja Marz
Journal:  Proc Natl Acad Sci U S A       Date:  2022-08-26       Impact factor: 12.779

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.