Literature DB >> 33538820

DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome.

Yanrong Ji1, Zhihan Zhou2, Han Liu2, Ramana V Davuluri3.   

Abstract

MOTIVATION: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios.
RESULTS: To address this challenge, we developed a novel pre-trained bidirectional encoder represen-tation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy, and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites, and transcription factor binding sites, after easy fine-tuning using small task-specific labelled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. AVAILABILITY: The source code, pretrained and finetuned model for DNABERT are available at GitHub https://github.com/jerryji1993/DNABERT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) (2021). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

Entities:  

Year:  2021        PMID: 33538820     DOI: 10.1093/bioinformatics/btab083

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  17 in total

1.  Deep learning predicts DNA methylation regulatory variants in the human brain and elucidates the genetics of psychiatric disorders.

Authors:  Jiyun Zhou; Qiang Chen; Patricia R Braun; Kira A Perzel Mandell; Andrew E Jaffe; Hao Yang Tan; Thomas M Hyde; Joel E Kleinman; James B Potash; Gen Shinozaki; Daniel R Weinberger; Shizhong Han
Journal:  Proc Natl Acad Sci U S A       Date:  2022-08-15       Impact factor: 12.779

2.  Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks.

Authors:  Florian Mock; Fleming Kretschmer; Anton Kriese; Sebastian Böcker; Manja Marz
Journal:  Proc Natl Acad Sci U S A       Date:  2022-08-26       Impact factor: 12.779

3.  iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species.

Authors:  Pengyu Zhang; Hongming Zhang; Hao Wu
Journal:  Nucleic Acids Res       Date:  2022-10-14       Impact factor: 19.160

4.  Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution.

Authors:  Meng Yang; Lichao Huang; Haiping Huang; Hui Tang; Nan Zhang; Huanming Yang; Jihong Wu; Feng Mu
Journal:  Nucleic Acids Res       Date:  2022-08-12       Impact factor: 19.160

5.  Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning.

Authors:  Alex X Lu; Amy X Lu; Iva Pritišanac; Taraneh Zarin; Julie D Forman-Kay; Alan M Moses
Journal:  PLoS Comput Biol       Date:  2022-06-29       Impact factor: 4.779

6.  GeMI: interactive interface for transformer-based Genomic Metadata Integration.

Authors:  Giuseppe Serna Garcia; Michele Leone; Anna Bernasconi; Mark J Carman
Journal:  Database (Oxford)       Date:  2022-06-03       Impact factor: 4.462

7.  A general framework for predicting the transcriptomic consequences of non-coding variation and small molecules.

Authors:  Moustafa Abdalla; Mohamed Abdalla
Journal:  PLoS Comput Biol       Date:  2022-04-14       Impact factor: 4.779

8.  Multi-objective data enhancement for deep learning-based ultrasound analysis.

Authors:  Chengkai Piao; Mengyue Lv; Shujie Wang; Rongyan Zhou; Yuchen Wang; Jinmao Wei; Jian Liu
Journal:  BMC Bioinformatics       Date:  2022-10-20       Impact factor: 3.307

9.  Supervised promoter recognition: a benchmark framework.

Authors:  Raul I Perez Martell; Alison Ziesel; Hosna Jabbari; Ulrike Stege
Journal:  BMC Bioinformatics       Date:  2022-04-02       Impact factor: 3.169

Review 10.  Representation learning applications in biological sequence analysis.

Authors:  Hitoshi Iuchi; Taro Matsutani; Keisuke Yamada; Natsuki Iwano; Shunsuke Sumi; Shion Hosoda; Shitao Zhao; Tsukasa Fukunaga; Michiaki Hamada
Journal:  Comput Struct Biotechnol J       Date:  2021-05-23       Impact factor: 7.271

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.