Yanrong Ji1, Zhihan Zhou2, Han Liu2, Ramana V Davuluri3. 1. Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL 60611, USA. 2. Department of Computer Science, Northwestern University, Evanston, IL 60208, USA. 3. Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11794, USA.
Abstract
MOTIVATION: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. RESULTS: To address this challenge, we developed a novel pre-trained bidirectional encoder represen-tation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy, and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites, and transcription factor binding sites, after easy fine-tuning using small task-specific labelled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. AVAILABILITY: The source code, pretrained and finetuned model for DNABERT are available at GitHub https://github.com/jerryji1993/DNABERT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. RESULTS: To address this challenge, we developed a novel pre-trained bidirectional encoder represen-tation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy, and efficiency. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites, and transcription factor binding sites, after easy fine-tuning using small task-specific labelled data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance. We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks. AVAILABILITY: The source code, pretrained and finetuned model for DNABERT are available at GitHub https://github.com/jerryji1993/DNABERT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Jiyun Zhou; Qiang Chen; Patricia R Braun; Kira A Perzel Mandell; Andrew E Jaffe; Hao Yang Tan; Thomas M Hyde; Joel E Kleinman; James B Potash; Gen Shinozaki; Daniel R Weinberger; Shizhong Han Journal: Proc Natl Acad Sci U S A Date: 2022-08-15 Impact factor: 12.779
Authors: Florian Mock; Fleming Kretschmer; Anton Kriese; Sebastian Böcker; Manja Marz Journal: Proc Natl Acad Sci U S A Date: 2022-08-26 Impact factor: 12.779
Authors: Alex X Lu; Amy X Lu; Iva Pritišanac; Taraneh Zarin; Julie D Forman-Kay; Alan M Moses Journal: PLoS Comput Biol Date: 2022-06-29 Impact factor: 4.779