| Literature DB >> 36136096 |
Hanyu Luo1, Wenyu Shan1, Cheng Chen1, Pingjian Ding1, Lingyun Luo2,3.
Abstract
The DNA-protein binding plays a pivotal role in regulating gene expression and evolution, and computational identification of DNA-protein has drawn more and more attention in bioinformatics. Recently, variants of BERT are also used to capture the semantic information of DNA sequences for predicting DNA-protein bindings. In this study, we leverage a task-specific pre-training strategy on BERT using large-scale multi-source DNA-protein binding data and present TFBert. TFBert treats DNA sequences as natural sentences and k-mer nucleotides as words. It can effectively extract upstream and downstream nucleotide context information by pre-training the 690 unlabeled ChIP-seq datasets. Experiments show that the pre-trained model can achieve promising performance on every single dataset in the 690 ChIP-seq datasets after simple fine tuning, especially on small datasets. The average AUC is 94.7%, outperforming existing popular methods. In conclusion, this study provides a variant of BERT based on pre-training and achieved state-of-the-art results in predicting DNA-protein bindings. We believe that TFBert can provide insights into other biological sequence classification problems.Entities:
Keywords: BERT; Biological sequence; DNA–protein binding; Pre-training
Year: 2022 PMID: 36136096 DOI: 10.1007/s12539-022-00537-9
Source DB: PubMed Journal: Interdiscip Sci ISSN: 1867-1462 Impact factor: 3.492