Literature DB >> 36136096

Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training.

Hanyu Luo1, Wenyu Shan1, Cheng Chen1, Pingjian Ding1, Lingyun Luo2,3.   

Abstract

The DNA-protein binding plays a pivotal role in regulating gene expression and evolution, and computational identification of DNA-protein has drawn more and more attention in bioinformatics. Recently, variants of BERT are also used to capture the semantic information of DNA sequences for predicting DNA-protein bindings. In this study, we leverage a task-specific pre-training strategy on BERT using large-scale multi-source DNA-protein binding data and present TFBert. TFBert treats DNA sequences as natural sentences and k-mer nucleotides as words. It can effectively extract upstream and downstream nucleotide context information by pre-training the 690 unlabeled ChIP-seq datasets. Experiments show that the pre-trained model can achieve promising performance on every single dataset in the 690 ChIP-seq datasets after simple fine tuning, especially on small datasets. The average AUC is 94.7%, outperforming existing popular methods. In conclusion, this study provides a variant of BERT based on pre-training and achieved state-of-the-art results in predicting DNA-protein bindings. We believe that TFBert can provide insights into other biological sequence classification problems.
© 2022. International Association of Scientists in the Interdisciplinary Areas.

Entities:  

Keywords:  BERT; Biological sequence; DNA–protein binding; Pre-training

Year:  2022        PMID: 36136096     DOI: 10.1007/s12539-022-00537-9

Source DB:  PubMed          Journal:  Interdiscip Sci        ISSN: 1867-1462            Impact factor:   3.492


  21 in total

Review 1.  DNA binding sites: representation and discovery.

Authors:  G D Stormo
Journal:  Bioinformatics       Date:  2000-01       Impact factor: 6.937

Review 2.  Applied bioinformatics for the identification of regulatory elements.

Authors:  Wyeth W Wasserman; Albin Sandelin
Journal:  Nat Rev Genet       Date:  2004-04       Impact factor: 53.242

3.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

Authors:  Babak Alipanahi; Andrew Delong; Matthew T Weirauch; Brendan J Frey
Journal:  Nat Biotechnol       Date:  2015-07-27       Impact factor: 54.908

4.  DNA-binding specificities of human transcription factors.

Authors:  Arttu Jolma; Jian Yan; Thomas Whitington; Jarkko Toivonen; Kazuhiro R Nitta; Pasi Rastas; Ekaterina Morgunova; Martin Enge; Mikko Taipale; Gonghong Wei; Kimmo Palin; Juan M Vaquerizas; Renaud Vincentelli; Nicholas M Luscombe; Timothy R Hughes; Patrick Lemaire; Esko Ukkonen; Teemu Kivioja; Jussi Taipale
Journal:  Cell       Date:  2013-01-17       Impact factor: 41.582

Review 5.  The Human Transcription Factors.

Authors:  Samuel A Lambert; Arttu Jolma; Laura F Campitelli; Pratyush K Das; Yimeng Yin; Mihai Albu; Xiaoting Chen; Jussi Taipale; Timothy R Hughes; Matthew T Weirauch
Journal:  Cell       Date:  2018-02-08       Impact factor: 41.582

6.  Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning.

Authors:  Jiajun Hong; Yongchao Luo; Yang Zhang; Junbiao Ying; Weiwei Xue; Tian Xie; Lin Tao; Feng Zhu
Journal:  Brief Bioinform       Date:  2020-07-15       Impact factor: 11.622

7.  Assessing computational tools for the discovery of transcription factor binding sites.

Authors:  Martin Tompa; Nan Li; Timothy L Bailey; George M Church; Bart De Moor; Eleazar Eskin; Alexander V Favorov; Martin C Frith; Yutao Fu; W James Kent; Vsevolod J Makeev; Andrei A Mironov; William Stafford Noble; Giulio Pavesi; Graziano Pesole; Mireille Régnier; Nicolas Simonis; Saurabh Sinha; Gert Thijs; Jacques van Helden; Mathias Vandenbogaert; Zhiping Weng; Christopher Workman; Chun Ye; Zhou Zhu
Journal:  Nat Biotechnol       Date:  2005-01       Impact factor: 54.908

8.  TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.

Authors:  V Matys; O V Kel-Margoulis; E Fricke; I Liebich; S Land; A Barre-Dirrie; I Reuter; D Chekmenev; M Krull; K Hornischer; N Voss; P Stegmaier; B Lewicki-Potapov; H Saxel; A E Kel; E Wingender
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

9.  iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree.

Authors:  Shaherin Basith; Balachandran Manavalan; Tae Hwan Shin; Gwang Lee
Journal:  Comput Struct Biotechnol J       Date:  2018-10-24       Impact factor: 7.271

10.  JASPAR 2020: update of the open-access database of transcription factor binding profiles.

Authors:  Oriol Fornes; Jaime A Castro-Mondragon; Aziz Khan; Robin van der Lee; Xi Zhang; Phillip A Richmond; Bhavi P Modi; Solenne Correard; Marius Gheorghe; Damir Baranašić; Walter Santana-Garcia; Ge Tan; Jeanne Chèneby; Benoit Ballester; François Parcy; Albin Sandelin; Boris Lenhard; Wyeth W Wasserman; Anthony Mathelier
Journal:  Nucleic Acids Res       Date:  2020-01-08       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.