Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training.

Literature DB >> 36136096

Improving language model of human genome for DNA-protein binding prediction based on task-specific pre-training.

Hanyu Luo¹, Wenyu Shan¹, Cheng Chen¹, Pingjian Ding¹, Lingyun Luo^2,3.

Abstract

The DNA-protein binding plays a pivotal role in regulating gene expression and evolution, and computational identification of DNA-protein has drawn more and more attention in bioinformatics. Recently, variants of BERT are also used to capture the semantic information of DNA sequences for predicting DNA-protein bindings. In this study, we leverage a task-specific pre-training strategy on BERT using large-scale multi-source DNA-protein binding data and present TFBert. TFBert treats DNA sequences as natural sentences and k-mer nucleotides as words. It can effectively extract upstream and downstream nucleotide context information by pre-training the 690 unlabeled ChIP-seq datasets. Experiments show that the pre-trained model can achieve promising performance on every single dataset in the 690 ChIP-seq datasets after simple fine tuning, especially on small datasets. The average AUC is 94.7%, outperforming existing popular methods. In conclusion, this study provides a variant of BERT based on pre-training and achieved state-of-the-art results in predicting DNA-protein bindings. We believe that TFBert can provide insights into other biological sequence classification problems.

Entities: Chemical

Keywords: BERT; Biological sequence; DNA–protein binding; Pre-training

Year: 2022 PMID： 36136096 DOI： 10.1007/s12539-022-00537-9

Source DB: PubMed Journal: Interdiscip Sci ISSN： 1867-1462 Impact factor: 3.492

Keyword Cloud
References

21 in total

Review 1. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

Review 2. Applied bioinformatics for the identification of regulatory elements.

Authors: Wyeth W Wasserman; Albin Sandelin
Journal: Nat Rev Genet Date: 2004-04 Impact factor: 53.242

3. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

Authors: Babak Alipanahi; Andrew Delong; Matthew T Weirauch; Brendan J Frey
Journal: Nat Biotechnol Date: 2015-07-27 Impact factor: 54.908

4. DNA-binding specificities of human transcription factors.

Authors: Arttu Jolma; Jian Yan; Thomas Whitington; Jarkko Toivonen; Kazuhiro R Nitta; Pasi Rastas; Ekaterina Morgunova; Martin Enge; Mikko Taipale; Gonghong Wei; Kimmo Palin; Juan M Vaquerizas; Renaud Vincentelli; Nicholas M Luscombe; Timothy R Hughes; Patrick Lemaire; Esko Ukkonen; Teemu Kivioja; Jussi Taipale
Journal: Cell Date: 2013-01-17 Impact factor: 41.582

Review 5. The Human Transcription Factors.

Authors: Samuel A Lambert; Arttu Jolma; Laura F Campitelli; Pratyush K Das; Yimeng Yin; Mihai Albu; Xiaoting Chen; Jussi Taipale; Timothy R Hughes; Matthew T Weirauch
Journal: Cell Date: 2018-02-08 Impact factor: 41.582

6. Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning.

Authors: Jiajun Hong; Yongchao Luo; Yang Zhang; Junbiao Ying; Weiwei Xue; Tian Xie; Lin Tao; Feng Zhu
Journal: Brief Bioinform Date: 2020-07-15 Impact factor: 11.622

7. Assessing computational tools for the discovery of transcription factor binding sites.

Authors: Martin Tompa; Nan Li; Timothy L Bailey; George M Church; Bart De Moor; Eleazar Eskin; Alexander V Favorov; Martin C Frith; Yutao Fu; W James Kent; Vsevolod J Makeev; Andrei A Mironov; William Stafford Noble; Giulio Pavesi; Graziano Pesole; Mireille Régnier; Nicolas Simonis; Saurabh Sinha; Gert Thijs; Jacques van Helden; Mathias Vandenbogaert; Zhiping Weng; Christopher Workman; Chun Ye; Zhou Zhu
Journal: Nat Biotechnol Date: 2005-01 Impact factor: 54.908

8. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.

Authors: V Matys; O V Kel-Margoulis; E Fricke; I Liebich; S Land; A Barre-Dirrie; I Reuter; D Chekmenev; M Krull; K Hornischer; N Voss; P Stegmaier; B Lewicki-Potapov; H Saxel; A E Kel; E Wingender
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree.

Authors: Shaherin Basith; Balachandran Manavalan; Tae Hwan Shin; Gwang Lee
Journal: Comput Struct Biotechnol J Date: 2018-10-24 Impact factor: 7.271

10. JASPAR 2020: update of the open-access database of transcription factor binding profiles.

Authors: Oriol Fornes; Jaime A Castro-Mondragon; Aziz Khan; Robin van der Lee; Xi Zhang; Phillip A Richmond; Bhavi P Modi; Solenne Correard; Marius Gheorghe; Damir Baranašić; Walter Santana-Garcia; Ge Tan; Jeanne Chèneby; Benoit Ballester; François Parcy; Albin Sandelin; Boris Lenhard; Wyeth W Wasserman; Anthony Mathelier
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971