Literature DB >> 35563090

A Contrastive Learning Pre-Training Method for Motif Occupancy Identification.

Ken Lin1, Xiongwen Quan1, Wenya Yin1, Han Zhang1.   

Abstract

Motif occupancy identification is a binary classification task predicting the binding of DNA motif instances to transcription factors, for which several sequence-based methods have been proposed. However, through direct training, these end-to-end methods are lack of biological interpretability within their sequence representations. In this work, we propose a contrastive learning method to pre-train interpretable and robust DNA encoding for motif occupancy identification. We construct two alternative models to pre-train DNA sequential encoder, respectively: a self-supervised model and a supervised model. We augment the original sequences for contrastive learning with edit operations defined in edit distance. Specifically, we propose a sequence similarity criterion based on the Needleman-Wunsch algorithm to discriminate positive and negative sample pairs in self-supervised learning. Finally, a DNN classifier is fine-tuned along with the pre-trained encoder to predict the results of motif occupancy identification. Both proposed contrastive learning models outperform the baseline end-to-end CNN model and SimCLR method, reaching AUC of 0.811 and 0.823, respectively. Compared with the baseline method, our models show better robustness for small samples. Specifically, the self-supervised model is proved to be practicable in transfer learning.

Entities:  

Keywords:  contrastive learning; data augmentation; edit distance; motif occupancy identification; pre-training; sequence similarity

Mesh:

Year:  2022        PMID: 35563090      PMCID: PMC9103107          DOI: 10.3390/ijms23094699

Source DB:  PubMed          Journal:  Int J Mol Sci        ISSN: 1422-0067            Impact factor:   6.208


  16 in total

Review 1.  Too many transcription factors: positive and negative interactions.

Authors:  M Karin
Journal:  New Biol       Date:  1990-02

Review 2.  Transcription factors: an overview.

Authors:  D S Latchman
Journal:  Int J Biochem Cell Biol       Date:  1997-12       Impact factor: 5.085

3.  A general method applicable to the search for similarities in the amino acid sequence of two proteins.

Authors:  S B Needleman; C D Wunsch
Journal:  J Mol Biol       Date:  1970-03       Impact factor: 5.469

4.  Whole Genome Chromatin IP-Sequencing (ChIP-Seq) in Skeletal Muscle Cells.

Authors:  Karl Kamhei So; Xianlu Laura Peng; Hao Sun; Huating Wang
Journal:  Methods Mol Biol       Date:  2017

5.  Identification of common molecular subsequences.

Authors:  T F Smith; M S Waterman
Journal:  J Mol Biol       Date:  1981-03-25       Impact factor: 5.469

6.  scNAME: Neighborhood contrastive clustering with ancillary mask estimation for scRNA-seq data.

Authors:  Hui Wan; Liang Chen; Minghua Deng
Journal:  Bioinformatics       Date:  2022-01-06       Impact factor: 6.937

7.  Predicting effects of noncoding variants with deep learning-based sequence model.

Authors:  Jian Zhou; Olga G Troyanskaya
Journal:  Nat Methods       Date:  2015-08-24       Impact factor: 28.547

8.  Text Data Augmentation for Deep Learning.

Authors:  Connor Shorten; Taghi M Khoshgoftaar; Borko Furht
Journal:  J Big Data       Date:  2021-07-19

9.  Contrastive self-supervised clustering of scRNA-seq data.

Authors:  Madalina Ciortan; Matthieu Defrance
Journal:  BMC Bioinformatics       Date:  2021-05-27       Impact factor: 3.169

10.  An integrated encyclopedia of DNA elements in the human genome.

Authors: 
Journal:  Nature       Date:  2012-09-06       Impact factor: 49.962

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.