Shuangquan Zhang1, Lili Yang2, Xiaotian Wu3, Nan Sheng1, Yuan Fu4, Anjun Ma5, Yan Wang1,3. 1. Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China. 2. Department of Obstetrics, the First Hospital of Jilin University, Changchun, 130012, China. 3. School of Artificial Intelligence, Jilin University, Changchun, 130012, China. 4. Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth, Ceredigion, United Kingdom. 5. Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA.
Abstract
MOTIVATION: Transcription factor binding sites (TFBSs) prediction is a crucial step in revealing functions of transcription factors (TFs) from high-throughput sequencing data. Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) provides insight on TFBSs and nucleosome positioning by probing open chromatic, which can simultaneously reveal multiple TFBSs compare to traditional technologies. The existing tools based on convolutional neural network (CNN) only find the fixed length of TFBSs from ATAC-seq data. Graph neural network (GNN) can be considered as the extension of CNN, which has great potential in finding multiple TFBSs with different lengths from ATAC-seq data. RESULTS: We develop a motif predictor called MMGraph based on three-layer GNN and coexisting probability of k-mers for finding multiple motifs from ATAC-seq data. The results of the experiment which has been conducted on 88 ATAC-seq datasets indicate that MMGraph has achieved the best performance on area of eight metrics radar (AEMR) score of 2.31 and could find 207 higher quality multiple motifs than other existing tools. AVAILABILITY: MMGraph is wrapped in Python package, which is available at https://github.com/zhangsq06/MMGraph.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Transcription factor binding sites (TFBSs) prediction is a crucial step in revealing functions of transcription factors (TFs) from high-throughput sequencing data. Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) provides insight on TFBSs and nucleosome positioning by probing open chromatic, which can simultaneously reveal multiple TFBSs compare to traditional technologies. The existing tools based on convolutional neural network (CNN) only find the fixed length of TFBSs from ATAC-seq data. Graph neural network (GNN) can be considered as the extension of CNN, which has great potential in finding multiple TFBSs with different lengths from ATAC-seq data. RESULTS: We develop a motif predictor called MMGraph based on three-layer GNN and coexisting probability of k-mers for finding multiple motifs from ATAC-seq data. The results of the experiment which has been conducted on 88 ATAC-seq datasets indicate that MMGraph has achieved the best performance on area of eight metrics radar (AEMR) score of 2.31 and could find 207 higher quality multiple motifs than other existing tools. AVAILABILITY: MMGraph is wrapped in Python package, which is available at https://github.com/zhangsq06/MMGraph.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Identifying the corresponding transcription factor binding sites (TFBSs) which are short and conserved DNA motifs, is crucial for studying transcription factor (TF) regulations. The footprints of Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) can simultaneously reveal more kinds of TFBSs than traditional sequencing technologies (Bentsen ). Furthermore, a graph neural network (GNN) can be considered as the extension of convolutional neural network, which learns the representation of target nodes via their neighbors (Feng ). GNN has great potential in finding multiple motifs from ATAC-seq data. In this study, we develop a robust and effective tool, named MMGraph, for finding multiple motifs from ATAC-seq data.
2. Materials and methods
MMGraph is based on GNN and coexisting probability of k-mers, where the coexisting probability represents the degree of association between k-mers. It consists of three components (Fig. 1): (i) a heterogeneous graph; (ii) a three-layer GNN model to get embeddings of k-mers and sequences; and (iii) coexisting probability calculation for finding multiple motifs.
Fig. 1.
The whole MMGraph workflow consists of three steps
The whole MMGraph workflow consists of three steps
2.1 Constructing the heterogeneous graph
The sequence set contains sequences , . If contains TFBSs, we set label of as positive ‘1’, otherwise as negative ‘0’, . is trimmed into the same length fragments of k-mers as by the step of one base. We set , is used in previous paper (Fletez-Brant ). is a k-mer set that contains all unique existing in . is an k-mer set that contains all unique in all of . Three edge types are defined (i.e. similarity edges, coexisting edges and inclusive edges) to build a heterogeneous graph where and are nodes.
where represents the weight matrix of similarity edges, .
where represents the weight matrix of coexisting edges, represents the total number of that contain ; is the total number of that contain both and .
where represents the weight matrix of inclusive edges, is the number of existing in , .Similarity edges. Similarity edges measure mismatch between nodes using the Hamming distance (Norouzi ). The similarity edge weight between two nodes and is calculated by the formula (1):2. Coexisting edges. Coexisting edges measure the coexisting probability between nodes. The coexisting edge weight between two k-mers and is calculated by the formula (2):3. Inclusive edges. Inclusive edges measure the dependency degree between and . The inclusive edge weight between and is calculated by transferring the concept of term frequency-inverse document frequency (TF-IDF) (Yun-Tao ) to formula (3):
2.2 Building the GNN model
MMGraph decomposes the heterogeneous graph into three sub-graphs, i.e. similarity graph, coexisting graph and inclusive graph. Then, it uses the labels of as targets to train a three-layer of GNN model. The first layer learns the concatenated embedding of as , where and are the embedding dimensions of and from similarity graph and coexisting graph respectively. The second layer learns the embedding of as from inclusive graph, where is the dimension of . The third layer is a fully connected layer, which identifies whether a label of is ‘1’ that contains TFBSs or not by threshold 0.5 (Supplementary material).
2.3 Finding multiple motifs
After getting the embeddings of and , we calculate mutual information (MI) of and by their corresponding embedding and . The measures the information that and shared. In order to indicate MI for different with labels, we subdivide into by sequences’ label of ‘1’ and ‘0’, as well as , , , and for , , , and . The whole procedure of motifs finding includes four steps.obtain matrix by calculating , where , , , .get the denoised matrix by , where the background noise .get the set that contains all unique in , s.t. , , then locate the interval of in as , where , is the starting position of in . If multiple exist in , there will be multiple .then get two centered on as and , where , . If , which means that and have a strong relation, we merge and to a candidate TFBS as ; otherwise, and will not be considered. For all , we can find multiple candidates in . If multiple overlapped, they may be merged to a longer TFBS in . Finally, we can find multiple with different lengths.
3. Experiments and results
The experiment has been conducted on 88 ATAC-seq datasets. For each sequence set, sequences in this set were acquired by ATAC-seq footprints, which were all positive sequences with label ‘1’. We randomly shuffled all bases within a positive sequence to generate a negative sequence with label ‘0’ (Alipanahi ). The negative sequence did not contain TFBSs, but with the same GC content as the positive sequence. The area of eight metrics radar (AEMR) was used to evaluate tools’ capability in identifying whether the sequence contains TFBSs (Zhang ), and the P-value was used to measure the quality of tools’ found motifs. Our results show that MMGraph achieved the highest AEMR score of 2.31 while , and could find 207 higher-quality motifs compare to four comparison tools (Supplementary data).
Funding
This work was supported by the National Natural Science Foundation of China [62072212]; Jilin province project (20220508125RC) and the Chinese Postdoctoral Science Foundation [2021M691211].Conflict of Interest: The authors declare that they have no conflict of interest.Click here for additional data file.