Literature DB >> 30267681

DNA Motif Recognition Modeling from Protein Sequences.

Abstract

Although the existing works on DNA motif discovery on DNA sequences are plethoric, mechanistic knowledge to infer DNA motifs from protein sequences across multiple DNA-binding domain families without conducting any wet-lab experiments is still lacking. Therefore, the k-spectrum recognition modeling is proposed to address the issues at the highest possible resolutions. The k-spectrum model can capture DNA motif patterns from protein sequences at the resolution in which local sequence context and nucleotide dependency can be taken into account completely. Multiple evaluation metrics are adopted and measured on millions of k-mer binding intensities from 92 proteins across 5 DNA-binding families (i.e., bHLH, bZIP, ETS, Forkhead, and Homeodomain), demonstrating its competitive edges. In addition, it not only can contribute to DNA motif recognition modeling but also can help prioritize the observed or even unobserved binding of single nucleotide variants on transcription factor binding sites in a genome-wide manner.

Entities: CellLine Chemical Disease Mutation Species

Keywords: Bioinformatics; Computational Biology; DNA Motifs; Genetics; Quantitative Genetics

Year: 2018 PMID： 30267681 PMCID： PMC6153143 DOI： 10.1016/j.isci.2018.09.003

Source DB: PubMed Journal: iScience ISSN： 2589-0042

Introduction

According to a robust estimation (Li and Biggin, 2015), 73% of protein expression is regulated by gene transcription. Such a percentage is substantially higher than the other steps such as translation (8%), protein degradation (8%), and mRNA degradation (11%). A recent study also indicated that most individuals have unique repertoires of gene transcription activities, which can contribute to phenotypic variations (Barrera et al., 2016) and thus the difficulties in developing personalized medicine. Therefore, understanding gene transcription forms the important basis for personalized medicine development. Especially, the protein-DNA binding interactions between transcription factors (TFs) and transcription factor binding sites (TFBSs) are the essential components in eukaryotic gene transcription where TFs bind to TFBSs in a sequence-specific manner as evidenced by the study that 17.5%–19.5% of the top expression quantitative trait loci are overlapped with the annotated TFBSs (1000 Genomes Project Consortium et al., 2015). Therefore, substantial efforts have been made into elucidating the DNA-binding specificity patterns (TFBS patterns) of TFs, which are the essential proteins for gene transcription. TFs bind onto specific DNA sites (TFBSs) on regulatory regions (e.g., promoters and enhancers), controlling when and where each gene is transcribed. Given its central importance, the existing protein-DNA binding structures have been analyzed to decipher the protein-DNA binding interactions between TFs and TFBSs (Luscombe et al., 2001) on specific TF families (e.g., zinc fingers [Krishna et al., 2003]). Bonding and force types, bending of the DNA (Jones et al., 1999), TF conservation, and mutations (Luscombe and Thornton, 2002) have been discovered as the key factors that can determine the binding amino acid residues on the TF side (Jones et al., 2003). In addition, many have tried to generalize the binding rules (i.e., the one-to-one mapping between amino acids from TFs and nucleotides from TFBSs). Unfortunately, despite the efforts (Mandel-Gutfreund et al., 1995, Mandel-Gutfreund and Margalit, 1998, Luscombe and Thornton, 2002), it is suggested that there is not any general binding rule across different TF families (Sarai and Kono, 2005). To address this, high-throughput biotechnologies have been developed such as chromatin immunoprecipitation (ChIP) sequencing technologies (e.g., ChIP-Chip, ChIP-seq, and ChIP-exo), ChIA-PET, HT-SELEX, CHAMP, and ORGANIC (Jung et al., 2017, Kasinathan et al., 2014). In particular, protein binding microarray (PBM) has been developed as a high-throughput technology that can discover the in vitro TFBSs (Barrera et al., 2016). Each PBM run can measure binding intensities of a given TF to all possible DNA k-mers (k ≥ 8). With these technologies, international projects (e.g., ENCODE, GTEx, and FANTOM) have been successfully launched (ENCODE Project Consortium, 2012, GTEx Consortium, 2015, Forrest et al., 2014), accumulating massive TFBS data that we can study for human TFs on a genome-wide scale; for instance, the sequencing data from the ENCODE consortium has enabled the systematic genome-wide discovery and characterization of human TFBSs (Kheradpour and Kellis, 2014). In addition, Jolma et al. have also characterized the DNA-binding specificity landscape of human TFs for large-scale TFBS data (Jolma et al., 2013). The de facto modeling of TFBS is usually in the form of either solid consensus sequences or position weight matrices (PWMs). Nonetheless, it is well known that nucleotide dependencies and indel operations exist (Tomovic and Oakeley, 2007). Therefore, Wong et al. have proposed to adopt probabilistic modeling to tackle these challenges (Wong et al., 2013). Transcription Factor Flexible Model has also been proposed by Mathelier and Wasserman to solve these issues with variable sequence lengths (Mathelier and Wasserman, 2013). Recently, deep convolutional neural networks have also been applied for large-scale modeling (Alipanahi et al., 2015). Therefore, TFBS modeling is still an active but central challenge in nucleic acids research. However, these studies are limited to the DNA side. In recent years, given the exponential increase in data, people started to realize that we can predict DNA motifs from the protein side; for instance, Pelossof et al. have successfully adopted the PBM data to predict the DNA-binding affinities of different Homeodomain proteins (Pelossof et al., 2015). Gupta et al. have also proposed a random forest recognition model to predict DNA motif matrices from the protein sequences of the C2H2 zinc finger family (Gupta et al., 2014). Unfortunately, both studies are limited to a single DNA-binding family. In addition, its modeling methodology is either limited to k-mer independence assumption or PWM assumption. The actual modeling performance across multiple DNA-binding families remain speculative. Therefore, in this study, an approach that can take into account sequence context and nucleotide dependency for k-mer dependence modeling at the highest possible resolution is proposed.

DNA Motif Recognition Modeling

For this study, the proposed framework is summarized in Figure 1. The major step descriptions can be referenced from the figure caption. Further details are provided in the following subsections.

Figure 1

Proposed Framework Overview

(Step 1) The input protein sequence is given for elucidating its DNA-binding specificities. (Step 2) The input protein sequence is scanned with Pfam to retrieve its DNA-binding domain (D) annotation. (Step 3) Based on D, the corresponding recognition model can be retrieved from the previous study (Wong, 2015). (Steps 4 and 5) The domain (D) annotation and its sequence are fed into the retrieved model. (Step 6) A draft DNA k-spectrum is generated from the model. (Step 7) The draft DNA k-spectrum is forwarded into the Markov random field model trained for detailed refinement. (Step 8) The predicted DNA k-spectrum profile is generated from the Markov random field model. (Step 9) Evaluations and applications are conducted under leave-one-out cross-validations (LOOCVs).

Proposed Framework Overview (Step 1) The input protein sequence is given for elucidating its DNA-binding specificities. (Step 2) The input protein sequence is scanned with Pfam to retrieve its DNA-binding domain (D) annotation. (Step 3) Based on D, the corresponding recognition model can be retrieved from the previous study (Wong, 2015). (Steps 4 and 5) The domain (D) annotation and its sequence are fed into the retrieved model. (Step 6) A draft DNA k-spectrum is generated from the model. (Step 7) The draft DNA k-spectrum is forwarded into the Markov random field model trained for detailed refinement. (Step 8) The predicted DNA k-spectrum profile is generated from the Markov random field model. (Step 9) Evaluations and applications are conducted under leave-one-out cross-validations (LOOCVs).

Data Sources

The DNA-binding-family-specific recognition models are retrieved from the previous study (Wong et al., 2015). The training and testing k-spectrum data (i.e., e-scores), if available, are retrieved from UniPROBE in October 2017 (Barrera et al., 2016). In particular, the top five DNA-binding domain families with the largest available PBM data are selected and tabulated in Table S1. For each family, leave-one-out cross-validations (LOOCVs) are conducted for fair evaluations. The official Pfam web server is chosen in October 2017 (Finn et al., 2016). The ground truth DNA motif matrices are based on CIS-BP (v1.02) in human (Weirauch et al., 2014).

DNA k-Spectrum Recognition Model Training

DNA k-spectrum refers to the complete quantitative modeling on m DNA k-mers {kmer1,kmer2, …,kmer} of length k. The DNA kmers can be selected based on the DNA-binding intensity ranking from PBM experiments. Although progress has been made in DNA motif recognition modeling, we lack in silico DNA motif recognition models that can achieve such a high resolution for the TF of interest across multiple DNA-binding families. Therefore, we aim at developing the first-of-its-kind in silico DNA k-spectrum recognition model that can fully capture the DNA-binding specificities of TFs at a very high resolution across multiple DNA-binding families. As the intermediate step, we have already developed a unique model Model, which can predict specificity-determining residue-nucleotide interactions between TFBSs and TFs with known annotations from protein-DNA binding sequences with the help of random forest model training on the existing protein-DNA binding complex structures from Protein Data Bank (PDB) (Wong et al., 2015). Briefly, given a TF whose DNA-binding specificity we would like to find, we can run the existing protein domain annotation programs (e.g., InterPro and PFam) to annotate its DNA-binding domains (DBDs) on its protein sequence. For the DBD of interest (D), we collect T known PBM k-spectrum profiles from UniProbe where t denotes the DBD instance index for model training or building. In each profile , we have the binding intensities of all DNA k-mers of interest:where denotes the actual binding intensity of kmer as measured by PBM. In addition, we can also collect the corresponding k-spectrum recognition profile from the model Model using DNA motif recognition modeling techniques for each DBD instance based on its protein sequence (Wong et al., 2015):where denotes the binding intensity of kmer as recognized by Model. As a result, our objective is to build a mathematical model that can predict from , given the T known instances of the DBD of interest (D). Statistically, it is a sparse multi-task regression problem where the predictor variables are and the response variables are . However, such a formulation does not take into account the sequence neighborhood among different k-mers. Therefore, a Markov Random Field (MRF) model is proposed to capture the sequence neighborhood information for sparse multi-task regression. Mathematically, the energy function E of Model is given by:where T is the number of DBD instances, M is the number of DNA k-mers, g and f are the MRF clique potential functions, and NB (a) is the sequence neighborhood of kmer, which can be defined in different settings. An example of the sequence neighborhood NB (kmer = AAA) with one substitution error is illustrated in Figure 2. To ensure its function convexity, the classic least square error estimation formulation is adopted: and . Given the training data and , we can take partial derivatives to the energy function E with respect to the regression parameters {s} and {i}, resulting in the typical ordinary least aquares estimations for the model training of Model.

Figure 2

A Markov Random Field Example of the Sequence Neighborhood NB (kmer = AAA) with One Substitution Error

Reverse complements are not shown for illustrative purposes.

A Markov Random Field Example of the Sequence Neighborhood NB (kmer = AAA) with One Substitution Error Reverse complements are not shown for illustrative purposes.

DNA k-Spectrum Recognition Model Testing

For each DBD of interest (D), its known PBM k-spectrum profiles have already been collected from UniProbe where t denotes the DBD instance index for model training on the top M k-mers (e-scores). On the other hand, its corresponding k-spectrum recognition profiles have also been generated via the model Model (Wong et al., 2015). To test the model Model, LOOCVs are conducted on those DBD instances. For each left-out DBD instance indexed by t, given its k-spectrum recognition profile , we have to compute for its predicted PBM k-spectrum profile via the model Model, which has been trained on the other instances. Unfortunately, we do not have any closed form solution without any distribution assumption. Therefore, the iterative formula is obtained by taking partial derivatives to the energy function E with respect to each for all kmer (i.e., ): Such a formulation has the linear complexity . Convexity is also guaranteed because of the least square error estimation setting. Therefore, it can be easily iterated from random initialization until convergence or maximal number of iterations, resulting in the overall linear complexity where I is the actual number of iterations (i.e., less than 50 iterations in most cases). At the implementation level, we can also combine the equations for all kmer into a system of linear equations and solve it via the matrix inverse operation if sufficient memory is available.

DNA k-mer Neighborhood Modeling

Given the central importance of the DNA k-mer neighborhood modeling (NB), different possible definitions are parameterized and compared in this study. In particular, three distance metrics are proposed for measuring k-mer sequence similarity: Hamming distance (Norouzi et al., 2012), longest common subsequence (LCS) distance (Paterson and Dančík, 1994), and Levenshtein distance (also known as edit distance) (Yujian and Bo, 2007). Notably, LCS distance is a special case of Levenshtein distance. To extensively cover all possible scenarios, the distance thresholds from one to five are enumerated for each distance metric, resulting in 15 different definitions of DNA k-mer neighborhood NB.

Results

DNA-Binding Intensity Correlation

Given the linear model complexities, for each DBD (D), the aforementioned LOOCV procedures are conducted to obtain the predicted PBM k-spectrum profiles and compared with the actual PBM k-spectrum profiles where k = 8, following previous studies (Weirauch et al., 2013, Wong et al., 2013, Zhao and Stormo, 2011, Chen et al., 2007). Spearman rank correlation coefficients are computed for evaluations as depicted in Figure 3. It can be observed that the proposed approach performs better than the previous approach in capturing DNA-binding intensities (Wong et al., 2015); for instance, for most of the cases, it can achieve the correlation values well above 0.5 and up to 0.9. However, the proposed approach cannot work very well with the bHLH and bZIP families partly due to their dimerization nature. The bHLH family is especially the worst because its TFs are known for its structural dimer flexibility as well as loose DNA-binding contacts, leading to diverse DNA motif instances that are difficult to be captured (Ellenberger, 1994).

Figure 3

DNA-Binding Intensity Correlation Comparisons under Different k-mer Neighborhood Settings

Box plots on the Spearman rank correlations between the actual binding intensities of k-mers and the predicted binding intensities of k-mers using the previous method (Wong et al., 2015), and the current method denoted in red and blue colors, respectively.

(A) Number of top k-mers (M) = 100.

(B) Number of top k-mers (M) = 1,000.

DNA-Binding Intensity Correlation Comparisons under Different k-mer Neighborhood Settings Box plots on the Spearman rank correlations between the actual binding intensities of k-mers and the predicted binding intensities of k-mers using the previous method (Wong et al., 2015), and the current method denoted in red and blue colors, respectively. (A) Number of top k-mers (M) = 100. (B) Number of top k-mers (M) = 1,000. To investigate it further, the MRF networks of the hidden variables (i.e., the predicted PBM k-spectrum profile variables) of the five DNA-binding families when the number of top k-mers (M) is 100 are plotted in Figures 4 and S1–S4. The corresponding performance values are visualized in Figure 3. The edge betweenness community detection method is applied to segregate each Markov network into different communities for modularity maximization (i.e., the “cluster_edge_betweenness” function in R). Interestingly, based on the community structures, we can observe several characteristics and reasons for the aforementioned family-specific performance difference. (1) The bHLH family has multiple disjoint communities that impose difficulties in synergizing the sequence neighborhood information for binding intensity correlation modeling. (2) The bZIP and Homeodomain families have multiple singleton communities that can isolate the related k-mers from other communities for network modeling. (3) The ETS and Forkhead families have strongly overlapping communities that can boost sequence neighborhood information for DNA motif recognition modeling.

Figure 4

Markov Random Field Network of the Hidden Variables for the bHLH Family When 100 Top k-mers Are Selected (M) for Modeling

Hamming distance threshold of one is adopted for sequence neighborhood connections while reverse complements are considered equivalent. The edge betweenness community detection method has been adopted to segregate the Markov network into different communities for modularity maximization (i.e., “cluster_edge_betweenness” function in R), as highlighted in different colors. The node sizes are proportional to the node degrees.

Markov Random Field Network of the Hidden Variables for the bHLH Family When 100 Top k-mers Are Selected (M) for Modeling Hamming distance threshold of one is adopted for sequence neighborhood connections while reverse complements are considered equivalent. The edge betweenness community detection method has been adopted to segregate the Markov network into different communities for modularity maximization (i.e., “cluster_edge_betweenness” function in R), as highlighted in different colors. The node sizes are proportional to the node degrees. Based on these observations, we may wonder whether we can improve the modeling performance by increasing the number of DNA k-mers. In this study, the number of top DNA k-mers M is varied among {100,200,300,500,1,000} as depicted in Figure S5. Interestingly, it can be observed that the proposed approach can be improved with the increasing number of top k-mers, regardless of the DNA-binding family type. It indicates that our MRF modeling can be scaled for improvement once rich k-mer information is made available. Two left-out examples are shown in Figure 5. The first example concerns the DNA-binding specificities of ERG, which is an oncogene related to hematopoiesis, whereas the second example belongs to PROP1, which is responsible for hormone regulation. Additional examples are depicted in Figures S6–S8.

Figure 5

Examples of Left-One-Out Cross-Validation Predictions on 1,000 Top k-mers Whose Neighborhood Is Defined Based on the Hamming Distance Threshold of One

Each dot represents one k-mer. The evaluations are based on Spearman rank correlations (r). Random initialization shows the initial guess before the iterations. The figures are drawn using R, where the red curves are local polynomial regression fittings with α = 2/3 and the p values are computed using algorithm AS 89 (Best and Roberts, 1975). Additional examples are depicted in Figures S6–S8 for illustration purposes.

(A) ERG from ETS family.

(B) PROP1 from Homeodomain family.

Examples of Left-One-Out Cross-Validation Predictions on 1,000 Top k-mers Whose Neighborhood Is Defined Based on the Hamming Distance Threshold of One Each dot represents one k-mer. The evaluations are based on Spearman rank correlations (r). Random initialization shows the initial guess before the iterations. The figures are drawn using R, where the red curves are local polynomial regression fittings with α = 2/3 and the p values are computed using algorithm AS 89 (Best and Roberts, 1975). Additional examples are depicted in Figures S6–S8 for illustration purposes. (A) ERG from ETS family. (B) PROP1 from Homeodomain family.

DNA Motif Pattern Recognition

Given those top k-mers ranked, consistent with the previous study by Zhao and Stormo (Zhao and Stormo, 2011), we pick the top 25 scoring k-mers and compare them with the previous studies' top 25 k-mers (Wong et al., 2015). In particular, for each left-out DBD instance, the literature procedures are followed to align the top 25 scoring k-mers and build DNA motif matrices, which are then compared with the ground truth DNA motif matrices in CIS-BP (v1.02) as indexed by the protein names (Weirauch et al., 2014). To quantify the DNA motif matrix pattern similarities, the previously published motif matrix distance is adopted (Wong et al., 2013). All resultant DNA motif matrices are compared with the ground truth DNA motif matrices in CIS-BP (v1.02) (Weirauch et al., 2014). The results are depicted in Figure 6.

Figure 6

Complete DNA Motif Pattern Recognition Performance Comparisons under Different k-mer Neighborhood Settings

Distribution plots on the DNA motif matrix distances (Wong et al., 2013) between the actual DNA motif matrices in CIS-BP (v1.02) (Weirauch et al., 2014) and the predicted DNA motif matrices using the previous method (Wong et al., 2015) and the current method denoted in red and blue colors, respectively The horizontal axis denotes different neighborhood distance metrics, whereas the vertical axis denotes different distance thresholds.

Complete DNA Motif Pattern Recognition Performance Comparisons under Different k-mer Neighborhood Settings Distribution plots on the DNA motif matrix distances (Wong et al., 2013) between the actual DNA motif matrices in CIS-BP (v1.02) (Weirauch et al., 2014) and the predicted DNA motif matrices using the previous method (Wong et al., 2015) and the current method denoted in red and blue colors, respectively The horizontal axis denotes different neighborhood distance metrics, whereas the vertical axis denotes different distance thresholds. From the figure, it can be observed that the motif distance distribution of our generated motifs are skewed more toward the left side (zero side) than the previous approach (Wong et al., 2015); it indicates that the DNA motif matrices generated by our approach are more similar to the ground truth DNA motif matrices than those generated by the previous approach (Wong et al., 2015). In particular, if the k-mer neighborhood is defined using the LCS distance, the DNA motif matrices are consistently similar to the ground truth across different thresholds. It is important as the ground truth DNA motif matrices in CIS-BP (v1.02) are built on both in vivo and in vitro technologies such as ChIP-seq, HT-SELEX, and PBM. It can be observed that the DNA motif matrices generated in this study not only can capture the in vitro DNA-binding specificities but also have the potential to infer in vivo DNA-binding specificities. Examples are visualized in Figure 7. Additional examples are depicted in Figures S9–S12 for illustration purposes.

Figure 7

Examples of DNA Motif Matrices Generated by the Previous Approach and Our Approach and the Actual DNA Motif Matrices as Measured using PBM

The first two motif matrices are based on the left-one-out cross-validation predictions on 1,000 top k-mers whose neighborhood is defined based on the Hamming distance threshold of one. All settings strictly follow the protocols established by Zhao and Stormo (Zhao and Stormo, 2011). Additional examples are depicted in Figures S9–S11 for illustration purposes.

(A) HLH-30 from bHLH family.

(B) YAP3 from bZIP family.

(D) FOXO3 from Forkhead family.

Examples of DNA Motif Matrices Generated by the Previous Approach and Our Approach and the Actual DNA Motif Matrices as Measured using PBM The first two motif matrices are based on the left-one-out cross-validation predictions on 1,000 top k-mers whose neighborhood is defined based on the Hamming distance threshold of one. All settings strictly follow the protocols established by Zhao and Stormo (Zhao and Stormo, 2011). Additional examples are depicted in Figures S9–S11 for illustration purposes. (A) HLH-30 from bHLH family. (B) YAP3 from bZIP family. (C) EHF from ETS family. (D) FOXO3 from Forkhead family. Although the proposed approach has been extensively benchmarked across five DNA-binding families (i.e., bHLH, bZIP, ETS, Forkhead, and Homeodomain), one may wonder about its performance compared with the existing motif recognition modeling approaches. Therefore, the existing works on DNA motif elucidation are surveyed. Unfortunately, most of the existing works are based on verified DNA sequences (e.g., ChIP-seq, HiTS-FLIP, and DNase hypersensitivity data) (Wang et al., 2014, Khamis et al., 2018, Dai et al., 2017). It is not fair to do the comparisons. On the other hand, the most related work is limited to a specific DNA-binding family (i.e., homeodomain) (Pelossof et al., 2015). Therefore, the performance comparisons are conducted on the Homeodomain motif recognition tasks here. Pelossof et al. have proposed an affinity regression approach to infer DNA motifs from Homeodomain protein sequences (Pelossof et al., 2015). It is applied to generate and compare their DNA motifs against our DNA motifs. All experimental settings follow the standard setting by Zhao and Stormo (Zhao and Stormo, 2011). The results are depicted in Figure 8. It can be observed that our approach can generate the Homeodomain motifs more similar to the CIS-BP motifs than the other approaches, demonstrating its competitive edges. The comparison is significant as the Pelossof method already has 163 Homeodomain motif data for model building, whereas our approach only has 19 Homeodomain motif data for model building under LOOCVs. Some of the motif examples are depicted in Figure S13 for illustration purposes. Its success can be attributed to the k-mer community segregation ability of the underlying MRF modeling as exemplified in Figure S4, where you can see that the Homeodomain top k-mers are segregated into different communities, consistent with the existing knowledge that the Homeodomain family members have independently evolved into different subtypes for its DNA-binding specificity over the past years (Berger et al., 2008).

Figure 8

Homeodomain DNA Motif Pattern Recognition Performance Comparisons under Different k-mer Neighborhood Settings

Distribution plots on the Homeodomain DNA motif matrix distances (Wong et al., 2013) between the actual DNA motif matrices in CIS-BP (v1.02) (Weirauch et al., 2014) and the predicted DNA motif matrices using the previous method (Wong et al., 2015), Pelossof method (Pelossof et al., 2015), and the current method denoted in red, blue, and green colors, respectively. The horizontal axis denotes different neighborhood distance metrics, whereas the vertical axis denotes different distance thresholds of the current method.

Homeodomain DNA Motif Pattern Recognition Performance Comparisons under Different k-mer Neighborhood Settings Distribution plots on the Homeodomain DNA motif matrix distances (Wong et al., 2013) between the actual DNA motif matrices in CIS-BP (v1.02) (Weirauch et al., 2014) and the predicted DNA motif matrices using the previous method (Wong et al., 2015), Pelossof method (Pelossof et al., 2015), and the current method denoted in red, blue, and green colors, respectively. The horizontal axis denotes different neighborhood distance metrics, whereas the vertical axis denotes different distance thresholds of the current method.

Single Nucleotide Variants on DNA Motifs

In the previous studies, significant efforts have been made and relied on DNA motif matrices for regulatory single nucleotide variant (rSNV) prioritization (Macintyre et al., 2010, Guo et al., 2013, Zeng et al., 2016) with additional genomic information such as DNase accessibility and chromatin features (Shi et al., 2016, Li et al., 2016) on TFBSs; for instance, atSNP was proposed to address the rSNV prioritization challenge with ENCODE and JASPAR motif matrices (Zuo et al., 2015). SNP2TFBS was also built as an online resource that can bridge the annotation gap between rSNV and the classic SNV databases based on existing motif matrices automatically (Kumar et al., 2017). Unfortunately, motif matrices are well known for their position independence assumption (Benos et al., 2002). Although several improvements in DNA motif modeling have been proposed to address the issues (Mathelier and Wasserman, 2013, Wong et al., 2013), the combinatorial space of the sequence context in DNA motif modeling remains computationally expensive to be modeled around the rSNVs. The current study offers a novel opportunity for us to quantify the effects of rSNVs on DNA motifs, thanks to its high-resolution k-spectrum modeling methodology. Given an rSNV on DNA motif instances known to be bound by a family-specific TF, we can apply our model to compute the score difference between the reference sequence and its variant; for example, at Chr3: 38591950 (on Human Assembly GRCh37), an SNV (rs972777761) has been reported where the reference adenine (A) is changed to guanine (G) (or thymine [T] to cytosine [C] on the opposite strand). At the same time, we have also detected a surrounding motif instance (AAGGAAGTG) as bound by the ELF1 protein from the ETS family between Chr3:38591945 and Chr3:38591953. We can enumerate the reference k-mers and alternate k-mers surrounding the SNV, for instance, the 8-mers AGGAGTG and AGGAGTG, and compute their score difference using the ETS family model in our study. Since the scores represent k-mer binding intensities, the score difference and its sign could quantify the DNA-binding effect of the candidate SNV for mechanistic prioritization. In this case, our model returns −0.15, which indicates that the SNV could disrupt the DNA-binding affinity of ELF1 and thus its downstream regulation. To provide genome-wide insights, we have adopted the DNA motifs generated in our previous section and scanned the humane genome (GRCh37) using TFBSTools (Tan and Lenhard, 2016). In particular, we have selected the advanced genome-wide phylogenetic footprinting function (i.e., “searchPairBSgenome” in R) to improve the scanning performance by chaining the human genome (GRCh37) over the mouse genome (mm10), taking advantage of evolutionary conservation (Wasserman and Sandelin, 2004). Once scanned, we overlapped those motif instances with the clinically verified SNVs from ClinVar (version 20171029), resulting in 303,666 motif instances overlapping with known SNVs (Fisher's exact test p value < 0.001). For each DNA-binding family, we focus on the top DNA motif instances as ranked by TFBSTools (Tan and Lenhard, 2016). The family-specific results with different top ranks are depicted in Figure 9. For completeness, additional results with other parameter settings are depicted in Figure S14.

Figure 9

Violin Plots on the k-mer Score Difference Distributions of the SNVs at Top-Ranked Motif Instances via Family-Specific Recognition Modeling

The observed SNVs are retrieved from the clinically verified dataset, ClinVar (version 20171029), whereas the DNA motif instances are ranked by TFBSTools (Tan and Lenhard, 2016). Other settings are visualized in Figure S14.

(A) SNVs at top 5% motif instances.

(B) SNVs at top 10% motif instances.

Violin Plots on the k-mer Score Difference Distributions of the SNVs at Top-Ranked Motif Instances via Family-Specific Recognition Modeling The observed SNVs are retrieved from the clinically verified dataset, ClinVar (version 20171029), whereas the DNA motif instances are ranked by TFBSTools (Tan and Lenhard, 2016). Other settings are visualized in Figure S14. (A) SNVs at top 5% motif instances. (B) SNVs at top 10% motif instances. Interestingly, it can be observed that different DNA-binding families have strikingly different score difference distributions of observed SNVs from the clinically verified dataset, ClinVar (version 20171029). The bHLH, ETS, and Forkhead families have the score difference distributions skewed toward negativity. It indicates that the SNVs on the DNA motif instances bound by the TFs from those three families usually act by disrupting the corresponding protein-DNA binding interactions. In contrast, we observed the opposite trend for the bZIP family. The trend is even more complicated for the Homeodomain family as we can observe bimodal distributions for the SNVs on the DNA motif instances bound by its TFs, regardless of the DNA motif instance ranks as observed from Figures 9 and S14. On the other hand, we also observe that the clinically verified SNVs are more neutral than the possible SNVs on the DNA motif instances which are yet to be observed, consistent with the previous finding that negative selection pressures have been observed on DNA motif instances (Vorontsov et al., 2016). To support the aforementioned insights, the SNVs overlapping with the Homeodomain motif instances are extracted, resulting in 126,924 SNVs clinically verified in ClinVar (version 20171029). On the other hand, the previous setting is followed to generate an equal number of random SNVs as the control. Therefore, it constitutes a two-class SNV dataset (i.e., ClinVar versus control) for rSNV prioritization benchmark comparisons. The previous methods and the affinity regression approach (Pelossof et al., 2015) are run on those SNVs for prioritization. The k-mer frequencies and its differences are computed to ascertain the significance of each rSNV on those Homeodomain motif instances. The resultant areas under receiver operating characteristics values on different motifs are depicted and summarized in Figure 10. It is not surprising that different approaches have their own competitive edges on different DNA motifs for rSNV prioritization. Therefore, we seek to summarize them in the second figure, where we can observe that our approach is slightly better than the previous approaches as a whole. However, we note that we cannot ascertain any statistical significance using hypothesis testing (i.e., t test and rank-sum test) here. It indicates that our approach is comparable to the existing state-of-the-art approaches including the previous method published in Nucleic Acids Research and the affinity regression method published in Nature Biotechnology.

Figure 10

Single Nucleotide Variants (SNV) Prioritization on DNA Motifs according to ClinVar and TFBSTools

SNV prioritization performance comparisons based on area under receiver operating characteristics (ROCs) on Homeodomain motif rSNVs using the previous method (Wong et al., 2015), Pelossof method (Pelossof et al., 2015), and the current method denoted in red, blue, and green colors, respectively.

(A) Bar chart details.

(B) Violin and boxplot summary.

Single Nucleotide Variants (SNV) Prioritization on DNA Motifs according to ClinVar and TFBSTools SNV prioritization performance comparisons based on area under receiver operating characteristics (ROCs) on Homeodomain motif rSNVs using the previous method (Wong et al., 2015), Pelossof method (Pelossof et al., 2015), and the current method denoted in red, blue, and green colors, respectively. (A) Bar chart details. (B) Violin and boxplot summary.

Discussion

DNA motif recognition modeling offers opportunities for us to infer DNA motifs from protein sequences. Such an approach is not only applicable in the cases in which the direct evidences are unavailable but also holds promise for us to understand the DNA-binding specificities from the first principle on the protein side. The proposed approach directly addressed such issues at the unprecedented resolution based on the k-spectrum MRF modeling. It has been extensively benchmarked on millions of k-mer binding intensities from 92 TFs across 5 DNA-binding families bHLH, bZIP, ETS, Forkhead, and Homeodomain, as tabulated in Table S1. The DNA-binding intensity correlation results demonstrate that the proposed approach is robust against different numbers of top k-mers. In particular, it can be scaled and keeps improving with increasing k-mers. The DNA motif pattern recognition results also reveal that it can capture not only the in vitro patterns but also the in vivo patterns in CIS-BP (v1.02). Last, the DNA motif patterns have been overlapped with the clinically verified SNVs, revealing genome-wide insights into the DNA-binding mechanisms across five DNA-binding families. Thanks to the underlying formulation, the models also have the potential to predict the deleteriousness of unobserved SNVs for the DNA-binding specificities of TFs. It is especially important to uncover unobserved deleterious SNVs as the current studies estimated that, even if we have 500,000 sequenced individuals, we can only observe 12% of all possible SNVs under the protein-coding variant subset (Zou et al., 2016). Therefore, our approaches should be promising based on the first principle in the near future. As the future works, one may be interested in integrating the existing DNA shape data into the modeling process (Yang et al., 2013). However, it is subject to data availability as well as reliability since the current DNA shape data are mostly computationally predicted (Zhou et al., 2013) and may not be applicable to our DNA-binding specificity studies (Rossi et al., 2017).

Limitation of Study

The current study is limited to five DNA-binding families: bHLH, bZIP, ETS, Forkhead, and Homeodomain because of data availability. In the future, it should be extended to other families such as zinc finger. In addition, the study can be benchmarked with longer k-mers than the current ones. The Markov assumption here can be investigated further under different Markov orders.

Methods

All methods can be found in the accompanying Transparent Methods supplemental file.

50 in total

1. A normalized Levenshtein distance metric.

Authors: Li Yujian; Liu Bo
Journal: IEEE Trans Pattern Anal Mach Intell Date: 2007-06 Impact factor: 6.226

2. RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors.

Authors: Xiaoyu Chen; Timothy R Hughes; Quaid Morris
Journal: Bioinformatics Date: 2007-07-01 Impact factor: 6.937

3. Evaluation of methods for modeling transcription factor sequence specificity.

Authors: Matthew T Weirauch; Atina Cote; Raquel Norel; Matti Annala; Yue Zhao; Todd R Riley; Julio Saez-Rodriguez; Thomas Cokelaer; Anastasia Vedenko; Shaheynoor Talukder; Harmen J Bussemaker; Quaid D Morris; Martha L Bulyk; Gustavo Stolovitzky; Timothy R Hughes
Journal: Nat Biotechnol Date: 2013-01-27 Impact factor: 54.908

4. A promoter-level mammalian expression atlas.

Authors: Alistair R R Forrest; Hideya Kawaji; Michael Rehli; J Kenneth Baillie; Michiel J L de Hoon; Vanja Haberle; Timo Lassmann; Ivan V Kulakovskiy; Marina Lizio; Masayoshi Itoh; Robin Andersson; Christopher J Mungall; Terrence F Meehan; Sebastian Schmeier; Nicolas Bertin; Mette Jørgensen; Emmanuel Dimont; Erik Arner; Christian Schmidl; Ulf Schaefer; Yulia A Medvedeva; Charles Plessy; Morana Vitezic; Jessica Severin; Colin A Semple; Yuri Ishizu; Robert S Young; Margherita Francescatto; Intikhab Alam; Davide Albanese; Gabriel M Altschuler; Takahiro Arakawa; John A C Archer; Peter Arner; Magda Babina; Sarah Rennie; Piotr J Balwierz; Anthony G Beckhouse; Swati Pradhan-Bhatt; Judith A Blake; Antje Blumenthal; Beatrice Bodega; Alessandro Bonetti; James Briggs; Frank Brombacher; A Maxwell Burroughs; Andrea Califano; Carlo V Cannistraci; Daniel Carbajo; Yun Chen; Marco Chierici; Yari Ciani; Hans C Clevers; Emiliano Dalla; Carrie A Davis; Michael Detmar; Alexander D Diehl; Taeko Dohi; Finn Drabløs; Albert S B Edge; Matthias Edinger; Karl Ekwall; Mitsuhiro Endoh; Hideki Enomoto; Michela Fagiolini; Lynsey Fairbairn; Hai Fang; Mary C Farach-Carson; Geoffrey J Faulkner; Alexander V Favorov; Malcolm E Fisher; Martin C Frith; Rie Fujita; Shiro Fukuda; Cesare Furlanello; Masaaki Furino; Jun-ichi Furusawa; Teunis B Geijtenbeek; Andrew P Gibson; Thomas Gingeras; Daniel Goldowitz; Julian Gough; Sven Guhl; Reto Guler; Stefano Gustincich; Thomas J Ha; Masahide Hamaguchi; Mitsuko Hara; Matthias Harbers; Jayson Harshbarger; Akira Hasegawa; Yuki Hasegawa; Takehiro Hashimoto; Meenhard Herlyn; Kelly J Hitchens; Shannan J Ho Sui; Oliver M Hofmann; Ilka Hoof; Furni Hori; Lukasz Huminiecki; Kei Iida; Tomokatsu Ikawa; Boris R Jankovic; Hui Jia; Anagha Joshi; Giuseppe Jurman; Bogumil Kaczkowski; Chieko Kai; Kaoru Kaida; Ai Kaiho; Kazuhiro Kajiyama; Mutsumi Kanamori-Katayama; Artem S Kasianov; Takeya Kasukawa; Shintaro Katayama; Sachi Kato; Shuji Kawaguchi; Hiroshi Kawamoto; Yuki I Kawamura; Tsugumi Kawashima; Judith S Kempfle; Tony J Kenna; Juha Kere; Levon M Khachigian; Toshio Kitamura; S Peter Klinken; Alan J Knox; Miki Kojima; Soichi Kojima; Naoto Kondo; Haruhiko Koseki; Shigeo Koyasu; Sarah Krampitz; Atsutaka Kubosaki; Andrew T Kwon; Jeroen F J Laros; Weonju Lee; Andreas Lennartsson; Kang Li; Berit Lilje; Leonard Lipovich; Alan Mackay-Sim; Ri-ichiroh Manabe; Jessica C Mar; Benoit Marchand; Anthony Mathelier; Niklas Mejhert; Alison Meynert; Yosuke Mizuno; David A de Lima Morais; Hiromasa Morikawa; Mitsuru Morimoto; Kazuyo Moro; Efthymios Motakis; Hozumi Motohashi; Christine L Mummery; Mitsuyoshi Murata; Sayaka Nagao-Sato; Yutaka Nakachi; Fumio Nakahara; Toshiyuki Nakamura; Yukio Nakamura; Kenichi Nakazato; Erik van Nimwegen; Noriko Ninomiya; Hiromi Nishiyori; Shohei Noma; Shohei Noma; Tadasuke Noazaki; Soichi Ogishima; Naganari Ohkura; Hiroko Ohimiya; Hiroshi Ohno; Mitsuhiro Ohshima; Mariko Okada-Hatakeyama; Yasushi Okazaki; Valerio Orlando; Dmitry A Ovchinnikov; Arnab Pain; Robert Passier; Margaret Patrikakis; Helena Persson; Silvano Piazza; James G D Prendergast; Owen J L Rackham; Jordan A Ramilowski; Mamoon Rashid; Timothy Ravasi; Patrizia Rizzu; Marco Roncador; Sugata Roy; Morten B Rye; Eri Saijyo; Antti Sajantila; Akiko Saka; Shimon Sakaguchi; Mizuho Sakai; Hiroki Sato; Suzana Savvi; Alka Saxena; Claudio Schneider; Erik A Schultes; Gundula G Schulze-Tanzil; Anita Schwegmann; Thierry Sengstag; Guojun Sheng; Hisashi Shimoji; Yishai Shimoni; Jay W Shin; Christophe Simon; Daisuke Sugiyama; Takaai Sugiyama; Masanori Suzuki; Naoko Suzuki; Rolf K Swoboda; Peter A C 't Hoen; Michihira Tagami; Naoko Takahashi; Jun Takai; Hiroshi Tanaka; Hideki Tatsukawa; Zuotian Tatum; Mark Thompson; Hiroo Toyodo; Tetsuro Toyoda; Elvind Valen; Marc van de Wetering; Linda M van den Berg; Roberto Verado; Dipti Vijayan; Ilya E Vorontsov; Wyeth W Wasserman; Shoko Watanabe; Christine A Wells; Louise N Winteringham; Ernst Wolvetang; Emily J Wood; Yoko Yamaguchi; Masayuki Yamamoto; Misako Yoneda; Yohei Yonekura; Shigehiro Yoshida; Susan E Zabierowski; Peter G Zhang; Xiaobei Zhao; Silvia Zucchelli; Kim M Summers; Harukazu Suzuki; Carsten O Daub; Jun Kawai; Peter Heutink; Winston Hide; Tom C Freeman; Boris Lenhard; Vladimir B Bajic; Martin S Taylor; Vsevolod J Makeev; Albin Sandelin; David A Hume; Piero Carninci; Yoshihide Hayashizaki
Journal: Nature Date: 2014-03-27 Impact factor: 49.962

5. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

6. Negative selection maintains transcription factor binding motifs in human cancer.

Authors: Ilya E Vorontsov; Grigory Khimulya; Elena N Lukianova; Daria D Nikolaeva; Irina A Eliseeva; Ivan V Kulakovskiy; Vsevolod J Makeev
Journal: BMC Genomics Date: 2016-06-23 Impact factor: 3.969

7. Correspondence: DNA shape is insufficient to explain binding.

Authors: Matthew J Rossi; William K M Lai; B Franklin Pugh
Journal: Nat Commun Date: 2017-06-05 Impact factor: 14.919

8. Affinity regression predicts the recognition code of nucleic acid-binding proteins.

Authors: Raphael Pelossof; Irtisha Singh; Julie L Yang; Matthew T Weirauch; Timothy R Hughes; Christina S Leslie
Journal: Nat Biotechnol Date: 2015-11-16 Impact factor: 54.908

9. Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape.

Authors: Hanjun Dai; Ramzan Umarov; Hiroyuki Kuwahara; Yu Li; Le Song; Xin Gao
Journal: Bioinformatics Date: 2017-11-15 Impact factor: 6.937

10. TFBSTools: an R/bioconductor package for transcription factor binding site analysis.

Authors: Ge Tan; Boris Lenhard
Journal: Bioinformatics Date: 2016-01-21 Impact factor: 6.937

3 in total

1. Integrating genome-wide association study with regulatory SNP annotation information identified candidate genes and pathways for schizophrenia.

Authors: Xiao Liang; Sen Wang; Li Liu; Yanan Du; Bolun Cheng; Yan Wen; Yan Zhao; Miao Ding; Shiqiang Cheng; Mei Ma; Lu Zhang; Xin Qi; Ping Li; Xiong Guo; Feng Zhang
Journal: Aging (Albany NY) Date: 2019-06-07 Impact factor: 5.682

2. Genome-Wide Identification and Functional Characterization of GATA Transcription Factor Gene Family in Alternaria alternata.

Authors: Yanan Chen; Yingzi Cao; Yunpeng Gai; Haijie Ma; Zengrong Zhu; Kuang-Ren Chung; Hongye Li
Journal: J Fungi (Basel) Date: 2021-11-26

3. Biolayer interferometry for DNA-protein interactions.

Authors: John K Barrows; Michael W Van Dyke
Journal: PLoS One Date: 2022-02-02 Impact factor: 3.240

3 in total