Literature DB >> 20529874

Discovering protein-DNA binding sequence patterns using association rule mining.

Kwong-Sak Leung¹, Ka-Chun Wong, Tak-Ming Chan, Man-Hon Wong, Kin-Hong Lee, Chi-Kong Lau, Stephen K W Tsui.

Abstract

Protein-DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) play an essential role in transcriptional regulation. Over the past decades, significant efforts have been made to study the principles for protein-DNA bindings. However, it is considered that there are no simple one-to-one rules between amino acids and nucleotides. Many methods impose complicated features beyond sequence patterns. Protein-DNA bindings are formed from associated amino acid and nucleotide sequence pairs, which determine many functional characteristics. Therefore, it is desirable to investigate associated sequence patterns between TFs and TFBSs. With increasing computational power, availability of massive experimental databases on DNA and proteins, and mature data mining techniques, we propose a framework to discover associated TF-TFBS binding sequence patterns in the most explicit and interpretable form from TRANSFAC. The framework is based on association rule mining with Apriori algorithm. The patterns found are evaluated by quantitative measurements at several levels on TRANSFAC. With further independent verifications from literatures, Protein Data Bank and homology modeling, there are strong evidences that the patterns discovered reveal real TF-TFBS bindings across different TFs and TFBSs, which can drive for further knowledge to better understand TF-TFBS bindings.

Entities: CellLine Chemical Disease Mutation Species

Mesh：

Substances：

Year: 2010 PMID： 20529874 PMCID： PMC2965231 DOI： 10.1093/nar/gkq500

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

We first introduce protein–DNA bindings in this section. Existing bioinformatics methods are briefly described, followed by the layout of this article.

Protein–DNA binding

Protein–DNA binding plays a central role in genetic activities such as transcription, packaging, rearrangement, and replication (1,2). Therefor, it is very important to identify and understand the protein–DNA bindings as the basis for further deciphering biological systems. We focus on protein–DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs), which are the primary regulatory activities with abundant data support. TFs bind in a sequence-specific manner to TFBSs to regulate gene transcription (gene expression). The DNA binding domain(s) of a TF can recognize and bind to a collection of similar TFBSs, from which a conserved pattern called motif can be obtained. TFBSs, the nucleotide fragments bound by TFs, are usually short (usually about 5–20 bp) in the cis-regulatory/intergenic regions and can assume very different locations from the transcription start site. It is expensive and laborious to experimentally identify TF–TFBS binding sequence pairs, for example, using DNA footprinting (3) or gel electrophoresis (4). The technology of chromatin immunoprecipitation (ChIP) (5,6) measures the binding of a particular TF to DNA of co-regulated genes on a genome-wide scale in vivo, but at low resolution. Further processing are needed to extract precise TFBSs (7). TRANSFAC (8) is one of the largest and most representative databases for regulatory elements including TFs, TFBSs, nucleotide distribution matrices of the TFBSs and regulated genes. The data are expertly annotated and manually curated from peer-reviewed and experimentally proved publications. Other annotation databases of TF families and binding domains are also available [e.g. PROSITE (9), Pfam (10)]. It is even more difficult and time-consuming to extract high-resolution 3D TF–TFBS complex structures with X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopic analysis. Nevertheless, the high-quality TF–TFBS binding structures provide valuable insights into verifications of putative principles of binding. The Protein Data Bank (PDB) (11) serves as a representative repository of such experimentally extracted protein–DNA (in particular TF–TFBS) complexes with high resolution at atomic levels. However, the available 3D structures are far from complete. As a result, there is strong motivation to have automatic methods, particularly, computational approaches based on existing abundant data, to provide testable candidates of novel TF domains and/or TFBS motifs with high confidence to guide and accelerate the wet-lab experiments.

Existing methods

The first attempt of computational methods related to TF–TFBS bindings was to discover the motifs of TF domains and TFBSs separately. Moreover, researchers have been trying hard to generalize the one-to-one binding codes from existing 3D structures. Data mining methods have also been proposed with feature transformations and machine learning to decipher complicated binding rules. They are briefly described as follows:

Motif discovery

TF domains and TFBSs sequences are somewhat conserved due to their functional similarity and importance. By exploiting conservation in the sequences, Bioinformatics methods called motif discovery save some of the expensive and laborious laboratory experiments. Motif discovery (6) can be categorized into two types: (i) motif matching and (ii) de novo motif discovery. (i) Motif matching is to identify putative TF domains (9,10) or TFBSs (12) based on motif knowledge obtained from annotated data. (ii) de novo motif discovery predicts conserved patterns without knowledge on their appearances, based on certain motif models and scoring functions (13,14) from a set of protein/DNA promoter sequences with similar regulatory functions. While de novo motif discovery is successful for well-conserved TF functional domain motifs, the counterpart for TFBSs remains very challenging with poor performances on real benchmarks (6,15,16). A significant limitation of motif discovery is the lack of linkage between the binding counterparts for revealing TF–TFBS relationships.

One-to-one binding codes

Numerous studies have been carried out to analyze existing protein–DNA binding 3D structures comprehensively (2,17,18) or with focus on specific families (1) [e.g. zinc fingers (19)]. Various properties have been discovered concerning, e.g. bonding and force types, TF conservation and mutation (1), and bending of the DNA (17). Some are already applicable to predict binding amino acids on the TF side (20). However, annotated data are far from complete. Alternatively, researchers have sought hard for general binding ‘codes' between proteins and DNA, in particular the one-to-one mapping between amino acids from TFs and nucleotides from TFBSs. Despite many proposed one-one binding propensity mappings (1,21,22), it has come to a consensus that there are no simple binding ‘codes' (23).

Data mining

In the hope of better understanding for protein–DNA bindings, many data mining approaches have also been proposed (24). Researchers employ and transfer additional detailed information such as base compositions, structures, thermodynamic properties (25,26) as well as expressions (27), into sophisticated features to fit into certain data mining techniques. Although some approaches may provide interpretable rules, most of them have stringent data requirements which cannot be obtained trivially. Existing data beyond sequences are also insufficient and limited for practitioners. These methods usually extract complicated features rather than working on interpretable data directly. Many data mining techniques, such as neural networks, support vector machines (SVM) (28) and regressions (24), may generate rules which are not trivial to interpret. Furthermore, many data mining approaches are based on specific families or particular data sets, where the generality of the results are limited. On the other hand, sequences serve as the most handy primary data that carry important information for protein–DNA bindings (23). It is desirable to make use of the large-scale and comprehensive sequence data to mine explicit and interpretable TF–TFBS binding rules.

Article layout

In this article, we propose a framework based on association rule mining to discover protein–DNA binding sequence patterns from TRANSFAC. The article layout is as follows: the proposed methods are presented in the next section: ‘Materials and Methods' section; experimental results and verifications are reported in sections ‘Results and Analysis' and ‘Verifications' section, respectively; and finally we have the ‘Discussion' section for the approach.

MATERIALS AND METHODS

In this section, we propose a framework for mining, discovering and verifying TF–TFBS bindings on large-scale databases. The framework starts from data cleansing and transformation on TRANSFAC, and then applies association rule mining to discover TF-TFBS binding sequence patterns. Comprehensive 3D verifications and evaluations are carried out on PDB. Detailed bonding analysis is performed to provide strong support to the discovered rules. In the following subsections, Apriori algorithm for association rule mining is first introduced. We then elaborate how the algorithm is applied to protein–DNA binding pattern discovery. Finally, we present how the data are preprocessed for the task with a running example.

Association rule mining and Apriori Algorithm

Association rule mining (29) aims at discovering frequently co-occurring items, called frequent itemsets, from a large number of data samples above a certain count threshold (minimum support) (30). The support of an itemset is defined as the number of data samples where all the items in the itemset co-occur. In the case of protein–DNA binding, the binding domains of TFs can recognize and form strong bondings with certain sequence-specific patterns of the TFBSs. Therefore, they are likely to co-occur frequently among the combinations between all possible TF and TFBS subsequences, and can be thus identified by association rule mining. In this study, we use the notation of k-mer (a subsequence with k amino acid or nucleotide residues) to represent a candidate item. A frequent TF–TFBS itemset is a TF k-mer and TFBS k-mer (the two k's can be different) pair, or simply a pair, co-occurring with a frequency no less than the minimum support in the TF–TFBS sequence records (TRANSFAC database). Apriori algorithm proposed by Agrawal et al. (29) is a classical approach to find out frequent itemsets. It is outlined in Algorithm 1 in the Appendix 1. It is a branch and bound algorithm for discovering association rules in a database. With its downward closure property, an optimal performance is guaranteed. The algorithm first obtains frequent 1-itemsets. Iteratively, it uses the frequent n-itemsets (itemsets with n items) to generate all possible candidate (n+1)-itemsets. They are then evaluated for their supports (30). If the support of an (n+1)-itemset is lower than a threshold, the (n+1)-itemset is removed. After the removal, the resultant (n+1)-itemsets are the frequent (n+1)-itemsets. The above procedure is repeated until an empty set is found.

Discovering associated TF–TFBS sequence patterns

To formulate the TF–TFBS sequence pattern discovery problem into association rule mining, we have to transform the protein–DNA binding records into the formats of itemsets (k-mers). An illustrative example for the TF–TFBS binding records from TRANSFAC 2008.3 is shown in Figure 1. The TF (e.g. T01333 RXR-γ) can bind to several TFBS DNA sequences. The DNA sequences may be different in lengths due to experimental methods and noises. Both the TF and TFBS sequences are chopped into overlapping short k-mers, as illustrated in Figure 2 (first part). They together with the corresponding reverse complements (e.g. GACCT and reverse complement: AGGTC) form one data sample. To generate the itemsets, all the k-mers are recorded in a binary array where appearing k-mers are marked 1; and 0 otherwise. Thus, the length of the array depends on the number of all possible TF k-mers and TFBS k-mers (Figure 2, second part). Since k is usually short (4–6), all the possible 4 combinations of TFBS DNA k-mers can be adopted. However, it is computationally infeasible to obtain all the possible 20 combinations of TF k-mers. Thus a data-driven approach is employed by scanning the whole TRANSFAC to obtain frequent TF amino acid k-mers.

Figure 1.

TFBS sequences of a TF (TRANSFAC 2008.3 ID: T01333).

Figure 2.

Flowchart of the proposed framework to discover association rules from TRANSFAC.

TFBS sequences of a TF (TRANSFAC 2008.3 ID: T01333). Flowchart of the proposed framework to discover association rules from TRANSFAC. Since there are multiple TFBSs for each TF (e.g. Figure 1), a question arises: how to define the ‘commonly found' TFBS k-mers of a TF? Without loss of generality, the majority rule (31) is applied. If the majority of a TF's TFBS sequences contains a certain DNA residue k-mer, then the k-mer is considered ‘commonly found'. We set the majority to be 50% for TFBS k-mers. We only count the number of TFBS sequences in which a certain k-mer appears, in order not to be biased by multiple occurrences of the k-mer appearing in only a few TFBS sequences. Figure 1 illustrates an example where there are five TFBS sequences. The TFBS DNA k-mer AGGTC (or its reverse complement: GACCT) can be found in three of the TFBS sequences. The k-mer appears in 60% (3/5) of the TFBS sequences of the TF, and thus is considered ‘commonly found'. On the other hand, GTTCA is not considered ‘commonly found' because it only appears in 2 (40%) out of the 5 TFBS sequences of the TF. After all valid TF data samples are transformed into itemsets, Apriori algorithm is applied to generate frequent TF–TFBS k-mer sequence patterns (the links in Figure 2, second part). The special feature in this study is that the co-occurring pairs should contain both TF and TFBS k-mer items, as illustrated in the third part of Figure 2. In the current study, we only consider one TF k-mer with one TFBS k-mer in the frequent itemsets, but it is straightforward to generalize it to be multiple TF and TFBS k-mers in principle. The huge computational intensity for the generalization, when applied on the large TRANSFAC database, prevents us from doing so at this time. Finally, the association rules are computed based on the confidence measurements for the frequent itemsets, which are defined as follows: where conf(k-merDNA ⇒ k-merAA) is called forward confidence, conf(k-merDNA ⇐ k-merAA) is called backward confidence and support(X) is the support of itemset X. For each association rule, its forward confidence measures the posterior probability that the corresponding amino acid k-mer can be found in a TF's sequence if the DNA k-mer is commonly found in the TF's TFBS sequences. Its backward confidence measures the posterior probability that the corresponding DNA k-mer can be commonly found in a TF's TFBS sequences if the amino acid k-mer is found in the TF's sequence. The minimum of them is taken as confidence in this article. The higher the confidence, the better the association rule is (Figure 2, fourth part). The whole proposed approach is summarized in Figure 2.

Data preparation

To apply the methodology on TRANSFAC, TF and TFBS data were downloaded and extracted from the flat files of TRANSFAC 2008.3 [a free public (older) version is also available (http://www.gene-regulation.com/pub/database.html)]. The entries without sequence data were discarded. Since a TF can bind to one or more TFBSs, TFBS data were grouped by TF. TFBS sequences were extracted for each TF to form a TF data set—a TF sequence and the corresponding TFBS sequences—and finally to be transformed into itemsets. To avoid sampling error, TF data sets with less than five TFBS sequences were discarded. Furthermore, the redundancy of TF sequences was removed by BLASTClust using 90% TF sequence identity (32). Only one TF data set was selected for each cluster. Note that we only used sequence data in TRANSFAC. None of the prior information (e.g. the binding domains of TFs) other than sequences was used. Importantly, it turns out that the results of the proposed approach can be verified by annotations, 3D structures from PDB and even homology modeling as described in the subsequent sections. After data preparation, the 631 TF data sets (listed in Table 5 in the Appendix 1) were selected. The minimum support (30) was set to seven TF data sets to avoid sampling error. For the values of k, we try 4–6 for both TF k-mers and TFBS k-mers, resulting in 9 (3 × 3) different combinations. In particular, 256 DNA 4-mers, 1024 DNA 5-mers and 4096 DNA 6-mers were adopted for TFBS, whereas 99 621 amino acid 4-mers, 82 561 amino acid 5-mers, and 39 320 amino acid 6-mers were adopted for TF, as the frequent 1-itemsets.

Table 5.

631 TRANSFAC 2008.3 IDs and factor names used in this article

ID	Factor name	ID	Factor name	ID	Factor name	ID	Factor name	ID	Factor name	ID	Factor name
T00003	AS-CT3	T00842	Tra-1(long form)	T01950	HNF-1α-B	T04378	Mad	T08676	STAT6	T09986	NF-AT4
T00008	Adf-1	T00843	Ttk69K	T01951	HNF-1α-C	T04446	Nkx5-1	T08787	ARF1isoform-1	T09990	CDP-isoform1
T00011	ADR1	T00851	T3R-β1	T01973	REST-form2	T04539	RPN4	T08797	MCB1	T10028	SREBP-1c
T00019	AhR	T00863	Ubx	T01992	Abd-A	T04610	SXR	T08805	WRKY1	T10030	POU3F2
T00026	Antp	T00886	v-ErbA	T02003	Cdx-3	T04651	ER-β	T08823	E2F	T10059	GCMa
T00028	YAP1	T00891	HNF-1β-A	T02008	Ems	T04665	Xvent-1	T08853	myogenin	T10068	COUP-TF2
T00033	AP-2α	T00893	v-Jun	T02030	Sd	T04674	IRF-7A	T08858	REVERB-α	T10083	HNF-3α
T00063	Bcd	T00894	Vmw65	T02033	HsfA1	T04675	MRF-2- isoform1	T08863	S8	T10144	Gfi1b
T00077	CACCC- binding factor	T00895	v-Myb	T02039	HAC1	T04679	dri	T08868	CTCF	T10187	NF-E2p45
T00079	Cad	T00899	WT1	T02050	Nkx6-2	T04728	CDC5L	T08878	Opaque-2	T10207	GATA-6
T00080	CBF1	T00910	YB-1	T02054	HOX11	T04733	Alfin1	T08972	EAR2	T10209	Nkx2-1
T00104	C/EBPα	T00915	YY1	T02063	KNOX3	T04734	Topors- isoform1	T08978	Dl-A	T10211	Evi-1
T00106	C/EBP	T00917	Zen-1	T02068	PU.1	T04783	mtTFA	T08985	Pti4	T10265	LRH-1
T00109	C/EBPδ	T00918	Zeste	T02099	Zen-2	T04784	PF1	T08989	Fra-1	T10276	Erm
T00112	c-Ets-1	T00923	Zta	T02100	Zeste	T04811	FOXP1a	T08994	HIF-1α- isoform1	T10282	Otx2
T00113	c-Ets-2	T00925	AMT1	T02128	SAP-1b	T04817	LIM1	T09001	BPC1	T10317	IA-1
T00115	c-Ets-168	T00937	HBP-1a	T02142	OCA-B	T04819	EmBP-1a	T09018	N-Myc	T10331	NRF-1
T00117	CF1	T00938	HBP-1b	T02216	TFIIA-α/β precursor (major)	T04886	Tel-2b	T09033	TEF-1	T10392	GATA-3
T00120	CF2-II	T00969	POU3F1	T02217	TFIIA-α/β precursor (minor)	T04931	p73α	T09051	AhR	T10393	GATA-2
T00128	HOXA4	T01005	MEF2A- isoform1	T02235	PEBP2αB1	T04957	EKLF	T09059	SEF2-1B	T10429	PU.1
T00140	c-Myc	T01017	CRE-BP2	T02248	StuAp	T04961	GLI2α	T09071	AG	T10459	Alx-3
T00151	CP2a	T01019	Elf-1	T02256	AML1a	T04996	ZBP89	T09089	PIF3	T10462	Prop-1
T00163	CREB	T01027	BAS1	T02288	HFH-1	T04998	Tel-2a	T09093	IPF1	T10473	TEF-5
T00167	ATF-2-xbb4	T01035	Isl-1α	T02290	FOXD3	T04999	Tel-2c	T09097	SRY	T10482	AP-2γ
T00176	CTF-1	T01051	FOXA4a	T02291	Croc	T05021	NERF-1a	T09098	SREBP-2	T10484	TEF-3
T00177	CTF-2	T01053	HNF-3β	T02294	FOXI1a	T05051	BTEB3	T09102	FOXO4	T10543	Sox5
T00179	CUP2	T01059	MNB1a	T02302	GCM	T05137	CIZ6-1	T09106	RelA-p65	T10573	DREB1A
T00183	DBP	T01072	TEF	T02313	MIBP1	T05181	DSF	T09117	E2F-1	T10588	Snai3
T00193	Dfd	T01074	Ap	T02330	G/HBF-1	T05553	MYBAS1	T09129	BCL-6	T10638	HY5
T00204	E12	T01078	GBF1	T02361	CREBβ	T05587	BZI-1	T09156	TGIF-isoform2	T10644	MTF-1
T00208	E74A	T01083	NF-μNR	T02378	USF1	T05682	ERRα1	T09158	BZR1	T10664	Gfi1
T00217	EcR	T01085	abaA	T02419	Sp3	T05705	GATA-1	T09159	PITX2A	T10666	SRY
T00253	En	T01109	TCF-1(P)	T02420	Sox13	T05706	GATA-2	T09162	Pax-3	T10674	MafK
T00262	ER-α	T01112	EBF1-L	T02422	HNF-4α2	T05707	GATA-3	T09177	MyoD	T10712	DMRT1
T00264	ER-α	T01147	SF-1isoform2	T02429	HNF-4α1	T05708	GATA-4	T09178	C/EBPα	T10720	GCR1
T00272	Eve	T01152	T3R-α1	T02463	GBF1	T05737	PCF3	T09182	Pax-5	T10721	DMRT2
T00295	Ftz	T01154	c-Rel	T02469	AP-2β	T05743	ABI4	T09183	WRKY53	T10723	DMRT3
T00296	FTZ-F1	T01258	MSN4	T02529	PPARγ1	T05770	DREB1A	T09184	Pax-8	T10725	DMRT7
T00301	GAGA factor	T01265	MAC1	T02636	CBF1	T05834	CBF2	T09190	AGL15	T10727	DMRT4
T00302	GAL4	T01274	ABF2	T02639	ANT	T05835	DRF1.1	T09194	NF-AT1C	T10731	DMRT5
T00303	GAL80	T01275	mat1-Mc	T02654	ERF2	T05837	DRF1.3	T09195	SPL14	T10739	MRP1
T00315	GBF	T01286	ROX1	T02669	EmBP-1a	T05929	SUSIBA2	T09196	HSF2A	T10745	HSFA2
T00329	Glass	T01313	ATF3	T02672	GBF1	T05943	FOXP1d	T09199	STAT5A	T10747	MTF-1
T00330	GLI1	T01333	RXR-γ	T02690	Dof2	T05975	E2F1	T09218	Msx-1	T10754	ABF1
T00331	GLI3	T01346	Arnt	T02691	Dof3	T05977	PEND	T09225	En-1	T10760	HAP1
T00337	GR-α	T01350	T3R-β2	T02772	GCNF	T05982	POTH1	T09226	Lhx2	T10795	C/EBPγ
T00349	HAP2	T01352	PPARα	T02786	RITA-1	T06004	DeltaNp63α	T09230	Prep1	T10849	STB5
T00350	HAP3	T01388	C/EBP	T02789	bZIP910	T06029	Sox17	T09243	MafG	T10854	GCN4
T00368	HNF-1α-A	T01400	Ets-1deltaVII	T02790	bZIP911	T06043	AGP1	T09287	MITF-A2	T10881	TRAB1
T00377	HOXA5	T01422	ste11	T02807	OSBZ8	T06137	p73β	T09304	Smad4	T10928	TGA2
T00383	HSF	T01427	p300	T02809	ROM1	T06168	p63α	T09319	IRF-1	T10958	ATHB-2
T00385	HSF1	T01431	c-Maf (long form)	T02810	ROM2	T06341	BEL5	T09323	IRF-1	T10959	PCF1
T00386	HSTF	T01470	Ik-2	T02818	GLN3	T06356	Rim101p	T09343	SRF	T10960	PCF2
T00395	Hb	T01471	Ik-3	T02825	gaf2	T06404	WRKY38	T09355	Alx-4	T11115	ZIC1
T00401	ICP4	T01476	Abd-B	T02841	FACB	T06429	HIC-1- isoform2	T09356	HOXA3	T11136	DEC2
T00445	KNIRPS	T01477	BR-CZ1	T02846	UAY	T06532	NAC69-1	T09383	GABP-α	T11158	HELIOS-B
T00456	Kr	T01478	BR-CZ2	T02878	TCF-4E	T06533	MYB80	T09424	WRKY2	T11164	FOXJ1
T00458	LAC9	T01479	BR-CZ3	T02897	Sox6-Isoform1	T06537	Ci	T09426	Sp3-isoform1	T11166	FOXF1
T00459	C/EBPβ(LAP)	T01480	BR-CZ4	T02905	LEF-1	T08158	ABZ1	T09427	RAP-1-xbb1	T11180	Gli1
T00480	MAL63	T01481	Pbx1a	T02907	MYB305	T08251	FBI-1	T09431	Sp1	T11200	DEC1
T00487	MATα2	T01482	Exd	T02929	MYB340	T08252	NF-AT3	T09441	RBP-Jκ	T11217	Gzf1
T00488	MATa1	T01484	Cdx-1	T02936	FOXO1	T08279	USF1	T09444	CPRF-3	T11246	ZIC2
T00489	Max-isoform2	T01492	STAT1α	T02983	Pax-4a	T08291	GATA-1	T09449	CPRF-2	T11250	Brachyury
T00490	MAZ	T01517	Twi	T02999	OCSBF-1	T08292	GATA- 1isoform1	T09450	CPRF-1	T11256	GCMb
T00497	MBP-1(1)	T01527	RORα1	T03031	Pax-2.1	T08293	GATA-1	T09462	Egr-1	T11258	GCMa
T00500	MCM1	T01528	RORα2	T03178	SQUA	T08298	Kaiso	T09478	TGA1a	T11310	MafA
T00509	MIG1	T01556	SREBP-1a	T03227	CAT8	T08300	ER-α-L	T09507	Sox-xbb1	T11372	HOXB8
T00529	MZF1B-C	T01590	P (long form)	T03256	HNF-3β	T08313	USF2a	T09514	HTF4γ	T11383	HOXD13
T00535	NF-1	T01592	C1 (long form)	T03258	HNF-6β	T08318	Elf-1	T09531	ATF-4	T11390	Cart-1
T00594	RelA-p65	T01599	LCR-F1	T03388	Meis-1a	T08319	Zec	T09540	c-Krox	T11394	PR-β
T00625	ZEB(1124AA)	T01615	Su	T03389	Meis-1b	T08321	p53-isoform1	T09548	IRF-3	T11402	Crx
T00627	NIT2	T01649	HES-1	T03447	LHX3b	T08323	p53	T09561	Roaz	T11425	Chx10
T00642	POU2F1	T01660	PR-α	T03481	SKN7	T08340	Egr-2	T09569	Hlf	T11440	FAC1-xbb1
T00644	POU2F1a	T01661	PRA	T03491	MED8	T08348	RXR-α	T09571	MYB1	T11453	TAF-1
T00651	POU5F1	T01664	TR2-11	T03500	MOT3	T08358	GATA-4	T09588	E4BP4	T13753	HsfB1
T00653	POU5F1(Oct-5)	T01667	RFX2	T03524	PDR1	T08409	GAMYB	T09608	Kid3	T13760	ABF1
T00669	Ovo-B	T01669	RFX2	T03525	PDR3	T08410	PBF	T09623	ATF6	T13794	TGA1
T00677	Pax-1	T01670	RFX3	T03538	RCS1	T08411	SED	T09629	MYBJS1	T13809	AGL2
T00689	PHO2	T01671	RFX3	T03541	RFX1	T08415	CBT	T09635	AP1	T13810	Dof4
T00690	PHO4	T01673	RFX1	T03556	RGT1	T08431	PPARα	T09649	cel-let-7	T13811	AGL3
T00691	Pit-1A	T01675	Nkx2-5	T03593	Pax-9a	T08441	Sox10	T09701	cel-miR-84	T14002	GKLF
T00696	PRB	T01679	PacC	T03594	Pax-9b	T08445	Elk-1-isoform1	T09706	hsa-let-7a	T14118	ASR-1
T00697	PRB	T01692	T3R-β1	T03600	SIP4	T08466	c-Jun	T09707	hsa-let-7b	T14187	AIRE-isoform1
T00699	Prd	T01705	HOXA7	T03612	NK-4	T08475	GR-α	T09718	hsa-miR-23a	T14230	WRKY40
T00709	qa-1F	T01710	HoxA-9	T03707	XBP1	T08482	VDR	T09727	hsa-miR-103	T14231	RP58
T00710	R	T01735	HOXB7	T03717	ZAP1	T08487	AR	T09729	hsa-miR-107	T14234	WRKY18
T00715	RAP1	T01737	HOXB8	T03718	WRKY1	T08492	LRH-1-xbb1	T09731	dme-miR-2a	T14258	Nkx3-2
T00719	RAR-α1	T01755	HOXD9	T03722	ZAP1	T08493	c-Fos	T09732	dme-miR-2b	T14268	MIZF
T00725	REB1	T01757	HOXD10	T03975	SPF1	T08505	COUP-TF1	T09737	dme-miR-7	T14302	C1-Myb
T00731	RME1	T01784	MEF-2A	T03994	ID1	T08520	TBP	T09741	dme-miR-13a	T14317	Myb-15
T00737	SAP-1a	T01786	E12	T04001	ATHB-9	T08528	AR	T09742	dme-miR-13b	T14381	ATHB-1
T00746	SGF-3	T01799	Tal-1	T04096	Smad3	T08544	MOVO-B	T09793	dme-let-7	T14382	ATHB-5
T00751	Sn	T01814	Pax-6/Pd-5a	T04146	HLTF	T08546	Ovo1a	T09806	hsa-miR-1	T14442	STF1
T00761	SRF	T01823	Pax-2	T04166	FOXD3	T08571	GATA-2	T09810	hsa-miR-124a	T14444	TGA1
T00763	SRF	T01838	Sox4	T04169	FOXJ2 (long isoform)	T08577	ZBRK1	T09812	hsa-miR-130a	T14447	PBF
T00767	Sry-δ	T01841	WT1-del2	T04176	FOXO4	T08580	STAT3	T09819	hsa-miR-125a	T14485	XBP-1
T00769	Sry-β	T01851	HMGI	T04255	Nkx3-1	T08583	CCA1	T09824	hsa-miR-206	T14491	CBNAC
T00776	SWI5	T01865	Oct-2.3	T04280	FOXP3	T08584	LHY	T09840	hsa-miR-130b	T14517	Zic3
T00788	T-Ag	T01866	Oct-2.4	T04297	Nkx6-1	T08613	ZNF219	T09880	dre-miR-430a	T14521	ZF5
T00789	Tll	T01867	Oct-2.6	T04312	NURR1- isoform1	T08615	PLZFB	T09892	c-Myb-isoform1	T14543	CBF1
T00798	TBP	T01882	unc-86	T04323	Nkx2-5	T08619	WEREWOLF	T09914	SF-1	T14573	FUS3
T00810	TFE3-L	T01888	POU6F1(c2)	T04324	DREF	T08621	HAHB-4	T09923	RREB-1	T14681	Spz1
T00812	TFEB-isoform1	T01897	Cf1a	T04336	Nkx2-8	T08624	Sox9	T09942	HNF-3β	T14827	DEAF-1
T00814	TFE3-S	T01900	PDM-1	T04337	Nkx2-2	T08630	CAR	T09949	FOXC1	T14951	Ncx
T00830	TGA1b	T01944	NF-AT1	T04345	TBX5-L	T08667	SZF1-1	T09960	TR4	T14954	OG-2
										T14992	Pitx3

Apriori algorithm was then applied to discover frequently co-occurring TF–TFBS k-mer pairs (2-itemsets). Finally, the resultant pairs were rescanned in TRANSFAC to measure their forward and backward confidences (33).

RESULTS AND ANALYSIS

In this section, the discovered rules are reported, followed by analysis with different measurements.

Rules discovered

Varying k from 4 to 6 for both TF k-mers and TFBS k-mers, we have obtained nine sets of associated pairs. For each set of pairs, the forward and backward confidences of each pair were calculated. Then, the pairs in the same set were sorted by the minima of their forward and backward confidences in descending order. The nine sets of rules (pairs) exhibit a similar trend that the number of rules decreases as the association criterion becomes more stringent (with higher confidence levels). The TFBS 5-mers settings in general show the most available rules when the confidence level is high (≥0.5), indicating more conserved and significant results. Therefore, we focus on them and use TFBS 5-mer–TF 5-mer as the representative example throughout the article. The results for all other settings are available in the Supplementary Data. The number of rules (pairs) discovered is summarized in Table 1. For instance, there are 70 TF 5-mer–TFBS 5-mer pairs without any further removal (in the N column) with both forward and backward confidences ≥0.5. Considering direct and reverse complement TFBS DNA k-mers as equivalent, we further removed the duplicated pairs (e.g. leaving AGGTC–CEGCK and removing GACCT–CEGCK because AGGTC and GACCT are reverse complements). The results are shown in the N′ column in Table 1. For instance, the 70 TF 5-mer–TFBS 5-mer pairs were reduced to 35 at a confidence level of 0.5. Furthermore, we found that most pairs could be merged together to form a longer pair. For instance, GGTCA–SGYHY and GGTCA–GYHYG could be merged to form a pair GGTCA–SGYHYG. Thus the pairs have been merged and the rule numbers are shown in the Nm column in Table 1. For instance, 35 TF 5-mer–TFBS 5-mer pairs are merged to form 11 merged pairs when the confidence level is equal to 0.5.

Table 1.

Number of the TFBS 5-mer–TF 5-mer pairs across different confidence levels

Confidence
0.0	262	131	29	9.88 ± 3.68
0.1	262	131	29	9.88 ± 3.68
0.2	240	120	24	10.14 ± 3.73
0.3	180	90	23	10.63 ± 4.11
0.4	126	63	21	11.40 ± 4.59
0.5	70	35	11	13.63 ± 5.05
0.6	24	12	8	15.08 ± 5.28
0.7	6	3	2	10.33 ± 2.36
0.8	0	0	0	N/A
0.9	0	0	0	N/A
1.0	0	0	0	N/A

N, number of pairs, N, number of pairs (duplicated pairs removed); , number of merged pairs; S, mean and SD of the support of the pairs in .)

Number of the TFBS 5-mer–TF 5-mer pairs across different confidence levels N, number of pairs, N, number of pairs (duplicated pairs removed); , number of merged pairs; S, mean and SD of the support of the pairs in .)

Quantitative analysis

To evaluate the number of TF data sets supporting each pair (support), the support for each pair was counted. In general, more supports are found when the confidence level is increased. For instance, the average support of the TFBS 5-mer–TF 5-mer pairs is generally increased when the confidence level is increased in the S column of Table 1. The overall results are summarized in Supplementary Table S4. Support is considered the degree of co-occurrence between a TF amino acid k-mer and a TFBS DNA k-mer. Forward and backward confidences consider the cases when either one of them is absent. Some may have questions about the remaining case. How about the case when both of them are absent? To take the case into account, ϕ-coefficients (35) were measured for each pair, as shown in the ϕ column in Table 2. The overall results are summarized in Supplementary Table S5. Most values are >0.4, indicating that positive correlations exist among pairs.

Table 2.

Quantitative measurements for the TFBS 5-mer–TF 5-mer pairs across different confidence levels

Confidence		L	FC	BC
0.0	0.49 ± 0.11	17.92 ± 7.34	1.89 ± 0.67	3.50 ± 2.29
0.1	0.49 ± 0.11	17.92 ± 7.34	1.89 ± 0.67	3.50 ± 2.29
0.2	0.51 ± 0.11	18.32 ± 7.46	1.94 ± 0.68	3.51 ± 2.30
0.3	0.54 ± 0.10	19.81 ± 7.79	2.02 ± 0.64	3.46 ± 2.31
0.4	0.58 ± 0.09	21.41 ± 8.53	2.23 ± 0.66	3.61 ± 2.40
0.5	0.64 ± 0.07	22.57 ± 10.46	2.49 ± 0.70	4.35 ± 2.65
0.6	0.71 ± 0.06	25.80 ± 13.76	3.33 ± 0.57	4.21 ± 2.55
0.7	0.79 ± 0.03	42.07 ± 14.87	3.70 ± 0.29	4.87 ± 0.00
0.8	N/A	N/A	N/A	N/A
0.9	N/A	N/A	N/A	N/A
1.0	N/A	N/A	N/A	N/A

ϕ, mean and SD of ϕ-coefficient; L, mean and SD of lift; FC, mean and SD of forward conviction; BC, mean and SD of backward conviction.

Quantitative measurements for the TFBS 5-mer–TF 5-mer pairs across different confidence levels ϕ, mean and SD of ϕ-coefficient; L, mean and SD of lift; FC, mean and SD of forward conviction; BC, mean and SD of backward conviction. Consider the following scenario: if a TFBS DNA k-mer and a TF amino acid k-mer are both frequently found in the data sets, it will be very likely that they co-occur frequently merely by chance. To tackle such scenario, forward and backward confidences do play their important roles in pruning them. But for clarity, lift (36) that estimates the ratio of the actual support to the expected support was measured for each pair, where the expected support was calculated from the random model that the TFBS DNA k-mer is independent of the TF amino acid k-mer for each pair. For instance, the average lift for the TFBS 5-mer–TF 5-mer pairs is shown in the L column in Table 2. The overall results are summarized in Supplementary Table S6. Most values of the lift are >5. Thus the DNA residue k-mer and the amino acid residue k-mer of most pairs co-occur at least five times more frequently than the prediction based on the independent assumption made by the lift measurement. To estimate the validity of the pairs, both forward and backward convictions (the same directions as the forward and backward confidences, respectively) (36) were measured for each pair. The measurements were averaged for each set of pairs. For instance, the average forward and backward convictions for the TFBS 5-mer–TF 5-mer pairs is shown in the FC and BC columns in Table 2. The overall results are summarized in Supplementary Tables S7 and S8. Most values are >1. The pairs commit fewer errors than the prediction based on the statistically independent assumption made by the measurements: forward and backward convictions. In other words, the pairs would have committed more errors if the association between its TFBS k-mer and TF k-mer had happened purely by chance.

Annotation analysis

If the pairs in our results are the actual binding cores between TFs and TFBSs, most of their TF amino acid k-mers should be inside DNA binding domains. Thus, the TF amino acid k-mers were scanned in TRANSFAC to check whether they were within the annotated DNA binding domains. As stated in the previous section, the set of TFBS 4-mer–TF 4-mer pairs constitutes all the pairs in the other sets by the downward closure property. Thus only the TF amino acid 4-mers of the set of TFBS 4-mer–TF 4-mer pairs were needed for the checking: of the 792 TF amino acid 4-mers, of them were found within the DNA binding domains listed in the ‘PFAM 18’ list downloaded from DBD (37) on 25 January 2010.

Empirical analysis

Since the numbers of results are quite large, they are tabulated in a statistical perspective in the previous sections. This section provides readers with empirical insights into the results obtained. Comparing with the other sets, the set of TFBS 5-mer–TF 5-mer pairs shows its relative invariability to confidence level pruning. Thus, it motivates us to have an in-depth empirical analysis on them. They are listed in Table 3.

Table 3.

The set of TFBS 5-mer–TF 5-mer pairs (duplicated pairs removed and sorted in alphabetical order)

Confidence	Forward confidence	Backward confidence	Pairs	Confidence	Forward confidence	Backward confidence	Pairs	Confidence	Forward confidence	Backward confidence	Pairs
0.7	0.7	0.8	AAACA–HNLSL	0.4	0.4	0.7	AGGTC–CQYCR	0.3	0.5	0.3	GCCAC–ARRSR
0.5	0.5	0.7	AAACA–IRHNL	0.2	0.2	0.6	AGGTC–CVVCG	0.4	0.5	0.4	GCCAC–ESARR
0.5	0.5	0.6	AAACA–KPPYS	0.6	0.6	0.7	AGGTC–EGCKG	0.4	0.4	0.6	GCCAC–KQSNR
0.4	0.4	0.7	AAACA–NLSLN	0.2	0.2	0.7	AGGTC–FFRRT	0.4	0.6	0.4	GCCAC–NRESA
0.6	0.6	0.6	AAACA–NSIRH	0.2	0.2	0.8	AGGTC–FRRTI	0.4	0.4	0.6	GCCAC–QSNRE
0.5	0.5	0.6	AAACA–PPYSY	0.6	0.6	0.6	AGGTC–GCKGF	0.4	0.5	0.4	GCCAC–RESAR
0.4	0.4	0.6	AAACA–PYSYI	0.3	0.3	0.5	AGGTC–GFFKR	0.4	0.4	0.6	GCCAC–RKQSN
0.4	0.4	0.6	AAACA–QNSIR	0.4	0.4	0.5	AGGTC–GFFRR	0.4	0.4	0.5	GCCAC–RLRKQ
0.7	0.7	0.8	AAACA–RHNLS	0.3	0.3	0.6	AGGTC–KGFFK	0.4	0.4	0.5	GCCAC–RRSRL
0.5	0.5	0.8	AAACA–SIRHN	0.4	0.4	0.5	AGGTC–KGFFR	0.4	0.4	0.6	GCCAC–RSRLR
0.4	0.4	0.6	AAACA–WQNSI	0.4	0.4	0.9	AGGTC–RNRCQ	0.4	0.5	0.4	GCCAC–SARRS
0.4	0.4	0.6	AACAA–HNLSL	0.3	0.3	0.5	AGGTC–TCEGC	0.4	0.5	0.4	GCCAC–SNRES
0.3	0.3	0.6	AACAA–IRHNL	0.4	0.4	0.5	AGGTC–VCGDK	0.4	0.5	0.4	GCCAC–SRLRK
0.3	0.3	0.5	AACAA–NSIRH	0.2	0.2	0.5	AGGTC–VVCGD	0.6	0.6	0.8	GGTCA–CEGCK
0.3	0.3	0.7	AACAA–PMNAF	0.2	0.7	0.2	ATTAA–FQNRR	0.2	0.2	0.9	GGTCA–CGDKA
0.4	0.4	0.6	AACAA–RHNLS	0.2	0.6	0.2	ATTAA–IWFQN	0.5	0.5	0.6	GGTCA–CKGFF
0.3	0.3	0.7	AACAA–RPMNA	0.2	0.6	0.2	ATTAA–KIWFQ	0.3	0.3	0.9	GGTCA–CQYCR
0.3	0.3	0.7	AACAA–SIRHN	0.3	0.5	0.3	ATTAA–NRRMK	0.2	0.2	0.8	GGTCA–CVVCG
0.2	0.6	0.2	AAGGT–CKGFF	0.3	0.5	0.3	ATTAA–QNRRM	0.1	0.1	1	GGTCA–DLVLD
0.2	0.5	0.2	AATTA–FQNRR	0.2	0.7	0.2	ATTAA–WFQNR	0.5	0.5	0.8	GGTCA–EGCKG
0.3	0.3	0.3	AATTA–NRRAK	0.2	0.5	0.2	CACCC–GEKPY	0.2	0.2	0.8	GGTCA–FFKRS
0.4	0.4	0.5	AATTA–QNRRA	0.1	0.5	0.1	CACCC–HTGEK	0.2	0.2	0.8	GGTCA–FFRRT
0.3	0.3	0.7	AATTA–QVWFQ	0.1	0.5	0.1	CACCC–TGEKP	0.2	0.2	1	GGTCA–FRRTI
0.5	0.5	0.5	AATTA–VWFQN	0.5	0.5	0.5	CCACG–ARRSR	0.5	0.5	0.7	GGTCA–GCKGF
0.2	0.5	0.2	AATTA–WFQNR	0.5	0.5	0.6	CCACG–ESARR	0.2	0.2	0.5	GGTCA–GFFKR
0.5	0.5	0.7	ACGTG–ARRSR	0.3	0.3	0.7	CCACG–KQSNR	0.3	0.3	0.6	GGTCA–GFFRR
0.1	0.1	0.7	ACGTG–ERELK	0.2	0.2	0.6	CCACG–LRKQA	0.1	0.1	0.6	GGTCA–GYHYG
0.5	0.5	0.9	ACGTG–ESARR	0.6	0.6	0.6	CCACG–NRESA	0.1	0.1	1	GGTCA–ITCEG
0.2	0.2	0.8	ACGTG–KQSNR	0.3	0.3	0.6	CCACG–QSNRE	0.2	0.2	0.6	GGTCA–KGFFK
0.2	0.2	0.7	ACGTG–LRKQA	0.5	0.5	0.6	CCACG–RESAR	0.3	0.3	0.6	GGTCA–KGFFR
0.6	0.6	0.9	ACGTG–NRESA	0.2	0.2	0.7	CCACG–RKQAE	0.1	0.1	1	GGTCA–NRCQY
0.2	0.2	0.7	ACGTG–QSNRE	0.3	0.3	0.7	CCACG–RKQSN	0.1	0.1	1	GGTCA–RCQYC
0.5	0.5	0.9	ACGTG–RESAR	0.3	0.3	0.5	CCACG–RLRKQ	0.1	0.1	0.8	GGTCA–RNQCQ
0.1	0.1	0.7	ACGTG–RKQAE	0.3	0.3	0.6	CCACG–RRSRL	0.3	0.3	1	GGTCA–RNRCQ
0.2	0.2	0.8	ACGTG–RKQSN	0.3	0.3	0.6	CCACG–RSRLR	0.2	0.2	1	GGTCA–SCEGC
0.2	0.2	0.6	ACGTG–RLRKQ	0.5	0.5	0.6	CCACG–SARRS	0.1	0.1	0.5	GGTCA–SGYHY
0.2	0.2	0.7	ACGTG–RRSRL	0.5	0.5	0.6	CCACG–SNRES	0.3	0.3	0.6	GGTCA–TCEGC
0.2	0.2	0.8	ACGTG–RSRLR	0.4	0.4	0.4	CCACG–SRLRK	0.3	0.3	0.6	GGTCA–VCGDK
0.5	0.5	0.9	ACGTG–SARRS	0.5	0.5	0.5	CGGAA–LRYYY	0.2	0.2	0.7	GGTCA–VVCGD
0.5	0.5	0.9	ACGTG–SNRES	0.5	0.5	0.8	CTTCC–LRYYY	0.5	0.5	0.7	GTCAA–KYGQK
0.3	0.3	0.5	ACGTG–SRLRK	0.4	0.4	0.7	CTTCC–LWQFL	0.5	0.5	0.7	GTCAA–RKYGQ
0.6	0.7	0.6	AGGTC–CEGCK	0.4	0.7	0.4	GATAA–CNACG	0.5	0.5	0.7	GTCAA–WRKYG
0.3	0.3	0.8	AGGTC–CGDKA	0.4	0.7	0.4	GATAA–LCNAC	0.7	0.7	1	TGACA–NWFIN
0.6	0.7	0.6	AGGTC–CKGFF	0.6	0.7	0.6	GATAA–NACGL

The set of TFBS 5-mer–TF 5-mer pairs (duplicated pairs removed and sorted in alphabetical order) Among the 131 pairs in Table 3, the TFBS DNA k-mers are quite conserved. There are only 15 distinct TFBS DNA k-mers. Each TFBS DNA k-mer forms pairs with 8.73 TF amino acid k-mers on average. One of the reasons may be the specificity of DNA residue, is lower in view of its alphabet size (4) as compared to the amino acid alphabet size (20). To act as a DNA binding protein, a TF needs to provide a basic interacting surface for the recognition of major/minor grooves as well as the phosphate backbone of DNA. Therefore, we searched through the set of pairs in Table 3 to count the occurring frequency for each residue. Interestingly, we found that the basic residues, lysine (50 times) and arginine (131 times), occur at the highest frequency among 131 pairs of TFBS–TF. On the other hand, the hydrophobic residues (38) such as isoleucine (15) and valine (13) occur at the lowest frequency. These results suggest the potential of the TF sequences for being the binding sequences between TFs and TFBSs. On the other hand, as the nucleotides of TFBSs are somehow negatively charged, it can be deduced that their binding amino acid residues of TFs should be positively charged. Thus the occurring frequencies were further examined. Among the 131 pairs, the positively charged residues: arginine (R) and lysine (K) occur 131 and 50 times, respectively. In contrast, the negatively charged residues aspartic acid (D) and glutamic acid (E) occur 8 and 30 times, respectively. Such discrepancy supports their potential for being the binding sequences between TFs and TFBSs.

Experimental analysis

This section follows the same approach in empirical analysis. The set of TFBS 5-mer–TF5mer pairs in Table 3 is selected for experimental analysis. Out of the 131 pairs, 5 of them were selected and analyzed. The first pair is GGTCA–CEGCK, which have been experimentally proved as binding sequences in Ref. (39). The TF amino acid k-mer (CEGCK) is considered part of P-box (CEGCKG) within the DNA binding domain of Bp-nhr-2, which is believed to bind the DNA k-mer (GGTCA). The second pair is AAACA–IRHNL mentioned in Ref. (40). Based on the corresponding PDB entry 3CO6, it is believed that the pair was the binding pair between a TF and a TFBS as shown in Figure 3. Similarly, the remaining pairs are GATAA–NACGL, GGTCA–GFFRR and CTTCC–LRYYY. They are found as binding pairs in PDB entries 3DFV (41), 3DZY (42) and 2NNY (43) as shown in Figure 3a, b and c, respectively. The above five pairs reveal that the pairs generated from the proposed approach have biological evidences in literatures. Among the previous figures, two of them (3CO6 and 2NNY) were further analyzed in terms of hydrogen bonding, which also means the specificity of the interaction between amino acids and the bases, as shown in Figure 4a and b. We have also highlighted the hydrogen bonds as black lines as well as the residues that make contact with the base (only predicted residues), which are the evidence of the significance and accuracy of the prediction of the TF–TFBS pairs. Nevertheless, as the proposed approach is applied on a large-scale database, such extensive and detailed analysis of all the binding core pairs discovered are not practical. Therefore, a scalable verification approach will be presented in the next section to verify the massive results generated.

Figure 3.

Figure 4.

The interactions between the TF and TFBS of two representative pairs (a) AAACA–IRHNL in 3CO6 and (b) CTTCC–LRYYY in 2NNY are shown. The proteins are shown in ribbon diagram with the highlighted TF amino acids in ball and stick format. The helices and strands are colored in red and cyan, respectively. The amino acids that interact with the nucleotides are labeled. The hydrogen bonds are shown in dark line. The figures are generated using DS visualizer, Accelrys.

Four representative TF–TFBS pairs are shown in ribbon diagram. (a) AAACA–IRHNL pair in 3C06, (b) GATAA–NACGL pair in 3DFV, (c) GGTCA–GFFRR pair in 3DZY and (d) CTTCC–LRYYY pair in 2NNY are shown. The TF amino acids and TFBS nucleotides are highlighted in ball and stick format. The sequences of the TF–TFBS pairs are also labeled in the figures. The figures are generated using Protein Workshop (34). The interactions between the TF and TFBS of two representative pairs (a) AAACA–IRHNL in 3CO6 and (b) CTTCC–LRYYY in 2NNY are shown. The proteins are shown in ribbon diagram with the highlighted TF amino acids in ball and stick format. The helices and strands are colored in red and cyan, respectively. The amino acids that interact with the nucleotides are labeled. The hydrogen bonds are shown in dark line. The figures are generated using DS visualizer, Accelrys.

VERIFICATIONS

In this section, we try to verify the discovered pairs with external data sources, in particular the 3D protein-DNA complex structures experimentally determined from PDB. Homology modeling has also been done for further verifications.

Verification by PDB

In this article, PDB is selected for providing 3D protein–DNA complex data for 3D structural verification. The PDB data were downloaded from RCSB PDB (http://www.pdb.org) from 16 September 2009 to 22 September 2009, where the protein–DNA complexes were selected based on the entry-type list provided in ftp://ftp.wwpdb.org/. For each set of pairs in Supplementary Table S2, each pair is independently evaluated as shown in Figure 5. For each pair, its TF k-mer is used to query which PDB chain has the TF k-mer. Once the corresponding set of PDB chains has been identified and returned, its redundancy is removed by BLASTClust using 90% sequence identity (32). The removal is to ensure that redundant PDB chains are not double counted. After the removal, the pair is evaluated for binding in the 3D space:

Figure 5.

Flowchart of 3D verification for each set of pairs.

A TFBS k-mer–TF k-mer pair is considered binding for a PDB chain if and only if an atom of the TFBS k-mer and an atom of the TF k-mer are close to each other. Two atoms are considered close if and only if their distance is <3.5 Å (25,28). With the pair evaluated in its PDB chains, its PDB chains can be classified into the following three categories: PDB chains only having the TF k-mer (a) PDB chains having both TF k-mer and TFBS k-mer The pair binds together (b) The pair does not bind together (c) Thus the number of chains in each category is counted and converted into the following performance metrics: TFBS prediction score=(b + c)/(a + b + c) TFBS binding prediction score =b/(a + b + c) Binding prediction score=b/(b + c) Given the resultant PDB chains queried by a TF k-mer, TFBS prediction score measures the proportion of PDB chains that contain the corresponding TFBS k-mer. In other words, it measures the backward confidence of a pair in PDB. TFBS binding prediction score is a more stringent metric. It measures the proportion of PDB chains that have the corresponding TFBS k-mer binding with the queried TF k-mer. Lastly, binding prediction score is the most important metric. It measures the proportion of PDB chains in which the pair is really binding. To verify the cases when (b + c) = 0 (i.e. the pairs do not appear in PDB), homology modeling is also performed. Flowchart of 3D verification for each set of pairs. For each setting, we have a set of pairs. For each pair, the above performance metrics are calculated. The overall results are averaged and summarized in Supplementary Tables S9–S11. For each setting, we also have a set of merged pairs. For each merged pair, the above performance metrics are also calculated. The overall results are averaged and summarized in Supplementary Tables S12–S14. Note that the most conservative calculation has been used for each performance metric for each pair. If a performance metric of a pair does not have enough PDB data for calculation, a value of zero will be given to the performance metric of the pair. For instance, the cases when (b + c) = 0 or (a + b + c) = 0. Despite the above setting, the performance metrics of the pairs still have reasonable performances. They are shown to be significantly better than the maximal performance of 50 random runs in a later section. Nevertheless, although the above metrics can capture the performance of a pair quantitatively, the most important point is to know how many generated pairs could be verified [with at least one binding evidence in PDB data (b > 0)]. To gain more insights, the number of pairs with at least one related PDB chain [(b + c) > 0] are tabulated in Supplementary Tables S15 and S16. Correspondingly, the percentage of verified pairs ((Number of pairs with b > 0/Number of pairs with (b + c) > 0)) are calculated and tabulated in Supplementary Tables S17 and S18. In the tables, the percentage of verified pairs is high enough to justify that the proposed approach has produced pairs proven to be binding in PDB. For instance, the statistics for the TFBS 5-mer–TF 5-mer pairs is extracted in Table 4 and Figure 6. Among the 80 TFBS 5-mer–TF 5-mer pairs with at least one related PDB chain [(b + c) > 0] when the confidence level = 0.0, more than 81% of them have at least one binding evidence (b > 0).

Table 4.

Number of the TFBS 5-mer–TF 5-mer pairs verified across different confidence levels

Confidence
0.0	80	65	19	16
0.1	80	65	19	16
0.2	71	59	15	13
0.3	50	44	15	13
0.4	32	28	12	11
0.5	19	17	7	6
0.6	9	9	5	5
0.7	2	2	1	1
0.8	0	0	0	0
0.9	0	0	0	0
1.0	0	0	0	0

, number of the TFBS 5-mer–TF 5-mer pairs with at least one related PDB chain []; , number of the TFBS 5-mer–TF 5-mer pairs with at least one PDB chain as a binding evidence []; , number of the TFBS 5-mer–TF 5-mer merged pairs with at least one related PDB chain []; , number of the TFBS 5-mer–TF 5-mer merged pairs with at least one PDB chain as a binding evidence [)].

Figure 6.

Percentage of the TFBS 5-mer–TF 5-mer pairs verified across different confidence levels.

Number of the TFBS 5-mer–TF 5-mer pairs verified across different confidence levels , number of the TFBS 5-mer–TF 5-mer pairs with at least one related PDB chain []; , number of the TFBS 5-mer–TF 5-mer pairs with at least one PDB chain as a binding evidence []; , number of the TFBS 5-mer–TF 5-mer merged pairs with at least one related PDB chain []; , number of the TFBS 5-mer–TF 5-mer merged pairs with at least one PDB chain as a binding evidence [)]. Percentage of the TFBS 5-mer–TF 5-mer pairs verified across different confidence levels. 631 TRANSFAC 2008.3 IDs and factor names used in this article The TFBS–TF pairs that we found to have binding evidences in the PDB show typical structural features of DNA–protein interactions. Such features include the 'recognition helix' of the DNA–binding protein making base contacts in the major groove and direct hydrogen bonds between the side chains and the bases. These interactions play the crucial role in the DNA recognition and site-specific binding, respectively (44). Interestingly, the nucleotides of TFBS are located in the major groove of the DNA, which are close to, and make contacts with the amino acids of the ‘recognition helix' of the TF (as for example shown in Figure 3). The verification is considered satisfactory since those pairs not found in PDB [(b + c) = 0] may be unannotated discovery as shown in the following verification by homology modeling.

Verification by homology modeling

Regarding the pairs without any related PDB chain [(b + c) = 0], there is no PDB data for us to verify them. Thus, we have taken the most conservative approach to assign zero to their performance metrics in the aforementioned evaluations. Nevertheless, we believe that most of those pairs are true and our approach can be used as an effective protein–DNA binding discovery tool. Thus 6 TFBS 5-mer–TF 5-mer pairs were taken and merged. The resultant pair ACGTG-SNRESARRSR was analyzed by homology modeling as follows: The model of DNA–protein complex was built by homology modeling (INSIGHT II, MSI) based on the structure of the GCN4–DNA complex (1YSA) (45). Briefly, three amino acids (R234S, T236R and A238S) and two nucleotides (T29C and A31T) were mutated in the original structure. The side chains of the mutated amino acids were chosen from the rotamer database and examined using the Ramachandran plots to prevent any steric effect. The interactions between the amino acids and the nucleotides were searched based on the distance of the hydrogen bond. The pair ACGTG–SNRESARRSR using homology modeling. As shown in Figure 3, we found that the pair ACGTG-SNRESARRSR exists in plant as the basic leucine-zipper (bZIP) transcription factor which binds to G-box binding factors (GBF) of DNA (46). Moreover, the ACGTG sequence is the consensus sequence, which is defined as G-box core and locates at the major groove of the double-stranded DNA. It is believed that the G-box core is the DNA sequence of GBF that provides the specificity of the binding to bZIP proteins. In order to further understand the interactions between the TF–TFBS, we built a model by using homology modeling based on the structure of GCN4–DNA (1YSA) complex (45). As shown in the model, the protein helix fits into the major groove of the DNA very well and forms extensive interactions (black lines) between the amino acids and the nucleotides. Interestingly, the mutations of the protein (R234S, T236R and A238S) as well as nucleotides (T29C and A31T) increases the number of hydrogen bonds compared with the original structure (1YSA), suggesting the binding specificity between this pair of TF–TFBS. In conclusion, we believe that the protein–DNA binding sequence patterns found using association rule mining on the large-scale database reveal real TF–TFBS pairs in physiologically relevant situation and this method could guide us to discover new and undescribed TF–TFBS pairs in the future.

Verification by random analysis

For each set of pairs in Supplementary Table S1, we use a random process to generate a random set with the same number of pairs. Within a random set, its pairs were randomly sampled from all the combinations of the k-mers used in the proposed approach. Fifty random runs were performed. The maximal performance metrics of the 50 random runs are summarized in Supplementary Tables S19–S21. In a comparison to the proposed approach, their performance has been depicted in Figure 8. It can be observed that the performance of the proposed approach is significantly better than the best one of the 50 random runs. For instance, the binding prediction score of the 131 TFBS 5-mer–TF 5-mer pairs generated is 0.36±0.39 on average, whereas the maximal binding prediction score over 50 random runs is only 0.00509±0.06492 on average. Similar observation can also be drawn for their merged pairs in Supplementary Tables S22–S24. It can be concluded that the performance of the proposed approach is very unlikely to happen purely by chance in PDB.

Figure 8.

Performance Comparison for PDB verifications. (a) TFBS prediction score, (b) TFBS binding prediction score, (c) binding prediction score (d) TFBS prediction score (merged pairs), (e) TFBS binding prediction score (merged pairs) and (f) binding prediction score (merged pairs) are shown.

DISCUSSION

In this article, we have proposed a framework based on association rule mining with Apriori algorithm to discover associated TF–TFBS binding sequence patterns in the most explicit and interpretable form from TRANSFAC. With downward closure property, the algorithm guarantees the exact and optimal performance to generate all frequent TFBS k-mer TF k-mer pairs from TRANSFAC. The approach relies merely on sequence information without any prior knowledge in TF binding domains or protein–DNA 3D structure data. From comprehensive evaluations, statistics of the discovered patterns are shown to reflect meaningful binding characteristics. According to external literatures, PDB data and homology modeling, a good number of TF–TFBS binding patterns discovered have been verified by experiments and annotations. They exhibit atomic-level bindings between the respective TF binding domains and specific nucleotides of the TFBS from experimentally determined protein–DNA 3D structures. In fact, most of the pairs discovered are actually the binding cores from the TF binding domains and TFBS, respectively. The proposed approach has great potential for discovering intuitive and interpretable rules of TF–TFBS binding mechanisms. Such rules are able to reveal TF binding domains, detailed interactions between amino acids and nucleotides, accurate TFBS sequence motifs, and help better understanding and deciphering of protein–DNA interactions. It also offers strategic help to reduce the labor and costs involved in wet-lab experiments. With increasing computational power and more sophisticated mining approaches, the proposed methodology can be further improved for discovering more intriguing TF–TFBS binding patterns and rules. In the future, approximate associations will be considered to handle the experimental and biological noises, although the inevitable computational burden needs to be carefully handled, and much more efforts are needed to distinguish real signals from the large number of false positives introduced by loosening the pattern matching and clustering. Combinatorial associations between multiple TF and TFBS k-mers will also be another challenging topic. We will also seek further real applications of the approach on experimentally verifiable TF–TFBS bindings.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Research Grants Council of the Hong Kong SAR, China (project numbers: CUHK41407 and CUHK414708. Funding for open access charge: Block Grant Project of the Chinese University of Hong Kong, ref #2150591; Focused Investment Scheme D on Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong (project number: 1904014). Conflict of interest statement. None declared.

Algoritham 1 Pseudocode of Apriori algorithm (29)
data: A dataset of itemsets
L_n: Frequent n-itemsets
C_n: Candidate n-itemsets
x : An itemset
minsupport: Minimum Support
i ← 1;
Scan data to get L_i;
while do
C_{i + 1} ← Extend (L_i);
L_{i + 1} ← ∅;
For x∈C_{i + 1} do
If then
L_{i + 1} ← L_{i + 1}∩x;
end if
end for
i ← i + 1;
end while
Notes:
Extend(L_i) is the function ‘Candidate itemset generation procedure' stated in (29). Support(x) returns the support (30) of the itemset x. A frequent n-itemset is the n-itemset support is higher than minsupport.

39 in total

1. Cloning and characterization of two nuclear receptors from the filarial nematode Brugia pahangi.

Authors: J Moore; E Devaney
Journal: Biochem J Date: 1999-11-15 Impact factor: 3.857

Review 2. Plant bZIP G-box binding factors. Modular structure and activation mechanisms.

Authors: Y Sibéril; P Doireau; P Gantet
Journal: Eur J Biochem Date: 2001-11

3. Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level.

Authors: N M Luscombe; R A Laskowski; J M Thornton
Journal: Nucleic Acids Res Date: 2001-07-01 Impact factor: 16.971

4. Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins.

Authors: Susan Jones; Hugh P Shanahan; Helen M Berman; Janet M Thornton
Journal: Nucleic Acids Res Date: 2003-12-15 Impact factor: 16.971

5. Structural classification of zinc fingers: survey and summary.

Authors: S Sri Krishna; Indraneel Majumdar; Nick V Grishin
Journal: Nucleic Acids Res Date: 2003-01-15 Impact factor: 16.971

6. MATCH: A tool for searching transcription factor binding sites in DNA sequences.

Authors: A E Kel; E Gössling; I Reuter; E Cheremushkin; O V Kel-Margoulis; E Wingender
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

7. Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity.

Authors: Nicholas M Luscombe; Janet M Thornton
Journal: J Mol Biol Date: 2002-07-26 Impact factor: 5.469

8. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments.

Authors: X Shirley Liu; Douglas L Brutlag; Jun S Liu
Journal: Nat Biotechnol Date: 2002-07-08 Impact factor: 54.908

9. Structure of the intact PPAR-gamma-RXR- nuclear receptor complex on DNA.

Authors: Vikas Chandra; Pengxiang Huang; Yoshitomo Hamuro; Srilatha Raghuram; Yongjun Wang; Thomas P Burris; Fraydoon Rastinejad
Journal: Nature Date: 2008-11-20 Impact factor: 49.962

Review 10. An overview of the structures of protein-DNA complexes.

Authors: N M Luscombe; S E Austin; H M Berman; J M Thornton
Journal: Genome Biol Date: 2000-06-09 Impact factor: 13.583

10 in total

1. Biomedical application of fuzzy association rules for identifying breast cancer biomarkers.

Authors: F J Lopez; M Cuadros; C Cano; A Concha; A Blanco
Journal: Med Biol Eng Comput Date: 2012-05-24 Impact factor: 2.602

2. icuARM-II: improving the reliability of personalized risk prediction in pediatric intensive care units.

Authors: Chih-Wen Cheng; Nikhil Chanani; Kevin Maher
Journal: ACM BCB Date: 2014-09

3. Novel approach for identification of influenza virus host range and zoonotic transmissible sequences by determination of host-related associative positions in viral genome segments.

Authors: Fatemeh Kargarfard; Ashkan Sami; Manijeh Mohammadi-Dehcheshmeh; Esmaeil Ebrahimie
Journal: BMC Genomics Date: 2016-11-16 Impact factor: 3.969

Confidence
0.0	80	65	19	16
0.1	80	65	19	16
0.2	71	59	15	13
0.3	50	44	15	13
0.4	32	28	12	11
0.5	19	17	7	6
0.6	9	9	5	5
0.7	2	2	1	1
0.8	0	0	0	0
0.9	0	0	0	0
1.0	0	0	0	0

Confidence
0.0	80	65	19	16
0.1	80	65	19	16
0.2	71	59	15	13
0.3	50	44	15	13
0.4	32	28	12	11
0.5	19	17	7	6
0.6	9	9	5	5
0.7	2	2	1	1
0.8	0	0	0	0
0.9	0	0	0	0
1.0	0	0	0	0