Literature DB >> 35671296

Identification of upstream transcription factor binding sites in orthologous genes using mixed Student's t-test statistics.

Tinghua Huang¹, Hong Xiao¹, Qi Tian¹, Zhen He¹, Cheng Yuan¹, Zezhao Lin¹, Xuejun Gao¹, Min Yao¹.

Abstract

BACKGROUND: Transcription factor (TF) regulates the transcription of DNA to messenger RNA by binding to upstream sequence motifs. Identifying the locations of known motifs in whole genomes is computationally intensive. METHODOLOGY/PRINCIPAL
FINDINGS: This study presents a computational tool, named "Grit", for screening TF-binding sites (TFBS) by coordinating transcription factors to their promoter sequences in orthologous genes. This tool employs a newly developed mixed Student's t-test statistical method that detects high-scoring binding sites utilizing conservation information among species. The program performs sequence scanning at a rate of 3.2 Mbp/s on a quad-core Amazon server and has been benchmarked by the well-established ChIP-Seq datasets, putting Grit amongst the top-ranked TFBS predictors. It significantly outperforms the well-known transcription factor motif scanning tools, Pscan (4.8%) and FIMO (17.8%), in analyzing well-documented ChIP-Atlas human genome Chip-Seq datasets. SIGNIFICANCE: Grit is a good alternative to current available motif scanning tools.

Entities: Chemical

Mesh：

Substances：
Transcription Factors

Year: 2022 PMID： 35671296 PMCID： PMC9205514 DOI： 10.1371/journal.pcbi.1009773

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.779

This is a PLOS Computational Biology Software paper.

1. Introduction

A DNA sequence motif is a short conserved pattern that can be coordinated by regulator proteins, such as transcription factors (TFs). DNA sequence motifs represent functionally important regions of the genome and are one of the basic units of molecular evolution that are usually conserved among species [1]. Locating these motifs in the genome and understanding their function is fundamental in building molecular models for biological processes such as human diseases [2, 3]. Researchers often face the task of identification of putative binding sites for TFs in whole genomes, termed “motif scanning” [4]. Over the past several decades, many computational pipelines have been described that utilize position weight matrices (PWM) for this task. MAST searches DNA motifs against a database composed of short sequences and assigns a score to each target sequence assuming that every motif occurs once [5]. MCAST uses a hidden Markov model (HMM) to scan DNA sequences for regions comprising one or more of the given motifs [6], whereas SWAN utilizes a log likelihood ratio (LLR) scoring system built by training a two-state HMM on the background sequences [1]. FIMO computes a LLR score for each position in a DNA sequence motif and converts this score to a q-value using dynamic programming methods [7]. TRAP introduces a physical binding model to predict the relative binding affinity of a transcription factor for a given sequence [8]. PWMScan scans sequence motifs using Bowtie [9] or “matrix_scan”, which employs a conventional search algorithm [10]. The Python-based program Motif scraper searches motifs specified as a text string using IUPAC degenerate bases [11]. Several tools such as Toucan [12], OTFBS [13], and CREME [14] count all matches in the target and control sequences and apply binomial statistics for over-representation. Other tools such as Clover [15], PAP [16], oPOSSUM [17], Pscan [18], and TFM_Explorer [19] scan sequence sets from co-regulated or co-expressed genes with TF motifs, and assess motifs that are significantly over- or under-represented, to identify common regulators of the sequence sets. WeederH was designed for discovering conserved TFBS and distal regulatory modules in sequences from orthologous genes [20]. MatrixREDUCE predicts functional transcription factor binding through alignment-free and affinity-based analyses of orthologous promoter sequences in closely related species [21]. Table 1 summarizes motif scanning parameters of 18 currently available tools.

Table 1

Functionality of currently available motif scanning tools.

Tool	Scan single DNA	Scan multiple DNAs^a	Report single p-value^b	Report multiple p-value^c	Species specific	Utilize conservation information	Source code available	Compared^d	Release date
MAST	✓		✓		✓		✓		1998
MCAST	✓		✓		✓		✓		2003
OTFBS		✓		✓		✓			2003
CREME		✓		✓		✓			2003
Clover		✓		✓		✓	✓	✓	2004
Toucan		✓		✓		✓			2005
PAP		✓		✓		✓			2006
oPOSSUM		✓		✓		✓			2007
TRAP	✓					✓	✓		2007
WeederH		✓		✓		✓	✓		2007
MatrixREDUCE		✓		✓			✓		2008
Pscan		✓		✓		✓	✓	✓	2009
TFM_EXPLORER		✓		✓		✓	✓		2010
SWAN	✓		✓			✓	✓	✓	2010
FIMO	✓		✓		✓		✓	✓	2011
PWMScan	✓		✓				✓	✓	2018
Motif scraper	✓				✓		✓		2018
Grit		✓	✓		✓	✓	✓	✓	2021

aDesigned to scan multiple sequence sets.

bReport p-value for target genome.

cReport p-value for multiple sequence sets.

dSelected for performance assessment.

aDesigned to scan multiple sequence sets. bReport p-value for target genome. cReport p-value for multiple sequence sets. dSelected for performance assessment. To overcome the shortcomings of currently available tools as described below, a novel motif-scanning algorithm “Grit” was developed that identifies genome-wide upstream TFBS from a known collection of PWMs for promoters of orthologous genes without the need of sequence alignment. This study addressed the question of finding significant associations between TFs and orthologous gene sets by introducing a new computational framework that uses mixed Student’s t-test statistics and adopts new ways of constructing promoter sequence sets of orthologous genes. Its application to the human genome has yielded fruitful results, demonstrating its desirability as a motif scanning tool.

2. Design and implementation

2.1. Building putative promoter sets for orthologous genes

PWMs for TFs were obtained from the Jaspar database release 2020, referred here as JASPAR-2020 [22], which comprises a collection of TF motifs for human species. The Ensembl Biomart web tool release 100 [23] was used to extract the putative promoter sequences 2 kb upstream of the transcript, for all genes in 294 genomes (S1 Table). The promoter set for orthologous genes used for scanning of TFBS was built by first comparing the cDNA sequence from the target genome (TG, human) with the cDNA sequence from the other 293 reference genomes (RGs, genomes other than human) to identify the orthologous gene clusters, and consequently, put the 2 kb upstream sequence of the orthologous genes together. The BLASTN parameter was “-word_size 11 -reward 2 -penalty -3 -gapopen 5 -gapextend 2 -evalue 1e-6”, and the BEST-to-BEST approach based on the E-value (mutual best hit) was used to define orthologous gene pairs between the two species. This promoter set was referred to as the “2K-set” and was available from the Grit website, the promoter sequence for the TG in this set was referred to as the “TPS”. A random background promoter sequences set was randomly selected from the 2K-set and named as “BPS”.

2.2. Statistical identification of TFBS in a target genome

First, we obtained a statistical score based on a component of HMM (Eq 1). The implementation of this raw score (RS) represented the ideals of existing statistical approaches [15]. The RS calculation represents repeated averaging of LLRs. RS represented the LLR for a motif being present at one particular location in a sequence, where w was the width of the motif, L denoted the location being considered, L was the nucleotide at position k within this location, p(L) is the background probability of observing nucleotide L estimated from the frequency of L in that sequence, and q(k, L) is the probability of observing nucleotide L estimated from the frequency of the K location in the motif. The RS for a motif present in a sequence with length l was the natural logarithm of the average of LRs taken over all locations of s, where M was the number of locations in the sequence calculated as l–w + 1. Statistically significant TFBS in the target genome are identified by a mixed Student’s t-test.

2.3. Theory of mixed Student’s t-test

We tested the significance of RS of a gene in the TG for a given motif assuming that the RSs for the sequences in the 2K-set for this gene were normally distributed. We propose a new statistical approach that is an extension of the Student’s t-test, namely, the “mixed Student’s t-test”. A slightly varied statistical approach from the canonical Student’s t-test was proposed—giving a background set (bkg) and a testing set (obs), we determine whether one value (one) from the obs is significantly different from the mean of the values in bkg, where one, obs and bkg are RSs for TPS, 2K-set, and BPS, respectively. The mixed Student’s t-test statistic can be calculated by combining the one-sample Student’s t-test and independent two-sample Student’s t-test. The calculation of the t-statistics (t’) and degree of freedom (df) of the mixed Student’s t-test were presented as Eqs 2 and 3, respectively. The p-values can be estimated by the “cdflib” function [24]. The coefficient of conserved variation (CCV, Eq 4) and standard difference (SD, Eq 5) are calculated, which indicate the degree of conservation of the TFBS among species and the altitude of difference in RS scores between the TG and the RGs, respectively.

2.4. Development of the Grit software

Utilizing the mixed Student’s t-test statistics, we developed a tool, called Grit, for screening TFBS by coordinating TFs to their promoter sequences in orthologous genes. The tool takes JASPAR-2020 (specified by the -m option), 2K-set (-i option), and BPS (-b option) as it’s input. Running the tool with default options (-n 10 -z 200 -s 1 -t 0.05 -p 0) will produce a result file (-o option) containing the predicted TFBS. There are three major steps built into the program: 1) calculate the RS for each PWMs presented in each promoter set for orthologous genes using Eqs 1 and 2) calculate the p-values for the significance of RS of each genes for each given PWMs using the mixed Student’s t-test statistics; and 3) perform multiple testing correction for all p-values using the FDR method [25], and retain the TFBS with FDR ≤ threshold defined by the -t option. The source code has been deposited in GitHub and is available under academic free license.

2.5. Performance assessment methods of the Grit software

The ReMap datasets were obtained from the ReMap website release 2020, referred to as ReMap-2020 [26], and ChIP-Atlas website release 2021, referred to as Atlas-2021 [27]. True positives (TP) were defined as predicted binding sites overlapping 80% with experiment-supported binding sites from ReMap or the ChIP-Atlas ChIP-Seq datasets. False positives (FP) were defined as predicted binding sites that did not overlap with experiment-supported binding sites, and false negatives (FN) were defined as experiment-supported binding sites that were not identified. Performance was assessed by calculating sensitivity [Sn = TP/(TP + FN)], positive predictive value [PPV = TP/(TP + FP)], and geometric accuracy [ACCg = ], as reported by Sand et al. [28] and Jayaram et al. [29] for all of the datasets analyzed. All assessments of the six tools were performed on Amazon EC2 computation services in parallel. For software such as PWMScan and Clover, where the local version was not available or too slow to analyze all PWMs, a random subset of the transcription factors (35 TFs) was analyzed. The Sn, PPV, and ACCg values for each tool were compared by paired Student’s t-test.

3. Results

3.1. Mixed student’s t-test with simulated datasets

Two normally distributed datasets were generated and used as bkg (mean = −10, SD = 5, gray) and obs (mean = −2, SD = 7, dark green) datasets. Three values from the obs (located at 25, 50, and 75 percentiles, red) were tested for their significance, and produced p-values of 1.0, 0.03, and 1E-25, respectively (Fig 1A). Three representative genes from the human genome were used for testing: purple for a gene at the 75% quantile of CCV and 25% quantile of SD, dark green for a gene at the 50% quantile of CCV and 75% quantile of SD, and red for a gene at the 25% quantile of CCV and 75% quantile of SD, all of which produced p-values less than 1E-6 (Fig 1B). The p-values, CCV, and SD for each entity (one) in obs were calculated. As shown in Fig 1C, the p-values decreased with increasing CCV and SD, indicated that the mixed Student’s t-test prefers TFBS either having higher CCV or having higher SD, or both. We also noted that the mixed t-test behaved as a one-sample t-test when distributions of values in obs and bkg were the same, or a two-sample t-test when observation (one) was located at the mean of the obs (S1 Text).

Fig 1

Validating the mixed Student’s t-test with simulated and real-world data.

A and B. Simulated data and representative real-world data testing; the distribution of bkg was colored with gray, and obs with dark green; the one values of interest were indicated with vertical lines. C. Parameter testing. The X axis represented the coefficient of conserved variation (CCV, 0 to 2.0), the Y axis represented the standard deviation (SD, 0 to 3.0), and the Z axis represented the 1 − p-value. D. Schematic of the pipeline used for the study.

Validating the mixed Student’s t-test with simulated and real-world data.

3.2. Prediction of TFBS in human genome using Grit

A schematic of the pipeline used in this study was indicated in Fig 1D, which included: (1) blast TG with RGs; (2) build the 2K-set for homolog genes using the BEST-to-BEST approach; (3) run Grit using Jaspar-2020 and the 2K-set; and 4) assess the performance of Grit using public ChIP-Seq datasets. The promoter set contained 2 kb length sequences for a putative promoter region of 35,342 homologues gene clusters. To estimate the accuracy of this promoter set, we compared it with experimentally validated human promoters available in the EPD database containing 29,598 human gene promoter sequences [30]. The TPS contained promoter sequences for 93.2% of these genes showing post alignment with the TPS of the EPD sequences with an E-value < 1E-6. Grit was used for identification of TFBS in the human genome by applying it to the 2K-set datasets. The Grit run took 22 h and identified 7.57 million significant TFBS for 537 TFs (FDR ≤ 0.05). A target gene was assigned a TF if the gene was found in at least one TFBS. Grit prediction results were assessed with six publicly available motif scanning tools designed for high throughput analysis using the 829 ReMap-2020 datasets (S2 Table) obtained from the ReMap database [26]. The results were shown in Fig 2. FIMO and Swan consistently achieved higher Sn but lower PPV for ChIP-Seq datasets as compared with other tools (p-value ≤ 0.05). The average Sn of Grit is lower than FIMO but the average PPV of Grit is the highest among all competitors. As results, Grit attained the highest average ACCg, followed by FIMO, Swan, Pscan, and PWMScan, Clover had the lowest ACCg. It is noticed that Grit outperformed FIMO 29% as evaluated by ACCg (p-value ≤ 0.05). The number of predicted targets for FIMO and Swan was strikingly high, covering approximately 80% of human genes on average, whereas the number of predicted targets for Grit, Pscan, and PWMScan were significantly smaller, with Clover producing the lowest targets (p-value ≤ 0.05).

Fig 2

Performance assessment of motif scanners in analyzing ReMap datasets.

A total of six scanners (Clover, FIMO, Grit, Pscan, PWMScan, and Swan) were evaluated based on the parameters of sensitivity (Sn), positive predictive value (PPV), geometric accuracy (ACCg), and total number of predicted transcription factor binding sites (TFBS, Count).

Performance assessment of motif scanners in analyzing ReMap datasets.

3.3. Performance of Grit using ChIP-Atlas datasets

Additionally, performance of the six scanners was evaluated using 111 high-quality Atlas-2021 target gene sets (S3 Table) collected from experimentally validated human ChIP-Atlas data with literature support [27]. Table 2 listed a random subset of the assessment results. The Sn values of FIMO were higher than those of Grit (33.0%, p-value ≤ 0.05), whereas the PPV values for Grit were higher than those of FIMO (2.15 fold, p-value ≤ 0.05). The ACCg values for Grit were higher than those of FIMO (17.8% on average, p-value ≤ 0.05), indicating that Grit performed better than FIMO for Atlas-2021. Furthermore, the Grit method slightly outperformed the Pscan method (4.8% on average, p-value ≤ 0.05). Analysis using JASPAR-2020, ReMap-2020, and Atlas-2021 datasets identified Grit, Pscan, and FIMO as the best tools for identifying TFBS (complete prediction results for all the tools have been provided on the Grit website), ranking them based on ACCg in the order Grit > Pscan > FIMO > Swan > PWMScan > Clover.

Table 2

Performances of Grit with FIMO using publicly available Chip-seq datasets with literature support.

Motif	Grit			FIMO			Pscan
Motif	Sn	PPV	ACCg	Sn	PPV	ACCg	Sn	PPV	ACCg
ASCL1	0.51	0.15	0.28	0.94	0.04	0.20	0.46	0.16	0.27
CDX2	0.44	0.18	0.28	0.69	0.07	0.22	0.43	0.20	0.30
DUX4	0.55	0.08	0.21	0.78	0.04	0.18	0.53	0.09	0.22
E2F1	0.61	0.80	0.70	0.31	0.49	0.39	0.58	0.81	0.69
ELK4	0.69	0.22	0.39	0.93	0.10	0.31	0.71	0.35	0.50
FLI1	0.48	0.77	0.61	0.56	0.44	0.49	0.42	0.81	0.58
GATA3	0.48	0.68	0.57	0.50	0.32	0.40	0.52	0.67	0.59
GLI2	0.52	0.37	0.44	0.70	0.17	0.35	0.30	0.33	0.31
HNF4G	0.46	0.27	0.35	0.86	0.10	0.30	0.29	0.29	0.29
JUND	0.35	0.71	0.50	0.57	0.33	0.43	0.30	0.69	0.46
MAFF	0.44	0.18	0.28	0.67	0.08	0.23	0.49	0.20	0.31
MEF2A	0.57	0.28	0.40	0.87	0.10	0.29	0.54	0.29	0.39
MXI1	0.40	0.72	0.54	0.59	0.38	0.47	0.34	0.72	0.49
NFE2	0.50	0.62	0.56	0.62	0.29	0.42	0.27	0.56	0.39
NFIC	0.34	0.59	0.45	0.73	0.26	0.44	0.32	0.58	0.43
NR2C2	0.49	0.13	0.25	0.91	0.05	0.21	0.31	0.12	0.19
NRF1	0.71	0.72	0.71	0.87	0.38	0.58	0.49	0.78	0.62
OTX2	0.60	0.51	0.55	0.47	0.26	0.35	0.37	0.56	0.46
PAX5	0.59	0.63	0.61	0.82	0.29	0.49	0.50	0.63	0.56
RUNX3	0.37	0.60	0.47	0.56	0.28	0.40	0.26	0.65	0.41
SP1	0.75	0.81	0.78	0.98	0.27	0.52	0.50	0.83	0.65
SPI1	0.42	0.78	0.57	0.78	0.31	0.49	0.46	0.78	0.60
SRF	0.36	0.63	0.48	0.41	0.31	0.36	0.46	0.62	0.53
TBX21	0.44	0.59	0.51	0.74	0.26	0.44	0.18	0.56	0.32
TCF7L2	0.40	0.62	0.50	0.71	0.25	0.42	0.29	0.59	0.42
Average	0.50	0.51	0.48	0.70	0.24	0.37	0.41	0.51	0.44

*Full information available in S3 Table. Sn = sensitivity, PPV = positive predictive value, ACCg = geometric accuracy.

3.4. Differences among Grit and other prediction tools

The prediction results of Grit and five other tools were compared. There were 38.9% TFBS in ChIP-Atlas datasets that were not identified by the other five prediction tools; 32.8% of TFBS were identified by both Grit and the other tools; 11.5% of TFBS were identified by the other tools but not by Grit; and 16.8% of TFBS were identified by Grit but not by the other tools. A total of 2.9% best TFBS identified by Grit for the same gene did not overlap with those identified by the other tools. A comparison of the numbers of TFBS identified by Grit and by the other five tools showed that each tool produced dramatically different prediction results (Fig 3A and 3B). To show the unique features between Grit and the other tools, we investigated the distributions of CCV and SD for Grit TFBS and Grit specific TFBS (TFBS detected by Grit but did not by other tools, Grit–other, the “–” symbol means subtracting). The results indicated that Grit − FIMO and Grit − Swan had significant higher CCV values, while Grit − Clover, Grit − Pscan, and Grit − PWMScan had significantly higher SD values, than Grit TFBS (p-value ≤ 0.05, Fig 3C and 3D).

Fig 3

Differences among the results of Grit and other prediction tools.

A. Number of transcription factor binding sites (TFBS) that overlap (light green) between Grit and other prediction tools, and Grit specific (red) or other tool specific (light blue) TFBS. B. Overall comparison of numbers of TFBS between prediction results of Grit and other prediction tools. The violin plot in C and D shown distributions of conserved variation (CCV) and standard difference (SD) for TFBS identified by Grit but did not by other tools (Grit–other, the symbol “–” means subtracting), respectively.

Differences among the results of Grit and other prediction tools.

4. Discussion

4.1. Comparative genomics is required for TFBS prediction

Identification of TFBS is essential for understanding how TFs regulate gene expression, ultimately controlling processes such as cell cycle progression, stress response, or stem cell differentiation into adult tissues [31-33]. A typical computational issue is deciding, giving a PWM, if a nucleotide sequence contains a valid instance of the TFBS modeled by the PWM itself [4]. Reliable predictions on a single sequence are nearly impossible without further filtering because of the redundant information available on promoter sequences [18]. The activities of functionally important TFs are highly conserved among both closely related and distant species, thereby causing frequent occurrence of their binding sites in orthologous genes [1]. A gene can be compared with its orthologs by analyzing sequence conservation in evolutionarily preserved transcribed regions, which enables the identification of orthologous gene sets, and TFBS can be predicted from the promoter sequences of these genes [21]. Although the predicted TFBS require further wet-lab experiment validation, with the increasing availability of PWMs, this in silicon approach has gained wide popularity [29]. These functionally important binding sites in closely related species can be identified by promoter sequence alignment and phylogenetic printing methods [16, 34–36]. However, promoters of orthologous genes in distantly related species are always poorly conserved, and identification of TF binding sites in these sequences is difficult [1, 20]. This study used cross-species comparison to build co-regulated orthologous gene sets, without the need for non-coding sequence alignment. Therefore, this approach is well suited for comparative genomics across large evolutionary divergences, when existing alignment-based methods are not feasible. The rationale is that the promoters of most of the genes targeted by the same TF(s) should contain significantly higher scores for TFBS than some suitably computed numbers obtained from a collection of unrelated genes or a random background model.

4.2. Mixed Student’s t-test is useful in discovering TFBS

By counting the number of matches and mismatches in target and control sequences, over-represented motif analysis was performed using hypergeometric distribution [14, 37]. A more intricate procedure, accounting for sequences with zero, one, two, or three or more matches in the target and control sets has been reported [31]. Several studies have suggested counting all matches in the target and control sequences, and proposed two different binomial formulas for assessing motif over-representation [12, 14]. Notably, the widely used Pscan program calculates an RS similar to Clover’s z-test to analyze over- or under-representation of TFBS. The p-value is computed by counting the number of times a random dataset yields a score higher than the input sequence set. Our tool "Grit" calculates an RS similar to Clover and Pscan, RS is the average exponent of the standard motif matrix score and is proportional to the factor’s total equilibrium occupancy of the TFBS in sequence in a simple thermodynamic model [38-40]. Note that RS is a function of the length of the promoter sequence S, and if S is extended to include nucleotides that do not coordinate the motif, RS would decrease. Given sets of equal-length target and control sequences, it is possible to test for significance by ranking the RSs from both sets and performing statistical analyses. The newly developed mixed Student’s t-test was performed for TG and RGs for sites where the TFBS were expected to be conserved. Additionally, we considered the possibility of motif variation among species with highly diverged in RGs, such as pig or cattle, because of significant changes in the binding scores of TF among orthologous genes in the 2K-set of reference species. However, in cases of sufficiently large numbers of RGs, the binding affinity scores should show a normal distribution. The statistical analysis prefers to detect TFBS either conserved among species (high CCV) or having significant RS differences between the target and control sequences (high SD), or both. In contrast to the statistical test implemented in other tools, which produce a “whole” p-value for the gene set but fail to tell whether a specific sequence has certain TFBS or not, the mixed Student’s t-test is not only able to utilize the information from comparative genomics, but also produces a theoretical p-value for an individual sequence of interest.

4.3. Single- and multi-species prediction tools

FIMO, Swan, and PWMScan were designed to not only identify potential matches to a motif, but also for potential matches that are greater than expected by chance, considering the genomic background [1, 7, 10]. All were designed for TFBS prediction in a single-species and produced a large number of TFBS as expected. Compared with these tools, Grit identified significantly smaller numbers of binding sites, which highlights the major differences between these tools. Grit has been designed to predict TFBS based on PWMs, and these sites were either highly conserved or had high RS among the promoter sequences. With the added condition that the TFBS were required to be highly conserved among species, which was not a criterion for single-species scanners, the final lists produced were relatively small, have a higher CCV, and were thus likely to be more suitable for further experimental validation. Clover and Pscan were designed for multi-species TFBS scanning [15, 18]. Similar to the Clover algorithm, Grit computed an RS for each input sequence, representing the average likelihood of each TFBS to a promoter. Regulatory regions of higher eukaryotes often contain multiple binding sites for the same transcription factor, with weaker “shadow” copies of the motif also present [41]. Therefore, considering the average score of multiple matches per sequence will likely aid in the discovery of functional motifs. Another issue is the definition of a “background” suitable for assessing the significance of the results obtained. In Clover, this is performed by shuffling the columns of the motif, or by building random sequence sets of the same size and length of the sequence set investigated [15]. However, the algorithm implemented in Clover is computationally intensive, taking 15 days to process 25 PWMs for the human genome. Similar to Pscan, Grit treats the input sequences as a sample taken from a “universe” composed of all promoter sequences available for the species investigated, several subsamples are taken from the universe, with a default size = 200 and n = 10, and used as the background. For each promoter set of orthologous genes, the RSs obtained from the input sequence set can be compared with the RSs on the subsets randomly taken from the whole genome promoter set, and the p-value can be produced by the mixed Student’s t-test.

Availability and future directions

Grit is a good alternative to current available motif scanning tools and is publicly available at http://www.thua45.cn/grit under an academic free license. Further directions will be development of algorithms like gene-set enrichment analysis, to analyze transcriptome data.

Proof of the mixed Students’ t-test.

(DOCX) Click here for additional data file.

Detailed information for genomes in Ensembl Biomart web tool release 100.

(DOCX) Click here for additional data file.

Detailed information of automated annotated ChIP-Seq datasets obtained from ReMap-2020 database.

(DOCX) Click here for additional data file.

Detailed information of the publicly available Chip-Seq datasets (Altas-2021) with literature support.

(DOCX) Click here for additional data file. 10 Mar 2022 Dear Dr. Yao, Thank you very much for submitting your manuscript "Identification of upstream transcription factor binding sites in orthologous genes using mixed Student’s t-test statistics" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Manja Marz Software Editor PLOS Computational Biology Manja Marz Software Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Some minor picks: Line 12: "3.2 Mb/s" - is it mega-base-pairs (Mbp/s)? Please indicate so as to avoid mix up with MB (file size). Line 44: Suggest change "as mentioned below" to "as described below". Line 54: Suggest change "referred to here as" to "referred here as", deleting "to". The download source code package is missing an essential README file. Some major concerns: (1) This manuscript is well written in general, however, it lack a clear description as how Grit is designed to work: what's the input, what's output, what's the process in which what algorithms was used to perform the motif recognitions calculate and how the reference (Jasper database) was used to make decisions, etc. Figure 1 D looks superfacial at this point. It seems necessary to add a section to serve this purpose. (2) The approaches in this report appear to be innovative in terms of using a new mixed Student t-test to determine the cut off thresholds. The authors compared the outcomes with this tool against a number of other computational tools. However, is it reasonable to suggest that where discrepancies exist between this and other tools, would the TFBS under the spotlights be further verified (e.g. with wet lab methods) in order to make a stronger claim? At the lease this should be a discussion point, I think. Reviewer #2: In the present manuscript, Huang et al. describe a computational tool ("Grit") that identifies transcription factor binding sites (TFBS) in (alignment-free) upstream DNA sequences of orthologous genes, based on PWMs (positional weight matrices) and a new mixed Student's t-test. The authors tested their tool on simulated data and Chip-Seq datasets from the human genome ChIP-Atlas and compared it with several state-of-the-art TFBS prediction tools. The study is well-written and, in general well-executed, but comes short at some points: 1. It is not entirely clear what the motivation/idea of the new statistical test is. It seems to work on the used simulated and real datasets, but the authors could elaborate more on the theory and model of their mixed Student's t-test. 2. In Table 1, the authors state that their tool Grit is species-specific and can identify TFBS conserved among species. This statement is quite ambiguous, and since they only showed results on human datasets, it seems that Grit is human-specific, or at least not tested on other species. 3. It is not clear what kind of multiple testing approach the authors used (if at all) to correct their calculated p-Values. This is crucial for their statistics and predictions and cannot be omitted! 4. Which kind of helpful information does Figure 1C provide? 5. In Figure 3B: What is the difference between "TFBS identified by Grit but not other applications" and "Grit TFBS that do not overlap with other applications"? 6. There are several minor typos throughout the manuscript, which should be carefully checked and corrected. The authors present an interesting variation of PWMs-based TFBS prediction tools, with an, in general, interesting application of a new statistical model. However, as described in the authors' own words, their tool only "marginally outperforms" current state-of-the-art tools based on one human dataset. To improve their study, the authors should emphasize the details and ideas of their statistical model. They should definitely consider implementing multiple testing corrections and finally apply and test their tool on more datasets. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Zhiliang Hu Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 16 Mar 2022 Submitted filename: Response-to-reviewers.docx Click here for additional data file. 30 Apr 2022 Dear Dr. Yao, We are pleased to inform you that your manuscript 'Identification of upstream transcription factor binding sites in orthologous genes using mixed Student’s t-test statistics' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Manja Marz Software Editor PLOS Computational Biology Manja Marz Software Editor PLOS Computational Biology *********************************************************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: Thank you for your efforts to address my concerns. Reviewer #2: The authors answered all my questions and concerns about their manuscript to my satisfaction and improved their manuscript accordingly. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No 2 Jun 2022 PCOMPBIOL-D-21-02278R1 Identification of upstream transcription factor binding sites in orthologous genes using mixed Student’s t-test statistics Dear Dr Yao, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Agnes Pap PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

38 in total

1. An approach to identify over-represented cis-elements in related sequences.

Authors: Jiashun Zheng; Jiajin Wu; Zhirong Sun
Journal: Nucleic Acids Res Date: 2003-04-01 Impact factor: 16.971

2. Genome-wide in silico identification of transcriptional regulators controlling the cell cycle in human cells.

Authors: Ran Elkon; Chaim Linhart; Roded Sharan; Ron Shamir; Yosef Shiloh
Journal: Genome Res Date: 2003-05 Impact factor: 9.043

3. Detection of functional DNA motifs via statistical over-representation.

Authors: Martin C Frith; Yutao Fu; Liqun Yu; Jiang-Fan Chen; Ulla Hansen; Zhiping Weng
Journal: Nucleic Acids Res Date: 2004-02-26 Impact factor: 16.971

4. Information content of binding sites on nucleotide sequences.

Authors: T D Schneider; G D Stormo; L Gold; A Ehrenfeucht
Journal: J Mol Biol Date: 1986-04-05 Impact factor: 5.469

Review 5. The principles that govern transcription factor network functions in stem cells.

Authors: Hitoshi Niwa
Journal: Development Date: 2018-03-14 Impact factor: 6.868

Review 6. A decade of transcription factor-mediated reprogramming to pluripotency.

Authors: Kazutoshi Takahashi; Shinya Yamanaka
Journal: Nat Rev Mol Cell Biol Date: 2016-02-17 Impact factor: 94.444

7. Systematic identification of mammalian regulatory motifs' target genes and functions.

Authors: Jason B Warner; Anthony A Philippakis; Savina A Jaeger; Fangxue Sherry He; Jolinta Lin; Martha L Bulyk
Journal: Nat Methods Date: 2008-03-02 Impact factor: 28.547

8. ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data.

Authors: Shinya Oki; Tazro Ohta; Go Shioi; Hideki Hatanaka; Osamu Ogasawara; Yoshihiro Okuda; Hideya Kawaji; Ryo Nakaki; Jun Sese; Chikara Meno
Journal: EMBO Rep Date: 2018-11-09 Impact factor: 8.807

9. Evaluating tools for transcription factor binding site prediction.

Authors: Narayan Jayaram; Daniel Usvyat; Andrew C R Martin
Journal: BMC Bioinformatics Date: 2016-11-02 Impact factor: 3.169

10. JASPAR 2020: update of the open-access database of transcription factor binding profiles.

Authors: Oriol Fornes; Jaime A Castro-Mondragon; Aziz Khan; Robin van der Lee; Xi Zhang; Phillip A Richmond; Bhavi P Modi; Solenne Correard; Marius Gheorghe; Damir Baranašić; Walter Santana-Garcia; Ge Tan; Jeanne Chèneby; Benoit Ballester; François Parcy; Albin Sandelin; Boris Lenhard; Wyeth W Wasserman; Anthony Mathelier
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971