Literature DB >> 23433959

Identification of candidate transcription factor binding sites in the cattle genome.

Derek M Bickhart1, George E Liu.   

Abstract

A resource that provides candidate transcription factor binding sites (TFBSs) does not currently exist for cattle. Such data is necessary, as predicted sites may serve as excellent starting locations for future omics studies to develop transcriptional regulation hypotheses. In order to generate this resource, we employed a phylogenetic footprinting approach-using sequence conservation across cattle, human and dog-and position-specific scoring matrices to identify 379,333 putative TFBSs upstream of nearly 8000 Mammalian Gene Collection (MGC) annotated genes within the cattle genome. Comparisons of our predictions to known binding site loci within the PCK1, ACTA1 and G6PC promoter regions revealed 75% sensitivity for our method of discovery. Additionally, we intersected our predictions with known cattle SNP variants in dbSNP and on the Illumina BovineHD 770k and Bos 1 SNP chips, finding 7534, 444 and 346 overlaps, respectively. Due to our stringent filtering criteria, these results represent high quality predictions of putative TFBSs within the cattle genome. All binding site predictions are freely available at http://bfgl.anri.barc.usda.gov/BovineTFBS/ or http://199.133.54.77/BovineTFBS.
Copyright © 2013. Production and hosting by Elsevier Ltd.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 23433959      PMCID: PMC4357788          DOI: 10.1016/j.gpb.2012.10.004

Source DB:  PubMed          Journal:  Genomics Proteomics Bioinformatics        ISSN: 1672-0229            Impact factor:   7.691


Introduction

The detection of functional transcription factor binding sites (TFBSs) remains an elusive goal in the post-genome world [1]. Much of the difficulty in TFBS discovery comes from the short length of the sequencing reads as well as their degenerate nature. Additionally, transcription factors (TFs) can often bind sequences completely dissimilar to their canonical TFBS motif and some TFBSs can be lineage-specific, thereby confounding comparative evolutionary discovery [2]. Given these difficulties, experimental discovery and annotation remains the most reliable method for TFBS discovery; however, such validation is often unavailable for individual TFs. The proliferation of genome sequencing and assembly has made possible the use of comparative genomics approaches for TFBS discovery. The reasoning behind the use of comparative genomics as a means of TFBS discovery is that conserved sequence upstream of a gene is likely to contain essential TFBSs due to selective pressures to conserve the sequence across several different species [3]. Since TFs typically bind to non-coding regions of the genome, accurate identification of their binding sites could provide important context to the recent influx of genetic variation discovery studies that often identify variants outside of coding regions. In silico prediction methods for the detection of TFBSs can be classified by the order in which they apply sequence homology among several related species. The first method—termed the alignment-free method—uses motif detection algorithms on unaligned genomic sequence prior to comparisons of sequence homology [4]. By contrast, phylogenetic footprinting uses conserved sequence alignments across several animal species as a starting point for TFBS motif detection [4]. Both techniques are subject to unique benefits and disadvantages based on their starting approaches. In our previous study, we identified novel TFBSs upstream of the phosphoenolpyruvate carboxykinase (PEPCK or PCK1) promoter and applied a TFBS prediction algorithm to detect TFBSs upstream of all genes available in the human genome [5]. In this study, we applied a phylogenetic footprinting approach using the transcription factor binding site locator (TFLOC) algorithm initially developed to detect conserved TFBSs within multiple genome alignments for the University of California, Santa Cruz (UCSC) genome browser [6].

Implementation

TFLOC uses a position-specific scoring matrix (PSSM) algorithm to identify putative TFBSs across multiple genome alignment files through the generation of a similarity matrix score for each putative position [5]. The PSSMs that we used were derived from the JASPAR CORE, FAM and phyloFACTs databases, which contain freely available consensus TFBS scoring matrices that were experimentally determined or statistically predicted [7]. We chose the Btau4.0 reference assembly for our analysis for two reasons: (1) it is currently the most extensively-annotated cattle reference assembly and (2) simultaneous comparative alignments against other mammalian genomes already exist for Btau4.0 (downloaded from: http://hgdownload.cse.ucsc.edu/goldenPath/bosTau4/multiz5way/). We chose the 1000 bp upstream multiple alignment file (maf) for our analysis as proximal TFBSs tend to be found within 1000 bp of the transcription start site (TSS) of a gene [8]. After downloading the 1000 bp maf from the UCSC genome browser, we removed alignments from Platypus (Ornithorhynchus anatinus) and mouse (Mus musculus) due to their large sequence divergences from cattle (Bos taurus). Unfortunately, promoter sequence for similar genes could not be found within the genomes of some animals. For example, platypus and mouse only shared promoter sequence synteny with cattle 1437 times (1437/8740; 16%) and 7649 times (7649/8740; 88%), respectively, compared to 8440 times (97%) for human (Homo sapiens) and 8165 times (93%) for dog (Canis lupus familiaris). By focusing on multiple alignments containing sequences only from cattle, human and dog, we were able to investigate 7764 locations that had homology among three species as opposed to only 1335 locations if we included alignments that contained sequences from all five species.

Application

Computational TFBS prediction methods are often marred by high false positive rates (FPRs), so we initially sought to define stringent filters for the algorithm in order to focus on highly-likely TFBS motifs. Similar to Liu et al.’s approach [5], we tested the fit of raw TFLOC prediction scores for all surveyed PSSMs to a Gaussian distribution and found that 176 out of 315 of the motifs (55.9%) had significant deviations from a standard distribution. We also identified 8 different distribution types for TFLOC prediction scores, similar again to the previous report [5]. For all subsequent predictions, we considered non-Gaussian distributions of TFLOC scores by using fine-tuned filtering values for each PSSM. The final filter values were derived from an empirical test consisting of comparisons between well-characterized TFBSs identified within the PCK1, ACTA1 and G6PC promoters and TFLOC predictions. To estimate the sensitivity and specificity of our predictions, we sought to use these promoter regions with relatively high numbers of coordinate-converted TFBS positions. If we include promoter regions with fewer characterized TFBS positions, the specificity estimation could be artificially penalized due to a lack of experimental TFBS information rather than a real flaw in our algorithm. Based on 44 characterized sites upstream of the human PCK1, ACTA1 and G6PC genes that could be converted to Btau4.0 reference coordinates using the liftOver tool [9] as a standard [10-13], we measured the overlap of predictions at incremental cutoff values (Table 1). We defined sensitivity as the number of true positives divided by the number of true positives in addition to the false negatives. Specificity was defined as the percentage of predictions that overlapped known sites by at least 50%, similar to the criterion described previously [5]. A cutoff value of 0.04% was chosen for future TFBS predictions as it produced superior sensitivity and specificity while making fewer overall predictions than the higher cutoff values (Figure 1). While it is very likely that some TFBSs with in vivo activity may have been excluded due to this stringent filter, our analysis focused on putative, high percent similarity binding sites that likely have functional significance due to their conservation across species.
Table 1

Performance of TFLOC predictions at various score thresholds.

Threshold (×0.01%)
MultiTF
123456
Known sites44444444444444
Predicted sites24313233333329
False negatives20131211111115
50% Overlapping predictions24406678818488
Total predictions93179284359416452215
Sensitivity (%)55707275757566
Specificity (%)25222322201941
Figure 1

Comparison of known and predicted sites upstream of the The chromosome position (Chr13: 59,379,179–59,379,654) is listed at the top of the diagram, with vertical gray bars serving as scale bar markers. Known PCK1 TFBSs are represented by black bars (previously identified in [11]) and blue bars (identified in [5]) in the top track. TFBS predictions made by TFLOC using a 3-way alignment of human, dog and cow are depicted in the following three tracks. Predictions from JASPAR CORE, JASPAR FAM and JASPAR PHYLOFACTS were represented by red, grey and green bars, respectively. Additional UCSC tracks include gap locations, RefSeq annotated genes, cow mRNAs mapped to the reference genome, and 5-way multiz alignment & conservation.

We also compared TFBS detection results from our method to results from another phylogenetic footprinting tool, MultiTF [14]. Using the same 1 kb upstream regions from PCK1, ACTA1 and G6PC, we aligned the sequences with the Mulan webtool (http://mulan.dcode.org/) and loaded the alignments in MultiTF for TFBS detection using the default settings. Only 29 of the 44 known TFBSs in the three genes were detected by MultiTF (66% sensitivity) compared to the 33 sites identified in our method (75% sensitivity). Both methods made a similar number of predictions within the analyzed regions (361 predictions for MultiTF and 359 for TFLOC). Therefore, differences in predicted sites may be attributed to the use of different TFBS PSSMs, as our method used the JASPAR databases [7], while MultiTF uses the TRANSFAC database [15]. Although both methods provide high degrees of sensitivity for TFBS detection in promoter regions, TFLOC was able to detect four more experimentally-validated TFBSs at the 0.04% cutoff filter than MultiTF. Our analysis predicted 379,333 TFBSs upstream of 7764 MGC annotated loci within the Btau4.0 reference assembly. Many of the placed MGC annotations (683 loci) on Btau4.0 lacked sequence conservation in either dog or in human, so we were unable to predict TFBSs in these regions. Another portion of MGC upstream alignments (293 loci) were removed as the upstream region fell within gap or repeat regions of the Btau4.0 assembly. Despite these losses, we were able to predict TFBSs at ∼80% (7764 out of 9706) of the currently-annotated MGC loci in the Btau4.0 assembly and ∼89% (7764 out of 8740) of the MGC loci present in the maf alignment. We then checked for previously-annotated variants that might overlap with our predictions by comparing our TFBS loci with the 9 million plus SNP variant calls within the cattle genome that are present in the dbSNP variant repository (http://www.ncbi.nlm.nih.gov/projects/SNP/) [16]. Since the variants present in dbSNP have coordinates on the UMD3.1 reference assembly, we used the UCSC’s liftOver tool to convert the SNP coordinates to the Btau4.0 assembly (>98% conversion rate). We identified 7534 TFBS predictions that overlapped with variant SNP loci (Table S1). We also compared our TFBS loci with SNPs present on the Illumina BovineHD 770k and Affymetrix Bos 1 SNP chips and identified 444 and 346 intersections, respectively. Given the potential for SNP variants to cause changes within the sequences of TFBSs and theoretically impact the binding affinities of TFs, we counted the number of SNP–TFBS intersections that were within conserved nucleotides (monomorphic) of the TFBS consensus sequence (Table S1). We found a high number of SNP–TFBS intersections that changed conserved TFBS consensus sequences (5598 in dbSNP; 243 in Bos 1 and 327 in BovineHD 770k). These SNP–TFBS intersections were also identified upstream of 1887 MGC annotated genes. Several of these overlaps occurred upstream of essential genes, such as the CTCF binding site of HLA-DMA (encoding histocompatibility antigen, DM alpha chain), the NKX3_1 binding site of LYZ1 (encoding lysozyme) and the FOXF2 binding site of HSP40/DNAJB4 (encoding heat shock protein 40/DnaJ homolog, subfamily B, member 4).

Conclusion

In this study, we identified 379,333 putative transcription factor binding sites (TFBSs) within the promoter regions of 7764 annotated genes in the cattle genome. Intersections of known SNP sites from dbSNP (5598 sites), the Bos 1 array (243 sites) and the BovineHD 770k array (327 sites) with our predicted TFBSs revealed interesting overlaps. It is feasible that future GWAS, QTL mapping and whole genome sequencing studies are able to investigate our identified SNP–TFBS intersections to link variants within our TFBS predictions to phenotypes. Currently, our predictions represent high priority regions of interest for future surveys such as RNA-seq, which can tag differences of expression with animal genotypes. All TFBS predictions and SNP marker intersections are freely available at http://bfgl.anri.barc.usda.gov/BovineTFBS/ or http://199.133.54.77/BovineTFBS.

Authors’ contributions

DMB and GEL designed the procedures, carried out the experiments, wrote the draft and corrected the manuscript. Both authors read and approved the final manuscript.

Competing interests

The authors declare no competing interests.
  16 in total

1.  dbSNP: a database of single nucleotide polymorphisms.

Authors:  E M Smigielski; K Sirotkin; M Ward; S T Sherry
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  The human genome browser at UCSC.

Authors:  W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal:  Genome Res       Date:  2002-06       Impact factor: 9.043

Review 3.  In silico identification of metazoan transcriptional regulatory regions.

Authors:  Wyeth W Wasserman; William Krivan
Journal:  Naturwissenschaften       Date:  2003-03-27

4.  A vision for the future of genomics research.

Authors:  Francis S Collins; Eric D Green; Alan E Guttmacher; Mark S Guyer
Journal:  Nature       Date:  2003-04-14       Impact factor: 49.962

5.  Conservation of an insulin response unit between mouse and human glucose-6-phosphatase catalytic subunit gene promoters: transcription factor FKHR binds the insulin response sequence.

Authors:  J E Ayala; R S Streeper; J S Desgrosellier; S K Durham; A Suwanichkul; C A Svitek; J K Goldman; F G Barr; D R Powell; R M O'Brien
Journal:  Diabetes       Date:  1999-09       Impact factor: 9.461

6.  TRANSFAC: transcriptional regulation, from patterns to profiles.

Authors:  V Matys; E Fricke; R Geffers; E Gössling; M Haubrock; R Hehl; K Hornischer; D Karas; A E Kel; O V Kel-Margoulis; D-U Kloos; S Land; B Lewicki-Potapov; H Michael; R Münch; I Reuter; S Rotert; H Saxel; M Scheer; S Thiele; E Wingender
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

7.  Evolution of gene regulation of pluripotency--the case for wiki tracks at genome browsers.

Authors:  Georg Fuellen; Stephan Struckmann
Journal:  Biol Direct       Date:  2010-12-29       Impact factor: 4.540

8.  Visualization and exploration of conserved regulatory modules using ReXSpecies 2.

Authors:  Stephan Struckmann; Daniel Esch; Hans Schöler; Georg Fuellen
Journal:  BMC Evol Biol       Date:  2011-09-24       Impact factor: 3.260

9.  JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles.

Authors:  Elodie Portales-Casamar; Supat Thongjuea; Andrew T Kwon; David Arenillas; Xiaobei Zhao; Eivind Valen; Dimas Yusuf; Boris Lenhard; Wyeth W Wasserman; Albin Sandelin
Journal:  Nucleic Acids Res       Date:  2009-11-11       Impact factor: 16.971

10.  A survey of genomic properties for the detection of regulatory polymorphisms.

Authors:  Stephen B Montgomery; Obi L Griffith; Johanna M Schuetz; Angela Brooks-Wilson; Steven J M Jones
Journal:  PLoS Comput Biol       Date:  2007-04-25       Impact factor: 4.475

View more
  10 in total

1.  Polymorphisms in the CTLA4 promoter sequence are associated with canine hypoadrenocorticism.

Authors:  Alisdair M Boag; Andrea Short; Lorna J Kennedy; Hattie Syme; Peter A Graham; Brian Catchpole
Journal:  Canine Med Genet       Date:  2020-03-04

2.  Gene regulatory networks in the genomics era.

Authors:  Matthew Loose; Roger Patient; Xiangdong Fang; Hongxing Lei
Journal:  Genomics Proteomics Bioinformatics       Date:  2013-06-03       Impact factor: 7.691

3.  Regulatory and coding genome regions are enriched for trait associated variants in dairy and beef cattle.

Authors:  Lambros Koufariotis; Yi-Ping Phoebe Chen; Sunduimijid Bolormaa; Ben J Hayes
Journal:  BMC Genomics       Date:  2014-06-06       Impact factor: 3.969

4.  Widespread modulation of gene expression by copy number variation in skeletal muscle.

Authors:  Ludwig Geistlinger; Vinicius Henrique da Silva; Aline Silva Mello Cesar; Polyana Cristine Tizioto; Levi Waldron; Ralf Zimmer; Luciana Correia de Almeida Regitano; Luiz Lehmann Coutinho
Journal:  Sci Rep       Date:  2018-01-23       Impact factor: 4.996

5.  Signatures of Selection for Environmental Adaptation and Zebu × Taurine Hybrid Fitness in East African Shorthorn Zebu.

Authors:  Hussain Bahbahani; Abdulfatai Tijjani; Christopher Mukasa; David Wragg; Faisal Almathen; Oyekanmi Nash; Gerald N Akpa; Mary Mbole-Kariuki; Sunir Malla; Mark Woolhouse; Tad Sonstegard; Curtis Van Tassell; Martin Blythe; Heather Huson; Olivier Hanotte
Journal:  Front Genet       Date:  2017-06-08       Impact factor: 4.599

6.  Prioritizing single-nucleotide polymorphisms and variants associated with clinical mastitis.

Authors:  Prashanth Suravajhala; Alfredo Benso
Journal:  Adv Appl Bioinform Chem       Date:  2017-06-12

7.  Computational identification of tissue-specific transcription factor cooperation in ten cattle tissues.

Authors:  Lukas Steuernagel; Cornelia Meckbach; Felix Heinrich; Sebastian Zeidler; Armin O Schmitt; Mehmet Gültas
Journal:  PLoS One       Date:  2019-05-16       Impact factor: 3.240

8.  In silico identification of variations in microRNAs with a potential impact on dairy traits using whole ruminant genome SNP datasets.

Authors:  Fabienne Le Provost; Gwenola Tosser-Klopp; Céline Bourdon; Mekki Boussaha; Philippe Bardou; Marie-Pierre Sanchez; Sandrine Le Guillou; Thierry Tribout; Hélène Larroque; Didier Boichard; Rachel Rupp
Journal:  Sci Rep       Date:  2021-10-01       Impact factor: 4.379

9.  Fine mapping for Weaver syndrome in Brown Swiss cattle and the identification of 41 concordant mutations across NRCAM, PNPLA8 and CTTNBP2.

Authors:  Matthew McClure; Euisoo Kim; Derek Bickhart; Daniel Null; Tabatha Cooper; John Cole; George Wiggans; Paolo Ajmone-Marsan; Licia Colli; Enrico Santus; George E Liu; Steve Schroeder; Lakshmi Matukumalli; Curt Van Tassell; Tad Sonstegard
Journal:  PLoS One       Date:  2013-03-20       Impact factor: 3.240

10.  Variance explained by whole genome sequence variants in coding and regulatory genome annotations for six dairy traits.

Authors:  Lambros T Koufariotis; Yi-Ping Phoebe Chen; Paul Stothard; Ben J Hayes
Journal:  BMC Genomics       Date:  2018-04-05       Impact factor: 3.969

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.