Literature DB >> 30204897

AnimalTFDB 3.0: a comprehensive resource for annotation and prediction of animal transcription factors.

Hui Hu1,2, Ya-Ru Miao1,2, Long-Hao Jia1, Qing-Yang Yu1, Qiong Zhang1, An-Yuan Guo1,2.   

Abstract

The Animal Transcription Factor DataBase (AnimalTFDB) is a resource aimed to provide the most comprehensive and accurate information for animal transcription factors (TFs) and cofactors. The AnimalTFDB has been maintained and updated for seven years and we will continue to improve it. Recently, we updated the AnimalTFDB to version 3.0 (http://bioinfo.life.hust.edu.cn/AnimalTFDB/) with more data and functions to improve it. AnimalTFDB contains 125,135 TF genes and 80,060 transcription cofactor genes from 97 animal genomes. Besides the expansion in data quantity, some new features and functions have been added. These new features are: (i) more accurate TF family assignment rules; (ii) classification of transcription cofactors; (iii) TF binding sites information; (iv) the GWAS phenotype related information of human TFs; (v) TF expressions in 22 animal species; (vi) a TF binding site prediction tool to identify potential binding TFs for nucleotide sequences; (vii) a separate human TF database web interface (HumanTFDB) was designed for better utilizing the human TFs. The new version of AnimalTFDB provides a comprehensive annotation and classification of TFs and cofactors, and will be a useful resource for studies of TF and transcription regulation.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 30204897      PMCID: PMC6323978          DOI: 10.1093/nar/gky822

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Transcription factors (TFs) are special proteins with sequence specific DNA-binding domains (DBDs) that bind target DNA to promote or suppress gene transcription (1) and play key roles in all kinds of biological processes (2). Accurate identification of TFs is the basis for studying the function of TFs. There are several databases for TFs, for example, the current most comprehensive plant TFs were well-defined and established by the PlantTFDB databases (3,4). For animal TF databases, although there are some databases such as The Human Transcription Factors database (5) and REGULATOR (6), which focus on single genome and 77 metazoan species, respectively. Our AnimalTFDB is the first and most comprehensive animal TF database including classification and annotation of genome-wide TFs and cofactors. The AnimalTFDB was firstly built in 2011 (7) and in 2015 it was updated to AnimalTFDB v2.0 (8) with more species and annotations. It has been accessed by millions, cited by hundreds and widely used for the functional studies of animal TFs and TF prediction. As one of the major regulator types in biological processes or diseases, TFs have been well studied in many aspects, such as functions and regulatory mechanism (9), evolutionary analysis (10), drug targets analysis (11,12), disease or phenotype of TFs (13–15), TF regulatory networks (16), TF target prediction (17), and TF-related single nucleotide polymorphisms (SNPs) (18). The regulatory networks and functional interactions between TFs and target binding sites play key roles in cancers and other diseases (19,20). The DNA binding sites of hundreds of vertebrate TFs have been determined and collected by several databases. HOCOMOCO contains transcription factor binding site (TFBS) models of several hundreds of human and mouse TFs (21). Besides, TRANSFAC and JASPAR (22) embrace TFBS of several animal species, and Cis-BP database (23) contains 6559 TFBS of 340 species. These resources laid the foundation of regulatory research for TFs. Since the TFBS is a short DNA sequence, genomic variants about SNPs and mutations will affect the TF binding and regulation. Genome-wide association studies (GWAS) (24) identified many phenotype related variants genome widely which may be a useful resource to explore TF related variants and phenotypes (25,26). In the past 4 years, the number of species in Ensembl database has increased by doubled. To meet the urgent demand of data-driven research, we upgraded AnimalTFDB to version 3.0, which covers more species, more TFs and cofactors with the latest annotation and new functions. In addition, TF related GWAS phenotype and TFBS information were integrated, as well as a TFBS prediction tool was provided. The new AnimalTFDB3.0 will be a useful resource for transcriptional regulation and comparative genomic research.

DATA SOURCE AND SUMMARY

All protein sequences of 97 animal genomes were downloaded from the Ensembl database (version 92) (27). In AnimalTFDB3.0, we identified 125,135 TFs and 80,060 transcription cofactors in 97 animal species (Table 1) by using the improved prediction pipeline as described in next section. There are 1665 TFs (7.34% in protein-coding genes) and 1025 cofactors (4.52%) in human. The numbers of TFs and cofactors in 97 species were shown in Supplementary Table S1. Statistical data shows that TFs account for 5–8.5% of protein-coding genes in vertebrates, while this data reduced to 2.9% ∼4% in other eukaryotic organisms (Supplementary Table S1). The 'Species' page was shown in Figure 1A. We collected a large amount of annotations from the NCBI Entrez Gene and Ensembl databases, including basic information, gene phenotypes, homologous genes, and Gene Ontology (GO). We acquired protein-protein interaction (PPI) data from BioGRID (28) and HPRD (29). The protein functional domains were predicted by the PfamScan for all protein domain models in Pfam database, while the signaling pathway information was obtained from BioCarta (https://cgap.nci.nih.gov/Pathways/BioCarta_Pathways) and KEGG databases.
Table 1.

Data summary in AnimalTFDB3.0 database

AnimalTFDBVersion 1.0Version 2.0Version 3.0
Species506597
TF families727073
TF genes52,72272,336125,135
Cofactor genes906621,05380,060
CRFs genes34766502Merged into cofactors
Cofactor families0083
Species with expression data0922
PhenotypeNoYesYes
DBDs WebLogoNoYesYes
TF prediction serverNoYesYes
BLAST searchNoYesYes
PPI networkNoNoYes
GWASNoNoYes
TFBSNoNoYes
TFBS prediction serverNoNoYes
Figure 1.

New features of AnimalTFDB3.0. (A) Part of the 'Species' page. (B) The families and categories of transcription cofactors. (C) An example for PPI network. (D) An example of TFBS information. (E) The GWAS phenotype related information of human TFs. (F) The TFBS prediction server and the example of prediction result.

Data summary in AnimalTFDB3.0 database New features of AnimalTFDB3.0. (A) Part of the 'Species' page. (B) The families and categories of transcription cofactors. (C) An example for PPI network. (D) An example of TFBS information. (E) The GWAS phenotype related information of human TFs. (F) The TFBS prediction server and the example of prediction result. Next, 4257 gene-SNP pairs (2469 for TFs and 1796 for transcription cofactors) with the corresponding GWAS phenotypes were gathered from the latest GWAS Catalog (25) and dbSNP (release 144) (30). Furthermore, TFBS for 18,952 TFs of 51 species were integrated from HOCOMOCO (21), TRANSFAC, JASPAR (22) and CIS-BP databases. In addition, we collected TF expression from TCGA (31), EMBL-EBI Expression Atlas (32), RNA-seq data published by Li et al. (33) and bgee database (34) of 22 animal species as well as the human protein expressions from Human Protein Map (35). In AnimalTFDB3.0, the data amount and types are more comprehensive compared with the previous two versions (Table 1).

IMPROVED CONTENT AND NEW FEATURES

Animal TF family and assignment rules

TFs are typically characterized and classified into specific families by their conserved DBDs. We adjusted the TF families based on the AnimalTFDB2.0 by extracting several new families from the ‘Others’ groups or merge some families after systematically literature review. The five new TF families extracted from the ‘Others’ group of previous version were zf-CCCH, LRRFIP, DACH, GCFC and CSRNP. In addition, we moved the CEP-1 family into the ‘Others’ group because it has only one TF and also merged the C/EBP and TF_bZIP 2 families into TF_bZIP family because of them with the same DBD (36). Finally, we obtained 73 TF families in AnimalTFDB3.0 including an ‘Others’ group contained orphan TFs. We set up three rules to classify a TF into its correct family. First, if a superfamily has several families, we classified the TFs based on the family specific domain. For example, the zf-C2H2 superfamily includes two families: zf-C2H2 and ZBTB. Proteins containing both zf-C2H2 and ZBTB domains were assigned into the ZBTB family, while proteins with only zf_C2H2 domain were classified into the zf-C2H2 family. The second rule is that if a TF has multiple unrelated DBDs, we will categorize it into the family with the smallest E value in DBD prediction. The third rule is that some proteins were predicted by some DBDs but they were annotated as enzymes based on their functional domains and functions, we removed them by their enzyme related domains. For example, we found some acetyltransferases were also predicted a zf-C2HC domain, so they were removed by the prediction result of acetyltransferase domain MOZ_SAS.

TF prediction pipeline

Based on the TF family and classification rules, we built the TF prediction pipeline. The Hidden Markov Model (HMM) profiles for DBDs of 58 TF families were downloaded from the latest Pfam database (version 31.0) (37) and 14 TF families were reconstructed based on the DBD sequences from classical species (human, mouse, zebrafish and fly) by ourselves with HMMER (v3.1b2) (38). The self-build HMM files of 14 TF families are downloadable in the ‘Download’ or ‘Document’ page. Next, we ran the hmmsearch program in HMMER package to search all the protein sequences against all DBD HMM profiles to predict TFs in each species. To improve the accuracy of prediction result, we set different E-value thresholds for different families (Supplementary Table S2 and online document page) based on our manual curation rather than using a fixed cutoff. For instance, E-value 1e–3 for zf-C2H2 domain while 1e–20 for zf-CCCH. In addition, orphan TFs with only one member in their families and reported as TFs by literature were categorized into the ‘Others’ group.

Identification of transcription cofactors and their family rules

Here, we defined transcription cofactors are proteins that can modify chromatin status or interact with TFs to activate or repress the transcription of genes. In AnimalTFDB3.0, the chromatin remodeling factors were merged into transcription cofactors. Same as the version 2.0, we collected the human transcription cofactors from Tcof-DB v2 database (39) and GO database according to the related GO terms. Finally, we obtained 1,025 transcription cofactors in human after manual curation and removing redundant genes. Cofactors in the other 96 species were identified by performing mutually best-hit BLAST between each of them and human with E-value ≤1e–4, coverage ≥50% and identity ≥30%. Transcription cofactors were divided into 83 families and the following five major categories according to their protein families and functions (Figure 1B). Genes in the ‘Co-activator/repressors’ category with the annotation of coactivator or corepressor; ‘Histone-modifying Enzymes’ category contains genes encoding histone modification enzymes; ‘Chromatin Remodeling Factors’ genes were collected according to the description of GO annotations related to chromatin remodeling but excluding the histone modification enzymes; Genes in ‘General Cofactors’ category are transcription cofactors involving in initiation or elongation process of transcription; ‘Cell Cycle’ genes are cell cycle associated transcription cofactors; cofactors did not belong to the above categories were classified as ‘Other Cofactors’.

Gene expression

In AnimalTFDB3.0, we provided gene expression information of TFs and transcription cofactors of 22 species, which contains normal tissues, cell lines and cancers in human as well as normal tissues and cells in other species. These expression data showed the ratio of expressed TFs varied from 37% to 99% and cofactors from 41% to 100% in 22 species (Supplementary Table S3). The human TF and cofactor expression in 16 normal tissues collected from EBI Expression Atlas (http://www.ebi.ac.uk/gxa/download.html) illustrates that totally 52.62% TFs and 88.15% cofactors expressed in the 16 tissues. For TFs, there are 6% of them expressed in only one tissue, and for cofactors, the data reduced to 1%.

PPI network

TFs act as important regulators in the transcription process, and a large number of proteins interact with them directly or indirectly to affect the transcription. We got PPI data for 19 species from BioGRID (28) and human PPI data from HPRD (29). In order to illustrate the interaction explicitly, we visualized the PPI networks by Cytoscape.js (http://js.cytoscape.org/) (Figure 1C). The two colors of the network node represent a TF or other gene, and the edges represent the interaction of these proteins with the selected TF.

TF related GWAS phenotypes

We collected the latest human GWAS data and SNP annotation data from GWAS Catalog (25) and dbSNP (release 144) (30) respectively. By mapping the GWAS identified phenotype associated SNPs to the genomic locations of TF and transcription cofactor genes respectively, we obtained a list of SNPs located in TFs and cofactors along with 2469 TF-SNP pairs (680 TFs) and 1796 cofactor-SNP pairs (538 cofactors) with the corresponding GWAS phenotypes. The data indicates that 40.84% TFs and 52.49% cofactors relate to disease phenotypes. For each GWAS SNP locates in TFs or cofactors, the position, disease and reference literature were shown on the page (Figure 1E).

TFBS and its prediction server

TFs regulate gene transcription by binding to specific DNA sequences on target genes. We extracted TFBS of vertebrate TFs by integrating data from HOCOMOCO (v11) (21), JASPAR (22), TRANSFAC (version 2017) and CIS-BP (23) databases, which including TFBS for 18,952 TFs (1335 human TFs, 886 mouse TFs and TFs of other 49 species). The MEME Suite (40) was used to draw the logo of each TFBS (Figure 1D). Identify the TF targets is a key step for understanding the TF functions. To help users identify TF binding sites on their nucleotide sequences, a TFBS prediction server (http://bioinfo.life.hust.edu.cn/AnimalTFDB/#!/tfbs_predict) was built in current version. The TF motif matrices of human for prediction was gathered from TRANSFAC, JASPAR, HOCOMOCO and CIS-BP databases. We also collected TF motifs from hTFtarget (http://bioinfo.life.hust.edu.cn/hTFtarget), which were predicted by peaks calling of ChIP-Seq data using MACS2 (41). The TFBS prediction server will scan these TFBS matrices on user input sequences to predict the TFBS by the motif detection function of the FIMO tool (42) in MEME Suite. In the prediction result, TFBS sequence, score, P-value, Q-value, and detailed alignment information will be shown (Figure 1F).

HumanTFDB web interface

Human is the most concerned and most studied species. In order to facilitate people to directly browse or search human TFs, we separately built a webpage for Human Transcription Factor Database (HumanTFDB, http://bioinfo.life.hust.edu.cn/HumanTFDB/). In the HumanTFDB, users can browse, search, and download human TFs and cofactors. It also retains the web servers ‘Predict TF’, ‘Predict TFBS’ and ‘Blast’ tool.

DISCUSSION

As the increasing of sequenced and well annotated animal genomes, we updated AnimalTFDB to version 3.0 and several new features were added. AnimalTFDB3.0 provided TFs and cofactors in 97 animal genomes. Most importantly, the accuracy of TF prediction result was improved by adjusting the TF family assignment rules and prediction cutoffs. We have compared the human TFs in AnimalTFDB3.0 with TFs in a recent paper (5) and TRANSFAC data. Among the 1639 TFs in Lambert's paper, 1566 (95.55%) of them are in our AnimalTFDB3.0. The remaining 73 genes (4.45%) were commented as ‘Likely to be sequence specific TF’ in their website or without literature evidence. However, most of the 157 unique TFs in AnimalTFDB3.0 were explicit TFs, such as transcriptional repressors (LRRFIP1, LRRFIP2, MIER1, MIER1, ID1/2/3/4 etc.) and activators (SMAD2, SMAD6, SMAD7, UBTF, TCF19, TCF25 etc.). Among the 736 human TFs in TRANSFAC, 598 (81.00%) of them were TFs or cofactors in AnimalTFDB3.0. Most of the remaining 138 genes are not TFs, such as, nuclear ribonucleoproteins (HNRNPA1, HNRNPDL and HNRNPL), transporters (ABCG2, SLC22A1, SLC22A3 and SLC6A2), and enzymes (ADAR, HNRNPAB and HSD17B4). These comparisons highlight the accuracy of our TF prediction results. The GWAS phenotype related information of human TF and TFBS information will provide useful resources for researchers to further exploration of TF function and regulation. The TFBS prediction server and PPI network will be helpful for user to analyze TF target and its regulatory network. The HumanTFDB web interface is very convenient for researchers to study human TFs. Overall, we believe these improvements will make AnimalTFDB more comprehensive and more useful. There is no doubt that the genomic data of various species will continue to grow. We will continue to update the AnimalTFDB database regularly to make it as a core resource for TF regulation. Click here for additional data file.
  41 in total

1.  dbSNP: the NCBI database of genetic variation.

Authors:  S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

Review 2.  Classification of human B-ZIP proteins based on dimerization properties.

Authors:  Charles Vinson; Max Myakishev; Asha Acharya; Alain A Mir; Jonathan R Moll; Maria Bonovich
Journal:  Mol Cell Biol       Date:  2002-09       Impact factor: 4.272

Review 3.  rSNP_Guide: an integrated database-tools system for studying SNPs and site-directed mutations in transcription factor binding sites.

Authors:  Julia V Ponomarenko; Galina V Orlova; Tatyana I Merkulova; Elena V Gorshkova; Oleg N Fokin; Gennady V Vasiliev; Anatoly S Frolov; Mikhail P Ponomarenko
Journal:  Hum Mutat       Date:  2002-10       Impact factor: 4.878

Review 4.  Unraveling transcription regulatory networks by protein-DNA and protein-protein interaction mapping.

Authors:  Albertha J M Walhout
Journal:  Genome Res       Date:  2006-10-19       Impact factor: 9.043

Review 5.  Mast cell transcription factors--regulators of cell fate and phenotype.

Authors:  Sagi Tshori; Hovav Nechushtan
Journal:  Biochim Biophys Acta       Date:  2011-01-12

6.  FIMO: scanning for occurrences of a given motif.

Authors:  Charles E Grant; Timothy L Bailey; William Stafford Noble
Journal:  Bioinformatics       Date:  2011-02-16       Impact factor: 6.937

7.  AnimalTFDB: a comprehensive animal transcription factor database.

Authors:  Hong-Mei Zhang; Hu Chen; Wei Liu; Hui Liu; Jing Gong; Huili Wang; An-Yuan Guo
Journal:  Nucleic Acids Res       Date:  2011-11-12       Impact factor: 16.971

8.  Human Protein Reference Database--2009 update.

Authors:  T S Keshava Prasad; Renu Goel; Kumaran Kandasamy; Shivakumar Keerthikumar; Sameer Kumar; Suresh Mathivanan; Deepthi Telikicherla; Rajesh Raju; Beema Shafreen; Abhilash Venugopal; Lavanya Balakrishnan; Arivusudar Marimuthu; Sutopa Banerjee; Devi S Somanathan; Aimy Sebastian; Sandhya Rani; Somak Ray; C J Harrys Kishore; Sashi Kanth; Mukhtar Ahmed; Manoj K Kashyap; Riaz Mohmood; Y L Ramachandra; V Krishna; B Abdul Rahiman; Sujatha Mohan; Prathibha Ranganathan; Subhashri Ramabadran; Raghothama Chaerkady; Akhilesh Pandey
Journal:  Nucleic Acids Res       Date:  2008-11-06       Impact factor: 16.971

9.  PlantTFDB: a comprehensive plant transcription factor database.

Authors:  An-Yuan Guo; Xin Chen; Ge Gao; He Zhang; Qi-Hui Zhu; Xiao-Chuan Liu; Ying-Fu Zhong; Xiaocheng Gu; Kun He; Jingchu Luo
Journal:  Nucleic Acids Res       Date:  2007-10-12       Impact factor: 16.971

10.  Model-based analysis of ChIP-Seq (MACS).

Authors:  Yong Zhang; Tao Liu; Clifford A Meyer; Jérôme Eeckhoute; David S Johnson; Bradley E Bernstein; Chad Nusbaum; Richard M Myers; Myles Brown; Wei Li; X Shirley Liu
Journal:  Genome Biol       Date:  2008-09-17       Impact factor: 13.583

View more
  183 in total

1.  MIPPIE: the mouse integrated protein-protein interaction reference.

Authors:  Gregorio Alanis-Lobato; Jannik S Möllmann; Martin H Schaefer; Miguel A Andrade-Navarro
Journal:  Database (Oxford)       Date:  2020-01-01       Impact factor: 3.451

Review 2.  Redefining fundamental concepts of transcription initiation in bacteria.

Authors:  Citlalli Mejía-Almonte; Stephen J W Busby; Joseph T Wade; Jacques van Helden; Adam P Arkin; Gary D Stormo; Karen Eilbeck; Bernhard O Palsson; James E Galagan; Julio Collado-Vides
Journal:  Nat Rev Genet       Date:  2020-07-14       Impact factor: 53.242

3.  Organoids Model Transcriptional Hallmarks of Oncogenic KRAS Activation in Lung Epithelial Progenitor Cells.

Authors:  Antonella F M Dost; Aaron L Moye; Marall Vedaie; Linh M Tran; Eileen Fung; Dar Heinze; Carlos Villacorta-Martin; Jessie Huang; Ryan Hekman; Julian H Kwan; Benjamin C Blum; Sharon M Louie; Samuel P Rowbotham; Julio Sainz de Aja; Mary E Piper; Preetida J Bhetariya; Roderick T Bronson; Andrew Emili; Gustavo Mostoslavsky; Gregory A Fishbein; William D Wallace; Kostyantyn Krysan; Steven M Dubinett; Jane Yanagawa; Darrell N Kotton; Carla F Kim
Journal:  Cell Stem Cell       Date:  2020-09-04       Impact factor: 24.633

4.  A 3-Gene Random Forest Model to Diagnose Non-obstructive Azoospermia Based on Transcription Factor-Related Henes.

Authors:  Ranran Zhou; Jingjing Liang; Qi Chen; Hu Tian; Cheng Yang; Cundong Liu
Journal:  Reprod Sci       Date:  2022-06-17       Impact factor: 3.060

5.  Screening of non-alcoholic steatohepatitis (NASH)-related datasets and identification of NASH-related genes.

Authors:  Ming-Jiang Liu; Hu Jin; Yu-Bing Chen; Jing-Jing Yu; Zhen-Ya Guo; Song-Qing He; Yong-Lian Zeng
Journal:  Int J Clin Exp Pathol       Date:  2021-05-15

6.  Expression of DOCK10.1 protein revealed with a specific antiserum: insights into regulation of first exon isoforms of DOCK10.

Authors:  Antonio Parrado
Journal:  Mol Biol Rep       Date:  2020-02-29       Impact factor: 2.316

7.  Human iPSC-derived Down syndrome astrocytes display genome-wide perturbations in gene expression, an altered adhesion profile, and increased cellular dynamics.

Authors:  Blandine Ponroy Bally; W Todd Farmer; Emma V Jones; Selin Jessa; J Benjamin Kacerovsky; Alexandre Mayran; Huashan Peng; Julie L Lefebvre; Jacques Drouin; Arnold Hayer; Carl Ernst; Keith K Murai
Journal:  Hum Mol Genet       Date:  2020-03-27       Impact factor: 6.150

8.  Single-cell RNA sequencing reveals cell type- and artery type-specific vascular remodelling in male spontaneously hypertensive rats.

Authors:  Jun Cheng; Wenduo Gu; Ting Lan; Jiacheng Deng; Zhichao Ni; Zhongyi Zhang; Yanhua Hu; Xiaolei Sun; Yan Yang; Qingbo Xu
Journal:  Cardiovasc Res       Date:  2021-03-21       Impact factor: 10.787

9.  Cancer Cells Retrace a Stepwise Differentiation Program during Malignant Progression.

Authors:  Sadegh Saghafinia; Iacovos P Michael; Krisztian Homicsko; Annunziata Di Domenico; Stephan Wullschleger; Aurel Perren; Ilaria Marinoni; Giovanni Ciriello; Douglas Hanahan
Journal:  Cancer Discov       Date:  2021-04-28       Impact factor: 39.397

10.  Genome-Wide Identification, Comparison, and Expression Analysis of Transcription Factors in Ascidian Styela clava.

Authors:  Jin Zhang; Jiankai Wei; Haiyan Yu; Bo Dong
Journal:  Int J Mol Sci       Date:  2021-04-21       Impact factor: 5.923

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.