Literature DB >> 25262351

AnimalTFDB 2.0: a resource for expression, prediction and functional study of animal transcription factors.

Hong-Mei Zhang1, Teng Liu1, Chun-Jie Liu1, Shuangyang Song1, Xiantong Zhang1, Wei Liu1, Haibo Jia1, Yu Xue1, An-Yuan Guo2.   

Abstract

Transcription factors (TFs) are key regulators for gene expression. Here we updated the animal TF database AnimalTFDB to version 2.0 (http://bioinfo.life.hust.edu.cn/AnimalTFDB/). Using the improved prediction pipeline, we identified 72 336 TF genes, 21 053 transcription co-factor genes and 6502 chromatin remodeling factor genes from 65 species covering main animal lineages. Besides the abundant annotations (basic information, gene model, protein functional domain, gene ontology, pathway, protein interaction, ortholog and paralog, etc.) in the previous version, we made several new features and functions in the updated version. These new features are: (i) gene expression from RNA-Seq for nine model species, (ii) gene phenotype information, (iii) multiple sequence alignment of TF DNA-binding domains, and the weblogo and phylogenetic tree based on the alignment, (iv) a TF prediction server to identify new TFs from input sequences and (v) a BLAST server to search against TFs in AnimalTFDB. A new nice web interface was designed for AnimalTFDB 2.0 allowing users to browse and search all data in the database. We aim to maintain the AnimalTFDB as a solid resource for TF identification and studies of transcription regulation and comparative genomics.
© The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 25262351      PMCID: PMC4384004          DOI: 10.1093/nar/gku887

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Transcription factors (TFs) are key regulators of gene expression in all organisms. They are usually classified into different families by their DNA-binding domains (DBDs). Usually, there are more than 5% TF genes in vertebrates and angiosperms (1,2). It is estimated that human genome contains ∼1700 TF genes, occupying more than 7% of the protein-coding genes (3). Similar as the studies of plant TF databases (4–6), there are several databases for TFs in one or more animal genomes, such as Riken mouse TFdb (7), FlyTF (8), TFCat (9), TFCONES (10), ITFP (11) and DBD (12). However, all these databases were built before 2010 and were not updated in recent years. In 2011, we characterized the TF families and constructed a comprehensive animal TF database (AnimalTFDB) (2), which contains TFs, co-factors and chromatin remodeling factors (CRFs) in 50 animal species. The AnimalTFDB database was accessed thousands of times and widely used for functional and evolutionary studies. Recent advance in high-throughput transcriptome sequencing (RNA-Seq) provides powerful ways to quantify the gene expression in a sample. There are many expression data sequenced for different tissues of human and model species, such as the human body map project (13), TCGA project (14) and other studies about the evolution of gene expression (15,16). Thus, it is feasible and very useful to explore the expressions of TFs from these RNA-Seq data. In the past 3 years, many genomes were sequenced and the number of species in Ensembl database was increased by more than a quarter (17). Thus, an updated animal TF database including the data of newly sequenced genomes is needed and an online animal TF prediction server is very necessary. The new annotations and tools in AnimalTFDB 2.0. (A) The multiple sequence alignment of TF DBDs, the weblogo and phylogenetic tree based on the alignment in each TF family. (B) The TF prediction server and examples of prediction result. (C) The BLAST search server. (D) One example of gene expression information. (E) The gene phenotype information. To meet the data-driven research requirements, we improved the prediction pipeline and updated AnimalTFDB to version 2.0 (http://bioinfo.life.hust.edu.cn/AnimalTFDB/). In comparison with the previous version, AnimalTFDB 2.0 covers more species and new types of annotations including gene phenotype and expression data in nine species. An online TF prediction server was set up. The multiple sequence alignment of TF DBD sequences and phylogenetic trees for each TF family of every species were also constructed. Taken together, AnimalTFDB 2.0 provides users with comprehensive animal TF lists, annotations and prediction tools.

MATERIALS AND METHODS

Data sources

We downloaded all the protein sequences of 65 animal genomes from Ensembl (version 75) (17) to identify their TFs, transcription co-factors and CRFs. We obtained most of the gene annotations from NCBI Entrez Gene and Ensembl databases, which includes basic information, orthologs, paralogs, phenotype, Gene Ontology (GO) and gene model. The protein–protein interaction information was parsed from BioGRID (18) and HPRD (19) databases. The pathway annotations were extracted from BioCarta (http://www.biocarta.com/) and KEGG databases. Putative functional domains were searched by PfamScan (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/) program. Rich information for gene expression is provided in AnimalTFDB 2.0. We downloaded the human gene expression data of cancers, tissues and cell lines from TCGA (https://tcga-data.nci.nih.gov/tcga/findArchives.htm) and EBI Expression Atlas (http://www.ebi.ac.uk/gxa/download.html). The expression data of the human proteome were parsed from two recent Nature papers (20,21). The gene expression of Drosophila melanogaster and Caenorhabditis elegans was extracted from the data published by Li et al. (22). Our collaborators Drs Yu Xue and Haibo Jia kindly provided the unpublished gene expression data of Danio rerio. We downloaded the raw data for Rattus norvegicus, Bos taurus and Gallus gallus from NCBI GEO DataSets published by Burge group (16) and estimated gene expression levels with TopHat (23) and Cufflinks (24) programs. The gene expression data for Mus musculus and Macaca mulatta were downloaded from RhesusBase (25,26), which were estimated from the RNA-Seq data published by groups Burge (16), Kaessmann (15) and Chuan-Yun Li (25).

Animal TF family and assignment rules

TFs are usually characterized and classified into specific families by their DBDs. After reviewing recently published literature, we found two new TF families NCU-G1 and CEP-1 comparing with AnimalTFDB 1.0, while CEP-1 only exists in C. elegans. In addition, the nuclear receptor superfamily was reclassified, which was grouped into 12 subfamilies based on InterPro (27) and Pfam (28) annotations in AnimalTFDB 1.0. In the updated version, we classified it into seven subfamilies according to the classification method of nuclear receptor nomenclature committee (29). The nuclear TF Y (NFY) was also classified into three subfamilies based on its three different subunits. In AnimalTFDB 2.0, there are 70 TF families and one of them named ‘Others’ including some orphan TFs. In most cases, a TF only has one type of DBD, thus it is easy to assign it into a certain family correctly. But in some cases, a TF may have more than one type of DBD. In order to classify them into correct family, we checked all the TFs of human and mouse which have multiple types of DBDs, and then set up two rules. First, if a superfamily contains several subfamilies, we classified the TFs based on the subfamily DBD. For example, the Homeobox superfamily has four subfamilies: Pou, CUT, TF_Otx and other Homeobox. In this superfamily, all TFs have a Homeobox domain, and some of them have one of the Pou, CUT and TF_Otx subfamily signature domains. We assigned them into specific family based on their subfamily signature domain. The second rule is that if a TF has more than one unrelated DBD, we will classify it into the family based on the DBD with the smallest E-value. We checked the classification of human and mouse TFs and found our method was reasonable.

TF prediction pipeline

We refined the TF prediction pipeline by updating the hidden Markov model (HMM) profiles of TF DBDs and adjusting the TF family assignment rules. The latest HMM profiles for most DBDs were downloaded from Pfam version 27.0 (28). For the remaining domains without available Pfam HMM profiles, we rebuilt the HMM profiles using the sequences in representative species (human, mouse, zebrafish and fly). To predict TFs, we applied the hmmsearch program in HMMER 3.0 (30) to search all the protein sequences in each species against the HMM profiles with E-value 0.0001 as the cutoff. Then we assigned the TFs into different families according to the above family assignment rules.

Identification of transcription co-factors and CRFs

We also adjusted the identification method of transcription co-factors and CRFs. First, we extracted both of them for human from Tcof-DB (31) and GO database by related GO items. For transcription co-factors, the used GO items are ‘transcription coactivator activity’, ‘transcription corepressor activity’, ‘transcription co-factor activity’ and ‘regulation of transcription’. For CRFs, the GO annotations are ‘chromatin remodeling’, ‘chromatin-mediated maintenance of transcription’, ‘histone *ylation’, ‘histone .*ylase activity’ and ‘histone *transferase activity’. After manual curation and removing redundant genes, 415 transcription co-factors and 142 CRFs were obtained in human genome. To identify them in other 64 species, we did the reciprocal best-hits Basic Local Alignment Search Tool (BLAST) between human and other species with the threshold setting as E-value ≤ 1e-4, coverage ≥ 50% and identity ≥ 30%.

RESULTS

Genomic repertoires of three kinds of regulatory factors

Using the refined prediction pipeline, we identified 72 336 TFs, 21 053 transcription co-factors and 6502 CRFs in 65 animal species (Table 1). Their numbers and percentages in model species are shown in Table 2. As a result, almost all of the vertebrates have 5–8.9% of TF genes in their genomes and the proportion of TFs in invertebrates is less than 5% (Supplementary Table S1). The large increase of TF percentage in vertebrates compared to invertebrates is due to the two-rounds of whole genome duplication that occurred in the stem vertebrate lineage followed by retention of a higher number of TF duplicates (32,33). Among the vertebrates, the zebrafish has the most TF genes (2345) and TF percentage (8.86%), because it retained more TF genes after the additional whole genome duplication (3R) in the teleost ancestor (34,35). In addition, the percentages of transcription co-factors and CRFs in vertebrates are about 1.8% and 0.6% of their protein-coding genes on average, which are also higher than those of invertebrates.
Table 1.

Comparison of data contents between two versions of AnimalTFDB

AnimalTFDBVersion 1.0Version 2.0
Species5065
TF families7270
TF genes52 72272 336
Co-factor genes906621 053
CRFs genes34766502
Annotation
-gene function descriptionNoYes
-expressionNoYes
-phenotypeNoYes
Multi-alignment of DBDs and their WebLogoNoYes
Phylogenetic treeNoYes
TF prediction serverNoYes
BLAST searchNoYes
Table 2.

Summary of the expression data and TF numbers of model species in AnimalTFDB 2.0

SpeciesLineageExpressionaTF (%)bExpressed TF (%)cCo-factor (%)bExpressed co-factor (%)cCRF (%)bExpressed CRF (%)c
Homo sapiensPrimateCA (27), TI (16,24), CL (22)1691 (7.4%)1589 (94.0%)462 (2.0%)430 (93.1%)155 (0.7%)140 (90.3%)
Macaca mulattaPrimateTI (11)1418 (6.5%)964 (68.0%)378 (1.7%)291 (77.0%)118 (0.5%)95 (80.5%)
Mus musculusRodentTI (10)1485 (6.5%)1227 (82.6%)397 (1.7%)390 (98.2%)122 (0.5%)118 (96.7%)
Rattus norvegicusRodentTI (9)1375 (6.0%)1137 (82.7%)382 (1.7%)374 (97.9%)118 (0.5%)116 (98.3%)
Bos taurusLaurasiatheriaTI (9)1280 (6.4%)1141 (89.1%)378 (1.9%)376 (99.5%)121 (0.6%)121 (100.0%)
Gallus gallusBirdTI (9)858 (5.5%)769 (89.6%)329 (2.1%)325 (98.8%)98 (0.6%)98 (100.0%)
Danio rerioFishDS (8)2345 (8.9%)1756 (74.9%)315 (1.2%)306 (97.1%)100 (0.4%)97 (97.0%)
Drosophila melanogasterInsectTI (29), CL (19), DS (30)604 (4.3%)594 (98.3%)160 (1.1%)158 (98.8%)53 (0.4%)51 (96.2%)
Caenorhabditis elegansNematodaTI (4), CT (14), DS (35)706 (3.4%)684 (96.9%)130 (0.6%)130 (100.0%)40 (0.2%)39 (97.5%)

aCA, cancer; TI, tissue; DS, development stage; CL, cell line; CT, cell type. Number in the bracket is the number of data sets of that type. The TI (16,24) of human indicates there are 16 mRNA data sets and 24 protein data sets for human tissue expression data. All other gene expression data are from RNA-seq mRNA expression.

bThe percentages in brackets are the percentages of TF (co-factor or CRF) genes in the protein-coding genes of genomes.

cThe percentages in brackets are the percentages of expressed TF (co-factor or CRF) genes.

aCA, cancer; TI, tissue; DS, development stage; CL, cell line; CT, cell type. Number in the bracket is the number of data sets of that type. The TI (16,24) of human indicates there are 16 mRNA data sets and 24 protein data sets for human tissue expression data. All other gene expression data are from RNA-seq mRNA expression. bThe percentages in brackets are the percentages of TF (co-factor or CRF) genes in the protein-coding genes of genomes. cThe percentages in brackets are the percentages of expressed TF (co-factor or CRF) genes.

Comprehensive annotations

In an attempt to construct a comprehensive knowledgebase for animal TFs, we provided rich information for them. Besides, the abundant annotations provided in version 1.0, we collected gene function description, gene expression at mRNA and protein levels, and phenotype data from various public resources and performed annotation for these factors (Figure 1). Through checking the transcription regulation-related GO annotation with experimental evidence codes, we marked the regulators as experimentally validated or putative in seven model species. As a result, we found 426 TFs, 236 co-factors and 37 CRFs with experimental evidence in human. In addition, using the DBD sequences, we made multiple sequence alignment by ClustalW2 (36) and constructed phylogenetic trees for TFs in each family of each species by applying neighbor-joining method in PHYLIP package (37) with bootstrap 100. The multiple sequence alignment result and phylogenetic tree were displayed by Weblogo (38) and Phylogeny.fr (39), respectively (Figure 1A). The phylogenetic tree will be helpful for users to infer the functions of poorly studied TFs.
Figure 1.

The new annotations and tools in AnimalTFDB 2.0. (A) The multiple sequence alignment of TF DBDs, the weblogo and phylogenetic tree based on the alignment in each TF family. (B) The TF prediction server and examples of prediction result. (C) The BLAST search server. (D) One example of gene expression information. (E) The gene phenotype information.

Gene expression

The gene expression information of nine species is provided in AnimalTFDB 2.0 involving normal tissues, cell lines, development stages and cancers in human (Table 2, Figure 1D). We considered a gene is expressed with RPKM ≥ 0.5 according to the benchmark set by Xie et al. (40). More than 90% of the co-factors and CRFs were detected to be expressed in at least one sample except for Macaca mulatta, which may be caused by its different gene annotation between Ensembl and UCSC. However, the percentage of expressed TFs is lower compared with co-factors and CRFs in most of the species. We also made a general analysis for the TF expression pattern in 16 human normal tissues. More than 50% of TFs are expressed in at least 14 tissues and 32% of TFs are expressed in all 16 tissues, such as YBX1, YBX3, EGR1, ATF4, FOS, JUN and MYC. More than 7% of TFs are only expressed in one tissue and most of them are expressed at low levels. The numbers of expressed TFs are also different between tissues, ranging from 800 in liver to 1200 in testis.

TF prediction server

With the development of high-throughput sequencing technology, a growing number of genomes and transcriptomes are being or will be sequenced. A TF prediction server will be helpful for users to identify TFs from their own protein sequences. In this regard, we set up a TF prediction server (http://bioinfo.life.hust.edu.cn/AnimalTFDB/prediction.shtml) in AnimalTFDB 2.0 (Figure 1B). Same prediction method and TF family assignment rules described above were used for this server. In the prediction result page, TF family, alignment e-value and detailed alignment information will be provided. Currently, users can upload up to 1000 protein sequences and obtain results within a few minutes from our server.

BLAST search

To help users find homologous gene and explore functions of poorly studied TFs, we provided a BLAST tool (http://bioinfo.life.hust.edu.cn/AnimalTFDB/blast.shtml) to search against TFs in our database with protein or DNA sequences (Figure 1C). The protein sequences of all species or one specific species could be selected as the BLAST database.

SUMMARY AND FUTURE PERSPECTIVES

We have updated our AnimalTFDB to version 2.0, which provides TF, transcription co-factor and CRF repertoires in 65 species across 11 lineages. The abundant annotation, gene expression profiles and phylogenetic trees will be useful resources for further exploration of the physiological function and evolutionary relationship of TFs. In addition, the TF prediction server in AnimalTFDB 2.0 will be helpful for TF identification in newly sequenced genome. In the future, we will continue to work on this project in the following directions: refining the TF family assignment rules and prediction pipeline, collecting more types of useful annotations for identified regulators, adding more species when new animal genome data is available and keeping the web interface compact, clear and beautiful. We aim to maintain a comprehensive animal TF database for a long time to provide a solid resource for the studies of transcriptional regulation and comparative genomics.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.
  40 in total

1.  WebLogo: a sequence logo generator.

Authors:  Gavin E Crooks; Gary Hon; John-Marc Chandonia; Steven E Brenner
Journal:  Genome Res       Date:  2004-06       Impact factor: 9.043

2.  A genome-wide and nonredundant mouse transcription factor database.

Authors:  Mutsumi Kanamori; Hideaki Konno; Naoki Osato; Jun Kawai; Yoshihide Hayashizaki; Harukazu Suzuki
Journal:  Biochem Biophys Res Commun       Date:  2004-09-24       Impact factor: 3.575

Review 3.  Turning a hobby into a job: how duplicated genes find new functions.

Authors:  Gavin C Conant; Kenneth H Wolfe
Journal:  Nat Rev Genet       Date:  2008-12       Impact factor: 53.242

4.  Clustal W and Clustal X version 2.0.

Authors:  M A Larkin; G Blackshields; N P Brown; R Chenna; P A McGettigan; H McWilliam; F Valentin; I M Wallace; A Wilm; R Lopez; J D Thompson; T J Gibson; D G Higgins
Journal:  Bioinformatics       Date:  2007-09-10       Impact factor: 6.937

5.  ITFP: an integrated platform of mammalian transcription factors.

Authors:  Guangyong Zheng; Kang Tu; Qing Yang; Yun Xiong; Chaochun Wei; Lu Xie; Yangyong Zhu; Yixue Li
Journal:  Bioinformatics       Date:  2008-08-19       Impact factor: 6.937

6.  The gain and loss of genes during 600 million years of vertebrate evolution.

Authors:  Tine Blomme; Klaas Vandepoele; Stefanie De Bodt; Cedric Simillion; Steven Maere; Yves Van de Peer
Journal:  Genome Biol       Date:  2006-05-24       Impact factor: 13.583

7.  Phylogeny.fr: robust phylogenetic analysis for the non-specialist.

Authors:  A Dereeper; V Guignon; G Blanc; S Audic; S Buffet; F Chevenet; J-F Dufayard; S Guindon; V Lefort; M Lescot; J-M Claverie; O Gascuel
Journal:  Nucleic Acids Res       Date:  2008-04-19       Impact factor: 16.971

8.  PlantTFDB: a comprehensive plant transcription factor database.

Authors:  An-Yuan Guo; Xin Chen; Ge Gao; He Zhang; Qi-Hui Zhu; Xiao-Chuan Liu; Ying-Fu Zhong; Xiaocheng Gu; Kun He; Jingchu Luo
Journal:  Nucleic Acids Res       Date:  2007-10-12       Impact factor: 16.971

9.  TFCONES: a database of vertebrate transcription factor-encoding genes and their associated conserved noncoding elements.

Authors:  Alison P Lee; Yuchen Yang; Sydney Brenner; Byrappa Venkatesh
Journal:  BMC Genomics       Date:  2007-11-29       Impact factor: 3.969

10.  DBD--taxonomically broad transcription factor predictions: new content and functionality.

Authors:  Derek Wilson; Varodom Charoensawan; Sarah K Kummerfeld; Sarah A Teichmann
Journal:  Nucleic Acids Res       Date:  2007-12-11       Impact factor: 16.971

View more
  122 in total

1.  Integrated Epigenetic Mapping of Human and Mouse Salivary Gene Regulation.

Authors:  D G Michael; T J F Pranzatelli; B M Warner; H Yin; J A Chiorini
Journal:  J Dent Res       Date:  2018-11-04       Impact factor: 6.116

2.  Molecular codes for cell type specification in Brn3 retinal ganglion cells.

Authors:  Szilard Sajgo; Miruna Georgiana Ghinia; Matthew Brooks; Friedrich Kretschmer; Katherine Chuang; Suja Hiriyanna; Zhijian Wu; Octavian Popescu; Tudor Constantin Badea
Journal:  Proc Natl Acad Sci U S A       Date:  2017-05-02       Impact factor: 11.205

3.  RNA sequencing-based transcriptomic profiles of embryonic lens development for cataract gene discovery.

Authors:  Deepti Anand; Atul Kakrana; Archana D Siddam; Hongzhan Huang; Irfan Saadi; Salil A Lachke
Journal:  Hum Genet       Date:  2018-11-11       Impact factor: 4.132

4.  Ultrahigh thermal conductivity confirmed in boron arsenide.

Authors:  Chris Dames
Journal:  Science       Date:  2018-08-10       Impact factor: 47.728

5.  An integrative method to predict signalling perturbations for cellular transitions.

Authors:  Gaia Zaffaroni; Satoshi Okawa; Manuel Morales-Ruiz; Antonio Del Sol
Journal:  Nucleic Acids Res       Date:  2019-07-09       Impact factor: 16.971

6.  Large-scale mapping of mammalian transcriptomes identifies conserved genes associated with different cell states.

Authors:  Yang Yang; Yu-Cheng T Yang; Jiapei Yuan; Zhi John Lu; Jingyi Jessica Li
Journal:  Nucleic Acids Res       Date:  2017-02-28       Impact factor: 16.971

7.  Cellular reprogramming dynamics follow a simple one-dimensional reaction coordinate.

Authors:  Sai Teja Pusuluri; Alex H Lang; Pankaj Mehta; Horacio Castillo
Journal:  Phys Biol       Date:  2017-10-04       Impact factor: 2.583

8.  Cellular reprogramming dynamics follow a simple 1D reaction coordinate.

Authors:  Sai Teja Pusuluri; Alex H Lang; Pankaj Mehta; Horacio E Castillo
Journal:  Phys Biol       Date:  2017-12-06       Impact factor: 2.583

9.  Proteome-transcriptome analysis and proteome remodeling in mouse lens epithelium and fibers.

Authors:  Yilin Zhao; Phillip A Wilmarth; Catherine Cheng; Saima Limi; Velia M Fowler; Deyou Zheng; Larry L David; Ales Cvekl
Journal:  Exp Eye Res       Date:  2018-10-22       Impact factor: 3.467

10.  Leukemia cell-derived microvesicles induce T cell exhaustion via miRNA delivery.

Authors:  Jieke Cui; Qing Li; Mei Luo; Zhaodong Zhong; Shu Zhou; Lin Jiang; Na Shen; Zhe Geng; Hui Cheng; Li Meng; Shujuan Yi; Hui Sun; Feifei Wu; Zunmin Zhu; Ping Zou; Yong You; An-Yuan Guo; Xiaojian Zhu
Journal:  Oncoimmunology       Date:  2018-03-26       Impact factor: 8.110

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.