| Literature DB >> 20675356 |
Varodom Charoensawan1, Derek Wilson, Sarah A Teichmann.
Abstract
Sequence-specific transcription factors (TFs) are important to genetic regulation in all organisms because they recognize and directly bind to regulatory regions on DNA. Here, we survey and summarize the TF resources available. We outline the organisms for which TF annotation is provided, and discuss the criteria and methods used to annotate TFs by different databases. By using genomic TF repertoires from ∼700 genomes across the tree of life, covering Bacteria, Archaea and Eukaryota, we review TF abundance with respect to the number of genes, as well as their structural complexity in diverse lineages. While typical eukaryotic TFs are longer than the average eukaryotic proteins, the inverse is true for prokaryotes. Only in eukaryotes does the same family of DNA-binding domain (DBD) occur multiple times within one polypeptide chain. This potentially increases the length and diversity of DNA-recognition sequence by reusing DBDs from the same family. We examined the increase in TF abundance with the number of genes in genomes, using the largest set of prokaryotic and eukaryotic genomes to date. As pointed out before, prokaryotic TFs increase faster than linearly. We further observe a similar relationship in eukaryotic genomes with a slower increase in TFs.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20675356 PMCID: PMC2995046 DOI: 10.1093/nar/gkq617
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
TF resources currently available
| Database | Annotation | Organism | Link | Comment | |
|---|---|---|---|---|---|
| GTOP_TF | A/M | Prokaryotes | Covers over 150 prokaryotic genomes | ||
| BacTregulators* | A/M | Prokaryotes | AraC-Xyls and TetR transcription regulator families. Last updated 2004 | ||
| PRODORIC | M | Bacteria | Contains protein–DNA interaction information | ||
| RegTransBase | M | Bacteria | Contains protein–DNA interaction information | ||
| ArchaeaTF | A/M | Archaea | Covers 37 archaeal genomes | ||
| CoryneRegNet | A/M | Corynebacteria | Contains protein–DNA interaction information | ||
| cTFbase | A/M | Cyanobacteria | Covers 21 cyanobacterial genomes | ||
| DBTBS | M | Contains other literature-curated information for | |||
| RegulonDB | M | Contains other literature-curated information for | |||
| TRANSFAC | M | Eukaryotes | Partially commercial. Licence required to access some restricted areas | ||
| JASPAR | A/M | Eukaryotes | Contains collections of experimentally defined TF binding sites | ||
| TrSDB* | A/M | Eukaryotes | Covers nine eukaryotic proteomes. Last updated 2004 | ||
| ITFP | A | Mammals | Contains TFs and target genes from human, mouse and rat | ||
| TFcat | M | Mammals | Contains manually curated TFs from human, mouse | ||
| TFdb* | A/M | Mouse | Based on LocusLink and GO annotations. Last update: 2004 | ||
| FTFD | A/M | Fungi | Covers 69 fungal and three oomycete genomes | ||
| PlanTAPDB | A/M | Plants | Contains taxonomic information of transcription associated protein families | ||
| PlantTFDB | A/M | Plants | Integrates other plant databases: DPTF (poplar), DRTF (rice), DATF ( | ||
| PlnTFDB | A/M | Plants | Covers five model plant genomes | ||
| RARTF | A/M | TF database devoted to | |||
| AtTFDB* | A/M | Sister database, AtsicDB, contains | |||
| SoyDB | A/M | Predicts TFs using InterProScan | |||
| wDBTF | A/M | Predicts TFs from wheat Expressed Sequence Tags (ESTs) and mRNA. | |||
| TOBFAC | A/M | Predicts TFs from tobacco gene-space sequence reads (GSRs) | |||
| FlyTF | M | TF database devoted to | |||
| EDGEdb | A/M | Contains protein–DNA interaction information | |||
| DBD | A | Cellular organisms | Contains TF predictions of more than 1000 cellular organisms |
The databases can be divided into three categories: (I) prokaryotic TF databases; (II) Eukaryotic TF databases; (III) databases that provide TF annotations in genomes from different superkingdoms. The databases which have ceased to be developed or not been updated since 2004 are marked with asterisks. The years of the latest update are included in the comment field. Annotation methods are indicated as A (Automated) and M (Manually curated).
Figure 1.Historical timeline of TF resources. The timeline to the left shows the years of the first publications describing the databases (not to scale). The panel on the right shows how the number of completely sequenced eukaryotic and bacterial genomes has increased according to the Genome OnLine Database (35). The TF resources are grouped according to their main annotation methods (manual curation, automatic plus manual curation or automatic). They are colored according to the organisms the resources annotate (blue for Bacteria, green for Archaea, red for Eukaryota and white if the resource covers two or three superkingdoms).
TF repertoires in the three main superkingdoms of life: Bacteria, Archaea, and Eukaryota
| Bacteria | Archaea | Eukaryota | Cellular organisms | |
|---|---|---|---|---|
| Proteins | ||||
| Proteins per species | 3140 | 1966 | 14 141 | 3885 |
| Length of all proteins (residues) | 322 | 289 | 465 | 328 |
| Domains assigned per protein | 1.41 | 1.30 | 1.53 | 1.42 |
| Distinct domain families per protein | 1.33 | 1.25 | 1.29 | 1.32 |
| Length of protein domains (residues) | 180 | 171 | 161 | 177 |
| TFs | ||||
| TFs per species | 131 | 60 | 325 | 155 |
| Distinct architectures per species | 39 | 19 | 45 | 39 |
| Length of TFs (residues) | 242 | 196 | 560 | 253 |
| DBDs per TF | 1.04 | 1.00 | 1.41 | 1.05 |
| Distinct DBD families per TF | 1.00 | 1.00 | 1.01 | 1.00 |
| Length of DBDs (residues) | 62 | 60 | 64 | 62 |
| Partner domains per TF | 0.58 | 0.25 | 0.24 | 0.49 |
| Distinct partner domain families per TF | 0.57 | 0.24 | 0.21 | 0.48 |
| Length of partner domains (residues) | 153 | 97 | 85 | 139 |
| TF content in genome (%) | 4.39 | 2.94 | 2.91 | 3.59 |
| DBDs | ||||
| DBD families | 61 | 15 | 77 | 131 |
| Superkingdom-specific DBD families | 43 | 0 | 69 | |
| Partner domain families | 228 | 55 | 795 | 938 |
| Superkingdom-specific partner domain families | 116 | 12 | 693 | |
| Distinct domain architectures | 605 | 118 | 2209 | 2779 |
| DBDs per species | 109 | 42 | 206 | 122 |
| Distinct DBD families per species | 23 | 12 | 27 | 24 |
Domain assignments are from Pfam. Median values of all species in each lineage are displayed. Mean values and their SDs for each property are described in ‘Supplementary Data’.
TF repertoires in three major eukaryotic kingdoms: Viridiplantae (plants), Fungi, and Metazoa (animals), plus all eukaryotes combined
| Viridiplantae | Fungi | Metazoa | Eukaryota | |
|---|---|---|---|---|
| Proteins | ||||
| Proteins per species | 27 235 | 9997 | 16 371 | 14 141 |
| Length of all proteins (residues) | 387 | 466 | 479 | 465 |
| Domains assigned per protein | 1.48 | 1.47 | 2.00 | 1.53 |
| Distinct domain families per protein | 1.24 | 1.29 | 1.37 | 1.29 |
| Length of protein domains (residues) | 158 | 185 | 150 | 161 |
| TFs | ||||
| TFs per species | 591 | 203 | 806 | 325 |
| Distinct architectures per species | 77 | 38 | 160.5 | 45 |
| Length of TFs (residues) | 375 | 604 | 545 | 560 |
| DBDs per TF | 1.13 | 1.36 | 2.75 | 1.41 |
| Distinct DBD families per TF | 1.01 | 1.00 | 1.03 | 1.01 |
| Length of DBDs (residues) | 73 | 65 | 56 | 64 |
| Partner domains per TF | 0.23 | 0.14 | 0.40 | 0.24 |
| Distinct partner domain families per TF | 0.20 | 0.13 | 0.35 | 0.21 |
| Length of partner domains (residues) | 90 | 83 | 85 | 85 |
| TF content in genome (%) | 2.12 | 2.53 | 4.65 | 2.91 |
| DBDs | ||||
| DBD families | 38 | 34 | 58 | 78 |
| Kingdom specific DBD families | 12 | 6 | 26 | |
| DBDs per species | 602 | 152 | 2151 | 206 |
| Distinct DBD families per species | 37 | 25 | 53 | 27 |
The domain assignments are Pfam families. Median values of all species in each lineage are displayed.
Figure 2.(A) TF abundance against number of genes per genome in different lineages across the tree of life. Each colored dot represents a genome. Different colors are used to highlight genomes from different phylogenetic groups. According to the linear model fit on a log–log scale, TF expansion in bacteria strictly follows a power law increase, with an exponent close to quadratic (logT = 1.98logG – 4.84 with R2 = 0.87 where T is number of predicted TFs, G is number of genes and R2 is coefficient of determination). The TF increase in eukaryotes has a lower exponent as well as degree of correlation (logT = 1.23logG – 2.53 with R2 = 0.61). (B) The number of unique DBD families increases linearly with the total number of proteins in bacteria (power law exponent = 1.00, R2 = 0.71). In contrast, the number of families is independent of the number of genes in metazoans (pink, exponent = 0.09, R2 = 0.11) and fungi (orange, exponent = 0.13, R2 = 0.23). Grey dots in the figures represent other eukaryotic species that do not belong to the main kingdoms such as apicomplexan and euglenozoa.
Figure 3.Examples of lineage-specific DBDs and domain architectures of TFs across the tree of life. Commonly found DBDs and TF architectures in different taxonomic species are projected onto the simplified NCBI taxonomic tree. DBDs and their architectures in TFs at different taxonomic nodes are unique to their descendent branches. DBDs are represented by red oblongs, and other protein domains occurring within the same TFs (partner domains) are represented by colored rectangles.