| Literature DB >> 21124763 |
Shuye Pu1, Andrei L Turinsky, James Vlasblom, Tuan On, Xuejian Xiong, Andrew Emili, Zhaolei Zhang, Jack Greenblatt, John Parkinson, Shoshana J Wodak.
Abstract
Chromatin modification (CM) plays a key role in regulating transcription, DNA replication, repair and recombination. However, our knowledge of these processes in humans remains very limited. Here we use computational approaches to study proteins and functional domains involved in CM in humans. We analyze the abundance and the pair-wise domain-domain co-occurrences of 25 well-documented CM domains in 5 model organisms: yeast, worm, fly, mouse and human. Results show that domains involved in histone methylation, DNA methylation, and histone variants are remarkably expanded in metazoan, reflecting the increased demand for cell type-specific gene regulation. We find that CM domains tend to co-occur with a limited number of partner domains and are hence not promiscuous. This property is exploited to identify 47 potentially novel CM domains, including 24 DNA-binding domains, whose role in CM has received little attention so far. Lastly, we use a consensus Machine Learning approach to predict 379 novel CM genes (coding for 329 proteins) in humans based on domain compositions. Several of these predictions are supported by very recent experimental studies and others are slated for experimental verification. Identification of novel CM genes and domains in humans will aid our understanding of fundamental epigenetic processes that are important for stem cell differentiation and cancer biology. Information on all the candidate CM domains and genes reported here is publicly available.Entities:
Mesh:
Substances:
Year: 2010 PMID: 21124763 PMCID: PMC2993927 DOI: 10.1371/journal.pone.0014122
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Venn diagrams illustrating the overlap between experimentally characterized CM genes from various data sources in yeast and human.
Numbers in parentheses denote the number of genes. Refer to the text for the detailed sources of the genes in each set.
Selected known CM domains.
| Pfam_Acc | Pfam_id | Function |
| PF00856 | SET | Protein lysine methyltransferase activity |
| PF08123 | DOT1 | H3K79 methyltransferase activity |
| PF02373 | JmjC | Histone demethylase activity |
| PF02375 | JmjN | Together with JmjC, appears histone demethylase |
| PF00628 | PHD | Methylated or unmethylated histone H3 binding |
| PF00385 | Chromo | Methylated histone H3 binding |
| PF00567 | TUDOR | Methylated histone binding |
| PF00855 | PWWP | H4K20me binding |
| PF02820 | MBT | Methylated histone binding |
| PF01853 | MOZ_SAS | Histone acetyltransferase activity |
| PF00583 | Acetyltransf_1 | Acetyltransferase activity, GNAT family |
| PF00850 | Hist_deacetyl | Histone deacetylase activity |
| PF02146 | SIR2 | NAD-dependent histone deacetylase activity |
| PF00439 | Bromodomain | Acetylated histone H3, H4 binding |
| PF03366 | YEATS | Putative histone binding domain |
| PF01426 | BAH | H3, H4 tail binding |
| PF00533 | BRCT | Phosphorylated H2A binding |
| PF00145 | DNA_methylase | DNA-binding, DNA methylase activity |
| PF01429 | MBD | Methylated DNA-binding |
| PF00271 | Helicase_C | ATP binding, helicase activity, nucleic acid binding |
| PF00176 | SNF2_N | DNA-binding, ATP binding |
| PF00249 | Myb_DNA-binding | DNA-binding |
| PF04433 | SWIRM | DNA-binding |
| PF00125 | Histone | DNA-binding |
| PF00538 | Linker_histone | DNA-binding |
A total of 25 Pfam domains occurring in well-documented CM proteins were selected as known CM domains (See the text for details). Function annotations of domains were obtained from the Pfam database whenever available, or from the literature, otherwise. Numbers in parenthesis denote literature references.
Performance of SVM classifiers.
| Precision | Recall | F-measure | Accuracy | |
| Leave-one-out | 0.5424 | 0.5646 | 0.5528 | 0.9489 |
| Re-substitution | 0.6539 | 0.7470 | 0.6967 | 0.9636 |
Re-substitution test examines self-consistency of the classification method by classifying on the training set. Precision = TP/(TP+FP), Recall = TP/(TP+FN), F-measure = 2×(Precision×Recall)/(Precision+Recall), Accuracy = (TP+TN)/(TP+FP+TN+FN), where TP = true positive, TN = true negative, FP = false positive, and FN = false negative. The F-measure [81] is the harmonic mean of Precision and Recall, and is a particularly useful performance measure when the dataset is unbalanced such that there are significantly more negative examples than positive ones. We chose not to measure Specificity ( = TN/(TN+FP)) because it is less meaningful in such situations.
Figure 2Expansion in the number of known CM domains in 4 model organisms relative to that in yeast.
On the X-axis, figures in parentheses following each domain denote the numbers of genes in yeast. Y-axis represents folds of increase over yeast when the number of domain-containing genes is non-zero in yeast, otherwise (for MBT, MBD and DNA_methylase domains), the absolute number of domain-containing genes in each organism.
Figure 3Abundance and combination partners of SET domains in yeast (y), worm (w), fly (f), mouse (m) and human (h) are shown as an illustration of domain neighborhood expansion as a function of domain abundance.
See Table 4 for SET domain abundance values in each organism. The prefix in front of each domain name indicates the source organism. Nodes represent domains and links represent co-occurrence relationship in a single protein. Size of the nodes is proportional to the number of domain-containing proteins in each genome, and nodes are colored red, magenta and green to denote known CM domains, candidate CM domains and non-CM domains, respectively. The figures on each edge indicate the numbers of proteins that contain the linked domain pairs. The thickness of edges is proportional to the Co-occurrence Score of the linked domain pairs (See Materials and Methods for definition of Co-occurrence Score).
Promiscuity of known CM domains in 5 model organisms.
| Yeast | Worm | Fly | Mouse | Human | ||||||||||||||||
| Domain | Ab | Ap | Sp | P | Ab | Ap | Sp | P | Ab | Ap | Sp | P | Ab | Ap | Sp | P | Ab | Ap | Sp | P |
| SET | 8 | 2 | 5.2 |
| 29 | 9 | 9.5 | 0.58 | 19 | 16 | 12.6 | 0.73 | 36 | 21 | 26.2 | 0.39 | 39 | 22 | 31.9 | 0.24 |
| DOT1 | 1 | 0 | 0.5 |
| 6 | 0 | 1.6 |
| 1 | 0 | 0.5 |
| 1 | 1 | 0.5 |
| 1 | 1 | 0.6 |
|
| JmjC | 4 | 4 | 2.5 | 0.66 | 12 | 7 | 3.5 | 0.75 | 10 | 11 | 6.0 |
| 24 | 10 | 16.7 | 0.27 | 24 | 10 | 18.8 |
|
| JmjN | 3 | 3 | 1.8 | 0.55 | 2 | 5 | 0.5 | 0.27 | 4 | 6 | 2.1 | 0.65 | 10 | 5 | 6.2 | 0.46 | 9 | 5 | 6.2 | 0.45 |
| PHD | 15 | 9 | 10.3 | 0.48 | 25 | 19 | 8.0 |
| 39 | 36 | 28.0 | 0.78 | 81 | 41 | 60.8 |
| 87 | 43 | 71.5 |
|
| Chromo | 2 | 3 | 1.1 | 0.49 | 14 | 10 | 4.1 |
| 14 | 13 | 8.9 | 0.78 | 23 | 16 | 16.0 | 0.57 | 26 | 15 | 20.6 | 0.33 |
| TUDOR | NA | NA | NA | NA | 8 | 2 | 2.2 | 0.40 | 15 | 7 | 9.7 | 0.41 | 12 | 9 | 7.8 | 0.66 | 14 | 10 | 10.3 | 0.57 |
| PWWP | 2 | 0 | 1.2 |
| 1 | 0 | 0.2 |
| 9 | 8 | 5.3 | 0.73 | 23 | 17 | 15.9 | 0.62 | 20 | 17 | 15.4 | 0.64 |
| MBT | NA | NA | NA | NA | 2 | 0 | 0.5 |
| 3 | 3 | 1.5 | 0.45 | 9 | 4 | 5.6 | 0.41 | 9 | 4 | 6.2 | 0.36 |
| MOZ_SAS | 3 | 0 | 1.8 |
| 4 | 0 | 1.0 |
| 5 | 2 | 2.7 | 0.35 | 5 | 3 | 2.8 | 0.43 | 5 | 2 | 3.3 | 0.31 |
| Acetyltransf_1 | 14 | 3 | 9.5 |
| 16 | 4 | 4.8 | 0.51 | 26 | 5 | 18.0 |
| 30 | 5 | 21.2 |
| 21 | 5 | 16.3 |
|
| Hist_deacetyl | 5 | 0 | 3.1 |
| 8 | 1 | 2.2 | 0.22 | 5 | 1 | 2.7 |
| 12 | 1 | 7.6 |
| 11 | 1 | 7.9 |
|
| SIR2 | 5 | 1 | 3.1 |
| 4 | 0 | 1.0 |
| 5 | 0 | 2.7 |
| 7 | 0 | 4.2 |
| 8 | 0 | 5.5 |
|
| Bromodomain | 10 | 8 | 6.7 | 0.70 | 15 | 19 | 4.5 |
| 17 | 24 | 11.1 |
| 38 | 30 | 27.5 | 0.60 | 42 | 30 | 34.4 | 0.38 |
| YEATS | 3 | 0 | 1.8 |
| 2 | 0 | 0.5 |
| 3 | 0 | 1.5 |
| 4 | 0 | 2.3 |
| 4 | 0 | 2.5 |
|
| BAH | 5 | 4 | 3.2 | 0.64 | 4 | 8 | 1.0 | 0.48 | 6 | 11 | 3.3 |
| 10 | 15 | 6.2 |
| 11 | 14 | 7.8 |
|
| BRCT | 10 | 10 | 6.6 |
| 24 | 7 | 7.7 | 0.56 | 12 | 14 | 7.4 |
| 19 | 25 | 13.1 |
| 21 | 26 | 16.3 |
|
| DNA_methylase | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 0 | 0.5 |
| 4 | 4 | 2.2 | 0.53 | 4 | 4 | 2.5 | 0.56 |
| MBD | NA | NA | NA | NA | 2 | 4 | 0.5 | 0.27 | 5 | 9 | 2.7 | 0.75 | 11 | 9 | 7.0 | 0.70 | 11 | 9 | 7.8 | 0.66 |
| Helicase_C | 79 | 23 | 55.5 |
| 82 | 32 | 30.0 | 0.56 | 77 | 40 | 56.3 |
| 106 | 55 | 79.4 |
| 114 | 56 | 92.0 |
|
| SNF2_N | 17 | 13 | 11.7 | 0.67 | 24 | 14 | 7.6 |
| 18 | 19 | 11.9 |
| 31 | 24 | 22.1 | 0.60 | 33 | 24 | 26.7 | 0.44 |
| Myb_DNA-binding | 15 | 7 | 10.3 | 0.31 | 13 | 6 | 3.8 | 0.70 | 16 | 10 | 10.5 | 0.55 | 38 | 17 | 27.6 | 0.23 | 37 | 18 | 30.3 |
|
| SWIRM | 5 | 2 | 3.1 | 0.36 | 3 | 2 | 0.7 | 0.30 | 3 | 2 | 1.5 | 0.37 | 5 | 4 | 2.9 | 0.54 | 5 | 4 | 3.3 | 0.53 |
| Histone | 11 | 1 | 7.4 |
| 74 | 0 | 26.9 |
| 98 | 5 | 71.5 |
| 91 | 6 | 67.9 |
| 86 | 6 | 70.7 |
|
| Linker_histone | 1 | 0 | 0.6 |
| 8 | 0 | 2.2 |
| 24 | 0 | 16.4 |
| 11 | 4 | 6.9 | 0.33 | 12 | 3 | 8.7 |
|
Promiscuity was estimated using a simulation procedure that allows for domain pair duplication (See the text for details). Ab: abundance, defined as the number of proteins containing the domain in a genome. Ap: actual number of combination partners of a domain. Sp: number of combination partners of a domain obtained in simulations. P: empirical probability of observing at most Ap combination partners during simulation of random combinations. A low P value indicates that a domain's actual combination partners are fewer than the results of most random simulations, and indicates that the domain is selective when combining with other domains. For example, in human, the Ap, Sp and P values for the PHD domain are 43, 71.5 and 0.06, respectively; this means that probability P(Sp≤Ap) = 0.06 and, in other words, Ap is less than 94% of simulated Sp values. Conversely, high P value indicates that a domain is promiscuous when combining with other domains. We considered domains with P≤0.2 as selective (marked as underlined in the table) and domains with P>0.8 as promiscuous (marked as bold in the table). Domains with P value in between 0.2 and 0.8 do not deviate from a random combination model. “NA” indicates that the domain is lacking in the organism.
Figure 4A domain co-occurrence network for known CM domains and their combination partners in human.
Nodes represent domains and each link represents co-occurrence relationship of two domains in proteins. Size of the nodes is proportional to the number of domain-containing proteins in each genome, and nodes are colored red, magenta and green, denoting known CM domains, candidate CM domains and non-CM domains, respectively. The thickness of edges is proportional to the Co-occurrence Score for the linked domain pair (See Materials and Methods for definition of Co-occurrence Score).
Candidate CM domains.
| Domain | Molucular Function | Biological Process | Co-occurring CM domain |
|
| Unkown | cell proliferation |
|
|
|
| DNA repair |
|
|
|
| Unkown |
|
|
| a domain in the histone acetylase PCAF | regulation of transcription, DNA-dependent |
|
|
|
| chromatin modification |
|
|
|
| Unkown |
|
|
| Required for Heterochromatin Spreading. | Unkown |
|
|
|
| Unkown |
|
|
|
| ATP-dependent chromatin remodeling |
|
|
|
| Unkown |
|
|
| putative protein interaction domain | regulation of transcription |
|
|
|
| chromatin remodeling |
|
|
| putative phosphatase | Unkown |
|
|
| histone binding | Unkown |
|
|
| ATP binding | DNA replication |
|
|
| TBP binding to suppress transcription | Unkown |
|
|
|
| Unkown |
|
|
| Unkown | Unkown |
|
|
| Single-strand DNA-depedent ATPase | chromatin modification |
|
|
|
| regulation of transcription, DNA-dependent |
|
|
|
| Unkown |
|
|
|
| Unkown |
|
|
|
| Unkown |
|
|
| putative protein interaction domain | Unkown |
|
|
|
| single strand break repair |
|
|
|
| telomere maintenance via telomerase |
|
|
|
| Unkown |
|
|
| protein-protein interaction domain | Unkown |
|
|
| putative protein-protein interaction domain | Unkown |
|
|
| catalytic domain of LSD1 | Unkown |
|
|
| RNA binding | mRNA processing, spliceosome assembly |
|
|
|
| Unkown |
|
|
| Unkown | Unkown |
|
|
|
| Unkown |
|
|
| Unkown | Unkown |
|
|
|
| Unkown |
|
|
|
| Unkown |
|
|
|
| Unkown |
|
|
| mediates interaction with H3K79me | Unkown |
|
|
| Unkown | Unkown |
|
|
|
| regulation of transcription |
|
|
| DNA helicase | Unkown |
|
|
|
| Unkown |
|
|
| protein-protein interaction domain | Unkown |
|
|
|
| regulation of transcription |
|
|
|
| regulation of transcription, DNA-dependent |
|
|
| ATP binding | Unkown |
|
The prediction of candidate CM domains was performed as described in the text. Function annotations are largely based on the literature and Pfam database. Co-occurring CM domain: known CM domains that combine with a candidate CM domain in a single protein.
Figure 5Distribution of log odds ratio (LOR) scores of Pfam domains in the human genome.
LOR score measures enrichment of Pfam domains in known and predicted human CM genes. The vast majority of Pfam domains score less than zero, and are thus not enriched or are even under-represented in human CM genes. However, it is clear that the LOR scores of known CM domains and candidate CM domains skew towards the higher end of the LOR spectrum, indicating that these domains are enriched in human CM genes.