| Literature DB >> 29137603 |
Oluwatosin Oluwadare1, Jianlin Cheng2,3.
Abstract
BACKGROUND: With the development of chromosomal conformation capturing techniques, particularly, the Hi-C technique, the study of the spatial conformation of a genome is becoming an important topic in bioinformatics and computational biology. The Hi-C technique can generate genome-wide chromosomal interaction (contact) data, which can be used to investigate the higher-level organization of chromosomes, such as Topologically Associated Domains (TAD), i.e., locally packed chromosome regions bounded together by intra chromosomal contacts. The identification of the TADs for a genome is useful for studying gene regulation, genomic interaction, and genome function.Entities:
Keywords: CTCF; Chromosome conformation capturing; Chromosome organization; Clustering; Genome structure; Hi-C; Topologically associated domain (TAD)
Mesh:
Substances:
Year: 2017 PMID: 29137603 PMCID: PMC5686814 DOI: 10.1186/s12859-017-1931-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Chromosome contact matrix, TADs, and the workflow of ClusterTAD. a The contact matrix of Chromosome 20 of the human embryonic stem cell (hESC). The x and y-axes represent the regions of the chromosome. b Representation of TADs along the main diagonal of a heat map visualizing a 100 × 100 chromosomal contact matrix at 40 KB resolution. The intensity of colors represents the value of interaction frequency in the matrix. The blue squares along the main diagonal denote the identified TADs in the contact matrix. c The workflow of ClusterTAD
Fig. 2Illustration of the topologically associated domains. a Illustration of the basic elements related to TAD: domain, border, boundary, and gap. A domain is a TAD. A boundary is the chromosomal region between two consecutive TADs. The border marks the start/end of a domain. A gap is a point with no interaction in the contact matrix. b The calculation of TAD quality score. Two adjacent TADs are denoted as i and j. The area between TADs i and j that has few interactions is labeled as E. The intra(i) is the average contact frequency within a TAD (e.g. the area marked i). The inter(i, j) is the average contact frequency of the area marked as E. The difference of the two is the quality of TAD i
Fig. 3The results on the simulated dataset. a An elbow plot for the clustering results of ClusterTAD on the simulated dataset. The percentage of within-cluster variance is plotted against the number of clusters. The elbow point is at K = 5. b The Davies-Bouldin index (DBI) for the different clustering algorithms. c The Silhouette Index (SI) for the different clustering algorithms. d The average Intra-Inter difference scores for the TADs extracted by ClusterTAD with different combinations of clustering algorithms and distance metrics: HC-eulcidean, KM-eulidean, HC-pearson, KM-pearson, HC-cityblock, KM-cityblock, and the EM. HC denotes the hierarchical clustering algorithm, KM the K-means algorithm, and EM the expectation maximization algorithm. HC-euclidean represents the combination of the hierarchical clustering algorithm with Euclidean distance metric
Fig. 4– The visualization of the TADs extracted for one chromosome contact map in the simulated dataset. Rows a to g represents the TADs extracted for K = 4, K = 5 and K = 6 (from left, middle to right) for the following combinations of clustering algorithms and distance metrics: (a) HC-eulcidean, (b) KM- eulidean, (c) HC-pearson, (d) KM-pearson, (e) HC-cityblock, (f) KM-cityblock, and (g) EM. HC denotes the hierarchical clustering algorithm, KM the K-means algorithm, and EM the expectation maximization algorithm. HC-euclidean denotes the combination of the hierarchical clustering algorithm with the Euclidean distance metric. The left column visualizes the TADs extracted by the seven algorithms when K = 4, the middle columns the TADs extracted when K = 5, and the right column the TADs extracted when K = 6. A TAD region identified on each contact heatmap is denoted by a blue square within the blue dots along its diagonal. The blue dots represent the boundary of a TAD region. The white squares along the diagonals are unrecognized TADs
The lists of TADs identified by the seven different algorithms in Fig. 4
| Algorithm | K = 4 | K = 5 | K = 6 |
|---|---|---|---|
| a | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (9,14), (15,20), (21,25), and (27,30)}. |
| b | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (9,14), (15,20), (21,25), and (27,30)}. |
| c | {(1,8), (9,14), (15,20), and (21,30)}. | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (15,20), (21,25), and (26,30)}. |
| d | {(1,8), (9,14), (15,20), and (21,30)}. | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (15,20), (21,25), and (26,30)}. |
| e | {{(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (15,20), (21,25), and (26,30)}. |
| f | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (15,20), (21,25), and (26,30)}. |
| g | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (9,14), (15,20), (21,25), and (26,30)}. | {(1,8), (9,14), (15,20), (21,25), and (27,30)}. |
The table contains the lists of TADs extracted for K = 4, K = 5 and K = 6 (from left, middle to right) by the seven algorithms: (a) HC-eulcidean, (b) KM-eulidean, (c) HC-pearson, (d) KM-pearson, (e) HC-cityblock, (f) KM-cityblock, and (g) EM. HC denotes the hierarchical clustering algorithm, KM the K-means algorithm, and EM the expectation maximization algorithm. HC-euclidean denotes the combination of the hierarchical clustering algorithm and the Euclidean distance metric. A TAD is represented as {start, end}, where “start” is the TAD start region, and “end” is the TAD end region. The best TAD set for the synthetic data is {(1, 8), (9, 14), (15, 20), (21, 25), and (26, 30)}
Fig. 5Evaluation on a real Hi-C dataset. a The workflow of the iterative application of ClusterTAD. b The average size of TADs identified for the mouse embryonic stem cell by three rounds of clustering of ClusterTAD (ClusterTAD_1, ClusterTAD_2, and ClusterTAD_3). c The average size of TADs identified for the mouse cortex cell by three rounds of clustering of ClusterTAD. d The box plot of the quality scores of TADs extracted for the mouse embryonic stem cell by the three rounds of clustering of ClusterTAD. e The box plot of the quality scores of TADs extracted for the mouse Cortex cell for the different clustering operations performed by ClusterTAD
Fig. 6Comparison of the quality scores, numbers and average sizes of TADs identified by TopDom, DI, and ClusterTAD on two mouse cell lines. a, b The comparison of the intra-inter difference scores; (c, d): the number of TADs, and (e, f) the average size of TADs for the mESC and mCortex cells respectively
Fig. 7The analysis of the consistency between TADs identified by ClusterTAD and other methods on the two mouse cell lines. a Four different cases in which TADs detected by two different methods are compared with each other. Case A: This refers to the case in which the TAD identified in method B exactly matches those from another method A. The TADs detected by the two methods have the same boundaries. Case B: This refers to the case in which a TAD detected by method A contains two or more domains detected by method B. The smaller TADs detected by method B are called sub-TAD of the TAD detected by method A. Case C: This represents the conflicting case in which the domain detected by method A does not match or contain the domains detected by method B even though there is some overlap between them. Case D: This refers to the rare case in which the region is not assigned to a TAD by method A, but is assigned by a TAD by method B. b The percentage of TADs detected by ClusterTAD for the mESC cell line that were also detected by TopDom and DI. (c) The percentage of TADs detected by ClusterTAD for the mCortex cell line that were also detected by TopDom and DI
Fig. 8– The enrichment analysis of active histone modification marks and CTCF binding sites at the domain boundary. The average peak number of active histone modification marks (promoter marks (Polymerase II and H3K4me3) and enhancer marks (H3K4me1 and H3K27ac) and CTCF binding sites at the boundary regions identified by TopDom, DI and ClusterTAD for mouse Embryonic Stem Cell line (mESC) (a-e) and the mouse cortex cell line (mCortex) (f-j)