| Literature DB >> 32483223 |
Bo Xia1,2,3,4, Dongyu Zhao1,2,3,4, Guangyu Wang1,2,3,4, Min Zhang2,3,4, Jie Lv1,2,3,4, Alin S Tomoiaga5, Yanqiang Li1,2,3,4, Xin Wang1,2,3,4, Shu Meng2,3,4, John P Cooke2,3,4, Qi Cao6, Lili Zhang7,8,9, Kaifu Chen10,11,12,13.
Abstract
Conversion between cell types, e.g., by induced expression of master transcription factors, holds great promise for cellular therapy. Our ability to manipulate cell identity is constrained by incomplete information on cell identity genes (CIGs) and their expression regulation. Here, we develop CEFCIG, an artificial intelligent framework to uncover CIGs and further define their master regulators. On the basis of machine learning, CEFCIG reveals unique histone codes for transcriptional regulation of reported CIGs, and utilizes these codes to predict CIGs and their master regulators with high accuracy. Applying CEFCIG to 1,005 epigenetic profiles, our analysis uncovers the landscape of regulation network for identity genes in individual cell or tissue types. Together, this work provides insights into cell identity regulation, and delivers a powerful technique to facilitate regenerative medicine.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32483223 PMCID: PMC7264183 DOI: 10.1038/s41467-020-16539-4
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1CIGdiscover uncovers cell identity genes (CIGs) on the basis of the unique histone codes for their transcriptional regulation.
a, b The number of curated CIGs for each category (a) or cell type (b). c Heatmap to show −log10 P-value for difference in histone modification feature between CIGs and random control genes. P-values determined by Wilcoxon test. d Bar plots to show importance, as determined by coefficient, for each individual feature used by the logistic regression model in CIGdiscover. e The probability calculated by CIGdiscover to be CIG plotted for each curated known CIG and control gene. f ROC curves to show accuracy for CIG prediction using CIGdiscover or other methods. g, h Barplot to show enrichment of endothelial (g) or cardiovascular (h) related pathways in CIGs uncovered by CIGdiscover or in high expression genes for HUVECs. i Endothelial CIG score determined by CIGdiscover for each gene. Predicted EC CIGs that were found to have reported endothelial functions are marked in orange color. Predicted EC CIGs that were not found to have reported endothelial functions are marked in blue color. Genes not predicted to be EC CIGs are in gray color. j Summary of experiment verification results for predicted CIGs, positive control genes, and negative control genes.
Fig. 2CIGdiscover is robust and thus resilient to small or noisy training dataset.
a ROC curves to show the performances of CIGdiscover and its variants trained based on data from different number of cell types. b Scatterplot to show the ROC AUC values for CIGdiscover variants trained by data from different number of cell types. c ROC curves to show performance of CIGdiscover and its variants trained with smaller number of known CIGs. d Scatterplot to show the AUC of ROC for CIGdiscover variants trained by smaller number of CIGs. e Heatmap to show AUC value of ROC for CIGdiscover variants trained by one category but tested by individual other categories of CIGs. f ROC curves to show performance of CIGdiscover and its variants that each only utilized all features of one type of histone modification. g Barplot to show P-values of difference in performance between CIGdiscover and its variants that each utilized a smaller combination of histone modifications and RNA expression. P-values labeled in b and c indicate difference between ROC curves for CIGdiscover and the associated variants.
Fig. 3CIGnet as a network model to uncover master transcription factors in the regulation network of CIGs.
a, b Box plot to show average number of network edges between genes (a) and average number of transcription factors (b) within each gene group. For fair comparison, each gene group was defined to have the same number of genes. c Box plots to show CIG scores (left), number of parental edges (middle), and number of children edges (right) associated with reported master transcription factors and random control transcription factors. Box plots: center line is median, boxes show first and third quartiles, whiskers extend to the most extreme data points that are no more than 1.5-fold of the interquartile range from the box. P-values were determined by Wilcoxon’s test. d Barplot to show coefficient for individual network features utilized by the regression model in CIGnet. e ROC curves to show performances of CIGnet and other conventional methods for recapturing known master transcription factors. f Cumulative percentage of genes plotted against cell-type specificity of gene features. g Barplot to show changes in percentage of induced endothelial cells derived from PSCs in which individual master transcription factors were disrupted using the CRISPR-Cas9 system relative to that from wild-type PSCs. Two gRNAs g1 and g2 were tested for each gene. T7 endonuclease cleavage assay is used to confirm cutting efficiency of the CRISPR-Cas9 system. P-values determined by Student’s T-test. *P < 0.05. Source data are provided as a Source Data file.
Fig. 4A comprehensive landscape of CIGs uncovered by CEFCIG.
a A cartoon to show CEFCIG framework to uncover cell identity genes and their master transcription factors. b Heatmap to show the specificity of CIGs from 57 cell types. c Heatmap to show the enrichment of individual pathways in CIGs defined for individual cell types. d Scatterplot to show −log10 enrichment P-value of cell-type-specific pathway in identity genes and high expression genes for each representative cell type. Each dot indicates one cell type. e, f One-dimensional scatterplots for network edge number (e) and closeness score (f) between CIGs from the same cell type (internal) or from different cell types (external). g–k Box plots to show CIG score (g), number of parental edges (h), number of children edges (i), number of parent transcription factors (j), and number of children transcription factors (k) associated with individual transcription factor groups. P-values were determined by Wilcoxon’s test (g–k). ***P < 0.001. Box plots: center line is median, boxes show first and third quartiles, whiskers extend to the most extreme data points that are no more than 1.5-fold of the interquartile range from the box.