| Literature DB >> 35399953 |
Wilson K M Wong1, Vinod Thorat2, Mugdha V Joglekar1, Charlotte X Dong1, Hugo Lee3, Yi Vee Chew4, Adwait Bhave2, Wayne J Hawthorne4, Feyza Engin3,5, Aniruddha Pant2, Louise T Dalgaard6, Sharda Bapat2, Anandwardhan A Hardikar1,6.
Abstract
Machine learning (ML)-workflows enable unprejudiced/robust evaluation of complex datasets. Here, we analyzed over 490,000,000 data points to compare 10 different ML-workflows in a large (N=11,652) training dataset of human pancreatic single-cell (sc-)transcriptomes to identify genes associated with the presence or absence of insulin transcript(s). Prediction accuracy/sensitivity of each ML-workflow was tested in a separate validation dataset (N=2,913). Ensemble ML-workflows, in particular Random Forest ML-algorithm delivered high predictive power (AUC=0.83) and sensitivity (0.98), compared to other algorithms. The transcripts identified through these analyses also demonstrated significant correlation with insulin in bulk RNA-seq data from human islets. The top-10 features, (including IAPP, ADCYAP1, LDHA and SST) common to the three Ensemble ML-workflows were significantly dysregulated in scRNA-seq datasets from Ire-1αβ-/- mice that demonstrate dedifferentiation of pancreatic β-cells in a model of type 1 diabetes (T1D) and in pancreatic single cells from individuals with type 2 Diabetes (T2D). Our findings provide direct comparison of ML-workflows in big data analyses, identify key elements associated with insulin transcription and provide workflows for future analyses.Entities:
Keywords: beta-cell; diabetes; human islet; insulin; machine-learning (ML) algorithms; single-cell RNA-sequencing (scRNAseq)
Mesh:
Substances:
Year: 2022 PMID: 35399953 PMCID: PMC8986156 DOI: 10.3389/fendo.2022.853863
Source DB: PubMed Journal: Front Endocrinol (Lausanne) ISSN: 1664-2392 Impact factor: 6.055
Figure 1Study design and performance of different ML workflows. A flowchart of our analytical plan is presented in (A). Previously published datasets of single-cell RNA-sequencing analyses from pancreatic islet cell preparations were randomly divided into a training (N = 11,652) and a validation (N = 2,913) set. The learning phase (Training) involved identifying features (genes) and their associated weights/coefficients in each of the 10 machine learning (ML) methods (listed 1-10). Weighted features were used in the prediction of insulin transcription (across 10 ML algorithms) to test the performance of these models in an independent validation set of samples (N = 2,913). ROC curve plots for each ML algorithm using validation set data are presented in (B). The area under the curve (AUC) for the tested workflows are presented along with a confusion matrix below the plot. Percent values are rounded off to the nearest integer (and hence may not sum up to an absolute 100%) and represent true negative (red), true positive (green), false positive (yellow) and false negative (blue) samples identified in the validation set.
Figure 2Performance and application of learned features in understanding insulin gene transcription. (A) A 2D clustering of pancreatic single cells assessed in this study using UMAP (Uniform Manifold Approximation and Projection plot). Cellular subtypes based on the UMAP clustering algorithm are labeled and graded (scale, inset) as per the level of insulin gene transcripts. (B) The performance of learning models on accurately identifying insulin-positive (1) and insulin-negative (0) single cells from the validation dataset are presented. (C) Relative weighted rank contributions of the top 10 genes in each of the four listed ML algorithms are presented as spider plots plotted in the order of importance (starting clockwise at 12-O’clock position). Percent representation of each of the genes indicates their relative contribution in the set on the spider plot with a logarithmic scale (center=1% and outer circle=100%). A comparison of the gene features identified by the top three ensemble workflows is presented along with those identified by the Decision Tree classifier. (D) Pathways targeted by up to the top 100 features ( ) from each of the four selected ML methods (RF, Random Forest; GB, Gradient Boosting; ADAB, ADA Boost; DT, Decision Tree) identified using gene ontology (GO) function analysis are presented in the Venn diagram. Number of GO terms enriched and common for top features (genes) in each ML method are plotted. (E) All significantly dysregulated genes identified from and common to the four ML algorithms (C) presented herein were assessed in the scRNA-seq dataset from Ire1αβ-/- mice. Bubble plot presenting fold-change and statistical significance (q-value) for each of the genes in Ire1αfl/fl and Ire1αβ–/– mice are shown. Blue color represents downregulation while red color indicates increased abundance of transcripts in Ire1αβ-/- mice compared to control.
Figure 3A summary of study design and results. Workflow and findings of our study are presented in this schematic, which illustrates the steps in discovery and validation of the important gene features associated with insulin transcript in pancreatic single cell transcriptomes. We further confirm these features to be dysregulated during β-cell dedifferentiation in a T1D mouse model and in individuals with T2D.