Literature DB >> 35399953

Analysis of Half a Billion Datapoints Across Ten Machine-Learning Algorithms Identifies Key Elements Associated With Insulin Transcription in Human Pancreatic Islet Cells.

Wilson K M Wong¹, Vinod Thorat², Mugdha V Joglekar¹, Charlotte X Dong¹, Hugo Lee³, Yi Vee Chew⁴, Adwait Bhave², Wayne J Hawthorne⁴, Feyza Engin^3,5, Aniruddha Pant², Louise T Dalgaard⁶, Sharda Bapat², Anandwardhan A Hardikar^1,6.

Abstract

Machine learning (ML)-workflows enable unprejudiced/robust evaluation of complex datasets. Here, we analyzed over 490,000,000 data points to compare 10 different ML-workflows in a large (N=11,652) training dataset of human pancreatic single-cell (sc-)transcriptomes to identify genes associated with the presence or absence of insulin transcript(s). Prediction accuracy/sensitivity of each ML-workflow was tested in a separate validation dataset (N=2,913). Ensemble ML-workflows, in particular Random Forest ML-algorithm delivered high predictive power (AUC=0.83) and sensitivity (0.98), compared to other algorithms. The transcripts identified through these analyses also demonstrated significant correlation with insulin in bulk RNA-seq data from human islets. The top-10 features, (including IAPP, ADCYAP1, LDHA and SST) common to the three Ensemble ML-workflows were significantly dysregulated in scRNA-seq datasets from Ire-1αβ-/- mice that demonstrate dedifferentiation of pancreatic β-cells in a model of type 1 diabetes (T1D) and in pancreatic single cells from individuals with type 2 Diabetes (T2D). Our findings provide direct comparison of ML-workflows in big data analyses, identify key elements associated with insulin transcription and provide workflows for future analyses.

Entities: Chemical

Keywords: beta-cell; diabetes; human islet; insulin; machine-learning (ML) algorithms; single-cell RNA-sequencing (scRNAseq)

Mesh：

Substances：
Insulin

Year: 2022 PMID： 35399953 PMCID： PMC8986156 DOI： 10.3389/fendo.2022.853863

Source DB: PubMed Journal: Front Endocrinol (Lausanne) ISSN： 1664-2392 Impact factor: 6.055

Introduction

Recent years have witnessed a surge in single-cell transcriptomic technologies; many already generating newer data and insights to address specific biological questions. Machine learning (ML) algorithms offer an unbiased mathematical workflow that facilitates the identification of complex relationships across variables. ML workflows involve an orderly set of instructions using automated, unbiased ‘learning’ processes usually targeted towards developing (training) a model that can be validated in a separate (test) dataset (1). One goal of ML algorithms is to analyze big data to identify variables that cannot be recognized through conventional biostatistical techniques, and enhance development of predictive algorithms (2, 3). Currently, several ML algorithms are available to researchers handling big data in omics-based high content analyses. These can be broadly divided into two categories: supervised and unsupervised algorithms (4). Supervised methods (such as decision tree) derive relationships between one dependent and multiple independent variables using a training set and then apply that knowledge in the testing set for predictive/efficacy analysis. Unsupervised methods derive patterns/data clusters amongst all available variables. ML algorithms have been used to unravel patterns/clustering in high-density transcriptome analyses (5) or to build associations (6) or for predictions in several biological processes such as determining DNA methylation states in single cells (7), identifying signatures of lipid or metabolite species (8) or microRNAs (9) in predicting transition from gestational diabetes to type 2 diabetes as well as in genetic studies (10). There are multiple ML algorithms available and it may present a challenge to select the most appropriate method for a particular dataset to answer a specific question. We, therefore, decided to compare different ML methodologies to (i) rank different ML methods for their performance on a large dataset (of 490,855,065 scRNA-sequencing data points) and (ii) understand the most important variables associated with insulin transcription. Previous studies (11–14) from several laboratories have identified master regulatory transcription factors that regulate the embryonic development of insulin-producing islet β-cells. Although transcription factor-mediated insulin transcription regulation is a well-known mechanism during the development of insulin-producing cells, it is also recognized that active genes localized on different chromosomal regions can dynamically regulate gene transcription in post-natal life (15). One approach to identify genes associated with insulin gene transcription is through single-cell (sc)RNA-seq-based big data analysis. Here, we examined the performance of 10 different ML algorithms in a curated human pancreatic single-cell sequencing dataset of 490,855,065 data points (N=14,565 single cells and 33,701 expressed gene features). The aims of this study were (i) to provide a comparative account of the predictive potential of 10 different commonly used ML workflows ( ), and (ii) to use existing scRNA-seq datasets in identifying genes (variables) associated with or important for determining insulin transcript-containing cells.

Materials and Methods

Pancreatic Single-Cell (sc)RNA Sequencing Datasets and Analyses

Human Pancreatic Single-Cell Sequencing Datasets

The pancreatic single-cell sequencing dataset (N=14,890) was extracted using the Panc8 data (16) containing multiple publicly available scRNA-seq transcriptomes (GSE84133, GSE85241, E-MTAB-5061,GSE81076, GSE86469). The original publications citing the listed GEO datasets (GSE84133, GSE85241, E-MTAB-5061, GSE81076, GSE86469) add up to a total of 31 pancreas samples across all the studies. Clinical and/or donor details are available for 26 of these samples; seven of which were indicated to be from donors with type 2 diabetes. The number of cell types in the combined single cell dataset (Panc8) can be found in the metadata of the SeuratData (version 0.2.1), using the command “panc8@meta.data” (in R studio version 1.2.5033). This panc8 dataset contains scRNA-seq data from acinar (n=1864), activated stellate (n=474), alpha-(n=4615), beta-(n=3679), delta-(n=1013), ductal (n=1954), endothelial (n=296), epsilon (n=30), gamma (n=625), macrophage (n=79), mast (n=56), quiescent stellate (n=180) and schwann (n=25) cell transcriptomes. Analysis was carried out by using R studio version 1.2.5033 as detailed in SOM.

Ire1αβ-/- Mouse Pancreatic Single-Cell Dataset

Single-cell RNA-seq dataset from pancreatic islets of Ire1αfl/fl (N=1,163 single-cell transcriptomes from one mouse) and Ire1αβ–/– (N=1,683 single-cell transcriptomes from two mice) were obtained through GSE144471 (17). The β-cells (Ire1αfl/fl: 830 cells; Ire1αβ–/–: 816 cells) were separated from the dataset and the expression values of selected genes were evaluated in the β-cell population.

T2D Pancreatic Single-Cell (sc)RNA Sequencing Dataset

Pancreatic single cell normalized read dataset of adult ND (with no diabetes; N=4) and T2D (N=10) donors were obtained from GSE154126 (18). The adult ND (N=296) and T2D (N=505) insulin transcribing cells were compared and used for validation. In this dataset, insulin-transcribing cells were identified and defined as any single cell that contained (non-zero) INS transcript.

Human Pancreatic Single-Cell Sequencing Classification and Analyses

Deidentified datasets were shared with data scientists. A random number generator function was used to allocate 80% of samples to a training set. Analyses were carried using Python (Ver:3.4), wherein the data was imported, transposed, edited to delete INS and INS-IGF2 columns from the data frame and labeled (label=0 where INS=0 and label=1 where INS>0). Classifiers were initialized and model trained using the discovery (80%) data set. Predictive analyses were then carried out on the validation (20%) set and the resulting accuracy metrics were saved to compare the feature importance. Selected classifiers (Random Forest, Gradient Boosting, Decision Tree Classifier, Logistic Regression, Multinomial Naive Bayes Classifier, ADA Boost Classifier, Linear Discriminant Analysis, Ridge Classifier, KNeighbors Classifier and Linear Support Vector Classifier) were used on the same set.

Pancreatic Islet RNA Sequencing Dataset and Analysis

Human Pancreatic Islet Bulk RNA Sequencing Dataset

Human pancreatic islet RNA-seq dataset was obtained from GSE152111 (6). RNA-seq dataset contains n=66 human islet samples, across 65 different donors with no diabetes. Two of the 66 RNA-seq samples were duplicates from the same donor and their RNA-seq profile highly correlative (Pearson r=0.99) to each other. The average of the duplicates of this donor was calculated prior to analysis. Data was analyzed in DEseq values. DEseq values are the normalized RNA-seq. DEseq compares the different read depths between samples by estimating the effective library size (using the estimate size factors function). The size factor for each sample is the median raw count of a gene’s geometric mean across all samples. DEseq normalization involves dividing the raw count of a gene in a sample by the size factor. The implementation of DEseq have been described previously (19).

Pathway Analysis

To analyze enrichment for β-cell pathways, lists of pancreatic single-cell features generated/predicted by ML algorithms (Random forest, Gradient boosting, Decision tree classifier and ADA Boost classifier) were compared with β cell-expressed genes (E-GEOD-20966) using Gene Ontology (GO) over-representation analysis on Pantherdb.org (20). Preanalytical workflows included cleaning up entries not mapping to protein-coding gene symbols in E-GEOD-20966. Gene lists for each ML algorithm consisted of up to the top 100 genes as predictors of insulin expression, which were compared against the data set of β-cell expressed genes (N=13,165 from E-GEOD-20966). Overrepresentation analysis using GO categories for biological processes (GO: BP) was performed using binomial testing using false-detection-rate to correct for multiple testing. Lists of significantly enriched pathways associated with each ML algorithm were compared using Venn diagrams (21).

Statistical Analysis

The R software (ver. 3.6.1; R Foundation for Statistical Computing, Vienna, Austria) was used to create the categorical bubble plot using the packages ggplot2 (3.3.3), ggpubr (0.4.0) and proto (1.0.0). Spearman correlation matrix analysis was generated through using R packages corrplot (0.90), Hmisc (4.6.0), dplyr (1.0.7) and readxl (1.3.1) in R and Rstudio software. Statistical software, Microsoft Excel (ver. 2016; Microsoft, Redmond, WA, USA), the R software and/or GraphPad Prism (ver. 8.4.1; GraphPad Software, San Diego, CA, USA) were used for univariate test comparisons and Benjamini-Hochberg method for multiple testing.

Results

Machine Learning (ML) Algorithms Yield Varying Performance Outputs

The scRNA-seq data were obtained from public databanks (GSE84133, GSE85241, E-MTAB-5061, GSE81076, GSE86469) of human pancreatic single-cell transcriptomes. We first randomized this available pancreatic scRNA-seq transcriptomic data and allocated 80% of samples to a discovery/training set (Training; N=11,652 samples) and remaining into a validation/testing set (Test; N=2,913 samples). With the availability of several ML algorithms ( ), we probed the discovery dataset using 10 different ML workflows ( ) to identify features highly associated with the presence of insulin transcripts in a single cell. Genes (features) identified as the most important/predictive variables for each of these ML workflows were used to identify insulin transcript-containing cells from the validation set (remainder 20% of the samples). Validation results of the identified gene features from each of the 10 ML workflows are presented in the form of receiver operator characteristic (ROC) curves ( ). The top three ML algorithms; Gradient boosting, Random Forest and ADA boost (all Ensemble workflows), demonstrated similar performance returning an Area Under Curve (AUC) of between 0.83 – 0.86. A confusion matrix is presented below each ROC curve dataset ( ) to demonstrate the false-positive and false-negative predictions within every workflow. These analyses show that although Ensemble machine learning workflows are the best in predicting insulin-transcribing cells, other workflows, such as logistic regression, also perform closely to the Ensemble methods.

Figure 1

Study design and performance of different ML workflows. A flowchart of our analytical plan is presented in (A). Previously published datasets of single-cell RNA-sequencing analyses from pancreatic islet cell preparations were randomly divided into a training (N = 11,652) and a validation (N = 2,913) set. The learning phase (Training) involved identifying features (genes) and their associated weights/coefficients in each of the 10 machine learning (ML) methods (listed 1-10). Weighted features were used in the prediction of insulin transcription (across 10 ML algorithms) to test the performance of these models in an independent validation set of samples (N = 2,913). ROC curve plots for each ML algorithm using validation set data are presented in (B). The area under the curve (AUC) for the tested workflows are presented along with a confusion matrix below the plot. Percent values are rounded off to the nearest integer (and hence may not sum up to an absolute 100%) and represent true negative (red), true positive (green), false positive (yellow) and false negative (blue) samples identified in the validation set.

Ensemble ML Workflows to Identify Genes Associated With Insulin Transcription

The scRNA-seq datasets obtained from public databanks of human pancreatic single-cell transcriptomes were classified as insulin-transcribing (1) or those with no insulin (0) ( ). As described earlier, all the three Ensemble ML workflows presented with an AUC that was better than any of the other ML workflows tested in our ROC curve analysis. Ensemble workflows also presented with high accuracy (≥87%), precision (≥0.89), and sensitivity (≥0.95), which was comparable to other popular workflows such as logistic regression ( ). As Ensemble ML workflows such as Random Forest use a collection of decision trees (forest), we decided to compare the performance of the top three (Ensemble) workflows to a single (Decision tree) algorithm. The relative contribution of the top 10 features (genes) from each of these ML workflows are presented as radar plots ( ), whilst the longer list of genes ranked by their importance is presented in . IAPP, ADCYAP1, LDHA and SST were common to all three Ensemble workflows. We then examined the pathways targeted by these features (genes) identified through each of the Ensemble and Decision Tree classifier by comparing them to a separate islet β-cell dataset ( ). Number of GO terms enriched across all four ML workflows ( ) suggests several common pathways (including insulin secretion) targeted by the features identified through these analysis ( and ). These genes were also validated in a bulk RNA-seq dataset (GSE152111, n=66) of human islet samples ( ). In this analysis ( ), most of these gene transcripts had significant positive correlation with insulin transcript. While some of the gene transcripts such as LDHA, CRP, RPS15 and RPL35 negatively correlated with insulin transcript in human islets.

Figure 2

Performance and application of learned features in understanding insulin gene transcription. (A) A 2D clustering of pancreatic single cells assessed in this study using UMAP (Uniform Manifold Approximation and Projection plot). Cellular subtypes based on the UMAP clustering algorithm are labeled and graded (scale, inset) as per the level of insulin gene transcripts. (B) The performance of learning models on accurately identifying insulin-positive (1) and insulin-negative (0) single cells from the validation dataset are presented. (C) Relative weighted rank contributions of the top 10 genes in each of the four listed ML algorithms are presented as spider plots plotted in the order of importance (starting clockwise at 12-O’clock position). Percent representation of each of the genes indicates their relative contribution in the set on the spider plot with a logarithmic scale (center=1% and outer circle=100%). A comparison of the gene features identified by the top three ensemble workflows is presented along with those identified by the Decision Tree classifier. (D) Pathways targeted by up to the top 100 features ( ) from each of the four selected ML methods (RF, Random Forest; GB, Gradient Boosting; ADAB, ADA Boost; DT, Decision Tree) identified using gene ontology (GO) function analysis are presented in the Venn diagram. Number of GO terms enriched and common for top features (genes) in each ML method are plotted. (E) All significantly dysregulated genes identified from and common to the four ML algorithms (C) presented herein were assessed in the scRNA-seq dataset from Ire1αβ-/- mice. Bubble plot presenting fold-change and statistical significance (q-value) for each of the genes in Ire1αfl/fl and Ire1αβ–/– mice are shown. Blue color represents downregulation while red color indicates increased abundance of transcripts in Ire1αβ-/- mice compared to control.

Insulin-Associated Genes Are Dysregulated During β-Cell Dedifferentiation

Dedifferentiation of β-cells, characterized by the loss of expression of key β-cell maturation marker genes with an accompanying reduction in insulin secretion, has been observed in mouse models of type 1 (T1D) and type 2 (T2D) diabetes, as well as in individuals with diabetes (22–25). We questioned if the expression of gene variables identified and validated (in silico) as being associated with insulin gene transcription ( ) are dysregulated in a mouse model of T1D with evidence of islet dedifferentiation. Transient dedifferentiation of islet β-cells was recently reported in an established T1D preclinical mouse model upon β-cell-specific deletion of a key stress response gene, Ire1α, (Ire1αβ-/-) (17). These mice also demonstrated reduced β-cell number as well as diminished expression of insulin transcripts in β-cells compared to control (Ire-1αfl/fl) mice. Therefore, we evaluated the expression of the total 25 gene transcripts that made up the top 10 features across the four different ML workflows ( ) in the single cell datasets generated from these (Ire1αβ-/- and Ire1αfl/fl) islets. Twelve of these features were not significantly different between Ire1αβ-/- and Ire1αfl/fl islets. However, the remaining thirteen features were significantly dysregulated in β-cells of Ire-1αβ-/- mice that were undergoing dedifferentiation ( ). Dedifferentiating β-cells showed significant downregulation of five key genes; Iapp, MafA, Pcsk1n, Atp5e and Ldha, whilst all other insulin-associated gene transcripts showed significantly higher levels ( ). In type 2 diabetes (T2D), it is known that INS transcript expression is reduced. Therefore we validated the top, common gene features (IAPP, SST, MAFA, ADCYAP1 and LDHA) from the three ML workflows using a separate publicly available single-cell RNA-seq dataset from non-diabetic (ND) vs T2D adult human pancreas (GSE154126 (18)). Four of the five genes (IAPP, SST, MAFA, ADCYAP1), were significantly lower in T2D insulin-transcribing cells compared to ND insulin-transcribing cells ( ).

Discussion

In this study, we compared the performance characteristics of 10 different ML algorithms, ( ) that are currently used in big data analyses. We analyzed a scRNA-seq dataset that was randomly split to a larger (80%; 392,684,052 data points) training set involving model learning, and then a smaller (20%; 98,171,013 data points) validation set. All algorithms identified a set of genes (features) that associate with insulin-production (1) defined as the presence of one or more transcripts of insulin in a sample, or no insulin production (0) from the 11,652 single cells analyzed in the training test. We validated the predictive features identified through each ML workflow in the validation/test set of 2,913 single cell transcriptomes. ML workflows that returned high performance (based on AUC, sensitivity/specificity) were selected and the top 10 genes (ranked by their importance) in each of those ML methods were re-validated in discrete mouse and human datasets that model beta cell dedifferentiation (summarized in ).

Figure 3

A summary of study design and results. Workflow and findings of our study are presented in this schematic, which illustrates the steps in discovery and validation of the important gene features associated with insulin transcript in pancreatic single cell transcriptomes. We further confirm these features to be dysregulated during β-cell dedifferentiation in a T1D mouse model and in individuals with T2D. Our analysis provides two major outcomes that are of interest to a broad range of data analysts and biologists. First, a comparison of the ML algorithms identified Ensemble-based ML methods as the best performing algorithms in our analyses. Logistic regression performed closest to Ensemble methods, in line with previous reports in clinical datasets (26). We then compared Ensemble methodologies to the Decision tree algorithm. Decision tree offers the often-desired simplistic model generation method as compared to Ensemble methods such as Random Forest. The latter builds multiple decision trees independently and offers an overall learning model that is closest to the best possible prediction. Indeed, Decision tree was determined to be a weaker predictor than the Random Forest as the latter reduces variance using different sample sets (bootstrap) in training, randomizing feature subsets, and combining the predictive learning by building multiple decision trees. Random Forest prediction outcomes were similar to gradient boosting, which also builds a set of decision trees, but one tree at a time. The bagging and boosting approach used in ADA/Gradient boosting methods seems to have offered better accuracy and performance in insulin prediction analysis than those observed using Random Forest, whereas the Random Forest algorithm offered the highest sensitivity ( ) amongst all methodologies tested. The other outcome from this analysis is the identification of genes that are associated with and predictive of insulin gene transcription in single cells. Since bulk RNA-sequencing studies do not offer the desired single-cell resolution to identify transcriptional regulation at a cellular level, our analyses provide a firsthand view of insulin gene transcriptional determinants identified through an unbiased, big data machine learning approach. The top three methodologies (based on high AUC values) belonged to Ensemble machine learning workflow. Weighted relative importance of the top-10 most important features are compared ( ). Interestingly, five genes were common to the top 10 features from all the algorithms compared – IAPP, ADCYAP1, MAFA, SST and LDHA. The top-ranked gene associated with insulin gene transcription across all the Ensemble workflows was IAPP. Islet amyloid polypeptide (IAPP) and insulin are known to be expressed in pancreatic islet β-cells and co-secreted in response to changes in glucose concentration (27, 28). Their mRNA levels are also regulated by glucose. The promoters of both these genes share similar cis-acting sequence elements, and both bind the master regulatory transcription factor PDX1 ( 27). FoxA2 (HNF-3β) negatively regulates IAPP promoter activity (29) and has also been shown to suppress insulin gene expression (30). Although insulin gene expression is known to be regulated by several islet-enriched transcription factors, MafA is the most well recognized β-cell-specific activator of insulin gene expression (31). The selection of MAFA as a key feature by three of the compared ML approaches tested through this analysis is therefore not surprising. The inclusion of SST in the top three gene features is intriguing. Somatostatin expression is known to be important in control of insulin release and ablation of somatostatin-expressing delta cells impairs pancreatic islet function and cause neonatal death in rodents (32). SST analogs were shown to inhibit the release of insulin via the activation of both ATP sensitive K+ channels and G protein-coupled inward rectifier K+ channels (33). Another candidate that was identified through these analyses is MTRNR2L8, a neuroprotective and antiapoptotic peptide derived from a portion of the mitochondrial MT-RNR2 gene and reported in fetal as well as adult beta cells (34). ADCYAP1 stimulates insulin secretion in a glucose-dependent manner (35) and genetic screening in T2D Caucasians indicated the presence of two SNPs in exons 3 and 5 of this gene to be associated with T2D (36). Finally, LDHA, which was also selected through these unbiased analyses across the top-three ML workflows is a pancreatic β-cell disallowed gene (37–39) and human LDHA levels are predictive of insulin transcription (40). Consistent with these previous reports, our validation analysis in human islets RNA-seq data, demonstrated negative correlation of LDHA and positive correlation of ADCYAP1, MAFA and SST transcripts with insulin ( ). Together, these algorithms help in identifying a set of genes expressed in or disallowed from insulin-producing pancreatic β-cells. Mouse models often provide the validation to understand mechanisms that cannot be tested in human studies. The Ire1αβ-/- mouse offers a unique model, wherein pancreatic β-cells transiently dedifferentiate during early post-natal life, allowing these knockout mice to escape immune-mediated β-cell destruction and T1D in later life (17). Analysis of islet single cell sequencing data from this model identified genes that were significantly dysregulated in β-cells of Ire1αβ-/- mice when compared to control (Ire1αfl/fl) mice. Eight of thirteen features (from the top 10 features in each of the four ML workflows, ), which showed significant dysregulation between Ire1αβ-/- and Ire1αβfl/fl mice are upregulated in β-cells of Ire1αβ-/- mice ( ). Analysis of T2D islet single cell data also revealed down-regulation of four common gene features (IAPP, SST, MAFA and ADCYAP1 identified across our three top ML workflows) in T2D compared to ND insulin transcribing cells ( ). Interestingly, Delta Like Non-Canonical Notch Ligand 1 (DLK1) was also significantly downregulated in T2D compared to ND insulin transcribing cells (Mann-Whitney test P-value=0.0017). The imprinted region of chromosome 14q32.2, contains microRNA cluster of DLK1-MEG3 which are highly expressed and more specific in human β-cells compared to α-cells. Previous study had also shown that in T2D human islets, the MEG3-microRNA locus expression levels are significantly lower (41). The 14q32 locus of microRNAs (such as co-expression of miR-376a and miR-432) also have been shown to target and suppress the expression of IAPP ( 41), that was one of the top features in our analyses.

Strength and Limitation

This is a first demonstration comparing multiple ML algorithms to identify key genes associated with insulin transcription using a big dataset of over 490 million data points. As anticipated, Ensemble methods perform better than most other workflows and identified a set of genes that corroborate with previous reports of transcriptional regulation of insulin in mouse and human β-cells. These findings indicate that unbiased ML workflows for big data analyses can generate biologically meaningful results, when applied to large training datasets. Our study provides the codes/scripts for other researchers to use in existing as well as emerging datasets for identification of gene candidates associated with other genetic pathways (e.g., related to GCG or GCK) in future or to genes recognized to be associated with T2D GWAS datasets. We recognize that there are several limitations: we are unsure as to why some other well known candidates (such as PDX1, and NEUROD) were not selected by our top predictive models. An explanation is that we used a whole pancreatic single cell dataset and that the predictive models generated through filtering out β-cells may be more enriched for known pro-endocrine gene regulators such as PDX1. The other explanation is that although PDX1 is a key regulator, the transcript levels in these datasets using multiple scRNA-seq technologies may not be sufficient considering the sequencing depth offered by some of these scRNA-seq workflows. It would be of interest to explore β-cell factors associated with insulin transcript levels through subset analyses in β-cell types. This is becoming increasingly important to the islet community as differences in insulin transcripts across islet β-cells [i.e., β-cell heterogeneity (42)] may drive optimal β-cell function (43) as well as diabetes progression. We recognize that exhaustive [e.g., LOOCV (44)] as well as non-exhaustive cross-validation approaches [such as K-fold cross-validation (45)] were not performed here. Such cross-validation approaches, although useful in assessing how results will generalize to an independent dataset, are mostly used in the validation of much smaller datasets. In big data analyses, the use of such cross-validation methodologies would limit the analyses to only those with an access to high-end cluster computing. The 10 different ML scripts used in these analyses are designed to work on a high-end personal computing device (i7 processor with 4 cores and 32GB RAM or better). We believe that the application of such ML algorithms to the expanding scRNA-seq datasets would lead to the confirmation/validation of current as well as identification of determinants of gene transcription, thereby accelerating innovation in discovery of gene targets in biology and medicine.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: NCBI, GEO, GSE152111 GSE144471, Panc8 (GSE84133, GSE85241, E-MTAB-5061, GSE81076, GSE86469). The data codes/scripts are available through https://github.com/Isletbiology/ML.

Author Contributions

Conceptualization: AH. Methodology: AH, AP, SB, and LD. Software: VT, MJ, WW, CD, HL, FE, and LD. Validation: WW, MJ, VT, CD, and AB. Data curation: WW, MJ, VT, CD, HL, YC, WH, FE, LD, and AH. Writing—original draft: AH, MJ, WW, and LD. Review and editing: all authors. Visualization: AH. Supervision: AH. Project administration: AH. Funding acquisition: AH. All authors contributed to the article and approved the submitted version.

Funding

The research presented herein has been funded through grants from the Australian Research Council Future Fellowship (FT110100254) the Juvenile Diabetes Research Foundation (JDRF) Australia T1D Clinical Research Network (JDRF/4-CDA2016-228-MB) and the Visiting Professorships (2016-18 and 2019-22) from the Danish Diabetes Academy, funded by the Novo Nordisk Foundation, grant number NNF17SA0031406 to AH. WW is supported through the Leona M. and Harry B. Helmsley Charitable Trust (Grant 2018PG-T1D009) in collaboration with the JDRF Australian Type 1 Diabetes Clinical Research Network funding (Grant 3-SRA-2019-694-M-B) to AH. MJ was supported through a JDRF International Advanced Post-doctoral award (3-APF-2016-178-A-N) and currently a transition award from JDRFI (1-FAC-2021-1063-A-N). AH acknowledges the support through the Ainsworth Medical Research Fund, Western Sydney University School of Medicine, Australia. FE was supported by a grant from the JDRF-5-CDA-2014-184-A-N and currently through grants from NIH NIDDK (1R01DK130919-01 and 1R56DK128136-01). HL is supported by NIH National Research Service Award T32 GM007215.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

43 in total

Review 1. The neuropeptide pituitary adenylate cyclase-activating polypeptide and islet function.

Authors: K Filipsson; M Kvist-Reimer; B Ahrén
Journal: Diabetes Date: 2001-09 Impact factor: 9.461

Review 2. On the origin of the beta cell.

Authors: Jennifer M Oliver-Krasinski; Doris A Stoffers
Journal: Genes Dev Date: 2008-08-01 Impact factor: 11.361

3. Glucose regulates islet amyloid polypeptide gene transcription in a PDX1- and calcium-dependent manner.

Authors: W M Macfarlane; S C Campbell; L J Elrick; V Oates; G Bermano; K J Lindley; A Aynsley-Green; M J Dunne; R F James; K Docherty
Journal: J Biol Chem Date: 2000-05-19 Impact factor: 5.157

4. Pancreas dorsal lobe agenesis and abnormal islets of Langerhans in Hlxb9-deficient mice.

Authors: K A Harrison; J Thaler; S L Pfaff; H Gu; J H Kehrl
Journal: Nat Genet Date: 1999-09 Impact factor: 38.330

5. Transcriptional regulation of the IAPP gene in pancreatic beta-cells.

Authors: Louisa M A Shepherd; Susan C Campbell; Wendy M Macfarlane
Journal: Biochim Biophys Acta Date: 2004-11-24

6. The MafA transcription factor appears to be responsible for tissue-specific expression of insulin.

Authors: Taka-aki Matsuoka; Isabella Artner; Eva Henderson; Anna Means; Maike Sander; Roland Stein
Journal: Proc Natl Acad Sci U S A Date: 2004-02-18 Impact factor: 11.205

7. Differential expression analysis for sequence count data.

Authors: Simon Anders; Wolfgang Huber
Journal: Genome Biol Date: 2010-10-27 Impact factor: 13.583

8. Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia.

Authors: Daniel R Schrider; Julien Ayroles; Daniel R Matute; Andrew D Kern
Journal: PLoS Genet Date: 2018-04-23 Impact factor: 5.917

9. Amino acid and lipid metabolism in post-gestational diabetes and progression to type 2 diabetes: A metabolic profiling study.

Authors: Mi Lai; Ying Liu; Gabriele V Ronnett; Anne Wu; Brian J Cox; Feihan F Dai; Hannes L Röst; Erica P Gunderson; Michael B Wheeler
Journal: PLoS Med Date: 2020-05-20 Impact factor: 11.069

10. Iterative transfer learning with neural network for clustering and cell type classification in single-cell RNA-seq analysis.

Authors: Jian Hu; Xiangjie Li; Gang Hu; Yafei Lyu; Katalin Susztak; Mingyao Li
Journal: Nat Mach Intell Date: 2020-10-05