| Literature DB >> 30407577 |
Jarny Choi1, Chris M Pacheco1, Rowland Mosbergen1, Othmar Korn2, Tyrone Chen1, Isha Nagpal1, Steve Englart1, Paul W Angel1, Christine A Wells1,3.
Abstract
Stemformatics is an established gene expression data portal containing over 420 public gene expression datasets derived from microarray, RNA sequencing and single cell profiling technologies. Developed for the stem cell community, it has a major focus on pluripotency, tissue stem cells, and staged differentiation. Stemformatics includes curated 'collections' of data relevant to cell reprogramming, as well as hematopoiesis and leukaemia. Rather than simply rehosting datasets as they appear in public repositories, Stemformatics uses a stringent set of quality control metrics and its own pipelines to process handpicked datasets from raw files. This means that about 30% of datasets processed by Stemformatics fail the quality control metrics and never make it to the portal, ensuring that Stemformatics data are of high quality and have been processed in a consistent manner. Stemformatics provides easy-to-use and intuitive tools for biologists to visually explore the data, including interactive gene expression profiles, principal component analysis plots and hierarchical clusters, among others. The addition of tools that facilitate cross-dataset comparisons provides users with snapshots of gene expression in multiple cell and tissues, assisting the identification of cell-type restricted genes, or potential housekeeping genes. Stemformatics is freely available at stemformatics.org.Entities:
Year: 2019 PMID: 30407577 PMCID: PMC6323943 DOI: 10.1093/nar/gky1064
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Number of public datasets in Stemformatics sorted by (A) platform type and (B) cumulative number of datasets (y-axis) curated overtime in years (x-axis).
Examples of the most common sample types hosted in Stemformatics. See details in Supplementary Data
| Blood | Stem Cells/Other | ||
|---|---|---|---|
| Monocyte | 1600 | Mesenchymal stromal cell | 820 |
| T-cell | 208 | Induced pluripotent stem cell | 746 |
| Macrophage | 146 | Acute myeloid leukaemia | 553 |
| Dendritic cell | 172 | Embryonic stem cell | 488 |
| Hematopoietic stem cell | 199 | iPSC-derived neuron | 87 |
Figure 2.Overview of the Stemformatics pipeline from raw file stage to an online version, including common causes of data processing failure
Figure 3.Percentage of datasets passing quality control for journal of publication (not all journals are shown). Numbers on each bar indicate the number of passed/total datasets and colours indicate the impact factor range of the journal.
Figure 4.The 2D distribution of the average gene-gene pair Pearson correlation coefficient. Each point is a gene-gene pair, and its average correlation coefficient within each platform. The x-axis shows the average correlation across microarrays, and the y-axis the average across RNA Seq datasets. Colour gradient indicates areas of high point density (in log scale). Dotted black line is a 1:1 relation for reference. This plot highlights gene pairs with a consistently high correlation in both microarray and RNA Seq data in the top right hand corner.
Figure 5.A screenshot of the Yugene graph for MINCR gene (top) with break-down of cell types for samples with high Yugene values (bottom) within the selected window.
Figure 6.A screenshot of the Rohart MSC Test available for any human microarray dataset in Stemformatics. (A) Predicted Rohart score for each sample in the dataset, where scores above the prediction region indicate that samples are predicted as MSCs. (B) MSC predictions in Stemformatics for most common sample types. Not all sample types are shown.